(Message inbox:31)
Return-Path: comp.mail.sendmail
From: rickert@mp.cs.niu.edu (Neil Rickert)
Date: Mon, 29 Apr 1991 14:40:19 GMT
Newsgroup: comp.mail.sendmail/2224
Message-Id: <1991Apr29.144019.22206@mp.cs.niu.edu>
Subject: Sendmail address rewrite sequence (was Re: From: , relaying and FAQ)

		 ----------------------------

                  ADDRESS REWRITING IN SENDMAIL.

 Prepared by Neil Rickert, Northern Illinois University, Apr 29, 1991.
 <rickert@cs.niu.edu>

 The author of this document has no relation whatsoever with the developers of 
sendmail at Berkeley.  Consequently any errors are in no way the fault of
Berkeley.  No guarantee is made as to the accuracy of this document.  Those
who wish guaranteed accurate information must read the source code themselves.

 Any comments or corrections are welcome.

                     -------------------------

 This description is intended to be a more thorough description of address 
parsing than is contained in the usual manual.  Some basic familiarity with 
the manual is assumed.  In particular, we do not attempt to explain the 
rewriting which occurs in a single rule, or the process in which an address 
passes from one rule to the next. 

 The description contains a few oversimplifications, such as omitting mention 
of some special cases during internal processing of alias expansions.  These 
simplifications are not directly related to the main issues covered.

 I make no guarantees as to the accuracy of this document.  It is based on 
experience with developing and debugging rulesets, and with reading parts of 
the code.  It mostly reflects my understanding of the status of sendmail based 
on the 5.65+IDA versions but with some cross checks with the Berkeley 5.65 
code. 

1.  A BRIEF OVERVIEW OF THE REWRITING PROCESS.

    Each address is initially rewritten by ruleset 3.  Thereafter the 
    processing depends on whether this is a recipient address or a sender 
    address.  A sender address is processed by ruleset 1, then by the ruleset 
    declared in the mailer, and finally by ruleset 4.  A recipient address, 
    after processing by ruleset 3, is normally processed by ruleset 2, the 
    ruleset declared in the mailer, and by ruleset 4. 

    The above description is, of course, a gross oversimplification.  We shall 
    fill in some of the details below. 

2.  BASIC ADDRESS PARSING.

    In order to send a message, the recipient address must be parsed, so as to 
    determine the transport mechanism and the next step in the chain of mail 
    relays.  This is handled in the function parseaddr(), which is in 
    parseaddr.c 

    The basic parsing strategy is as follows:

    As with all rewriting, the address is first rewritten by ruleset 3.  The 
    result of this rewrite is now rewritten by ruleset 0.  The purpose of 
    ruleset 0 is to resolve the address into a triple: Mailer, Host, 
    User_address.  This resolution occurs when the left hand side matches a 
    suitable rule, and the RHS is rewritten as: 

             $# MAILER_NAME $@ HOST_NAME $: USER_ADDRESS

    Apart from the exceptions noted below, every address must resolve into 
    such a triple.  If not 'sendmail' will complain.  We should clarify, 
    however, that the user address is allowed to contain a host component.  
    Thus a UUCP address of john@uunet.UUCP might resolve to:
                      $# UUCP $@ uunet $: john 
    while uunet!seismo!harry might resolve to
                       $# UUCP $@ uunet $: seismo!harry
    (In all the above examples the spacing is for readability.  We will comment 
    more on spacing when we discuss the function prescan() later in this 
    document). 

    It should also be clarified that the HOST_NAME as returned by ruleset 0 
    does not necessarily have anything to do with the domain component of the 
    original address.  It is the name of the host to which the message will 
    be next sent as known to the transmission software.  For the tcp mailers, 
    this is the fully qualified domain name.  For other mailers it may be 
    different.  In our example above, although the fully qualified uucp name 
    of uunet may be uunet.UUCP, the name known to the uucp software is just 
    uunet, so that must be the host name returned from ruleset 0.

    There are two exceptions to the above description.  The local mailer, and 
    the ERROR mailer. Both of these must resolve to just MAILER_NAME and 
    USER_ADDRESS, since a host does not make sense.  (The IDA versions of 
    'sendmail' permit the use of a triple with the local mailer, but ignore 
    the host).  In the case of the ERROR mailer, the USER_ADDRESS is actually 
    the error message.  Thus one could imagine a rewrite rule in ruleset 0. 

    R$+@annex1.$D           $#ERROR $:terminal servers don't accept mail 

    It is a basic principle of 'sendmail' that ruleset 0 must select a mailer, 
    even if only the ERROR mailer. 

    An address which resolves to the ERROR mailer is said to be 'unparseable' 
    and normally leads to bounced mail. 

3.  DIFFERENT CLASSES OF ADDRESSES: ENVELOPES AND HEADERS.

    'sendmail' must deal with addresses on the message envelope, and addresses 
    on the headers inside the message.  The distinction between envelope and 
    header is quite important for an understanding of how 'sendmail' 
    functions.  However it can be confusing at first. 

    In basic terms, the "header recipients" are the addresses on the "To:" and 
    "Cc:" header lines.  (We could include the "Bcc:" header, but this is 
    usually discarded.)  The "header senders" are the addresses on the "From:" 
    and the "Reply-To:" headers.  Sometimes there may also be "Resent-To:", 
    "Resent-Cc:", "Resent-From:" headers with additional header addresses. 

    The "envelope recipients" are the addresses the messages is really being 
    sent to.  They are often different from the "header recipients".  The 
    "envelope sender" is the real sender of the message, and is occasionally 
    different from the "header senders". 

    To the novice it often not obvious why there need be a distinction between 
    header and envelope addresses.  When a typical message originates the 
    envelope and header addresses are the same.  However much can change 
    during processing. A message might be sent to two recipients A and B on 
    different machines, host-A and host-B.  Initially both addresses are in 
    the envelope.  However when the message is relayed to host-A, only 
    recipient A remains in the envelope.  Likewise the copy of the message 
    sent to host-B will contain only B in the envelope.  Roughly speaking, as 
    a message is delivered to a recipient address, that address is deleted 
    from the envelope.  However all recipient addresses remain in the headers, 
    as these provide documentation to the human reader of the message as to 
    whom the message was sent to. 

    As another example, you may have a '.forward' file in your home directory 
    to forward your mail.  The forwarding action applies only to the envelope 
    recipients of a message, while the addresses on the header are not 
    affected. 

    The distinction between header senders and envelope sender is sometimes 
    harder to explain.  We start with an example unrelated to computing.  
    Suppose as a gift you purchase for a friend a subscription to "Scientific 
    American".  As part of the process you fill out a gift card which the 
    publishers will send.  When the gift card arrives, the card says it is 
    from you.  However the return address on the envelope is that of 
    Scientific American.  Essentially you were the header sender, but 
    Scientific American was the envelope sender.  For a comparable example in 
    computing, suppose you post an article in Usenet news.  This article comes 
    from you, and this is shown in the "From:" header.  However at some other 
    site, a copy of your article is mailed to someone without direct Usenet 
    news access.  The Usenet news system on the mailing host is the real 
    sender, or envelope sender, while as the author of the article you remain 
    the header sender. 

    In simple terms, the addresses in the headers are for humans to read, 
    while the addresses in the envelope are for machines to read. 

4.  WHERE DO ENVELOPES COME FROM AND WHERE DO THEY GO.

    When mail is sent or received over the network with SMTP, the envelope 
    sender is transmitted first in the 'MAIL-From' SMTP command.  Then a 
    sequence of 'RCPT-To' commands transfer the envelope recipients. 

    When you send a normal message on your system, the Mail User Agent (MUA) 
    invokes sendmail as:

    /usr/lib/sendmail recipient1 recipient2 recipient3 ...

    Thus the envelope recipients are passed as parameters on the argument 
    list.  The envelope sender, in that case, is taken from the uid of the 
    person sending the message.  It is also possible to invoke sendmail as 

    /usr/lib/sendmail -fsender recipient1 recipient2 recipient3 ...

    and thereby give a separate envelope sender.  However sendmail normally 
    will ignore the '-fsender' operand unless given by a trusted user such as 
    'root' or 'uucp' or 'daemon'. 

    When 'sendmail' invokes /bin/mail for delivery to your mailbox, it uses a 
    parameter list something like: 

    /bin/mail -r sender -d recipient1 recipient2 recipient3 ...

    Thus the envelope information is passed along to /bin/mail in a manner 
    similar to the way 'sendmail' received it.  If mail is sent out through 
    UUCP, the envelope sender is recorded on the unix "From " line which is 
    the first line of the message, and the envelope recipients become the 
    command operands to 'rmail' on the remote system. 

5.  TOKENIZING AND REBUILDING ADDRESSES.  prescan() AND cataddr().

    Before an address it processed by one of the rewrite rulesets, it must be 
    tokenized.  The rewriting rules only examine complete tokens and strings 
    of tokens during their matching.  After the rewriting is complete the 
    tokenized address must be converted back to a character string.  The 
    function prescan() does the tokenizing, while cataddr() converts a 
    tokenized address back to a character string. 

    The output of prescan() is an array of strings, with one array entry for 
    each token.  The tokenizing is done based on the characters in $o, defined 
    in the line beginning 'Do' in 'sendmail.cf'.  There are, additionally, a 
    few characters whose special handling is built into prescan().

    In an address such as the following (from RFC822):

         Wilt . (the Stilt) Chamberlain@NBA.US

    the parenthesized comments "(the Stilt)" are filtered out by prescan().  
    (However when such an address appears on a header, the comments are saved 
    for reinsertion after the address has been rewritten).  prescan() also 
    insists on properly balanced parentheses and properly balanced <angle 
    brackets>.  A string between "double quotes" becomes a single token.  With 
    one exception, an escaped character (such as \@) loses any special 
    properties.  Thus the address 'user@host' would ordinarily become the 
    three tokens "user", "@", "host", while the address 'user\@host' becomes a 
    single token.  The one exception to the backslash escaping is \!, which is 
    simply converted to !.  The special handling is an attempt to be somewhat 
    forgiving to csh users who sometimes become overly zealous in escaping 
    their bangs. 

    In tokenizing, special characters (those characters defined in $o) become 
    single character special tokens.  A string of ordinary characters becomes 
    a single token.  The space character, however, is never part of a token 
    except when in a quoted string or when backslash escaped.  The space 
    character is a token separator.  Thus the string AB become one token "AB", 
    while the string A B becomes two tokens "A", "B".  A space before or after 
    a special character, however, is completely superfluous.  Thus user @ host 
    is tokenized as "user","@","host", exactly as would be user@host 

    The function cataddr() sounds simple.  Just concatenate all the tokens to 
    form a new string.  This is almost what it does.  But you have to make a 
    special case where there are two ordinary tokens.  If 'A  B' is tokenized 
    as "A","B", then just concatenating the strings would produce the 
    incorrect 'AB'.  Instead, when cataddr() discovers two consecutive 
    ordinary tokens, in inserts between them the space substitute character 
    defined on the 'OB' line of 'sendmail.cf'.  In most versions of 
    'sendmail.cf' this character is the period '.' leading to the effect that 
    the original 'A B' was tokenized to "A","B", then is untokenized to 'A.B' 
    If you don't like this, you can define the space substitute as a blank.  
    If you do so, but forward the message to another 'sendmail' it will 
    probably be converted to a period again.  (One problem with defining the 
    space substitute as a blank is that this can easily become invisible in 
    'sendmail.cf', and if it is accidently deleted you might find you are 
    accidently defining the space substitute to be '\n' or '\0'). 


6.  THE C-FLAG PROCESSING.

    Certain mailers have the C flag set in their mailer definition.  If 
    effects processing of addresses as follows:

    After an address has been rewritten by ruleset 3, a check is done to see 
    if the address now contains an "@".  If it already contains an "@", then 
    C-FLAG processing has no effect.  If there is no "@", the 'receiving 
    mailer' is checked.  (The determination of the receiving mailer is 
    described below).  If the receiving mailer has the C-FLAG defined, and if 
    the sender address in $f contains an '@', everything from the '@' onward 
    in $f is appended to the end of the address as outputted by ruleset 3, and 
    the modified address is now reprocessed by ruleset 3.  The actual check 
    for the '@' in $f is done while the sender address is still tokenized, so 
    there is no additional call to prescan() in the C-FLAG processing.


7.  Mailer-specific rulesets.  A typical mailer definition looks something 
    like the following:

    Mlocal, P=/bin/mail, F=MFlusr, S=10, R=12, A=mail -d $u

    Here the S= and R= operands specify address rewrite rulesets to be used 
    for mail sent by this mailer.  The S= operand is for sender addresses, and 
    the R= is for recipient addresses.  We shall refer to these as the mailer-
    specific rulesets.  In the IDA version of sendmail, these can optionally 
    be specified as say S=13/15, etc.  This would use ruleset 13 for envelope 
    sender addresses and ruleset 15 for header sender addresses.  The simple 
    definition S=10 is equivalent to S=10/10.

    If the mailer specific ruleset is omitted, or is defined as 0, this means 
    that there is no mailer-specific ruleset.  Ruleset 0 itself is never used 
    as a mailer-specific ruleset.

8.  A MORE DETAILED LOOK AT SENDMAIL PROCESSING.

8a. Parsing the sender address.

    One of the first steps is to parse the envelope sender address.  This 
    follows the procedure of address parsing as described above.

    prescan()
    rewrite with ruleset 3.
    rewrite with ruleset 0.
    rewrite the user address portion with rulesets 2, the mailer specific 
    ruleset, and ruleset 4.
    process the output of ruleset 4 with cataddr()

    If the address proves unparseable a message is written to the log, and the 
    address 'Postmaster' is parsed in its place.  One effect is that should 
    the mail be undeliverable, and the sender address is unparseable, any 
    bounced mail will in this case be delivered to Postmaster.

    The mailer returned from ruleset 0 is called the receiving mailer, and is 
    examined for the presence of a C flag as discussed above in our 
    description of C-FLAG processing.

    If the receiving mailer is the local mailer, the sender is assumed local.  
    In that case the user address is looked up in the password file to find 
    the full name (for possible use on the 'From:' header) and the home 
    directory (in case the message should be stored in dead.letter).

8b. Defining $f.

    The sender address is processed with

    prescan()
    rulesets 3,1,4

    Search for an '@' in the address, and save a copy of the portion of the 
    tokenized address starting with the '@', in case needed for C-FLAG 
    processing.
    
    apply cataddr() to the output of ruleset 4, and save as the value of $f

    NOTE:  Even if the original address was determined to be unparseable in 
    step 8a, and replaced by Postmaster, it is still the original address and 
    not 'Postmaster' which is used for defining $f.  However there is one 
    anomoly here.  If the message cannot be delivered immediately but must be 
    queued, and if the original sender address was unparseable, the original 
    sender address is not saved in the queue file, so the sender will become 
    'Postmaster' in that case. 

8c. Building the recipient list.

    Each envelope recipient address is now added to an internal list of 
    recipients.  Before an address is added to the recipient list, it goes 
    through the following procedures: 

    prescan()
    rewrite with ruleset 3
    rewrite with ruleset 0
     The mailer and host returned are saved.  The user portion is further
    processed:
    rewrite with ruleset 2
    rewrite with the mailer specific ruleset (for the mailer returned by 
       ruleset 0)
    rewrite with ruleset 4.
    cataddr()

    Next the recipient list is searched to see if this address already 
    appears.  The search is based on a comparison of the mailer returned by 
    ruleset 0, the host returned by ruleset 0 (except the host is ignored if 
    the 'l' flag is set for the mailer), and the user name as output from 
    cataddr().  If the address is a duplicate it is not actually added to the 
    recipient list.

    Next the mailer is checked to see if the local mailer.  If so, the address 
    is looked up in the aliases database.  If there is an aliases entry the 
    current address is flagged QDONTSEND so that mail will not be sent to it, 
    and each entry in the alias expansion recursively goes through the same 
    process for adding to the recipient list.  An additional flag is set for 
    aliases to indicate they are indeed aliases. 

    Next, if the mailer is the local mailer, and if the QDONTSEND flag is not 
    set, there is a test to see if the user address begins with '|' after 
    removal of quotes.  If so, and if the address is from an alias expansion, 
    the mailer is changed to the 'prog' mailer, and the initial '|' is 
    removed. 

    Next, if the mailer is local, the name is looked up in /etc/passwd, the 
    home directory searched for a '.forward' file which, if found, is treated 
    to similar processing as an aliases entry.

    If at any stage something goes wrong, the address is flagged as bad for a 
    later bounce message.

8d. Beginning the delivery phase.

    Once the recipient list is complete, sendmail is ready to attempt 
    delivery.  This involves running down the recipient list.  As an address 
    is selected for a delivery attempt, it is marked QDONTSEND which is 
    approximately the equivalent of deleting it from the recipient list.  This 
    is to ensure it will not be sent twice.

    Once an address is found for a delivery attempt, a check is made to see if 
    the 'm' flag is set in the mailer.  If so, sendmail will attempt to send 
    to as many recipients with the same mailer/host combination as possible in 
    a single operation.

    If delivery requires sending to several hosts, the next few steps will be 
    repeated several times.

8e. Determining the envelope sender for delivery.

    Start with the expansion of $f
    prescan()
    rewrite with rulesets 3, C-FLAG processing, 1, the mailer specific 
    ruleset, and 4.
    cataddr()
    The result is assigned to $g

    Note that, because of the way $f was originally determined, this means the 
    original incoming address has been processed by 3,1,4,3,1,mailer 
    specific,4 

8f. Determining the command. 

    The command (the P= and A= operands) from the mailer definition are now 
    expanded, with $h evaluating to the host.  If there are multiple 
    recipients, and $u is in the last argument, multiple such arguments are 
    created, one for each recipient for this delivery transaction.  If there 
    is no $u in the argument list, the SMTP protocol is used instead.

    For SMTP use only, the recipient address is now processed by
    prescan(), ruleset 3, C-FLAG processing, 2, mailer specific, 4, cataddr(). 
    The IDA versions, however, do not do this additional rewriting step which 
    seems superfluous, and which could conceivably cause mailing loops because 
    of the C-FLAG processing. 

8g. Header rewriting.

    The headers are now sent to the mailer program as part of the message.  
    Before sending them they are subject to any rewriting.  If the required 
    'From:' header does not exist it is created from the definition in 
    'sendmail.cf'.  If the required 'To:' is missing an 'Apparently-To:' 
    header is created. 

    A special word on the 'From:' header.  If the incoming message has a 
    'From:' header whose contents are identical in all respects (except 
    leading and trailing white space) to the envelope header, that 'From:' 
    header is deleted.  The assumption is that it will be recreated with a 
    chance of adding the full name.

8h. Header sender rewriting.

    The address is extracted from the 'From:' header and other similar headers 
    such as 'Reply-To:' and 'Resent-From:', carefully saving the comments for 
    later use.  The address is processed with
    prescan(), 3, C-FLAG processing, 1, Mailer specific, 4, cataddr().

    The IDA versions use ruleset 5 in place of ruleset 1.  This is part of the 
    IDA strategy of allowing a distinction between the formatting of headers 
    and the formatting of the envelope.

    If there was no "From:", or if it was deleted and must be recreated, the 
    usual definition of "From:" in most version of 'sendmail.cf' begins with 
    $g, or with $q which is defined in terms of $g.  In that case, and 
    remembering how $f and then $g are determined, the address on the 'From:' 
    goes through the following steps: 

    incoming envelope sender
    prescan(), rulesets 3,1,4, cataddr()
    prescan(), rulesets 3, C-FLAG, 1, Mailer specific, 4, cataddr()
    prescan(), rulesets 3, C-FLAG, 1, Mailer specific, 4, cataddr()

    You will note the large amount of redundancy.  If designing rulesets you 
    must keep this in mind.  In particular you should be wary of approaches 
    which give different results depending on how many times the address is 
    passed through rulesets 3,1,Mailer-specific,4.  In the current IDA/NIU 
    rulesets, the $q variable is defined in terms of $f instead of $g, in 
    order to eliminate the most troublesome one of these extra rewrites, in 
    which a header address is rewritten with a mailer specific ruleset 
    intended for envelope addresses only. 

8i. Recipient header rewriting.

    Each recipient address on a "To:" or "Cc:" or "Apparently-To:" or 
    "Resent-To:" or "Resent-Cc:" header goes through the following steps: 

    prescan(), rulesets 3, C-FLAG, 2, Mailer specific, 4, cataddr().

    In the above, the IDA versions use ruleset 6 in place of ruleset 2.  Again 
    this is to allow header addresses to be formatted differently from 
    envelope addresses.

9.  GENERAL COMMENTS ON DESIGNING RULESETS.

    If you plan on designing your own 'sendmail.cf', or modifying an existing 
    one to add more functionality, here are some things to keep in mind:

9a. Remember the C-FLAG.

    Because of the C-FLAG processing, it is desirable that every address which 
    contains a host name should contain an '@' by the end of ruleset 3, and 
    addresses without hostname should not have one added in ruleset 3. 

    This also means that if you want to rewrite 'user' as 
    'user@your.full.domain' you should not do it in ruleset 3, but should do 
    it somewhat later. 

    This means that an address like 'uunet!seismo!harry' needs to be converted 
    to something like 'seismo!harry@uunet.UUCP' in ruleset 3, so as to ensure 
    that there is an '@' in the address and C-FLAG processing doesn't 
    incorrectly add another domain level. 

9b. Make major rewriting steps reversible.

    Ideally any address processed by ruleset 3 followed by ruleset 4 should 
    finish up in its original form.  Anything non-reversible done in ruleset 3 
    can never be fully compensated for later.  This is difficult to do in 
    practice, however.  In many versions of sendmail.cf, both 'uunet!john' and 
    'john@uunet.uucp' will be rewritten as 'john<@uunet.UUCP>' in ruleset 3, 
    so the original form cannot always be recovered by use of ruleset 4.  What 
    is probably more critical is that using rulesets 3,4,3,4 should yield the 
    same result as just using 3,4.  If your rulesets don't manage at least 
    this degree of consistency you are likely to run into major problems.

    The IDA/NIU rulesets are pretty close to the ideal that ruleset 4 
    completely reverse ruleset 3.  But to achieve this they use in internal 
    form which only vaguely looks like the original address.  Thus in these 
    ruleset 3 would rewrite 'uunet!john' as '<@uunet.UUCP>!john' while they 
    would rewrite 'john@uunet.UUCP' as '<@uunet.UUCP>,john'.  And, of course, 
    ruleset 4 is rather more complex also.  (This approach in IDA/NIU is not 
    simply a matter of purism.  It has to do with the need to be able to 
    unambiguously merge two addresses in the pathalias file lookup.) 

9c. Allow for the fact that sender addresses may be rewritten multiple times.

    The habit of sendmail of reprocessing sender addresses can cause some 
    problems and result in incorrect addresses if you do not properly allow 
    for it.

9d. Be particularly cautious with how you handle envelope recipients. 

    Basically if you mess these up, the mail probably won't go anywhere, or 
    won't go where you want it.

    As an example, suppose I have a uucp neighbor 'uuhost'.  Suppose my 
    neighbor wants all mail to leave in the format 
    'uuhost!user@my.full.domain'.  Now it might be that I become a little too 
    vigorous in my changes, and I even rewrite the addresses that way in mail 
    sent to 'uuhost'.  If this happens to a header address the problem is not 
    very serious.  At the worst, when someone on 'uuhost' sends a reply, the 
    reply will be sent first to my system, then back to 'uuhost'.  But if I 
    make the same transformation to the envelope, I will cause a mailing loop.  
    In other words, if mail I see destined for 'uuhost!user' is sent to 
    'uuhost' with the user address of 'uuhost!user@my.full.domain', the mailer 
    on 'uuhost' is likely to just send it back.  Then the same thing happens 
    over and over again.


-- 
=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=
  Neil W. Rickert, Computer Science               <rickert@cs.niu.edu>
  Northern Illinois Univ.
  DeKalb, IL 60115                                   +1-815-753-6940