(Message inbox:31) Return-Path: comp.mail.sendmail From: rickert@mp.cs.niu.edu (Neil Rickert) Date: Mon, 29 Apr 1991 14:40:19 GMT Newsgroup: comp.mail.sendmail/2224 Message-Id: <1991Apr29.144019.22206@mp.cs.niu.edu> Subject: Sendmail address rewrite sequence (was Re: From: , relaying and FAQ) ---------------------------- ADDRESS REWRITING IN SENDMAIL. Prepared by Neil Rickert, Northern Illinois University, Apr 29, 1991. The author of this document has no relation whatsoever with the developers of sendmail at Berkeley. Consequently any errors are in no way the fault of Berkeley. No guarantee is made as to the accuracy of this document. Those who wish guaranteed accurate information must read the source code themselves. Any comments or corrections are welcome. ------------------------- This description is intended to be a more thorough description of address parsing than is contained in the usual manual. Some basic familiarity with the manual is assumed. In particular, we do not attempt to explain the rewriting which occurs in a single rule, or the process in which an address passes from one rule to the next. The description contains a few oversimplifications, such as omitting mention of some special cases during internal processing of alias expansions. These simplifications are not directly related to the main issues covered. I make no guarantees as to the accuracy of this document. It is based on experience with developing and debugging rulesets, and with reading parts of the code. It mostly reflects my understanding of the status of sendmail based on the 5.65+IDA versions but with some cross checks with the Berkeley 5.65 code. 1. A BRIEF OVERVIEW OF THE REWRITING PROCESS. Each address is initially rewritten by ruleset 3. Thereafter the processing depends on whether this is a recipient address or a sender address. A sender address is processed by ruleset 1, then by the ruleset declared in the mailer, and finally by ruleset 4. A recipient address, after processing by ruleset 3, is normally processed by ruleset 2, the ruleset declared in the mailer, and by ruleset 4. The above description is, of course, a gross oversimplification. We shall fill in some of the details below. 2. BASIC ADDRESS PARSING. In order to send a message, the recipient address must be parsed, so as to determine the transport mechanism and the next step in the chain of mail relays. This is handled in the function parseaddr(), which is in parseaddr.c The basic parsing strategy is as follows: As with all rewriting, the address is first rewritten by ruleset 3. The result of this rewrite is now rewritten by ruleset 0. The purpose of ruleset 0 is to resolve the address into a triple: Mailer, Host, User_address. This resolution occurs when the left hand side matches a suitable rule, and the RHS is rewritten as: $# MAILER_NAME $@ HOST_NAME $: USER_ADDRESS Apart from the exceptions noted below, every address must resolve into such a triple. If not 'sendmail' will complain. We should clarify, however, that the user address is allowed to contain a host component. Thus a UUCP address of john@uunet.UUCP might resolve to: $# UUCP $@ uunet $: john while uunet!seismo!harry might resolve to $# UUCP $@ uunet $: seismo!harry (In all the above examples the spacing is for readability. We will comment more on spacing when we discuss the function prescan() later in this document). It should also be clarified that the HOST_NAME as returned by ruleset 0 does not necessarily have anything to do with the domain component of the original address. It is the name of the host to which the message will be next sent as known to the transmission software. For the tcp mailers, this is the fully qualified domain name. For other mailers it may be different. In our example above, although the fully qualified uucp name of uunet may be uunet.UUCP, the name known to the uucp software is just uunet, so that must be the host name returned from ruleset 0. There are two exceptions to the above description. The local mailer, and the ERROR mailer. Both of these must resolve to just MAILER_NAME and USER_ADDRESS, since a host does not make sense. (The IDA versions of 'sendmail' permit the use of a triple with the local mailer, but ignore the host). In the case of the ERROR mailer, the USER_ADDRESS is actually the error message. Thus one could imagine a rewrite rule in ruleset 0. R$+@annex1.$D $#ERROR $:terminal servers don't accept mail It is a basic principle of 'sendmail' that ruleset 0 must select a mailer, even if only the ERROR mailer. An address which resolves to the ERROR mailer is said to be 'unparseable' and normally leads to bounced mail. 3. DIFFERENT CLASSES OF ADDRESSES: ENVELOPES AND HEADERS. 'sendmail' must deal with addresses on the message envelope, and addresses on the headers inside the message. The distinction between envelope and header is quite important for an understanding of how 'sendmail' functions. However it can be confusing at first. In basic terms, the "header recipients" are the addresses on the "To:" and "Cc:" header lines. (We could include the "Bcc:" header, but this is usually discarded.) The "header senders" are the addresses on the "From:" and the "Reply-To:" headers. Sometimes there may also be "Resent-To:", "Resent-Cc:", "Resent-From:" headers with additional header addresses. The "envelope recipients" are the addresses the messages is really being sent to. They are often different from the "header recipients". The "envelope sender" is the real sender of the message, and is occasionally different from the "header senders". To the novice it often not obvious why there need be a distinction between header and envelope addresses. When a typical message originates the envelope and header addresses are the same. However much can change during processing. A message might be sent to two recipients A and B on different machines, host-A and host-B. Initially both addresses are in the envelope. However when the message is relayed to host-A, only recipient A remains in the envelope. Likewise the copy of the message sent to host-B will contain only B in the envelope. Roughly speaking, as a message is delivered to a recipient address, that address is deleted from the envelope. However all recipient addresses remain in the headers, as these provide documentation to the human reader of the message as to whom the message was sent to. As another example, you may have a '.forward' file in your home directory to forward your mail. The forwarding action applies only to the envelope recipients of a message, while the addresses on the header are not affected. The distinction between header senders and envelope sender is sometimes harder to explain. We start with an example unrelated to computing. Suppose as a gift you purchase for a friend a subscription to "Scientific American". As part of the process you fill out a gift card which the publishers will send. When the gift card arrives, the card says it is from you. However the return address on the envelope is that of Scientific American. Essentially you were the header sender, but Scientific American was the envelope sender. For a comparable example in computing, suppose you post an article in Usenet news. This article comes from you, and this is shown in the "From:" header. However at some other site, a copy of your article is mailed to someone without direct Usenet news access. The Usenet news system on the mailing host is the real sender, or envelope sender, while as the author of the article you remain the header sender. In simple terms, the addresses in the headers are for humans to read, while the addresses in the envelope are for machines to read. 4. WHERE DO ENVELOPES COME FROM AND WHERE DO THEY GO. When mail is sent or received over the network with SMTP, the envelope sender is transmitted first in the 'MAIL-From' SMTP command. Then a sequence of 'RCPT-To' commands transfer the envelope recipients. When you send a normal message on your system, the Mail User Agent (MUA) invokes sendmail as: /usr/lib/sendmail recipient1 recipient2 recipient3 ... Thus the envelope recipients are passed as parameters on the argument list. The envelope sender, in that case, is taken from the uid of the person sending the message. It is also possible to invoke sendmail as /usr/lib/sendmail -fsender recipient1 recipient2 recipient3 ... and thereby give a separate envelope sender. However sendmail normally will ignore the '-fsender' operand unless given by a trusted user such as 'root' or 'uucp' or 'daemon'. When 'sendmail' invokes /bin/mail for delivery to your mailbox, it uses a parameter list something like: /bin/mail -r sender -d recipient1 recipient2 recipient3 ... Thus the envelope information is passed along to /bin/mail in a manner similar to the way 'sendmail' received it. If mail is sent out through UUCP, the envelope sender is recorded on the unix "From " line which is the first line of the message, and the envelope recipients become the command operands to 'rmail' on the remote system. 5. TOKENIZING AND REBUILDING ADDRESSES. prescan() AND cataddr(). Before an address it processed by one of the rewrite rulesets, it must be tokenized. The rewriting rules only examine complete tokens and strings of tokens during their matching. After the rewriting is complete the tokenized address must be converted back to a character string. The function prescan() does the tokenizing, while cataddr() converts a tokenized address back to a character string. The output of prescan() is an array of strings, with one array entry for each token. The tokenizing is done based on the characters in $o, defined in the line beginning 'Do' in 'sendmail.cf'. There are, additionally, a few characters whose special handling is built into prescan(). In an address such as the following (from RFC822): Wilt . (the Stilt) Chamberlain@NBA.US the parenthesized comments "(the Stilt)" are filtered out by prescan(). (However when such an address appears on a header, the comments are saved for reinsertion after the address has been rewritten). prescan() also insists on properly balanced parentheses and properly balanced . A string between "double quotes" becomes a single token. With one exception, an escaped character (such as \@) loses any special properties. Thus the address 'user@host' would ordinarily become the three tokens "user", "@", "host", while the address 'user\@host' becomes a single token. The one exception to the backslash escaping is \!, which is simply converted to !. The special handling is an attempt to be somewhat forgiving to csh users who sometimes become overly zealous in escaping their bangs. In tokenizing, special characters (those characters defined in $o) become single character special tokens. A string of ordinary characters becomes a single token. The space character, however, is never part of a token except when in a quoted string or when backslash escaped. The space character is a token separator. Thus the string AB become one token "AB", while the string A B becomes two tokens "A", "B". A space before or after a special character, however, is completely superfluous. Thus user @ host is tokenized as "user","@","host", exactly as would be user@host The function cataddr() sounds simple. Just concatenate all the tokens to form a new string. This is almost what it does. But you have to make a special case where there are two ordinary tokens. If 'A B' is tokenized as "A","B", then just concatenating the strings would produce the incorrect 'AB'. Instead, when cataddr() discovers two consecutive ordinary tokens, in inserts between them the space substitute character defined on the 'OB' line of 'sendmail.cf'. In most versions of 'sendmail.cf' this character is the period '.' leading to the effect that the original 'A B' was tokenized to "A","B", then is untokenized to 'A.B' If you don't like this, you can define the space substitute as a blank. If you do so, but forward the message to another 'sendmail' it will probably be converted to a period again. (One problem with defining the space substitute as a blank is that this can easily become invisible in 'sendmail.cf', and if it is accidently deleted you might find you are accidently defining the space substitute to be '\n' or '\0'). 6. THE C-FLAG PROCESSING. Certain mailers have the C flag set in their mailer definition. If effects processing of addresses as follows: After an address has been rewritten by ruleset 3, a check is done to see if the address now contains an "@". If it already contains an "@", then C-FLAG processing has no effect. If there is no "@", the 'receiving mailer' is checked. (The determination of the receiving mailer is described below). If the receiving mailer has the C-FLAG defined, and if the sender address in $f contains an '@', everything from the '@' onward in $f is appended to the end of the address as outputted by ruleset 3, and the modified address is now reprocessed by ruleset 3. The actual check for the '@' in $f is done while the sender address is still tokenized, so there is no additional call to prescan() in the C-FLAG processing. 7. Mailer-specific rulesets. A typical mailer definition looks something like the following: Mlocal, P=/bin/mail, F=MFlusr, S=10, R=12, A=mail -d $u Here the S= and R= operands specify address rewrite rulesets to be used for mail sent by this mailer. The S= operand is for sender addresses, and the R= is for recipient addresses. We shall refer to these as the mailer- specific rulesets. In the IDA version of sendmail, these can optionally be specified as say S=13/15, etc. This would use ruleset 13 for envelope sender addresses and ruleset 15 for header sender addresses. The simple definition S=10 is equivalent to S=10/10. If the mailer specific ruleset is omitted, or is defined as 0, this means that there is no mailer-specific ruleset. Ruleset 0 itself is never used as a mailer-specific ruleset. 8. A MORE DETAILED LOOK AT SENDMAIL PROCESSING. 8a. Parsing the sender address. One of the first steps is to parse the envelope sender address. This follows the procedure of address parsing as described above. prescan() rewrite with ruleset 3. rewrite with ruleset 0. rewrite the user address portion with rulesets 2, the mailer specific ruleset, and ruleset 4. process the output of ruleset 4 with cataddr() If the address proves unparseable a message is written to the log, and the address 'Postmaster' is parsed in its place. One effect is that should the mail be undeliverable, and the sender address is unparseable, any bounced mail will in this case be delivered to Postmaster. The mailer returned from ruleset 0 is called the receiving mailer, and is examined for the presence of a C flag as discussed above in our description of C-FLAG processing. If the receiving mailer is the local mailer, the sender is assumed local. In that case the user address is looked up in the password file to find the full name (for possible use on the 'From:' header) and the home directory (in case the message should be stored in dead.letter). 8b. Defining $f. The sender address is processed with prescan() rulesets 3,1,4 Search for an '@' in the address, and save a copy of the portion of the tokenized address starting with the '@', in case needed for C-FLAG processing. apply cataddr() to the output of ruleset 4, and save as the value of $f NOTE: Even if the original address was determined to be unparseable in step 8a, and replaced by Postmaster, it is still the original address and not 'Postmaster' which is used for defining $f. However there is one anomoly here. If the message cannot be delivered immediately but must be queued, and if the original sender address was unparseable, the original sender address is not saved in the queue file, so the sender will become 'Postmaster' in that case. 8c. Building the recipient list. Each envelope recipient address is now added to an internal list of recipients. Before an address is added to the recipient list, it goes through the following procedures: prescan() rewrite with ruleset 3 rewrite with ruleset 0 The mailer and host returned are saved. The user portion is further processed: rewrite with ruleset 2 rewrite with the mailer specific ruleset (for the mailer returned by ruleset 0) rewrite with ruleset 4. cataddr() Next the recipient list is searched to see if this address already appears. The search is based on a comparison of the mailer returned by ruleset 0, the host returned by ruleset 0 (except the host is ignored if the 'l' flag is set for the mailer), and the user name as output from cataddr(). If the address is a duplicate it is not actually added to the recipient list. Next the mailer is checked to see if the local mailer. If so, the address is looked up in the aliases database. If there is an aliases entry the current address is flagged QDONTSEND so that mail will not be sent to it, and each entry in the alias expansion recursively goes through the same process for adding to the recipient list. An additional flag is set for aliases to indicate they are indeed aliases. Next, if the mailer is the local mailer, and if the QDONTSEND flag is not set, there is a test to see if the user address begins with '|' after removal of quotes. If so, and if the address is from an alias expansion, the mailer is changed to the 'prog' mailer, and the initial '|' is removed. Next, if the mailer is local, the name is looked up in /etc/passwd, the home directory searched for a '.forward' file which, if found, is treated to similar processing as an aliases entry. If at any stage something goes wrong, the address is flagged as bad for a later bounce message. 8d. Beginning the delivery phase. Once the recipient list is complete, sendmail is ready to attempt delivery. This involves running down the recipient list. As an address is selected for a delivery attempt, it is marked QDONTSEND which is approximately the equivalent of deleting it from the recipient list. This is to ensure it will not be sent twice. Once an address is found for a delivery attempt, a check is made to see if the 'm' flag is set in the mailer. If so, sendmail will attempt to send to as many recipients with the same mailer/host combination as possible in a single operation. If delivery requires sending to several hosts, the next few steps will be repeated several times. 8e. Determining the envelope sender for delivery. Start with the expansion of $f prescan() rewrite with rulesets 3, C-FLAG processing, 1, the mailer specific ruleset, and 4. cataddr() The result is assigned to $g Note that, because of the way $f was originally determined, this means the original incoming address has been processed by 3,1,4,3,1,mailer specific,4 8f. Determining the command. The command (the P= and A= operands) from the mailer definition are now expanded, with $h evaluating to the host. If there are multiple recipients, and $u is in the last argument, multiple such arguments are created, one for each recipient for this delivery transaction. If there is no $u in the argument list, the SMTP protocol is used instead. For SMTP use only, the recipient address is now processed by prescan(), ruleset 3, C-FLAG processing, 2, mailer specific, 4, cataddr(). The IDA versions, however, do not do this additional rewriting step which seems superfluous, and which could conceivably cause mailing loops because of the C-FLAG processing. 8g. Header rewriting. The headers are now sent to the mailer program as part of the message. Before sending them they are subject to any rewriting. If the required 'From:' header does not exist it is created from the definition in 'sendmail.cf'. If the required 'To:' is missing an 'Apparently-To:' header is created. A special word on the 'From:' header. If the incoming message has a 'From:' header whose contents are identical in all respects (except leading and trailing white space) to the envelope header, that 'From:' header is deleted. The assumption is that it will be recreated with a chance of adding the full name. 8h. Header sender rewriting. The address is extracted from the 'From:' header and other similar headers such as 'Reply-To:' and 'Resent-From:', carefully saving the comments for later use. The address is processed with prescan(), 3, C-FLAG processing, 1, Mailer specific, 4, cataddr(). The IDA versions use ruleset 5 in place of ruleset 1. This is part of the IDA strategy of allowing a distinction between the formatting of headers and the formatting of the envelope. If there was no "From:", or if it was deleted and must be recreated, the usual definition of "From:" in most version of 'sendmail.cf' begins with $g, or with $q which is defined in terms of $g. In that case, and remembering how $f and then $g are determined, the address on the 'From:' goes through the following steps: incoming envelope sender prescan(), rulesets 3,1,4, cataddr() prescan(), rulesets 3, C-FLAG, 1, Mailer specific, 4, cataddr() prescan(), rulesets 3, C-FLAG, 1, Mailer specific, 4, cataddr() You will note the large amount of redundancy. If designing rulesets you must keep this in mind. In particular you should be wary of approaches which give different results depending on how many times the address is passed through rulesets 3,1,Mailer-specific,4. In the current IDA/NIU rulesets, the $q variable is defined in terms of $f instead of $g, in order to eliminate the most troublesome one of these extra rewrites, in which a header address is rewritten with a mailer specific ruleset intended for envelope addresses only. 8i. Recipient header rewriting. Each recipient address on a "To:" or "Cc:" or "Apparently-To:" or "Resent-To:" or "Resent-Cc:" header goes through the following steps: prescan(), rulesets 3, C-FLAG, 2, Mailer specific, 4, cataddr(). In the above, the IDA versions use ruleset 6 in place of ruleset 2. Again this is to allow header addresses to be formatted differently from envelope addresses. 9. GENERAL COMMENTS ON DESIGNING RULESETS. If you plan on designing your own 'sendmail.cf', or modifying an existing one to add more functionality, here are some things to keep in mind: 9a. Remember the C-FLAG. Because of the C-FLAG processing, it is desirable that every address which contains a host name should contain an '@' by the end of ruleset 3, and addresses without hostname should not have one added in ruleset 3. This also means that if you want to rewrite 'user' as 'user@your.full.domain' you should not do it in ruleset 3, but should do it somewhat later. This means that an address like 'uunet!seismo!harry' needs to be converted to something like 'seismo!harry@uunet.UUCP' in ruleset 3, so as to ensure that there is an '@' in the address and C-FLAG processing doesn't incorrectly add another domain level. 9b. Make major rewriting steps reversible. Ideally any address processed by ruleset 3 followed by ruleset 4 should finish up in its original form. Anything non-reversible done in ruleset 3 can never be fully compensated for later. This is difficult to do in practice, however. In many versions of sendmail.cf, both 'uunet!john' and 'john@uunet.uucp' will be rewritten as 'john<@uunet.UUCP>' in ruleset 3, so the original form cannot always be recovered by use of ruleset 4. What is probably more critical is that using rulesets 3,4,3,4 should yield the same result as just using 3,4. If your rulesets don't manage at least this degree of consistency you are likely to run into major problems. The IDA/NIU rulesets are pretty close to the ideal that ruleset 4 completely reverse ruleset 3. But to achieve this they use in internal form which only vaguely looks like the original address. Thus in these ruleset 3 would rewrite 'uunet!john' as '<@uunet.UUCP>!john' while they would rewrite 'john@uunet.UUCP' as '<@uunet.UUCP>,john'. And, of course, ruleset 4 is rather more complex also. (This approach in IDA/NIU is not simply a matter of purism. It has to do with the need to be able to unambiguously merge two addresses in the pathalias file lookup.) 9c. Allow for the fact that sender addresses may be rewritten multiple times. The habit of sendmail of reprocessing sender addresses can cause some problems and result in incorrect addresses if you do not properly allow for it. 9d. Be particularly cautious with how you handle envelope recipients. Basically if you mess these up, the mail probably won't go anywhere, or won't go where you want it. As an example, suppose I have a uucp neighbor 'uuhost'. Suppose my neighbor wants all mail to leave in the format 'uuhost!user@my.full.domain'. Now it might be that I become a little too vigorous in my changes, and I even rewrite the addresses that way in mail sent to 'uuhost'. If this happens to a header address the problem is not very serious. At the worst, when someone on 'uuhost' sends a reply, the reply will be sent first to my system, then back to 'uuhost'. But if I make the same transformation to the envelope, I will cause a mailing loop. In other words, if mail I see destined for 'uuhost!user' is sent to 'uuhost' with the user address of 'uuhost!user@my.full.domain', the mailer on 'uuhost' is likely to just send it back. Then the same thing happens over and over again. -- =*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*=*= Neil W. Rickert, Computer Science Northern Illinois Univ. DeKalb, IL 60115 +1-815-753-6940