32. Retry configuration

Chapter 32 - Retry configuration

The “retry” section of the runtime configuration file contains a list of retry rules that control how often Exim tries to deliver messages that cannot be delivered at the first attempt. If there are no retry rules (the section is empty or not present), there are no retries. In this situation, temporary errors are treated as permanent. The default configuration contains a single, general-purpose retry rule (see section 7.5). The -brt command line option can be used to test which retry rule will be used for a given address, domain and error.

The most common cause of retries is temporary failure to deliver to a remote host because the host is down, or inaccessible because of a network problem. Exim’s retry processing in this case is applied on a per-host (strictly, per IP address) basis, not on a per-message basis. Thus, if one message has recently been delayed, delivery of a new message to the same host is not immediately tried, but waits for the host’s retry time to arrive. If the retry_defer log selector is set, the message “retry time not reached” is written to the main log whenever a delivery is skipped for this reason. Section 45.2 contains more details of the handling of errors during remote deliveries.

Retry processing applies to routing as well as to delivering, except as covered in the next paragraph. The retry rules do not distinguish between these actions. It is not possible, for example, to specify different behaviour for failures to route the domain snark.fict.example and failures to deliver to the host snark.fict.example. I didn’t think anyone would ever need this added complication, so did not implement it. However, although they share the same retry rule, the actual retry times for routing and transporting a given domain are maintained independently.

When a delivery is not part of a queue run (typically an immediate delivery on receipt of a message), the routers are always run, and local deliveries are always attempted, even if retry times are set for them. This makes for better behaviour if one particular message is causing problems (for example, causing quota overflow, or provoking an error in a filter file). If such a delivery suffers a temporary failure, the retry data is updated as normal, and subsequent delivery attempts from queue runs occur only when the retry time for the local address is reached.

1. Changing retry rules

If you change the retry rules in your configuration, you should consider whether or not to delete the retry data that is stored in Exim’s spool area in files with names like db/retry. Deleting any of Exim’s hints files is always safe; that is why they are called “hints”.

The hints retry data contains suggested retry times based on the previous rules. In the case of a long-running problem with a remote host, it might record the fact that the host has timed out. If your new rules increase the timeout time for such a host, you should definitely remove the old retry data and let Exim recreate it, based on the new rules. Otherwise Exim might bounce messages that it should now be retaining.

2. Format of retry rules

Each retry rule occupies one line and consists of three or four parts, separated by white space: a pattern, an error name, an optional list of sender addresses, and a list of retry parameters. The pattern and sender lists must be enclosed in double quotes if they contain white space. The rules are searched in order until one is found where the pattern, error name, and sender list (if present) match the failing host or address, the error that occurred, and the message’s sender, respectively.

The pattern is any single item that may appear in an address list (see section 10.19). It is in fact processed as a one-item address list, which means that it is expanded before being tested against the address that has been delayed. A negated address list item is permitted. Address list processing treats a plain domain name as if it were preceded by “*@”, which makes it possible for many retry rules to start with just a domain. For example,

lookingglass.fict.example        *  F,24h,30m;

provides a rule for any address in the lookingglass.fict.example domain, whereas

alice@lookingglass.fict.example  *  F,24h,30m;

applies only to temporary failures involving the local part alice. In practice, almost all rules start with a domain name pattern without a local part.

Warning: If you use a regular expression in a routing rule pattern, it must match a complete address, not just a domain, because that is how regular expressions work in address lists.

^\Nxyz\d+\.abc\.example$\N        *  G,1h,10m,2     Wrong
^\N[^@]+@xyz\d+\.abc\.example$\N  *  G,1h,10m,2     Right

3. Choosing which retry rule to use for address errors

When Exim is looking for a retry rule after a routing attempt has failed (for example, after a DNS timeout), each line in the retry configuration is tested against the complete address only if retry_use_local_part is set for the router. Otherwise, only the domain is used, except when matching against a regular expression, when the local part of the address is replaced with “*”. A domain on its own can match a domain pattern, or a pattern that starts with “*@”. By default, retry_use_local_part is true for routers where check_local_user is true, and false for other routers.

Similarly, when Exim is looking for a retry rule after a local delivery has failed (for example, after a mailbox full error), each line in the retry configuration is tested against the complete address only if retry_use_local_part is set for the transport (it defaults true for all local transports).

However, when Exim is looking for a retry rule after a remote delivery attempt suffers an address error (a 4xx SMTP response for a recipient address), the whole address is always used as the key when searching the retry rules. The rule that is found is used to create a retry time for the combination of the failing address and the message’s sender. It is the combination of sender and recipient that is delayed in subsequent queue runs until its retry time is reached. You can delay the recipient without regard to the sender by setting address_retry_include_sender false in the smtp transport but this can lead to problems with servers that regularly issue 4xx responses to RCPT commands.

4. Choosing which retry rule to use for host and message errors

For a temporary error that is not related to an individual address (for example, a connection timeout), each line in the retry configuration is checked twice. First, the name of the remote host is used as a domain name (preceded by “*@” when matching a regular expression). If this does not match the line, the domain from the email address is tried in a similar fashion. For example, suppose the MX records for a.b.c.example are

a.b.c.example  MX  5  x.y.z.example
               MX  6  p.q.r.example
               MX  7  m.n.o.example

and the retry rules are

p.q.r.example    *      F,24h,30m;
a.b.c.example    *      F,4d,45m;

and a delivery to the host x.y.z.example suffers a connection failure. The first rule matches neither the host nor the domain, so Exim looks at the second rule. This does not match the host, but it does match the domain, so it is used to calculate the retry time for the host x.y.z.example. Meanwhile, Exim tries to deliver to p.q.r.example. If this also suffers a host error, the first retry rule is used, because it matches the host.

In other words, temporary failures to deliver to host p.q.r.example use the first rule to determine retry times, but for all the other hosts for the domain a.b.c.example, the second rule is used. The second rule is also used if routing to a.b.c.example suffers a temporary failure.

Note: The host name is used when matching the patterns, not its IP address. However, if a message is routed directly to an IP address without the use of a host name, for example, if a manualroute router contains a setting such as:

route_list = *.a.example  192.168.34.23

then the “host name” that is used when searching for a retry rule is the textual form of the IP address.

5. Retry rules for specific errors

The second field in a retry rule is the name of a particular error, or an asterisk, which matches any error. The errors that can be tested for are:

auth_failed: Authentication failed when trying to send to a host in the hosts_require_auth list in an smtp transport.
data_4xx: A 4xx error was received for an outgoing DATA command, either immediately after the command, or after sending the message’s data.
mail_4xx: A 4xx error was received for an outgoing MAIL command.
rcpt_4xx: A 4xx error was received for an outgoing RCPT command.

For the three 4xx errors, either the first or both of the x’s can be given as specific digits, for example: mail_45x or rcpt_436. For example, to recognize 452 errors given to RCPT commands for addresses in a certain domain, and have retries every ten minutes with a one-hour timeout, you could set up a retry rule of this form:

the.domain.name  rcpt_452   F,1h,10m

These errors apply to both outgoing SMTP (the smtp transport) and outgoing LMTP (either the lmtp transport, or the smtp transport in LMTP mode).

lost_connection: A server unexpectedly closed the SMTP connection. There may, of course, legitimate reasons for this (host died, network died), but if it repeats a lot for the same host, it indicates something odd.
refused_MX: A connection to a host obtained from an MX record was refused.
refused_A: A connection to a host not obtained from an MX record was refused.
refused: A connection was refused.
timeout_connect_MX: A connection attempt to a host obtained from an MX record timed out.
timeout_connect_A: A connection attempt to a host not obtained from an MX record timed out.
timeout_connect: A connection attempt timed out.
timeout_MX: There was a timeout while connecting or during an SMTP session with a host obtained from an MX record.
timeout_A: There was a timeout while connecting or during an SMTP session with a host not obtained from an MX record.
timeout: There was a timeout while connecting or during an SMTP session.
tls_required: The server was required to use TLS (it matched hosts_require_tls in the smtp transport), but either did not offer TLS, or it responded with 4xx to STARTTLS, or there was a problem setting up the TLS connection.
quota: A mailbox quota was exceeded in a local delivery by the appendfile transport.
quota_<time>: A mailbox quota was exceeded in a local delivery by the appendfile transport, and the mailbox has not been accessed for <time>. For example, quota_4d applies to a quota error when the mailbox has not been accessed for four days.

The idea of quota_<time> is to make it possible to have shorter timeouts when the mailbox is full and is not being read by its owner. Ideally, it should be based on the last time that the user accessed the mailbox. However, it is not always possible to determine this. Exim uses the following heuristic rules:

If the mailbox is a single file, the time of last access (the “atime”) is used. As no new messages are being delivered (because the mailbox is over quota), Exim does not access the file, so this is the time of last user access.
For a maildir delivery, the time of last modification of the new subdirectory is used. As the mailbox is over quota, no new files are created in the new subdirectory, because no new messages are being delivered. Any change to the new subdirectory is therefore assumed to be the result of an MUA moving a new message to the cur directory when it is first read. The time that is used is therefore the last time that the user read a new message.
For other kinds of multi-file mailbox, the time of last access cannot be obtained, so a retry rule that uses this type of error field is never matched.

The quota errors apply both to system-enforced quotas and to Exim’s own quota mechanism in the appendfile transport. The quota error also applies when a local delivery is deferred because a partition is full (the ENOSPC error).

6. Retry rules for specified senders

You can specify retry rules that apply only when the failing message has a specific sender. In particular, this can be used to define retry rules that apply only to bounce messages. The third item in a retry rule can be of this form:

senders=<address list>

The retry timings themselves are then the fourth item. For example:

*   rcpt_4xx   senders=:   F,1h,30m

matches recipient 4xx errors for bounce messages sent to any address at any host. If the address list contains white space, it must be enclosed in quotes. For example:

a.domain  rcpt_452  senders="xb.dom : yc.dom"  G,8h,10m,1.5

Warning: This facility can be unhelpful if it is used for host errors (which do not depend on the recipient). The reason is that the sender is used only to match the retry rule. Once the rule has been found for a host error, its contents are used to set a retry time for the host, and this will apply to all messages, not just those with specific senders.

When testing retry rules using -brt, you can supply a sender using the -f command line option, like this:

exim -f "" -brt user@dom.ain

If you do not set -f with -brt, a retry rule that contains a senders list is never matched.

7. Retry parameters

The third (or fourth, if a senders list is present) field in a retry rule is a sequence of retry parameter sets, separated by semicolons. Each set consists of

<letter>,<cutoff time>,<arguments>

The letter identifies the algorithm for computing a new retry time; the cutoff time is the time beyond which this algorithm no longer applies, and the arguments vary the algorithm’s action. The cutoff time is measured from the time that the first failure for the domain (combined with the local part if relevant) was detected, not from the time the message was received.

The available algorithms are:

F: retry at fixed intervals. There is a single time parameter specifying the interval.
G: retry at geometrically increasing intervals. The first argument specifies a starting value for the interval, and the second a multiplier, which is used to increase the size of the interval at each retry.
H: retry at randomized intervals. The arguments are as for G. For each retry, the previous interval is multiplied by the factor in order to get a maximum for the next interval. The minimum interval is the first argument of the parameter, and an actual interval is chosen randomly between them. Such a rule has been found to be helpful in cluster configurations when all the members of the cluster restart at once, and may therefore synchronize their queue processing times.

When computing the next retry time, the algorithm definitions are scanned in order until one whose cutoff time has not yet passed is reached. This is then used to compute a new retry time that is later than the current time. In the case of fixed interval retries, this simply means adding the interval to the current time. For geometrically increasing intervals, retry intervals are computed from the rule’s parameters until one that is greater than the previous interval is found. The main configuration variable retry_interval_max limits the maximum interval between retries. It cannot be set greater than 24h, which is its default value.

A single remote domain may have a number of hosts associated with it, and each host may have more than one IP address. Retry algorithms are selected on the basis of the domain name, but are applied to each IP address independently. If, for example, a host has two IP addresses and one is unusable, Exim will generate retry times for it and will not try to use it until its next retry time comes. Thus the good IP address is likely to be tried first most of the time.

Retry times are hints rather than promises. Exim does not make any attempt to run deliveries exactly at the computed times. Instead, a queue runner process starts delivery processes for delayed messages periodically, and these attempt new deliveries only for those addresses that have passed their next retry time. If a new message arrives for a deferred address, an immediate delivery attempt occurs only if the address has passed its retry time. In the absence of new messages, the minimum time between retries is the interval between queue runner processes. There is not much point in setting retry times of five minutes if your queue runners happen only once an hour, unless there are a significant number of incoming messages (which might be the case on a system that is sending everything to a smart host, for example).

The data in the retry hints database can be inspected by using the exim_dumpdb or exim_fixdb utility programs (see chapter 50). The latter utility can also be used to change the data. The exinext utility script can be used to find out what the next retry times are for the hosts associated with a particular mail domain, and also for local deliveries that have been deferred.

8. Retry rule examples

Here are some example retry rules:

alice@wonderland.fict.example quota_5d  F,7d,3h
wonderland.fict.example       quota_5d
wonderland.fict.example       *         F,1h,15m; G,2d,1h,2;
lookingglass.fict.example     *         F,24h,30m;
*                 refused_A   F,2h,20m;
*                 *           F,2h,15m; G,16h,1h,1.5; F,5d,8h

The first rule sets up special handling for mail to alice@wonderland.fict.example when there is an over-quota error and the mailbox has not been read for at least 5 days. Retries continue every three hours for 7 days. The second rule handles over-quota errors for all other local parts at wonderland.fict.example; the absence of a local part has the same effect as supplying “*@”. As no retry algorithms are supplied, messages that fail are bounced immediately if the mailbox has not been read for at least 5 days.

The third rule handles all other errors at wonderland.fict.example; retries happen every 15 minutes for an hour, then with geometrically increasing intervals until two days have passed since a delivery first failed. After the first hour there is a delay of one hour, then two hours, then four hours, and so on (this is a rather extreme example).

The fourth rule controls retries for the domain lookingglass.fict.example. They happen every 30 minutes for 24 hours only. The remaining two rules handle all other domains, with special action for connection refusal from hosts that were not obtained from an MX record.

The final rule in a retry configuration should always have asterisks in the first two fields so as to provide a general catch-all for any addresses that do not have their own special handling. This example tries every 15 minutes for 2 hours, then with intervals starting at one hour and increasing by a factor of 1.5 up to 16 hours, then every 8 hours up to 5 days.

9. Timeout of retry data

Exim timestamps the data that it writes to its retry hints database. When it consults the data during a delivery it ignores any that is older than the value set in retry_data_expire (default 7 days). If, for example, a host hasn’t been tried for 7 days, Exim will try to deliver to it immediately a message arrives, and if that fails, it will calculate a retry time as if it were failing for the first time.

This improves the behaviour for messages routed to rarely-used hosts such as MX backups. If such a host was down at one time, and happens to be down again when Exim tries a month later, using the old retry data would imply that it had been down all the time, which is not a justified assumption.

If a host really is permanently dead, this behaviour causes a burst of retries every now and again, but only if messages routed to it are rare. If there is a message at least once every 7 days the retry data never expires.

10. Long-term failures

Special processing happens when an email address has been failing for so long that the cutoff time for the last algorithm is reached. For example, using the default retry rule:

* * F,2h,15m; G,16h,1h,1.5; F,4d,6h

the cutoff time is four days. Reaching the retry cutoff is independent of how long any specific message has been failing; it is the length of continuous failure for the recipient address that counts.

When the cutoff time is reached for a local delivery, or for all the IP addresses associated with a remote delivery, a subsequent delivery failure causes Exim to give up on the address, and a bounce message is generated. In order to cater for new messages that use the failing address, a next retry time is still computed from the final algorithm, and is used as follows:

For local deliveries, one delivery attempt is always made for any subsequent messages. If this delivery fails, the address fails immediately. The post-cutoff retry time is not used.

If the delivery is remote, there are two possibilities, controlled by the delay_after_cutoff option of the smtp transport. The option is true by default. Until the post-cutoff retry time for one of the IP addresses is reached, the failing email address is bounced immediately, without a delivery attempt taking place. After that time, one new delivery attempt is made to those IP addresses that are past their retry times, and if that still fails, the address is bounced and new retry times are computed.

In other words, when all the hosts for a given email address have been failing for a long time, Exim bounces rather then defers until one of the hosts’ retry times is reached. Then it tries once, and bounces if that attempt fails. This behaviour ensures that few resources are wasted in repeatedly trying to deliver to a broken destination, but if the host does recover, Exim will eventually notice.

If delay_after_cutoff is set false, Exim behaves differently. If all IP addresses are past their final cutoff time, Exim tries to deliver to those IP addresses that have not been tried since the message arrived. If there are no suitable IP addresses, or if they all fail, the address is bounced. In other words, it does not delay when a new message arrives, but tries the expired addresses immediately, unless they have been tried since the message arrived. If there is a continuous stream of messages for the failing domains, setting delay_after_cutoff false means that there will be many more attempts to deliver to permanently failing IP addresses than when delay_after_cutoff is true.

11. Deliveries that work intermittently

Some additional logic is needed to cope with cases where a host is intermittently available, or when a message has some attribute that prevents its delivery when others to the same address get through. In this situation, because some messages are successfully delivered, the “retry clock” for the host or address keeps getting reset by the successful deliveries, and so failing messages remain on the queue for ever because the cutoff time is never reached.

Two exceptional actions are applied to prevent this happening. The first applies to errors that are related to a message rather than a remote host. Section 45.2 has a discussion of the different kinds of error; examples of message-related errors are 4xx responses to MAIL or DATA commands, and quota failures. For this type of error, if a message’s arrival time is earlier than the “first failed” time for the error, the earlier time is used when scanning the retry rules to decide when to try next and when to time out the address.

The exceptional second action applies in all cases. If a message has been on the queue for longer than the cutoff time of any applicable retry rule for a given address, a delivery is attempted for that address, even if it is not yet time, and if this delivery fails, the address is timed out. A new retry time is not computed in this case, so that other messages for the same address are considered immediately.

<-previous next->

Exim Internet Mailer