- The server is dying (core dump or DrWatson), what should I do?
(or, What should I send to DMail Support?)
If one of the DMail servers (DPOP, DSMTP or DList) is dying, it will be evident
in a number of ways,
- users cannot connect to the server
- the dwatch resurrector may be emailing you as the system administrator.
- you may get 'core' files appearing in the server directories, e.g. /usr/local/dmail
(UNIX platforms only)
- you may get a DrWatson Window popping up
Note: dwatch is supposed to restart the servers when this happens, by default it only
does this 5 times and then gives up watching that server.
When one of the servers is dying, we at DMail support will of course want to know about it
because it means that there is a serious bug in our software.
See the next faq for suggestions on what to send DMail Support...
- What to Send DMail Support:
So here is a list of the things that it might be appropriate to send us. But
please don't send
us a huge email with lots of large attachments, just pick the best information that you have.
Sending
us your config file and a log or back trace is usually sufficient. Don't forget to
tell us your platform and the version you are using.
- Your dmail.conf file - almost always send us this
- The log file (on debug log level if possible, and maybe with log_data true)
- A 'ded' file from the dwatch directory
- A DrWatson log file, e.g. \winnt\drwtsn32.log
- A back trace from a core dump (don't send a core file)
- A trace.log file from the dwatch_path directory (check that the date is valid)
And email those to dmail-support@netwinsite.com
Here are some pointers on gathering the above information.
- Set your logging level to debug as soon as you are supicious that something is going
wrong. In order to do this, edit your dmail.conf file,
/etc/dmail.conf or \winnt\system32\dmail.conf
so that the setting log_level looks like this:
log_level debug
save the file and then reload both DPOP and DSMTP
- If the bug is something to do with the TCPIP connections on DSMTP, you may want to set
log_data true
so that DSMTP logs all TCPIP connections.
(or 'log_data some' on 2.7 and above versions so that your log does not end up filled
with attachment information)
- Send us the relevant log file, e.g. dsmtp.log which you will find in the log_path directory.
- Send us the relevant 'ded' file, e.g. d_1dsmtp.ded from the dwatch_path directory. These
are the log files as copied by dwatch when it noticed that the server had died. If a server
has crashed a number of times then a couple of these are useful to see if the last thing in
the log is the same each time - i.e. they can answer the question, is the server dying on
the same thing each time?
- The most useful thing is a back trace. This shows us which function within our program
was being run when it died.
Getting a back trace on NT:
DrWatson will create a back trace and put it in the file, \winnt\drwtsn32.log, if it
notices any program dying. NB: It may ask you whether you want a log to be created, which you
should make it do.
DrWatson should be on by default, but you can turn it on in the DMAdmin utility. Click
on, Config Dwatch, then select any server and click
on the 'Set DrWatson as debugger' button in the pop up window.
NB: If the drwatson pop up box comes up and waits for you to click OK, then dwatch will
not notice
that the server has died and so will not restart it until you click on the OK button.
So click 'don't popup
window when any program dies', then DrWatson will be set so that it automatically creates
the log
file for you and then closes the dying program. This allows DWatch to restart the server, but
you still get the log.
Getting a back trace on UNIX based platforms
Hopefully if one of our programs dies you will find a file named 'core' (or core.program on
some platforms) in the same directory that the program is running in. So look in the following
default server path directories,
/usr/local/dmail for DSMTP and DPOP
/usr/local/dmail/dlist for DList
/usr/local/dmail/dwatch for dwatch itself
PLEASE do not send us the core file. Valid information can only be read from it by analysing
it on the machine on which it was created.
So, in order to analyse the core file and get the back trace, here are a couple of common examples:
Most Boxes (usng DSMTP as an example):
1. cd to the program directory,
cd /usr/local/dmail
2. run gdb with arg1 being the process and arg2 being the name of the core file,
gdb /usr/local/dmail/dsmtp /usr/local/dmail/core
3. now that gdb should be running enter,
bt
this should display a back trace. Send us a cut and paste of the whole gdb session rather
than just the back trace bit.
4. enter quit to close gdb
On AIX:
Same as above but use, 'dbx' instead of gdb. You can also use the '-a pid' option to
attach to a running process.
On Solaris:
Same as above. Most customers seem to be able to install 'dbx' pretty easily but it is
also quite common to have, 'adb' which has a '-c' option that may be the one to use.
On some platforms we had forgotten the compile flag, -g, in versions before 2.8. So
the back traces will be useless, e.g. a message like, 'no symbols found' will appear.
Sometimes it is useful to send us a truss of the program, as you can run this while the
program is still running, (truss -p pid). Note that
this only shows us the system calls (like disk access) that the server makes (as far
as we know - someone tell us if we are missing something :-) ). So it is not as
good as a back trace.
- Send us a trace.log file if the death is in DSMTP. This is a very basic back trace
which DSMTP generates when it dies, but it is not nearly as good as a real back trace. DSMTP
puts this file in the dwatch_path directory, usually, /usr/local/dmail/dwatch or \dmail\dwatch.
Note: you should delete the trace.log file as soon as you have copied somewhere else, as DSMTP
will not always overwrite it if the death happens again.
- Lastly, check dates on files and look inside them to see that they contain information from
the time of the crash
- I tried to upgrade but it did not work . . .
Normally, if something does not upgrade correctly then
it means that the installation utility, dmsetup, was not able to
stop that part of the server in order to copy over it with the new
version.
So, in order to do the upgrade, you must stop that server or program and
then manually copy the new executable over the old one - make sure you
find the correct old executable to overwrite!
A few notes that might help:
1. On NT, remember to exit from DMAdmin before you do the upgrade.
2. On NT, if you want to stop the servers, and DMAdmin is not responding, you must stop the DWatch service that controls them - you can do
this from the control panel, 'services' dialog. If this doesn't
work then you must disable the DWatch service (in the same dialog) and
restart the machine,
so that when you restart, the servers are not running. At that point
dmsetup should be able to upgrade everything without any problems.
- What does this log (error) message mean?
See Deciphering Log Files.
- I am having a problem with the users ...
The following list is of things to try, given that you are having problems
with the user database. It assumes that you are using NWAuth, but most of
what is says applies to whichever database you are using.
I can add users (with NetAuth or whatever) but the servers don't
recognise them:
The most likely problem is that the users are being added in the wrong
form, i.e. with the wrong prefix or suffix. You should open up your user
database (for NWAuth that is nwauth.txt and/or nwauth.add)
with a text editor and see the form of username that has been added there. Then compare that with the username that DSMTP and DPOP are looking up
in the appropriate log file - obviously the two will need to match.
In order to get the log files that you need, edit the dmail.conf setting log_level
to read,
log_level debug
Then reload the servers (tellsmtp reload and tellpop reload) and then send in
a message to that user, or login to DPOP as that user. In the dsmtp.log file
(in the log_path directory) you are looking for the line:
"lookup username ..."
In the dpop.log file you are looking for:
"check username ...". It is the username that should match
with the username in the user database.
If they don't match, there are a number of settings
in dmail.conf that effect the prefix and suffix of a username in the user
database. In dmail.conf these are either vdomain(the prefix parameter) and
vdomain_separator OR authent_domain. The
NetAuth manual
has a
'Mail Server authentication setup'
section with all of the possible settings and what you have to
set in each product. Note: if you are using DMAdmin to enter the usernames, you have to enter them exactly as you want them to appear in the
nwauth.txt file.
If the usernames do match, either NWAuth is returning a bad response,
or DPOP and DSMTP are not running the same NWAuth as you are.
So run NWAuth from the command line. e.g. Assuming that you have a user called
bob, that his password is 'pass' and that your authent_process setting
in dmail.conf is c:\dmail\nwauth.exe, enter:
c:\dmail\nwauth.exe
lookup bob
check bob pass
exit
The response should be '+OK ...' in each case.
You can check that DSMTP is running the authentication process that
you have just run by entering,
tellsmtp config authent_process
It should respond with the value of that setting.
- I got a bounce (Delivery Status Notification) message from DSMTP ...
DSMTP creates a number of messages for sending back to the sender of a message,
explaining a delivery problem or notifying of delivery success. These are
called, DSNs (Deliver Status Notification) messages, and are generally
identified by the fact that the sender of the message is the 'postmaster@your_domain'.
There is a section of the manual on these, Bounces and DSNs.
Here is the start of a list explaning some common ones ...
(ask
us to add to this list if you get a DSN that is not listed)
- Subject= Possible message loop
The error message is generated when DSMTP detects that a message
that it is receiving already has received headers
stamped on it 15 times, i.e. indicating that the message
has been through many servers, and hence probably gone around and around in a loop (as most messages only have 2 or 3 such
received lines).
As the message states, this is normally because a message
arrives for a given domain, e.g. bob.com. DSMTP looks at it's
list of host_domains and vdomains and can't see bob.com there so it
considers it a non-local domain. Then it does a dns lookup
on bob.com and sends the message off to the resulting ip address,
which is itself. The same process happens again and again.
Normally in this situation, the result of the DNS lookup is
noted as pointing at itself so that it automatically adds that domain
to the host_domain list, and delivers the message locally.
However, sometimes it does not realise that the DNS lookup
points at itself so no auto host_domain addition occurs. The auto host addition
can also be turned off with the setting, no_autohost, so
check that that is not set to 'true' in your dmail.conf.
Also, if there is a server forwarding or gatewaying (routing) mail
for that domain to DSMTP, the DNS lookup will not point at dstmp
directly, so again, it cannot automatically add the host_domain line.
The best way to see what is happening is to set...
log_level debug
log_data true
and then to do a tellsmtp reload, so that DSMTP logs the full message body which will show you the
received lines in the messages. From that you should be able to trace the path of the message.
If, for some reason, you wish to allow your messages to have more than 15 received headers then
you can set, max_rcvd to a number higher than 15. NB: we do not
recommend changing this setting, as in 99.9% of cases it indicates that you have something
misconfigured. This setting is
fairly new (probably 2.7k onwards) so enter,
tellsmtp config max_rcvd
to see if your version of DMail knows about that setting.
If you cannot find the solution, please email DMail Support
with the bounce message that you get, your dmail.conf and also a dsmtp.log file showing the received
headers (or a copy of the message showing the headers).
- What does this DPOP error message mean?
DPOP returns quite a small set of error messages when it does not
allow a user to log in. Good email clients pass these messages through to
the email client, but note that some do not. Therefore, you should always check the
dpop.log file to see the real reason that a user cannot connect to the pop
server.
NB: a number of the DPOP error messages are simply the messages returned
by an external authentication module - this should be obvious in the dpop.log
file if it is the case. We're happy to edit the error responses of any of
our authentication modules if you wish to make suggestions.
Here is the start of a list explaning some common ones ...
(please ask
us to add to this list
if you get a message that is not listed)
- 'Database Down' or 'Out of Sync' message with External User Database ...
An error message in the dsmtp.log file such as,
...Out of sync reply from external auth (bob) isn't (fred)...
or similarly in a bounce message or server connection error message,
...User database is down
indicates that DSMTP (or DPOP) thinks that your authentication module is responding,
e.g. it looked up,bob and thinks that it received back a response for, fred.
The most likely reason for this is that your authentication module was delayed in
responding to the lookup request, so that DSMTP sees the response to that request when it
goes looking for the response to the following request.
The time that it waits is set by,
authent_timeout
which takes a timeout setting in seconds.
Also the settings,
tcp_timeout (DSMTP)
and
pop_timeout (DPOP)
set the timeout on TCPIP connections for DSMTP and DPOP respectively.
You could check that your authent_timeout setting is long enough to allow any normal
slow lookups by your authentication module, e.g. if your database regularly goes offline
for a few minutes each day. If you are unsure of what
to set it at, I would suggest that you set it at 30 seconds .
You also need to check that your tcp_timeout and pop_timeout settings are larger than
your authent_timeout setting. If they are not, the servers can drop the connection before they
have finished allowing the authentication module to do the user lookup. This can cause very strange
behaviour. We recommend that you leave both tcp_timeout (default 5 mins) and pop_timeout (default 10 minutes)
at their default values.
In version 2.7n (2.7q is the corresponding release version) we did some work on this
so that in such a situation, DSMTP can 'get back in sync'.
Therefore, if you are using an older version you may want to upgrade to at least version 2.7q.
- On Windows, DMAdmin just shows lines like 'Lost connection to DSMTP (Select failed () Connection Refused)' ...
The messages you are seeing in DMAdmin indicate that the administration utility
cannot connect to the DSMTP (and/or DPOP) server(s).
It is important to realise that DMAdmin is just an administration utility that connects to the servers
when they are running. It may well be that they are running, but DMAdmin cannot talk to them for
some reason.
You can check whether the servers are really running by entering at a command prompt,
telnet localhost 110
quit
to which the DPOP server should respond if it is going.
Similarly, entering
telnet localhost 25
quit
checks for DSMTP.
If the servers are not running, please send
DMail Support your configuration file,
dmail.conf
(typically c:\winnt\system32\dmail.conf or /etc/dmail.conf)
and the following log files,
dsmtp.log
dpop.log
from the log_path directory (specified in dmail.conf).
NB: the most common cause of this is that there is another Mail server running! So please do
check that you do not have another SMTP or POP server running. When you do the telnet tests above,
the DSMTP and DPOP servers will respond with a line including the word, 'DSMTP' and 'DPOP' respectively so
that you can tell that they are the server responding. Other servers will respond with similar lines
but of course will not mention the names of our products.
If some other servers are running then you need to shut them down and re-run the dmsetup installation
utility (which will do an upgrade, 2, this time). You will find dmsetup in the dmtemp directory.
If the servers are running, and they are indeed the DMail servers, then
DMAdmin is probably just having trouble connecting to the servers. So send
DMail Support the same files as above, but also
add the dwatch.log file (from the log_path directory) and you can click on the 'debug output' check box
on the dwatch tab in dmadmin and send us the resulting dmadmin.log file (dmadmin will log to screen the
name of the log file it is using).
- What does the following System Administrator message mean?
Many System Administrator messages are simply copies of bounce or DSN messages, so in addition
to any messages listed below check the FAQ above,
I got a bounce message from DSMTP ...
(no system admin messages doc'd at present - email DMail Support
if you want one explained)
- We are having problems with 'try again later' or 'too many simultaneous connections' messages in DSMTP ...
> we are having problems with people getting 'try again later' SMTP
> errors. I know there is a setting somewhere to limit the number of
> connections, but is there anything I should be watching out for so that I can be sure
> I'm not just addressing the symptom of a greater problem?
The setting you want to change is,
tcp_max
(default is 200)
from which you need to take away half of
max_send
(default is 10)
DSMTP will never allow more incoming channels than,
tcp_max -(max_send/2)
so by default,
200 - (10/2) = 195
You are correct that you need to monitor the use of your channels closely, rather than just raising tcp_max.
The first thing to do is grep your logs for the error message that corresponds to,
try again later
which the users see. I just went to look this up and found that there are various errors with
that ending, and all the ones with 'try again later' as opposed to 'try later' seem to be because
of authentication module problems. So definitely do check the
log files to see the reason for the rejections.
For the rejection message,
435 Sorry only %d simultaneous users permited, try later
the log line is,
User rejected because too many users already connected
the other place to check is at the end of a tellpop status, where DSMTP displays counts of various error conditions. The text to look for there is,
Connection refused: too many simultaneous connections
and the number beside it is a 'count' of how many times DSMTP has given that response.
Also, if you give the command,
tellsmtp showchans
then DSMTP will list all channels that are currently or have been active. You may want to set
up a cron job to get copies of that a few times a day, and/or save the output of that command (pipe to file) when you know it is in a period of rejecting connections.
You may be able to see problems in the showchans outputs,
but feel free to send them to us to be checked.
The same goes with the dsmtp.log files.