Troubleshooting Fun with Exchange 2007 Queues
I recently resolved an issue, involving two Exchange 2007 servers in two different AD Sites. The issue was simply slow email delivery when emailing from Site 'A' to Site 'B', and a quick check showed that both servers had backlogged mail queues with no obvious cause.
Both sites are part of the same domain, both servers are identical in hardware (HP DL380 G5) and patch levels (Windows Server 2003 Standard x64 R2, and Exchange 2007 SP2). Connectivity between both sites tested perfectly, and talking to other servers in each site also revealed no issues. It was only when both the Exchange servers attempted to communicate, that the issue occured.
Mail in both queues reported errors of "451 4.4.0 Primary target IP address responded with: "421 4.4.2 Connection dropped." Attempted failover to alternate host, but that did not succeed. Either there are no alternate hosts or delivery failed to all alternate hosts." or "421 4.4.2 Connection dropped.", which seemed to point to network issues. Packet captures from both servers also showed a large amount of retransmits on both SMTP and SMB communication:
SMTP:
338 XXXMAIL02 192.168.15.63 SMTP SMTP:Cmd EHLO XXXMAIL02.testdomain.com, 31 bytes
1197 192.168.15.63 XXXMAIL02 SMTP SMTP:Rsp 250 -YYYMAIL02.testdomain.com Hello [192.168.24.34], 255 bytes
1198 XXXMAIL02 192.168.15.63 SMTP SMTP:Data Payload, 16 bytes
4159 XXXMAIL02 192.168.15.63 TCP TCP:[ReTransmit #1198] [Bad CheckSum]Flags=...AP..., SrcPort=44217, DstPort=SMTP(25), PayloadLen=16, Seq=3183382952 - 3183382968, Ack=1774779495, Win=65181
8142 XXXMAIL02 192.168.15.63 TCP TCP:[ReTransmit #1198] [Bad CheckSum]Flags=...AP..., SrcPort=44217, DstPort=SMTP(25), PayloadLen=16, Seq=3183382952 - 3183382968, Ack=1774779495, Win=65181
11786 XXXMAIL02 192.168.15.63 TCP TCP:[ReTransmit #1198] [Bad CheckSum]Flags=...AP..., SrcPort=44217, DstPort=SMTP(25), PayloadLen=16, Seq=3183382952 - 3183382968, Ack=1774779495, Win=65181
15476 XXXMAIL02 192.168.15.63 TCP TCP:[ReTransmit #1198] [Bad CheckSum]Flags=...AP..., SrcPort=44217, DstPort=SMTP(25), PayloadLen=16, Seq=3183382952 - 3183382968, Ack=1774779495, Win=65181
17902 XXXMAIL02 192.168.15.63 TCP TCP:[ReTransmit #1198] [Bad CheckSum]Flags=...AP..., SrcPort=44217, DstPort=SMTP(25), PayloadLen=16, Seq=3183382952 - 3183382968, Ack=1774779495, Win=65181
20735 XXXMAIL02 192.168.15.63 TCP TCP:[ReTransmit #1198] [Bad CheckSum]Flags=...AP..., SrcPort=44217, DstPort=SMTP(25), PayloadLen=16, Seq=3183382952 - 3183382968, Ack=1774779495, Win=65181
23227 XXXMAIL02 192.168.15.63 TCP TCP:[ReTransmit #1198] [Bad CheckSum]Flags=...AP..., SrcPort=44217, DstPort=SMTP(25), PayloadLen=16, Seq=3183382952 - 3183382968, Ack=1774779495, Win=65181SMB:
1/5/2010 15:22 14560 {TCP:358, IPv4:16} XXXMAIL02 192.168.15.63 SMB SMB:R; Negotiate, Dialect is NT LM 0.12 (#5), SpnegoNegTokenInit
1/5/2010 15:22 14650 {TCP:358, IPv4:16} XXXXMAIL02 192.168.15.63 TCP TCP:[ReTransmit #14560] [Bad CheckSum]Flags=...AP..., SrcPort=Microsoft-DS(445), DstPort=44946, PayloadLen=186, Seq=3446414444 - 3446414630, Ack=2070264315, Win=65398 (scale factor 0x0) = 65398
1/5/2010 15:22 14943 {TCP:358, IPv4:16} XXXXMAIL02 192.168.15.63 TCP TCP:[ReTransmit #14560] [Bad CheckSum]Flags=...AP..., SrcPort=Microsoft-DS(445), DstPort=44946, PayloadLen=186, Seq=3446414444 - 3446414630, Ack=2070264315, Win=65398 (scale factor 0x0) = 65398
1/5/2010 15:23 15334 {TCP:358, IPv4:16} XXXXMAIL02 192.168.15.63 TCP TCP:[ReTransmit #14560] [Bad CheckSum]Flags=...AP..., SrcPort=Microsoft-DS(445), DstPort=44946, PayloadLen=186, Seq=3446414444 - 3446414630, Ack=2070264315, Win=65398 (scale factor 0x0) = 65398
1/5/2010 15:23 15862 {TCP:358, IPv4:16} XXXXMAIL02 192.168.15.63 TCP TCP:[ReTransmit #14560] [Bad CheckSum]Flags=...AP..., SrcPort=Microsoft-DS(445), DstPort=44946, PayloadLen=186, Seq=3446414444 - 3446414630, Ack=2070264315, Win=65398 (scale factor 0x0) = 65398
1/5/2010 15:23 16383 {TCP:358, IPv4:16} XXXXMAIL02 192.168.15.63 TCP TCP:[ReTransmit #14560] [Bad CheckSum]Flags=...AP..., SrcPort=Microsoft-DS(445), DstPort=44946, PayloadLen=186, Seq=3446414444 - 3446414630, Ack=2070264315, Win=65398 (scale factor 0x0) = 65398
1/5/2010 15:23 17225 {TCP:358, IPv4:16} XXXXMAIL02 192.168.15.63 TCP TCP:[ReTransmit #14560] [Bad CheckSum]Flags=...AP..., SrcPort=Microsoft-DS(445), DstPort=44946, PayloadLen=186, Seq=3446414444 - 3446414630, Ack=2070264315, Win=65398 (scale factor 0x0) = 65398
1/5/2010 15:24 18568 {TCP:358, IPv4:16} XXXXMAIL02 192.168.15.63 TCP TCP:[ReTransmit #14560] [Bad CheckSum]Flags=...AP..., SrcPort=Microsoft-DS(445), DstPort=44946, PayloadLen=186, Seq=3446414444 - 3446414630, Ack=2070264315, Win=65398 (scale factor 0x0) = 65398
Revisiting the issue, it was noticed that XXXMAIL02 had two network adapters in a Team, while YYYMAIL02 was running off a single network adapter. Both servers also had old network card drivers (the cards are HP NC373i Multifunction Gigabit Adapters, which are rebadged Broadcom cards, and were using driver v2.8.13.0 made on 30/06/2006), and as part of the troubleshooting we upgraded these drivers to the latest available versions (v5.0.13.0, 23/06/2009) at the next maintenance window. As part of the upgrade, XXXMAIL02 was changed from a Network Team to a single adapter, to match YYYMAIL02.
(Bootnote: We did the upgrade by installing the latest Proliant Support Pack, and ran into a small issue of note while doing so. You can't upgrade the network drivers straight to v5.0.13.0, otherwise the installation will fail with an error "HP Virtual Bus Device installation requires a newer version. Version 4.6.16.0 is required". The easy way around this is to download v4.6.16.0 from HP (64-bit here, 32-bit here), and install this prior to the running the PSP.)
Within minutes of the upgrade being completed, mail and other traffic was flowing freely between both servers. A speedtest was run using iperf, which showed speeds of ~60Mb/s (previously we were seeing ~557bytes/s), and new emails were being delivered to the server within seconds.
This was a tricky one to diagnose - but it proves how often simple things are overlooked, in search of a bigger problem!


February 9th, 2010 - 21:04
You were getting re-transmits which is weird. I note you didn’t mention what type of teaming you were using.
We’ve had problems with NICs in ‘Performance Teaming’ mode, when the switch isn’t LACP aware (think broadcast storm, but with a full tcp/ip stream of a backup taking place).
Most of the time (in my experience) it tends to the be the switch freaking out, rather than the NICs)
February 9th, 2010 - 23:30
The teaming mode was standard ‘Automatic Load Balancing/Fault Tolerance’ mode that the HP Drivers offer.
I will be fair, and state that the switch we were using it on is a Cisco 2950 that’s well past its use-by date and I did think that it was related to the problem.
However, we are in the process of upgrading it, but due to it’s… remote… location, we have to work it in around scheduled visits etc. Once it’s replaced with something newer (probably a 3750G-series), we’ll revisit the teaming and see if it reoccurs.