Short Version : If you see above wait type in sysprocesses* for your Distributed Transaction check that DNS is correctly configured / working / pingable for that server (and Cluster Network name if applicable)
This week had that classic MSDTC problem , whereby customer was just getting MSDTC transaction failed when running a custom Transaction Replication setup routine between 2 SQL 2005 Clusters – From a server I will call Publisher to a server I will call Subscriber
It would just hang for around 10 mins and then produce the error message.
Hadn’t looked at MSDTC for a while , so Binged it and came up with couple of really good troubleshooting articles (see bottom). But had a couple of key learnings too. Process :
1. Checked Event Logs and Restarted MSDTC Resource on both Clusters
it was a UAT / Staging environment so could be a brutal. No joy.
(I actually failed here – because i should have checked the Event Log after restart as had a 9017 DNS Bad Key entry with EventId = 1119 , Source = ClusSvc)
2. Used DTCPing
to test the Network connectivity / Security settings between the 2 Servers
This worked ok. Running DTCPing from my client workstation to both Servers actually failed , but I parked this fact as I thought it must be some Firewall / Port restriction
3. Used DTCTester
in both directions (ie creating DSN / running from Publisher –> Subscriber and vice versa)
Note : You have to create a User (not System) DSN and if using Windows Authentication you can just spoof the user and password (2nd and 3rd params)
DTCTester DSNName beer beer
This failed with similar hanging behaviour for both scenarios – although the hang period was about 5 mins.
I then ran DTCTester from my Client workstation and interestingly it worked against the Publisher but failed against the Subscriber.
4. Distributed Transaction from SQL Server.
As we already had Linked Servers set up, I ran below from the Publisher. With hindsight I should have done this first/instead of step 3 to save faffing around with DSN’s
BEGIN DISTRIBUTED TRANSACTION
SELECT * FROM Subscriber.master.dbo.sysobjects
(where Subscriber was the linked server name for the , er , Subsciber) Again this was hanging as per the problem / DTC Tester .
I then connected to Subscriber and did below to see if I could see anything
SELECT * FROM master.dbo.sysprocesses WHERE hostname = ‘PUBLISHER’ and loginame = ‘andyjball’
interestingly this showed my spid from the publisher a wait type of 0x01B3 / PREEMPTIVE_OS_DTCOPS which I had never seen before . Bing search came up more or less blank.
5. Examining the Subscriber Configuration.
As per point 3 it seamed to be the culprit. I found 2 major things wrong
a) Pinging Cluster Network Name failed
resolved to a different IP address that was configured on the Cluster IP Address and didn’t reply
b) Had Status 9017 against DNS Status for the Parameters tab on the Cluster Network name.
(note this shows 9016 as I was repro’ing on another VM / Cluster by removing write permission from the DNS entry for the Cluster Name from the Cluster Service Account)
I don’t recall ever seeing these at anything other than 0. Below (confirmed with some other research) resolved to DNS BAD KEY
NET HELPMSG 9017
which made sense given a). So contacted our Windows team who fixed the DNS / name resolution issue , restarted MSDTC Resource on the Subscriber Cluster , retested and All working !
Network / Firewall / MSDTC Permissions http://blogs.msdn.com/b/distributedservices/archive/2011/11/22/troubleshooting-msdtc-communication-checklist.aspx
* Yeah I know about DMV’s