Attention DBA’s – Don’t forget the Network as a Bottleneck !

These days when looking at performance issues with SQL Server , the root cause is generally DB concurrency  , CPU , I/O , Memory related or some combination thereof. This is especially true of OLTP style systems where you aren’t expecting to move large amounts of data around. Also the Client end of the connection tends to be on a good fast link to SQL Server , as most of the time its on either an App or Web Server*

I once resolved a Performance issue on a business critical SQL Server where Network queuing played a key part in the problem , but it was a SQL related issue.

Key facts (Abridged) 

  • Access Front End to SQL Server** application with about 1000 connections, around 100 active users.
  • Most users were using Virtual Desktops , hosted on Servers in same datacentre as SQL Server.
  • Performance had degraded gradually over a period of a week , whereby a SELECT / UPDATE operation in the app had gone from a fairly consistent 5 second response time , to around 50-120 seconds mainly during the busiest time of day (10am to 12pm)
  • New version of Access software / SQL components released about 3 weeks before performance started to degrade (Alarm Bells !).
  • SQL Server 2005 x64 , Server 2003, 32 GByte RAM DL380 – (cant remember the G number !
  • Consolidated*** SQL Server with 4 other active instances. sp_configure ‘max server memory’  set to 4 Gbytes on each instance. CPU not throttled / affinitised in the instances in any way.
  • Database approx 2Gbytes used space , so theoretically almost all of it would cached in memory (as 4Gbytes allocated to SQL).

Troubleshooting (Abridged)

(see here for an excellent post on the art of troubleshooting – http://blogs.msdn.com/b/lucascan/archive/2009/04/14/troubleshooting-101.aspx)

  • As soon as I got the call (around about 11am) , I kicked off a pre-Configured SQLDiag on the Server to capture Perfmon, Profiler, Blocking data for later analysis.
  • In parallel started looking interactively at the issue . Given the consolidated nature of this server,  I went straight for an interactive Perfmon to check OS counters like Processor Utilisation , I/O (Logical Disk / Avg. Disk/sec Transfer), Memory and a SQL Counters like Processes Blocked (was 0) , Wait Statistics etc. 
  • These all looked OK apart from Wait Statistics Network IO waits which I had never seen at any value apart from 0. Looked at Network Interface//Bytes Sent and Bytes Received for (Teamed) Public Facing Network Card. Sent Bytes was averaging around 12 Mbytes a sec (high for OLTP server) and more importantly Output Queue Length was averaging around 50-60 confirming some queuing happening at the network layer
  • A Conference call was convened
    • The Networks guy confirmed that they were seeing similar throughput (12 Mbytes / sec x 8 = 96 Mbits a sec).
    • He also checked that the Duplex settings were set correctly (FULL) and that there was no errors on the switch / for that port. 
    • Backup guy confirmed that there was no active backup running and that the dedicated Network Card was configured correctly.
    • Windows guy confirmed that the Chimney / RSS Patch was installed. Also that teaming was redundant but not load balanced as per company standard / hadn’t been changed recently.
    • (I briefly tried to find out which process was causing Network I/O – Server 2003 so no Resource Monitor , Proc Explorer from sysinternals output didn’t match Perfmon of 12 mbytes / sec , so I did a cheeky Network Monitor trace for about 1 minute to look at later – obviously a tiny snapshot but better than nout)
  • By then we had managed to organise getting an interactive filtered (by user) Profiler trace of a standard SELECT / UPDATE statement with a user who joined the conf call.  Unfortunately by then it was after the busy time mentioned above so took about 15 seconds (not bad for the app), but at least it gave us a baseline.
  • I eyeballed the trace quickly and noticed that the longest duration statement at around 10 seconds was something like

  SELECT IdField FROM 2MillionRowTable

(did you spot the missing WHERE clause ?)

  • Developer confirmed that this gets executed every time .. So that 4 bytes (int field) x 2 million rows = approx 8 Mbytes for each iteration ! This was a lovely Silver Bullet and was changed immediately and the performance problem went away !
  • Later analysis from the Profiler in the SQLDiag I’d kicked off using ReadTrace part of SQLNexus, revealed that above problem code above was executed 3500 times between 11:00am and 12:20pm (when i stopped the SQLDiag as after busy time). And that it was about 90% by Duration of all queries in the trace.

Key Take-Aways

  • Even though the root cause of this issue was above code, it is interesting Network Queueing was the first indication that there was something server side. Its always worth looking at the Network related Perfmon counters as a quick check at the start of the troubleshooting.
  • The fact that there was a Application / SQL code release 2 weeks before was always near the front of my mind. The 2millionrow table mentioned above was part of this release went from 0 to 2 million rows in about 4 weeks, hence the gradual degradation of performance. Timebomb / Snowball Effect
  • I had 4 things on the go in parallel at various points.  If I didn’t have the SQLDiag ready to go it probably wouldn’t have got done
    • Kicked off the pre-configured SQLDiag (a simple batch file). Obviously this only took about 5 minutes, including checking was working properly , not filling the disk.
    • Interactive Monitoring using Perfmon / DMvs etc.
    • Attending a 2 hour phone conference call whilst the problem was happening with about 10 people across 2 continents.
    • Dealing with incoming emails on the issue from multitude of people . After a Replied All after while saying “Sorry, I won’t respond to any mails for the next 30 minutes as I’m trying to diagnose the issue”.
  • Scoping and information gathering as always a very important part of the troubleshooting process.
  • For most SQL performance problems I will normally look at sysprocesses / DMV’s to look for blocking / aggregating wait types first. Because I knew this was a consolidated server, I used Perfmon first , as I was guessing that it would perhaps be another instance monopolising a shared resource. Because SQL Perfmon counters have Wait Statistics object and also Processes Blocked counter under General Statistics, you can get similar (not as detailed) info there.

To Do

  • Understand why we were seeing high Output Queue Length when running at about 12mbytes / sec and what the realistic maximum you can expect on a 1Gbps switch (I know there lots of other things that can come into play in between). I know an expert in this area , so hopefully this will cost me 1 coffee or beer. Good ROI I think.
  • Understand why Process Explorer Network I/O figures differed so much from Perfmon / Network Switch I/O data. I ran it really quickly , so I reckon I stuffed something up. This is important because the Network I/O*** is still quite high (around 5mbytes /sec) for a SQL Server box , so got a sneaking suspicion there is some other network hungry process that could be tuned . More important because it is a consolidate server.
  • Small caveat re the pre-configured SQLDiag. I hadn’t used this on a multi-instance box before , so I had to edit the XML file to change from default * / all instance, to just my specific instance. Need to variabilse my batch file to have optional instance parameter.
  • Might do a follow up post explaining how I used the various tools mentioned.

* Unlike the good old 90’s where Corporate Networks were slower , more unreliable and Client Server was the default.

** Don’t shoot the messenger..

*** Each Application has its own instance.

**** I used the TopUsers Expert add in for Network Monitor from Codeplex to see traffic by conversation – awesome

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s