SQL Server 2008 R2 and Nehalem Processors

I want to share a note about our latest experiences with SQL Server 2008 R2 EE X64 and Nehalem Processors. We have an active/active SQL Server cluster solution running on Two Dell R710 machines and Equallogic Storage Solution using iSCSI. Each Server has Two Intel X5550 processors, 64 GB of RAM and is running under Windows Server 2008 R2 EE X64.

We experienced performance degradation first when requesting I/O from a SAS RAID 10 Stripe Set (Dell Equallogic) for a particular node. After rebooting both nodes the behavior will occur just on the second node. The inconsistency let us to the rebuilding of the solution as we might have hit a problem with the MPIO configuration.

After rebuilding, everything went well. SQLIO, SQLIOSIM and later standard SQL Server Load Testing was performed and we experienced no issues. We reevaluated our storage needs and added a Tier 2 RAID 50 Equallogic solution with 32 drives. We had no problems with SQLIO and SQLIOSIM, but they began to reoccur during our SQL Server tests. In this particular scenario only node 1 was underperforming. When the SQL instances were running on Node 2, everything ran perfectly.

I started asking questions under the Twitter #sqlhelp topic. I had some feedback from multiple folks but we were yet to find the culprit of our problems.

I had the privilege to have an online session with Kevin Boles (@TheSQLGuru) who took a quick look to our configurations and tests, and was able to pinpoint that we were indeed having an issue with the I/O subsystem. He agreed that there were no obvious explanations on to why Node 1 will under perform and Node 2 would not. Kevin then suggested we shall disable Hyper-Threading from our servers. I took his recommendations and decided to wait for our vender to give us a response regarding our findings.

We continued researching and were given no explanation from our vender on why only Node 2 performed very well. We went ahead and rebuilt Node 1 from scratch; fresh OS installation, fresh SQL Server installation, addition of the node to the cluster, etc. SQLIO and SQLIOSIM tests gave back excellent results, but once SQL Server kicked in performance degradation reoccurred.

We opened a case with Microsoft and ran tests. Microsoft did not find any issues with SQL Server instances and they suggested we followed these recommendations:

  1. Patch the systems for bugs in Windows Server 2008 R2:
    1. Windows 2008 R2/win7: http://support.microsoft.com/kb/976700. Application stops responding, experiences low performance, or experiences high privileged CPU usage if many large I/O operations are performed in Windows 7 or Windows Server 2008 R2
    2. Windows 2008 R2/win7: http://support.microsoft.com/kb/979149. Computer that is running Windows 7 or Windows Server 2008 R2 becomes unresponsive when you run a large application
  2. Ensure that the Power Plans are configured as “High Performance”
    1. http://support.microsoft.com/kb/2254067
    2. http://support.microsoft.com/kb/2207548

We also decided to follow up with our vendor one more time and they suggested we shall disable Hyper-Threading (HT) on node 1. After doing so we lowered our degree of parallelism and ran tests. We were able to insert 10 million rows in under 45 seconds every time without any performance degradation.

Contrary to our findings, researching on the web let me to understand that our processors shall perform well with HT enabled and it was even suggested by a well-known SQL Server MVP Glenn Berry (@glennalanberry).

I then started questioning about the processors in node 1 and if maybe they have an issue. Because we did not want to spend (or waste) more time we decided to perform the same exercise on Node 2. We went ahead and ran tests and we are able to obtain similar results produced by node 1.

Microsoft contacted us and asked to submit our configurations and the steps taken to solve the problem. They told me we “might” have found a bug. We shall confirm once they get back to us.

For now, we will stick with the best practice of disabling HT on any new server implementation, with Nehalem CPUs or not.

As a final thought, I have pinpointed to the team the bright side about our experiences in the past few weeks. We are now able to add SQL nodes to a clustered environment pretty quickly and hook them up to our SAN back-end.