SQL Server 2008 R2 and Nehalem Processors

      13 Comments on SQL Server 2008 R2 and Nehalem Processors

I want to share a note about our latest experiences with SQL Server 2008 R2 EE X64 and Nehalem Processors. We have an active/active SQL Server cluster solution running on Two Dell R710 machines and Equallogic Storage Solution using iSCSI. Each Server has Two Intel X5550 processors, 64 GB of RAM and is running under Windows Server 2008 R2 EE X64.

We experienced performance degradation first when requesting I/O from a SAS RAID 10 Stripe Set (Dell Equallogic) for a particular node. After rebooting both nodes the behavior will occur just on the second node. The inconsistency let us to the rebuilding of the solution as we might have hit a problem with the MPIO configuration.

After rebuilding, everything went well. SQLIO, SQLIOSIM and later standard SQL Server Load Testing was performed and we experienced no issues. We reevaluated our storage needs and added a Tier 2 RAID 50 Equallogic solution with 32 drives. We had no problems with SQLIO and SQLIOSIM, but they began to reoccur during our SQL Server tests. In this particular scenario only node 1 was underperforming. When the SQL instances were running on Node 2, everything ran perfectly.

I started asking questions under the Twitter #sqlhelp topic. I had some feedback from multiple folks but we were yet to find the culprit of our problems.

I had the privilege to have an online session with Kevin Boles (@TheSQLGuru) who took a quick look to our configurations and tests, and was able to pinpoint that we were indeed having an issue with the I/O subsystem. He agreed that there were no obvious explanations on to why Node 1 will under perform and Node 2 would not. Kevin then suggested we shall disable Hyper-Threading from our servers. I took his recommendations and decided to wait for our vender to give us a response regarding our findings.

We continued researching and were given no explanation from our vender on why only Node 2 performed very well. We went ahead and rebuilt Node 1 from scratch; fresh OS installation, fresh SQL Server installation, addition of the node to the cluster, etc. SQLIO and SQLIOSIM tests gave back excellent results, but once SQL Server kicked in performance degradation reoccurred.

We opened a case with Microsoft and ran tests. Microsoft did not find any issues with SQL Server instances and they suggested we followed these recommendations:

  1. Patch the systems for bugs in Windows Server 2008 R2:
    1. Windows 2008 R2/win7: http://support.microsoft.com/kb/976700. Application stops responding, experiences low performance, or experiences high privileged CPU usage if many large I/O operations are performed in Windows 7 or Windows Server 2008 R2
    2. Windows 2008 R2/win7: http://support.microsoft.com/kb/979149. Computer that is running Windows 7 or Windows Server 2008 R2 becomes unresponsive when you run a large application
  2. Ensure that the Power Plans are configured as “High Performance”
    1. http://support.microsoft.com/kb/2254067
    2. http://support.microsoft.com/kb/2207548

We also decided to follow up with our vendor one more time and they suggested we shall disable Hyper-Threading (HT) on node 1. After doing so we lowered our degree of parallelism and ran tests. We were able to insert 10 million rows in under 45 seconds every time without any performance degradation.

Contrary to our findings, researching on the web let me to understand that our processors shall perform well with HT enabled and it was even suggested by a well-known SQL Server MVP Glenn Berry (@glennalanberry).

I then started questioning about the processors in node 1 and if maybe they have an issue. Because we did not want to spend (or waste) more time we decided to perform the same exercise on Node 2. We went ahead and ran tests and we are able to obtain similar results produced by node 1.

Microsoft contacted us and asked to submit our configurations and the steps taken to solve the problem. They told me we “might” have found a bug. We shall confirm once they get back to us.

For now, we will stick with the best practice of disabling HT on any new server implementation, with Nehalem CPUs or not.

As a final thought, I have pinpointed to the team the bright side about our experiences in the past few weeks. We are now able to add SQL nodes to a clustered environment pretty quickly and hook them up to our SAN back-end.

Share
  • alen

    interesting if this was a HT bug or something with parallel processing. i played with the latter and had some crazy performance issues and didn’t have time to get the configuration right.

    might also be a driver issue. we use HP and regularly update HP drivers. our procedure for a new server is to use HP Smartstart that also installs all the HP drivers/management software. we’ve had a few bugs along the way that were caused by driver issues.

    • What threw us off was the fact that an identical box used as second node with same exact HW, SW and drivers was performing well. Microsoft has the case open still; they called me on Friday and said it was a “research in progress”.

      We have deployed the environment to prod on Saturday and is performing very well with HT off.

  • Pingback: Installing SQL Server Checklist | Sean Gallardy()

  • Pingback: SQL Server Links and news the week 17/09/10 | John Sansom - SQL Server DBA in the UK()

  • Pingback: Query Performance Issue in SQL Server 2008 R2 « Oscar Zamora (@zamorao)()

  • Pingback: Microsoft Says there is nothing wrong with SQL Server 2008 R2 « Oscar Zamora (@zamorao)()

  • Pingback: Summarizing Performance issues and workarounds after migrating from 2005 to 2008 R2 [SQL Server] | Oscar Zamora()

  • James Wood

    We’re having some issues on our reporting instances – exact same set up as you have R710s, equallogic etc (minus the raid 50).
    We noticed that one core on each box was generally running at very high utilisation. Initially thought it was some sql process affinity issue. However, in perfmon, we checked the box to show kernel times and voila – the usage was all privileged.
    I logged a call with dell who suggested disabling the c states in the bios. We followed the advice and since then it seems like a different system – kernel times low and percieved performance enhancements increased en fold.

    (Prior to this I had already disabled hyperthreading but TBH didn’t notice a big difference – so maybe try this too… don’t really think power management is needed on you sql boxes!)

    • Thanks for the information about the C States. We have been progressively switching all production boxes to use “High Power” and disabling HT. We are no longer having CPU issues.

      • You’ll want to test performance with HT enabled and disabled. My cousin runs a *nix datacenter and his new 6core quad socket(24 physical cores total) 128GB SQL server runs about 50% faster for their loads with HT enabled. So, HT itself is really good now, but if Win/SQL has issues, you’ll want to disable.

        • Definitely something to be tested. Because of the time constraints and the production schedule setback, we opted to leave it disabled as it fixed our problems.

          Thanks for sharing

    • Jaroslav Sabacky

      C-states really seems to be the problem.
      Today I was facing the same issue:
      Clean installation of SQL Server 2008 is running on clean installation of Windows Server 2008 R2. Everything is installed on Hyper-V R2 (everything patched, W2K8R2 with SP1, SQL 2008 with SP2 and last security rollup). SQL is clean and unused with Reporting services installed.
      After SQL server started, there have been permanent CPU peaks (but not seen as privileged CPU time).
      After DISABLING C2 and C3 states (C6 remained ENABLED) the peaks are gone. Hyper Threading remained ENABLED and whole server is working fine.
      Power Management remained set to Balanced.

  • Pingback: Migration to 2008 R2. Not as easy as Copy & Paste [Performance] | Oscar Zamora()