Migration to 2008 R2. Not as easy as Copy & Paste [Performance]

We have gone through several migrations already, mixing and matching sources (SQL 2000, 2005, X86, X64, Enterprise to Standard, etc). We now got most of our databases running under SQL Server 2008 R2.

In every instance we have found performance issues related to HyperThreading, bad plans, missing registry key or outdated iSCSI drivers. We have reached a point where we are now ready to analyze and solve the challenges soon after the migration has been performed.

We recently migrated a core DB, from 2005 to 2008 R2, from X86 to X64 and from Enterprise Edition to Standard Edition. System was effective but the application was sporadically timing out. After digging through the logs, collecting info based on server side traces, and using Adam Machanic’s sp_whoisactive tool, we found the culprit. Eight services pulling data using 1 stored procedure was bringing the box to it’s knees. With updated stats, better hardware and more memory available, we never expected this to happen. But it did.

We added an index and narrowed the scope of a where clause inside the stored procedure. The procedure went from ~30 mins and ~100 million reads to 10 seconds and ~15,000 reads.

Small changes, huge differences. The key is to spot the issue, understand and work it around. Then schedule a production push and QA the process.

Reads out of hand

Things look very good now.

Allow me to sleep and focus on number crunching [#TSQL2sday]

T-SQL Tuesday #014

T-SQL Tuesday #014

It’s 2011 already and the first T-SQL Tuesday for the year has arrived.

This time the event is hosted by Jen McCown (Blog|Twitter), who just received an MVP nomination (congratulations Jen).

I have come up with a list that I want covered by my team this year, but my main goals are two: Allowing me to sleep properly and investing time on number crunching.

What does “allow me to sleep” mean?

Last year we started our migration project of all production database servers to High Availability solutions. We shall continue this year and finalize it. On the same token, homogenize our environment so every single business unit runs on SQL Server 2008 R2. We really spent time figuring out processes that cannot run in SQL Server 2005 (e.g. change data capture) and I need to make sure it does not happen anymore.

On the storage side, there are 2 projects running in parallel. One is to allow growth on a scalable solution which is already in place (stop concerning about hitting volume limits), and the second one to reduce data file footprint. The first project is trivial as we can hookup additional trays and expand volumes. The second one is more complex as we need to identify tables that hold historical data, partition them (if applicable), phase out old data and archive without impacting production applications.

On the monitoring side, we are creating a new project to see if we shall stick with Management Data Warehouse or get a 3rd party tool to give us bird’s eye view of our systems at the DB level, define KPIs, and report based on them. We got proper HW and Service level monitoring in place already.

What does “focus on number crunching” mean?

This has 2 phases.

The first one is to work on a Master Metadata repository project and figure out relationships between different business entities. Then work with Business Owners and Data Analysts to figure out what should we report on, how far back should we do so, and what KPIs should be defined. This can allow us to continue building our Data Warehouse, extending ETLs, and allowing business unit A to correlate its data with Business Unit B.

This project is also tied with data file footprint reduction as we will be able to rollup information that shall be available on our DSS without the need of all the transactional data.

The second phase is analytics. This can be tied with data mining and statistics and will definitely involve some of the information we have stored in our cubes. We are still figuring this one out.

That is it for me on the technical side. On the managerial and personal? I would write about them on future blog posts.

What are your resolutions for 2011?

What’s happening in 2011

2010 came by pretty quickly. It has been a very busy year with all kinds of expansions, migrations and consolidations. We installed HP Blade Servers, VMware, SQL Server 2008 R2, Windows Server 2008 R2, CentOS 4.2 X64, F5s, Dell R710s/610s, Dell Equallogic Storage, TFS 2010, Windows 7 X64 for all DBAs and BI developers, etc. The list continues.

So what is happening next year? Many exiting things, with a core focus on HA, DR, Analytics, SharePoint 2010, and VMware on production databases.

On the Database side:

  • Homogenous environment. All instance running under SQL Server 2008 R2 (some edition downgrades in the mix)
  • Isolated Development, QA and UAT environments with proper edition (Developer, Standard, Enterprise). VMWare and MSDN come to play.
  • Management Data Warehouse deployment to collect health information from all production instances. Birds-eye view of all instances in one location.
  • Central Management Servers Implementation
  • Policy Based Management Implementation
  • Data-Tier Application Implementation
  • RML utilities for pre-production load testing
  • PowerShell

On the Analytics side:

  • All legacy cubes migrated to SQL Server Analysis Services 2008 R2
  • Deployment of Star Schemas for the remaining business units
  • Dedicated instance for transformations (SSIS)
  • Fire up 3rd and 4th VM for our Scaled-out Deployment of Reporting Services
  • Mastering DAX. PowerPivot for Excel 2010
  • GUI implementation for browsing our cubes. We are in the process of evaluating products.

High Availability & Disaster Recovery:

  • All instances to be moved to HA (2, 3, 4 node clusters). No more single point of failures.
  • Zero-Downtime database migration with Mirroring
  • Block Level Backup and restore offsite
  • Full-Log-Daily Differential Backups across the board.
  • Backup file test by restoring to our UAT environments

Storage:

  • Footprint Reduction by archiving historical data that can be rolled up
  • Storage Expansion to benefit from the spindle count
  • SSD for the most demanding applications (over iSCSI)

Training:

  • I am encouraging my team members to use twitter, to follow knowledgeable folks and read tweets from the #sqlserver and #sqlhelp hash tags.
  • To Read Blog Posts and subscribe to RSS feeds
  • To attend to SQL Saturdays
  • To buy books (and e-books)
  • To get trained online and if necessary bring someone to get us trained.
  • To attend the SQL Rally 2011 or SQL Pass Summit 2011
  • To read about SQL Server Denali and install on a VM to test.

I see 2011 being an excellent year for us.

Happy Holidays everyone.

2011

Table Structure, Trading Maintenance with Performance [SQL Server]

We recently migrated a database from SQL Server 2005 X86 to 2008 R2 X64. Running an Enterprise Edition License, we were able to take advantage of partitioning. It had several tables with similar structures; row count north of 2 Billion rows.

Old Structure:

  1. 3 fields
    1. TypeId INT
    2. ContentData VARCHAR(100)
    3. insertionDate DATETIME
  2. Unique constraint: TypeId, ContentData. Added InsertionDate to comply with partitioning requirements.
  3. Table had a partition per month.

Pros

  1. On the first week of each new month we marked the previous month partition as read-only.
  2. A one-time File backup was performed to the previous month partition.
  3. Deltas (new data) would always be located on the last partition
  4. The current months’ partition would be the only piece left in Read-Write mode.

Cons

  1. Searching for ContentData for a particular TypeId scanned as many times as partitions used
  2. Bulk Data comparison between new and old
    TypeId + ContentData performed an index seek on each partition
1
2
3
4
SELECT ContentData
INTO #temp
FROM sampletable
WHERE TypeId  = 41;
1
Table 'sampletable'. Scan count 92, logical reads 7613, physical reads 223, read-ahead reads 7808, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Original Plan

Original Plan

New Structure:

  1. 3 fields
    1. TypeId INT
    2. ContentData VARCHAR(100)
    3. insertionDate DATETIME
  2. Unique constraint: TypeId, ContentData.
  3. Table has a partition per TypeId (range of 50 items each, e.g. 1,2, … ,50).
  4. Index on insertionDate with TypeId as include
  5. Page Compression enabled.

Pros

  1. Searching for a particular TypeId + ContentData generates a unique index seek plan, 1 scan count (1 partition).
  2. Scanning for ContentData for a particular TypeId generates an index scan for only 1 partition
  3. Bulk Data comparison between new and old TypeId + ContentData performs an index seek in one partition only
  4. Compression reduced I/O reads by 50%

Cons

  1. All datafiles are in read-write mode. Every file needs to considered in the backup maintenance plan
  2. Any partition can suffer block update even if it holds data from 4 years ago (cannot be tagged as historical)
  3. Retrieving deltas performs an index seek on every partition
1
2
3
4
SELECT ContentData
INTO #temp
FROM sampletable
WHERE TypeId  = 41;
1
Table 'sampletable'. Scan count 1, logical reads 5449, physical reads 66, read-ahead reads 5445, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
New Plan

New Plan

Conclusion:

The table is mostly used for retrieving and comparing data. The retrieval call will always pass in TypeId as parameter. The system now performs an index scan for 1 partition (1 scan count). For data comparison, an index seek is performed for a unique value under 1 partition only. Data compression reduced I/O reads by 50%.

In the past, data extraction and comparison scanned through all the partitions generating excessive I/O calls, but we did not need to worry about continuously backing up older partitions as they were already flagged as read-only and backed up.

With this exercise we have effectively increased performance at the expense of maintenance. Well worth the change.

I/O Issues on a newly deployed SQL Server cluster attached to a SAN [Windows 2008 R2]

We executed the migration of stand-alone database servers to an active/active cluster. This was for our enterprise-wide reporting solution which includes a scaled-out deployment of Reporting Services implementing F5 BIGIP as a load balancer.

I/O challenges:

We migrated over the Distributor Instance and needed to re-snapshot several tables; planned for a 36 hour downtime. Small tables were copied over quickly but we were struggling with large tables. We had IO_COMPLETION and ASYNC_NETWORK_IO as top waiters during the bulk copy. In this scenario, data was brought over to the distributor instance and then bulk copied onto the subscriber. Both instances reside on the same node, each one with their properly assigned storage on Dell Equallogic SAN.

The first thing we attempted was to correct the link negotiation from auto to 1000/full. Changes were made and we experienced no change. All instances bounced immediately after the change, a surprise as we never thought that changing the configuration for 1 NIC will affect the database instances. It was neither the heartbeat nor the iSCSI NICs the ones affected.

We continued with the bulk copy until we found that a 110 GB table was taking extremely long to complete. I decided to exclude the largest tables from replication and approach another plan for them (backup, restore, batch insert and partition switch-in – I might write about this in another blog post).

We continued with the remaining objects and declared success on a Sunday at 10 PM. Everything looked good except for four of the very large tables that were not included due to the time required to copy them over.

IO Pending Requests ms.

IO Pending Requests ms.

Re-Snapshot and subsequent Rollback:

Something interesting happened. A bulk copy was rolling back on Monday morning which took around 8 hours to complete. The log file grew to 125 GB and hit the storage limit. We were not sure why as the table was bulk copied successfully a night before. We removed the table from replication and decided to continue troubleshooting the I/O problems.

I/O resolution:

I was able to reproduce the I/O symptoms at the OS level by copying files larger than 4.4 GB. I would see rates of 15 MB/s instead of the 500 MB/s I was seeing with smaller files. We failed over expecting to isolate the issue to 1 node, but the issue reoccurred. We decided to upgrade the NIC drivers to the latest published a few weeks ago, but no improvements were experienced. The SAN admin then found an article about disabling Windows Scaling heuristics and decided to give it a try.

1
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\Tcpip\ParametersEnableWsd=0

Registry was updated, nodes were bounced (1 at a time) and tests were re-ran. No more I/O contention and replication was just flowing with a a few ms. of IO pending requests instead of 200 ~ 400 we were hitting before.

1
2
3
4
5
6
7
8
select
database_id,
file_id,
io_stall,
io_pending_ms_ticks,
scheduler_address
from sys.dm_io_virtual_file_stats(NULL, NULL) AS t1
INNER JOIN sys.dm_io_pending_io_requests AS t2 ON (t1.file_handle = t2.io_handle)

Re-Snapshot and subsequent Rollback, again:

I/O was fixed, so now we needed to re-snapshot the missing table that rolled back. We did on a Tuesday night, everything went well and then on Wednesday morning a rollback was pending. We again hit the storage limit of 125 GB for log. I asked the SAN admin to extend it to 200 GB, asked one of the DBAs to exclude the table for replication and let the undelivered commands flow to the subscriber.

Undelivered Transactions:

The rollback took another 6 hours to complete and at that point I was just waiting for replication to catch up. 30 minutes later, nothing happened. We continued waiting 30 more minutes and nothing happened either. Jobs were bounced, SQL Server Agent restarted on the Disitributor and Subscriber and nothing happened either. I asked a question on Twitter and was suggested to call Microsoft. Before I did, I called a friend of mine, Anthony Sammartino (Twitter | Blog), and left a message. While calling Microsoft, Anthony called me back and explained that there might be some errors with DMLs attempted on the subscriber. He discussed it with one of the DBAs and we were able to get replication going by changing the distributor agent profile. We had over 5 million undelivered commands pending.

Stabilization of the Environment:

By 8 PM we were up to date with all tables except the bulk copy that rolled back. I decided to stop making changes to the environment for the long Thanksgiving weekend and ensured that the environment was stable enough to continue working on it on the next Monday.

Re-Snapshot and subsequent Rollback resolution:

We figured that a filter was later giving an exception letting the publisher reinitialize replication for the object issuing a re-snapshot. It had been performing bulk copies and then rolling back several times during the night. The filter value was defined as column NOT LIKE ‘%value%’.

We decided to exclude the filter and re-snapshot the table today. 15 minutes later, 60 million rows were inserted into the subscriber. No issues so far.

Things are now looking much better.