I/O Issues on a newly deployed SQL Server cluster attached to a SAN [Windows 2008 R2]

We executed the migration of stand-alone database servers to an active/active cluster. This was for our enterprise-wide reporting solution which includes a scaled-out deployment of Reporting Services implementing F5 BIGIP as a load balancer.

I/O challenges:

We migrated over the Distributor Instance and needed to re-snapshot several tables; planned for a 36 hour downtime. Small tables were copied over quickly but we were struggling with large tables. We had IO_COMPLETION and ASYNC_NETWORK_IO as top waiters during the bulk copy. In this scenario, data was brought over to the distributor instance and then bulk copied onto the subscriber. Both instances reside on the same node, each one with their properly assigned storage on Dell Equallogic SAN.

The first thing we attempted was to correct the link negotiation from auto to 1000/full. Changes were made and we experienced no change. All instances bounced immediately after the change, a surprise as we never thought that changing the configuration for 1 NIC will affect the database instances. It was neither the heartbeat nor the iSCSI NICs the ones affected.

We continued with the bulk copy until we found that a 110 GB table was taking extremely long to complete. I decided to exclude the largest tables from replication and approach another plan for them (backup, restore, batch insert and partition switch-in – I might write about this in another blog post).

We continued with the remaining objects and declared success on a Sunday at 10 PM. Everything looked good except for four of the very large tables that were not included due to the time required to copy them over.

IO Pending Requests ms.

IO Pending Requests ms.

Re-Snapshot and subsequent Rollback:

Something interesting happened. A bulk copy was rolling back on Monday morning which took around 8 hours to complete. The log file grew to 125 GB and hit the storage limit. We were not sure why as the table was bulk copied successfully a night before. We removed the table from replication and decided to continue troubleshooting the I/O problems.

I/O resolution:

I was able to reproduce the I/O symptoms at the OS level by copying files larger than 4.4 GB. I would see rates of 15 MB/s instead of the 500 MB/s I was seeing with smaller files. We failed over expecting to isolate the issue to 1 node, but the issue reoccurred. We decided to upgrade the NIC drivers to the latest published a few weeks ago, but no improvements were experienced. The SAN admin then found an article about disabling Windows Scaling heuristics and decided to give it a try.

1
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\Tcpip\ParametersEnableWsd=0

Registry was updated, nodes were bounced (1 at a time) and tests were re-ran. No more I/O contention and replication was just flowing with a a few ms. of IO pending requests instead of 200 ~ 400 we were hitting before.

1
2
3
4
5
6
7
8
select
database_id,
file_id,
io_stall,
io_pending_ms_ticks,
scheduler_address
from sys.dm_io_virtual_file_stats(NULL, NULL) AS t1
INNER JOIN sys.dm_io_pending_io_requests AS t2 ON (t1.file_handle = t2.io_handle)

Re-Snapshot and subsequent Rollback, again:

I/O was fixed, so now we needed to re-snapshot the missing table that rolled back. We did on a Tuesday night, everything went well and then on Wednesday morning a rollback was pending. We again hit the storage limit of 125 GB for log. I asked the SAN admin to extend it to 200 GB, asked one of the DBAs to exclude the table for replication and let the undelivered commands flow to the subscriber.

Undelivered Transactions:

The rollback took another 6 hours to complete and at that point I was just waiting for replication to catch up. 30 minutes later, nothing happened. We continued waiting 30 more minutes and nothing happened either. Jobs were bounced, SQL Server Agent restarted on the Disitributor and Subscriber and nothing happened either. I asked a question on Twitter and was suggested to call Microsoft. Before I did, I called a friend of mine, Anthony Sammartino (Twitter | Blog), and left a message. While calling Microsoft, Anthony called me back and explained that there might be some errors with DMLs attempted on the subscriber. He discussed it with one of the DBAs and we were able to get replication going by changing the distributor agent profile. We had over 5 million undelivered commands pending.

Stabilization of the Environment:

By 8 PM we were up to date with all tables except the bulk copy that rolled back. I decided to stop making changes to the environment for the long Thanksgiving weekend and ensured that the environment was stable enough to continue working on it on the next Monday.

Re-Snapshot and subsequent Rollback resolution:

We figured that a filter was later giving an exception letting the publisher reinitialize replication for the object issuing a re-snapshot. It had been performing bulk copies and then rolling back several times during the night. The filter value was defined as column NOT LIKE ‘%value%’.

We decided to exclude the filter and re-snapshot the table today. 15 minutes later, 60 million rows were inserted into the subscriber. No issues so far.

Things are now looking much better.

MediaWhiz’s BI Solution Goes Live [Analytics]

Over a year ago we realized that the DBA team spent a very long time digging through data we had everywhere, counting billions and billions of rows for various systems, and that needed to be beautified in Excel with charts and the whole nine yards to be delivered to business owners.

That prompted us to take the decision and move to the next level. At the beginning, we called it our future Data Warehouse, then DataMart (as we were going to start with just one subject area), then Reporting Solution and finally decided to just call it “Business Intelligence”.

To speed up the process, we hired 2 consultants. A Data Analyst and a BI Architect. Everything started pretty good on the first week by deciding on what needed to be done first for us to start gathering requirements. Unfortunately other high priority tasks took over and the Data Analyst switched their duties to something else. That left us with just one BI Architect.

Anthony Sammartino (Twitter | Blog), of TekPartners Business Intelligence Solutions, assumed the role of Data Analyst, Project Manager and BI Architect. This was not easy as he needed to spend time creating proper questionnaires and meeting with different business owners to start gathering requirements. Then he needed to come up with proper schedules and plan the work.

As not everything worked according to plan (business owners needed to prioritize other work) we could not gather enough requirements from all of  the different businesses. Plus there was still this fuzzy understanding of what BI meant.

We approached a different strategy and started pinpointing the type of reports that were generated mostly from the transactional data. Anthony came with the great idea of replicating it to a new DB instance and let our users connect and just query it all day long. We setup a dedicated Distribution and Subscriber Servers under SQL Server 2008 Evaluation Edition. We named the Subscriber our Operational Data Store (ODS).

As we continued moving data from different Publishers we realized that the ODS had become a somewhat important system as more users started using it, and even the DBA team realized that was easier to join data from multiple businesses in one instance. We were still in Beta phase as we needed to prove that this was a viable purchase.

Then SQL Pass Summit 2009 came by, I went to Seattle, learned a ton of new stuff about Performance Tuning and BI. I came back to Florida and planned a migration of our Distribution and ODS to SQL Server 2008 R2 November CTP Edition. I had just two things in mind, PowerPivot and the new features provided by Reporting Services.

We executed the plan, rebuilt both servers with the latest CTP release, re-installed the SSL certificate for SSRS and we were done. This was performed on a weekend and was completely transparent to the end users.

We continued working with this environment, faced some challenges with multi-million row partitioned tables with no primary keys which could not be replicated, with the lack of redundancy and of course the CTP edition we were running on. But again, this was still in a Beta Phase.

Months came by, we continued deploying more subscribers, enhancing transactional reports and enabling filtered indexes and compression on our ODS to reduce the amount of I/Os.

A decision was taken as we wanted this system to go live with the RTM version of SQL Server 2008 R2. Licensing was purchased and new a hardware was queued up to be purchased later.

While this happened, we focused our efforts again on BI. We brought in Robin Haropulos from TekPartners. She eventually assumed the Project Manager Role and concentrated on one of our largest business units. The one that brought in 100 million records a day.

The team sat down with the business owners, end users, analysts and pretty much everyone that accessed that data, crunched from legacy transactional systems.

Pivoted Data

I decided that we needed a good ETL resource that knew SSIS inside and out. Luis Figueroa (Twitter| Blog), also from TekPartners, came on board, and while he was not yet an expert, he proved to us that he could master it in a matter of months. I bet on that and thankfully I was right.

I also wanted to have an additional resource on board who had a good knowledge on replication and the need to learn SSAS. Rich Bartho (Twitter| Blog) switched his main duties and started helping us out on this new project.

We came up with 28 dimensions, and 4 facts for the first release. We made sure the SSIS performed all its transformation as an isolated entity and only touched the source to extract the data. We also took advantage of Change Data Capture and Replication in order to have the systems pull from a replicated site instead of the main source and minimize production overhead.

After the BI architecture design, project plan, and first fact and dimension requirements  were complete, Anthony trained the entire team in the areas of SSAS cube design and development, Replication design, and BI Project Management to ensure we were on the correct path to complete the project successfully before he completed his assignment with us at Mediawhiz. I quickly moved another resource, James Yagielo (also from TekPartners), to help us out with the ETL packages. The project was running on track, the Infrastructure team finalized with the build of 4 new instances and a High Availability Active/Active Cluster with plenty of RAM and I/O to spare. We were hooked up to 48 spindles and 3 GB/s teamed dedicated network connections to the SAN. They also took care of the Reporting Services instances and created 4 VMs.

We went ahead and soft launched over 10 days ago, fired up a new distribution instance, a ODS/Historical instance, Dimensional instance and of course the SSIS/SSAS dedicated box. This solution can scale to multiple nodes if additional CPU cycles are required; one of the advantages of clustering. Reporting Services was launched as a scaled out deployment into multiple VMs and load balanced by F5.

We ensured that all the data was consistent, checked that the packages were efficient, performed health checks by looking at the wait stats and tuned the instances based on what we found.

Two days ago, we presented the solution to the business owners, with multiple sample reports generated by SSMS, Regular Pivot Tables and PowerPivot for Excel. We sliced and diced at the client level, server level and the results were coming back pretty fast. They were extremely happy as they just found a way to slice “everything by everything”. I even heard the statement ‎”This will be the company’s biggest asset”.

We will continue adding more facts in the next 3 months and be done with the bulk for this business unit. Then we will extend the current solution to include additional business units.

Stacked ChartWe will also finalize with the migration of the remaining objects, like Reports running on the old instance, remaining ODS tables, new replication deployments and of course the migration of a legacy SQL Server 2005 SSAS Cube.

We had a cube already? Yes, but we never called it BI, we called it just enhanced reporting. The Cube itself is fine but the problem is in the legacy ETL. That code will be wiped out and redeployed with best practices onto our SSIS instance.

All our solutions are running under SQL Server 2008 R2 RTM, Enterprise Edition.

Well, that’s it for now. I am lucky to be surrounded by a talented team that has allowed us to get to the next level. And this is just the beginning.