I believe you seeing the symptoms of an issue with Windows 2003 requiring contiguous memory and this causes the running processes including Analysis Service to trim their Working Set memory. With large amounts of memory allocated this trim process can take a significant amount of time to complete, while this is running new allocations will be blocked, causing the process to stall.
This issue is fixed in Windows 2008 onwards this support article is for SQL Server but has all the links to the relevant pages http://support.microsoft.com/kb/918483 in particular this quote
In Windows Server 2008, the allocation of physically contiguous memory
is greatly enhanced. Requests to allocate contiguous memory are much
more likely to succeed because the memory manager now dynamically
replaces pages, typically without trimming the working set or
performing I/O operations. In addition, many more types of pages—such
as kernel stacks and file system metadata pages, among others—are now
candidates for replacement. Consequently, more contiguous memory is
generally available at any given time. In addition, the cost to obtain
such allocations is greatly reduced.
I believe that with this issue you will either have to upgrade to Windows 2008 onwards or test granting Lock Pages In Memory to the Analysis Services account. I have also had issues with memory allocation on Windows 2003 which requires a reboot to clear.
This SQLCAT blog compares the changes between Windows Server 2003 & 2008.
This blog mentions some of the issues of LPIM with Analysis Services (near the end)
Also this knowledge base article on working set trimming causing performance problems and this article shows how to Identifying Drivers That Allocate Contiguous Memory
Others have already pointed out the culprit: SQL Server accumulates updates in memory (in the buffer pool) and only flushes them out periodically (at checkpoints). The two options suggested (-k and checkpoint interval) are complementary:
But I did not respond only to regurgitate the fine comments you received do far :)
What you're seeing is, unfortunately, a very typical behavior of queued processing. Whether you use Service Broker queues or opt for using tables as queues approach, the system is very prone to this kind of behavior. This is because queuing based processing is write heavy, even more write heavy than OLTP processing. Both enqueue and dequeue primitives are write operations and there are almost no read operations. Simply put, queue processing will generated the most writes (= most dirty pages, and most log) compared to any other workload, even OLTP (ie. TPC-C like workload).
Very importantly, the writes of a queue workload follow an the insert/delete pattern: every row inserted is very quickly deleted. This is important to distinguish from an append-only pattern of a insert heavy (ETL) workload. You are basically feeding the ghost cleanup task a full meal, and you can easily outrun it. Think about what that means:
- enqueue is an insert, it will create a dirty page
- dequeue is a delete, it will dirty the same page again (it may be lucky and catch the page before checkpoint, so it will avoid double-flush, but only if is lucky)
- ghost cleanup will cleanup the page, making it dirty again
Yes, it really means that you may end up writing a page three times to disk, in three different IO requests, for each message you process (worst case). And it also means that the random IO of checkpoints will be really random as the write point of the page will be visited by those moving heads again between two checkpoints (compare with many OLTP workloads tend to group the writes on some 'hot spots', not queues...).
So you have these three write points, racing to mark the same page dirty again and again. And that is before we consider any page splits, which queue processing may be prone too because of the insert key order. By comparison 'typical' OLTP workloads have a much more balanced read/write ratio and the OLTP writes distribute across inserts/updates/deletes, often with updates ('status' changes) and inserts taking the lion's share. Queue processing writes are exclusively insert/delete with, by definition, 50/50 split.
Some consequences follow:
- Checkpoint becomes a very hot issue (no longer a surprise for you)
- You'll see heavy fragmentation (the fragmentation per-se won't matter much as you are not going to do range scans, but your IO efficiency suffers and ghost cleanup has more to work, slowing it down even more)
- Your MDF storage random IO throughput is going to be your bottleneck
My recommendation comes in 3 letters: S, S and D. Move your MDF to a storage that can handle fast random IO. SSD. Fusion-IO if you have the moneys. Unfortunately this is one of those symptoms that cannot be resolved with more cheap RAM...
Edit:
As Mark points out you have two logical disks backed by one physical disk. Perhaps you tried to follow best practices and split log on D: and data on C: but alas is to no avail, C and D are the same disk. Between checkpoints you achieve sequential throughput but as soon as checkpoint starts the disk heads start to move and your log throughput collapses, taking down the entire app throughput. Make sure you separate the DB log so that is not affected by data IO (separate disk).
Best Answer
For SSIS you can implement logging on the packages. A good place to start looking would be at the
ProcessInput
andPrimeOutput
steps of the components within the packages. ThePipelineComponentTime
event will be useful to check for these steps, as it records the time spent in each of them. You can read more about logging for SSIS here.For SSAS you can use Extended Events to monitor the performance while the processing is being done. The
ProgressReportEnd
andResourceUsage
events are two that you may want into look into examing during the times the cubes are processed, and well as theRequestProcessBegin
andRequestProcessEnd
events. This link will further outline the use of Extended Events in SSAS.