SQL Server Bulk Copy – Can It Cause Timeouts When Querying

bulksql server

In SQL Server 2008 R2, can using BulkCopy to upload thousands of rows of data cause problems for queries taking place on the same table?

Specifically could it cause the queries to time out?

The table being queried has millions of rows.

Best Answer

It definitely could.

It requires locks just like any other insert operation. If enough locks are taken it will escalate to a full table lock (assuming the table allows it). Any insert operation like this would block anything else that was trying to read the data, unless NOLOCK was specified for those queries (which I am not recommending here).

You can actually set up a trace to see what types of locks are being taken if you are interested. Just set up a Profiler trace that gets the "Locks: Acquired" and "Locks: Escalation", and then do a bulk insert.

Word of warning, Profiler can cause a hit to performance, so I would either run the trace on a dev server, or if you have to run it in production, export the script and execute it as a server-side trace.

Related Solutions

Sql-server – Importing large file while keeping table available for other operations

Holy cow, you've got a lot of questions in here:

How the table will be affected? During this import (updates and inserts) what could happen to other processes trying to read or write data to this table?

With careful index design and partitioning, you can get away with minimal disruption to other queries. Discussing how to implement partitioning is a little beyond what we can talk through here - check out Louis Davidson's book on SQL 2008 database design. It covers concurrency pretty well.

I know bulk insert functionality as being something extremely fast compared to normal inserts. Does SQL Server 2008 provides something similar for UPDATES?

Not really. You can't update in bulk because you could be changing rows all over the table. This is where index design comes in (as explained in Louis's book).

All the row to be imported will be of the same UserId, so having this table partitioned by userid will improve the performance right? As all the other queries related to other users which doesnt belong to the affected partition will not be affected... Is this correct?

Only if your indexes are also partitioned by UserId, and that's rarely the case. Usually you'd want indexes arranged in a different manner - but again, that's covered in the book.

Sql-server – Indexing strategy when using the between operator SQL Server 2008

This is not a complete answer, I do not have the time now, so let me just share a few thoughts. The complete answer would be huge, and I am not sure you want to know the details.

I have been working with various temporal queries for several years already, and learned a lot in the process. As such, I would rather not have to optimize your query in my production system. I would try very hard to avoid solving it with T-SQL. It is a complex problem. Itzik Ben-Gan has written about "gaps and islands" several times, including a chapter in his latest book on OLAP functions. Your problem is a variation of gaps and islands.

First, I would consider reading all the data to the client and solve it there using loops. I know, it requires sending data over the network, but fast loops in Java/C++/C# work very well for me most of the time. For instance, once I was struggling with a query involving time series and temporal data. When I moved most of the logic to the client, the C# solution was several times shorter and it ran 20,000 times faster. That's not a typo - twenty thousand times faster.

There is another problem with solving such problems in T-SQL - your performance may be unstable. If a query is complex, all of a sudden the optimizer can choose another plan and it will run many times slower, and we have to optimize it again.

Alternatively, I would consider storing data differently. Right now I see two possible approaches.

First, instead of storing intervals we could use this table:

ClientId,
PrescriptionId,
DrugId,
Date

We can use trusted constraints to make sure each PrescriptionId covers a range of dates without gaps and overlaps, so that one interval stores as one unbroken sequence of dates.

Note: I know you are using DISTINCT in your first subquery, so you are assuming that one person can take one drug on one day from more than one prescription. I am not mkaing this assumption, for simplicity. Are you sure it is a correct assumption? If yes, we will have to change the design.

Once we have this table, we can essentially materialize your first subquery as an indexed view:

SELECT ClientId,Date,COUNT_BIG(*) AS DistinctDrugs
GROUP BY ClientId,
Date

THat done, you can use your second subquery to group data points into interval, or just run it on a client, where is can be solved as one trivial loop.

Second approach: instead of intervals I would store sequence of events. there woudl be two kinds of events: interval start and interval end. With every I would store a running total, the number of open events after this event has happened. Essentially this running total is the number of prescriptions active after the event has happened.

As in the previous approach, much of the data you calculate on the fly every time you run your query is pre-calculated in this table. We can use trusted constraints to ensure the integrity of precalculated data.

I can describe it in more detail later if you are interested.

Best Answer

Related Solutions

Sql-server – Importing large file while keeping table available for other operations

Sql-server – Indexing strategy when using the between operator SQL Server 2008

Related Question