Batching large data sets into smaller data sets, e.g. 100,000 rows or 1,000,000 rows will make the load run better than One Big Insert. But the same is true from SSIS, of course, as it batches inserts.
The difference in time that you show in your example is fairly small. That does not give much hope for great speed improvements, but it does encourage you to know that you can use other methods for import.
EDITS included below.
If you are willing to first extract to a .csv file, perhaps use BCP to extract from a view to the .csv file. Then you can use BULK INSERT
to load that file into your database:
BULK INSERT
described: http://msdn.microsoft.com/en-us/library/ms188365.aspx
If you want to do the imports all in TSQL then you might use the BULK INSERT
command. When you look as its parameters, you will see that it includes KILOBYTES_PER_BATCH
and ROWS_PER_BATCH
as methods of controlling batch size.
Because of the speed of BCP
and BULK INSERT
, along with the control of the batch size, I believe that this would be one of the quicker methods to use. But it does require the intermediate .csv file.
Other issues: Is your data coming from the same server, or from another server? If the same server, then access is fairly east. If another server and you do not want the intermediate .csv file, you can create a linked server to query the data from the other server.
However, as you noted, this will mean that you have to manage the batches yourself. This is similar to what you described as a 'cursor loop', although it probably does not need a cursor, just a loop selecting the next 'n' rows that you want to copy. If the data is on another server the overhead on this approach will be higher.
I think it could be helpful for you to think of MapReduce as (essentially) a distributed query engine. I know it isn't one-to-one, but with SELECT
and aggregate functions such as sum()
, max()
, etc being very much like a Map()
operation, and GROUP BY
being very much like a Reduce()
, there is quite a bit of similarity.
In your case, on the same hardware, I think the only benefit you will be able gain from using MapReduce is the distribution to multiple cores. Remember that query processes in PostgreSQL only use a single core per query, so there is a lot of waste of your 64 cores if you're running this one query at a time.
Parallelization
Perhaps you'd be able to break up the query and run it over only segments of your locations table, and then run these queries in parallel using a connection pooler like pgBouncer?
For some good info about access data in sequential chunks (pagination), check out this amazing blog post by Markus Winand.
An example of how you might break this into chunks might be (warning:untested SQL ahead)
SELECT
id
FROM
(SELECT * FROM locations LIMIT x OFFSET y) AS locs_fragment
WHERE
a_lat BETWEEN SYMMETRIC (38.5565 - 0.00055533333334) AND (38.5565 + 0.00055533333334) AND
a_lon BETWEEN SYMMETRIC (-77.2797222222222 - 0.00055533333334) AND (-77.2797222222222 + 0.00055533333334) AND
criteria_1 BETWEEN SYMMETRIC (42.7 - 1.5) AND (42.7 + 1.5) AND
criteria_2 = 23 AND
criteria_3 = 'ABCD' AND
(criteria_4 = '0014980726' OR criteria_4 IS NULL) AND
criteria_5 = 'A' AND
criteria_6 = 1;
Note that the sub-select SELECT * FROM locations LIMIT x OFFSET y
, given no additional criteria, will extract rows in ctid order. By selecting appropriate LIMIT
and OFFSET
values for a rnage of queries, you can essentially run the query in parallel.
Denormalization
Don't rule out denormalization as an option as well. In this case, it can help you to much more quickly get you the results you want. Of course, denormalization will only really help if your query criteria will be fairly stable over time, so that you can effectively pre-calculate results sets.
I'd say there are two decent options:
(1) Use partial indexing, where a given index is designed around some set of criteria, as
CREATE INDEX idx_locations_criteria_set_0
ON locations (id)
WHERE
a_lat BETWEEN SYMMETRIC (38.5565 - 0.00055533333334) AND (38.5565 + 0.00055533333334) AND
a_lon BETWEEN SYMMETRIC (-77.2797222222222 - 0.00055533333334) AND (-77.2797222222222 + 0.00055533333334) AND
criteria_1 BETWEEN SYMMETRIC (42.7 - 1.5) AND (42.7 + 1.5) AND
criteria_2 = 23 AND
criteria_3 = 'ABCD' AND
(criteria_4 = '0014980726' OR criteria_4 IS NULL) AND
criteria_5 = 'A' AND
criteria_6 = 1;
With such an index defined, if you make a query with a matching predicate, an index scan is performed rather than having to perform all the calculations for the SELECT
statement.
(2) Another denormalization option is to use triggers during INSERT
on your locations
table to verify if certain conditions are met, and when they are met, store the id
of that locations
entry in another table, thus making a simple table scan all you need to do at a later time to find matching entries.
Both of these cases will require additional storage space, as well as some minor INSERT
overhead for predicate checking, but it allows you to distribute your predicate checking over time, rather than being forced to do it all at once when a query is issued.
Conclusions
As with any advice on the internet, take mine with a grain of salt! :P I've done some work in both of these areas, but that doesn't make me an expert in your application area. Run some small scale tests on either approach to see what works best for your application.
Best Answer
To briefly quote the documentation:
You say that you're loading data from one table to another in the same database without transformations. SSIS isn't really the right tool for the job. With a reasonable table structure, properly written T-SQL inserts will perform better than loading through SSIS because you avoid the overhead of sending data to another server, passing it through buffers, and sending it back to the destination server. Behind the scenes SSIS is just doing a BULK INSERT which isn't going to be faster than a T-SQL
INSERT
except in edge cases.In the interest of giving a complete answer, I'll describe a few of the features of SSIS which can be helpful from a data loading performance perspective. Consider a table in SQL Server with a complex five column clustered index and many nonclustered indexes. Suppose you can't disable NCIs. You might get the best possible loading performance by splitting up the insert into batches because smaller sorts will be used to do NCI maintenance. It may be difficult to code an efficient batching solution in SQL Server with a five column clustered key. However, this is trivial to do in SSIS. SSIS can read the data in order and split the data up into batches for you just by changing a parameter. So in that case, you may get better performance with SSIS because the T-SQL code wasn't written to be as efficient as possible.
For another example, a
BULK INSERT
is able to take a BU lock on a target heap. BU locks are compatible with other BU locks. As a result, you can get multiple concurrent T-SQL inserts all loading minimally logged data into the same heap without blocking. That cannot be accomplished with T-SQL alone. The BU lock is not available in that context. SQL Server 2017 offers parallel inserts into heaps in T-SQL, so the benefit of a BU lock is diminished when loading data from a table with an identical structure. An example where a BU lock is better than anything in T-SQL would be loading from a table with a clustered index into a heap on SQL Server 2014 Standard Edition. In T-SQL you can only run a single insert at a time but in SSIS you could have many concurrent inserts running.In summary, SSIS is almost certainly the wrong tool for the scenario that you described. You should expect to see better performance with properly written T-SQL code, with the exception of a few edge cases, because it avoids the overhead inherent to moving data between servers.