So here's the scenario: I have an XML file, which is in size of 500GB, and with data of around 600 million rows (once on a database table). I'm using SSIS for the operation, and since it consumes a REALLY large amount of memory if I am to use an SSIS component (i.e.: XML Source
), it might cause a timeout (correct me if I'm wrong, but as far as I know, using the components on SSIS loads the content of the XML into memory – with a file that big it will surely cause errors). My approach then is:
- Use a
Script Task
to parse the XML data usingXML Reader
(XML Reader
by far is the best approach, since it parses the XML on a forward,
non-cached approach) - Insert the data on a
DataTable
- Every 500,000 rows on the
DataTable
, insert the contents to the
database usingSqlBulkCopy
, then clear the contents of theDataTable
My problem is, currently, I tried it to parse another file with the size of 200GB, and it's running on around 13.5M / 1 hour – and I don't know if it's still fine with that run time. It sure solves my problem – but it's not too elegant, I mean, there should be other ways.
I'm looking on other approaches, like:
- Dividing the large XML files into small pieces of CSVs (around 20GB)
then use an SSISData Flow task
- Use
INSERT
script every new rows
Can you help me do decide which is best? Or suggest any other solutions.
Every answer will be very much appreciated.
EDIT
I forgot to mention – my approach will be dynamic. I mean, there are many tables that will be populated with large sized XML files. So, using a Script Component as source might be not so useful, since I still need to define the output columns. But still, will give it a try.
Best Answer
Have you considered using a Data Flow Task with a Script Component acting as a source? You could just do an AddRow for each row you consume out of the file, and let SSIS handle the batches.