Sql-server – BULK INSERT: Trials and Tribulations

bulk-insertpythonsql-server-2008-r2

I recently ran into several problems with a BULK INSERT task, so I thought I'd share my experiences. I'm not looking for help, just offering help for others that may experience something similar in the future.

I have a python script on a network drive, where the path on my "workstation" is something like z:\pythonscript\script.py and a subdirectory z:\pythonscript\data. The network drive is a physical drive on the server (MS Server 2008 R2) with the database, but has a path on that computer of S:\blah\shared\pythonscript\script.py.

Lessons Learned

BULK INSERT takes a special permission set in SQL Server.
If using an autoincrement field (or not explicitly set field) not present in your data file, use a view and exclude that column from the View
The data file needs to be on-server for the DB server, and you have to use the local path from the server's perspective.
Set SQL Server service user(s) to be able to read from locations where data files are present.
If writing error files, they a) cannot already exist AND b) the target directory needs write permission for the SQL Server service users
SQL Server 2008R2 doesn't provide an interface for accessing system environment variables, including %TEMP%

Best Answer

The Circuitous Path to Learning (i.e. a semi-sequential log of what happened)

Sending a BULK INSERT query - referencing the z:\ path of my workstation - via the python script returns an error

The query is a variation of

BULK INSERT [dbo].[tStaging] FROM '/path/to/file.csv' WITH (
  FIELDTERMINATOR = ',', MAXERRORS = 1024, FIRSTROW = 1, KEEPNULLS,
  ERRORFILE = '/path/to/file.err', TABLOCK )

The error was

Cannot bulk load because the file "Z:/pythonscript/data/file.csv" could not be opened. Operating system error code 3(The system cannot find the path specified.).

Switching to the S:\ based-path gives me

Cannot bulk load because the file "S:/blah/shared/pythonscript/data/file.csv" could not be opened. Operating system error code 5(Access is denied.)

Hmm, okay. I'll follow my own advice here and enable read and execute permission for the service accounts for the path specified.

Then if I specify an error file via the ERRORFILE = '{}' parameter, I may get:

Cannot bulk load because the file "S:/blah/shared/pythonscript data/file.err" could not be opened. Operating system error code 80(The file exists.)

I cannot specify an "append" mode. As the MS documentation points out if I bother to read it, it just returns an error if the file exists. That's frustrating. Okay...I say, I'll just delete the file (and the file.err.Error.txt that is also created) and re-run.

Cannot bulk load because the file "S:/blah/shared/pythonscript/data/file.err" could not be opened. Operating system error code 5(Access is denied.)

So...write access too then? Silly me. Do this again for write access.

About now I hit an error that suggested that I didn't have access for BULK INSERT operations, so had to enable that feature

USE master
go 
GRANT ADMINISTER BULK OPERATIONS TO pySqlUser

Solving that, I still get

SQL ERROR in BULK INSERT. #4864 Severity: 16 State: 1 Message: SQL Server message 4864, severity 16, state 1, procedure , line 1: Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 1 (eID)

eID is an autoincrement primary key that I'm trying to add to my data set in a staging table. I tried destroying the table and moving the eID column to the last. That doesn't appear to help.

SQL ERROR in BULK INSERT. #4866 Severity: 16 State: 1 Message: SQL Server message 4866, severity 16, state 1, procedure , line 1: The bulk load failed. The column is too long in the data file for row 1, column 12. Verify that the field terminator and row terminator are specified correctly.

I suspect that something is wrong with my row terminator. Expanding the field size of the 12th column, immediately before eID, complains about

Bulk load data conversion error (type mismatch or invalid character for the specified codepage)

Playing around with the ROWTERMINATOR statement, I settle on, in python, ROWTERMINATOR = '{}'.format(chr(10))

Still problems. Maybe if I add the KEEPIDENTITY keyword...

Bulk load failed. User does not have ALTER TABLE permission on table 'dbo.tStaging'. ALTER TABLE permission is required on the target table of a bulk load if the target table contains triggers or check constraints, but the 'FIRE_TRIGGERS' or 'CHECK_CONSTRAINTS' bulk hints are not specified. ALTER TABLE permission is also required if the 'KEEPIDENTITY' bulk hint is specified.

That's weird, I say to myself, why would you need ALTER TABLE for KEEPIDENTITY? Then I look up KEEPIDENTITY again, and realize it's basically the opposite of what I want. I want it to "GENERATEIDENTITY" instead, but that's not an option. So, one option is to go with an XML file where I specify that the eID column is not coming from the imported data. Nah, I'll try a view instead.

Once the view was written, with all columns - minus eID - in same order as the underlying table, I could execute the following

qstring = "BULK INSERT {} FROM '{}' WITH ( FIELDTERMINATOR = '{}', ROWTERMINATOR = '{}', MAXERRORS = 1024, FIRSTROW = {}, KEEPNULLS, ERRORFILE = '{}', TABLOCK )".format(viewname, filename, fieldSep, chr(10), 1, ename)

which generates e.g.

BULK INSERT [dbo].[tStagingView] FROM 'S:/path/to/file.csv' WITH (
  FIELDTERMINATOR = ',', ROWTERMINATOR = '[newline here]',
  MAXERRORS = 1024, FIRSTROW = 1, KEEPNULLS, 
  ERRORFILE = 'S:/path/to/file.err', TABLOCK )

I have to police the directory for error dump files and remove them before the SQL call, or I need to name my error files "{}.err".format(gmtime()) and trash them manually, but frequently. I had intended to use the system temp directory via ERRORFILE = '%TEMP%/file.err'. But that gives:

Cannot bulk load because the file "%TEMP%/file.err" could not be opened. Operating system error code 3(The system cannot find the path specified.).

So despite SQL Server resolving linux-style path slash direction, it can't handle a simple environment variable expansion. Maybe I'll hard-code my server's temp directory path into my python code so I don't have to do any manual purges of my data directory. An alternative would be to create a cross-drive junction/symlink that points to the system temp, but that seems to be asking for trouble.

Related Solutions

Sql-server – Help with Cybernetics SAN Configuration

SQL Server 2008 R2 installed on a virtual server

IMO, the SQL Server VM should live with the rest of the VMs. If this device goes kaput (the Cybernetics website says nothing about the number of controllers in this model range, so presumably there's only one), you don't want all your eggs in one basket. Hopefully there are other, better protected devices available.

The SQL Server is more read intensive, with data bulk loaded nightly like a data warehouse and the web app used for looking up and reporting on the data.

You haven't said how big the data warehouse is or what the load time SLA is, but I strongly suspect RAID 5/6 is not going to be appropriate for that, particularly since 7.2k RPM drives are involved. High drive latency combined with RAID 5/6 is two feet in the grave for write performance.

The client can also look at optimizing the bulk loading process to be more efficient. It's likely there's inefficiencies in this area.

Move backups to a different SAN or network location, in order to free up the F drive for SQL use as well as for better DR.

Move log files and tempdb data files to the F drive.

Move the backups for DR: Yes, definitely. If there's space available on another device, use it.

Since the storage is all a big pool of RAID 6, the performance characteristics will be the same regardless of which logical volume is used, so I don't see any point in moving things around unless more space is needed.

I am wanting to suggest the client reconfigure their SAN as follows:

Aside from moving the backups off this SAN, this configuration will probably make things worse:

Using RAID 1 for a single database log file is fine, but as soon as you have more than one log file, that's random access, and performance will nosedive.
Reads from and writes to data files were distributed over the 8 drives before, now the same load will be on 3 drives.

I would recommend using all 8 drives in one pool of RAID 10, which is the only RAID level that makes sense given the workload and the available hardware, and this is only 1 drive less usable space than your proposed configuration. There's just not enough wiggle room to micromanage the physical pools here. The whole thing will need to be redone if the initial sizing is wrong, or if the sizing changes drastically over time. Using one pool gives way more flexibility.

Now having said all of this, depending on how big the data set is, it's usually a really good ROI to buy more RAM so that SQL Server needs to do fewer physical reads in the first place, thus increasing performance. But that won't help with the write workload, of course.

Sql-server – Why does SQL Server 2008 R2 insert an invalid datetime when using bulk insert

On SQL Server 2005, and SQL Server 2012, I did the following:

USE tempdb;
CREATE TABLE ImpTest
(
    ImpTestID INT NULL
    , ImpTestDate DATETIME NULL
);
GO
BULK INSERT ImpTest FROM 'C:\SQLServer\ImportTest.txt' WITH (FIELDTERMINATOR=',');
SELECT * FROM ImpTest;
DROP TABLE ImpTest;

With the following data:

1,
2,20130406
3,NOT_DATE
4,
5,20130409

On both servers, I received the following:

Msg 4864, Level 16, State 1, Line 1
Bulk load data conversion error (type mismatch or invalid character for the specified
codepage) for row 3, column 2 (ImpTestDate).

Perhaps, as Aaron noted, this is a bug that needs to be addressed through http://connect.microsoft.com

I'm wondering about some of the details of your system, including locale settings, collations, physical format of the import file (is it from a Unix system? etc).