Sql-server – BULK INSERT: Trials and Tribulations

bulk-insertpythonsql-server-2008-r2

I recently ran into several problems with a BULK INSERT task, so I thought I'd share my experiences. I'm not looking for help, just offering help for others that may experience something similar in the future.

I have a python script on a network drive, where the path on my "workstation" is something like z:\pythonscript\script.py and a subdirectory z:\pythonscript\data. The network drive is a physical drive on the server (MS Server 2008 R2) with the database, but has a path on that computer of S:\blah\shared\pythonscript\script.py.

Lessons Learned

  1. BULK INSERT takes a special permission set in SQL Server.
  2. If using an autoincrement field (or not explicitly set field) not present in your data file, use a view and exclude that column from the View
  3. The data file needs to be on-server for the DB server, and you have to use the local path from the server's perspective.
  4. Set SQL Server service user(s) to be able to read from locations where data files are present.
  5. If writing error files, they a) cannot already exist AND b) the target directory needs write permission for the SQL Server service users
  6. SQL Server 2008R2 doesn't provide an interface for accessing system environment variables, including %TEMP%

Best Answer

The Circuitous Path to Learning (i.e. a semi-sequential log of what happened)

Sending a BULK INSERT query - referencing the z:\ path of my workstation - via the python script returns an error

The query is a variation of

BULK INSERT [dbo].[tStaging] FROM '/path/to/file.csv' WITH (
  FIELDTERMINATOR = ',', MAXERRORS = 1024, FIRSTROW = 1, KEEPNULLS,
  ERRORFILE = '/path/to/file.err', TABLOCK )

The error was

Cannot bulk load because the file "Z:/pythonscript/data/file.csv" could not be opened. Operating system error code 3(The system cannot find the path specified.).

Switching to the S:\ based-path gives me

Cannot bulk load because the file "S:/blah/shared/pythonscript/data/file.csv" could not be opened. Operating system error code 5(Access is denied.)

Hmm, okay. I'll follow my own advice here and enable read and execute permission for the service accounts for the path specified.

Then if I specify an error file via the ERRORFILE = '{}' parameter, I may get:

Cannot bulk load because the file "S:/blah/shared/pythonscript data/file.err" could not be opened. Operating system error code 80(The file exists.)

I cannot specify an "append" mode. As the MS documentation points out if I bother to read it, it just returns an error if the file exists. That's frustrating. Okay...I say, I'll just delete the file (and the file.err.Error.txt that is also created) and re-run.

Cannot bulk load because the file "S:/blah/shared/pythonscript/data/file.err" could not be opened. Operating system error code 5(Access is denied.)

So...write access too then? Silly me. Do this again for write access.

About now I hit an error that suggested that I didn't have access for BULK INSERT operations, so had to enable that feature

USE master
go 
GRANT ADMINISTER BULK OPERATIONS TO pySqlUser

Solving that, I still get

SQL ERROR in BULK INSERT. #4864 Severity: 16 State: 1 Message: SQL Server message 4864, severity 16, state 1, procedure , line 1: Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 1 (eID)

eID is an autoincrement primary key that I'm trying to add to my data set in a staging table. I tried destroying the table and moving the eID column to the last. That doesn't appear to help.

SQL ERROR in BULK INSERT. #4866 Severity: 16 State: 1 Message: SQL Server message 4866, severity 16, state 1, procedure , line 1: The bulk load failed. The column is too long in the data file for row 1, column 12. Verify that the field terminator and row terminator are specified correctly.

I suspect that something is wrong with my row terminator. Expanding the field size of the 12th column, immediately before eID, complains about

Bulk load data conversion error (type mismatch or invalid character for the specified codepage)

Playing around with the ROWTERMINATOR statement, I settle on, in python, ROWTERMINATOR = '{}'.format(chr(10))

Still problems. Maybe if I add the KEEPIDENTITY keyword...

Bulk load failed. User does not have ALTER TABLE permission on table 'dbo.tStaging'. ALTER TABLE permission is required on the target table of a bulk load if the target table contains triggers or check constraints, but the 'FIRE_TRIGGERS' or 'CHECK_CONSTRAINTS' bulk hints are not specified. ALTER TABLE permission is also required if the 'KEEPIDENTITY' bulk hint is specified.

That's weird, I say to myself, why would you need ALTER TABLE for KEEPIDENTITY? Then I look up KEEPIDENTITY again, and realize it's basically the opposite of what I want. I want it to "GENERATEIDENTITY" instead, but that's not an option. So, one option is to go with an XML file where I specify that the eID column is not coming from the imported data. Nah, I'll try a view instead.

Once the view was written, with all columns - minus eID - in same order as the underlying table, I could execute the following

qstring = "BULK INSERT {} FROM '{}' WITH ( FIELDTERMINATOR = '{}', ROWTERMINATOR = '{}', MAXERRORS = 1024, FIRSTROW = {}, KEEPNULLS, ERRORFILE = '{}', TABLOCK )".format(viewname, filename, fieldSep, chr(10), 1, ename)

which generates e.g.

BULK INSERT [dbo].[tStagingView] FROM 'S:/path/to/file.csv' WITH (
  FIELDTERMINATOR = ',', ROWTERMINATOR = '[newline here]',
  MAXERRORS = 1024, FIRSTROW = 1, KEEPNULLS, 
  ERRORFILE = 'S:/path/to/file.err', TABLOCK )

I have to police the directory for error dump files and remove them before the SQL call, or I need to name my error files "{}.err".format(gmtime()) and trash them manually, but frequently. I had intended to use the system temp directory via ERRORFILE = '%TEMP%/file.err'. But that gives:

Cannot bulk load because the file "%TEMP%/file.err" could not be opened. Operating system error code 3(The system cannot find the path specified.).

So despite SQL Server resolving linux-style path slash direction, it can't handle a simple environment variable expansion. Maybe I'll hard-code my server's temp directory path into my python code so I don't have to do any manual purges of my data directory. An alternative would be to create a cross-drive junction/symlink that points to the system temp, but that seems to be asking for trouble.