Mysql – Determining the optimal Column Widths from a .csv File

datatypesimportMySQL

I want to import a .CSV file into a MySQL Table. I would like to determine the optimal column widths to use based on the actual data contained in the CSV file. Are there any scripts or utilities to accomplish this?

Best Answer

Trying to guess appropriate column lengths is best avoided if possible - if you can get the source of the CSV to specify the expected format in more detail (column X will contain no more than Y characters) then try that.

I would suggest initially going for long variable length types - this may be less efficient than more constrained types but will ensure that you do not experence errors in later imports if some fields grow larger than expected (or worse silently lose data because mysql silently truncates something). Under MSSQL text fields would want to be nvarchar(max) or varchar(max) in this case, unless they need to be part of an index in which case they'll need to be more constrained (no index can be over more than 900 bytes of data under MSSQL). Unless you can get more garantees of data format from the source, this is the safest way, though of course you can use a little guesswork (a field containing a national insurance number shouldn't ever be longer than 11 characters for instance, though be careful still as a field may be incorrectly named nad/or overloaded to store things other than what its name suggests).

If you really have no choice but to guess and using larger types feels too inefficient to you then to help the guestimating process import the file(s) you have into a table where all fields are maximum length variable text types, then perform some rudementry analysis such as SELECT MAX(LEN(<field1>)), MAX(LEN(<field2>)) FROM <table> to get indications to base your guesswork on, and for non-string fields you can use more complex pattern matching to determine expected ranges (i.e. field X is always a data in ISO format, field Y is a number between -500 and 45,123 with up to two decimal places, and so forth).

With regard to "optimal": while it may be inefficient to use larger variable-length types where a short fixed length will do, efficient and optimal are not neccesarily the same: being efficient will be little benefit if you expereience errors or lose data due to accidentally setting the type too restrictively.