How to automate the detection of the column type of a csv file for a table creation script

azure-postgres-databasecsvpgadminpgadmin-4table

I am new to pgAdmin and Azure database. I have a huge csv file with around 220 columns and I want to create a table out of it in pgAdmin4 to push it to Azure.

However I don't know how to detect automatically the types of columns.

Here is an exemple :

IDI GKID    S01Q01  S02Q01_Gender ...
100093  enq030059569748fc89091fdd91cc337cac44eca90  Yes, I agree    Female ...

I'm doing the script to create the table. However, given the number of columns I would like to automate it to get the script that would allow me to add the csv to the database in pgAdmin4 for Microsoft Azure.

Then, after transposing the csv header, I get:

    IDI
    GKID
    S01Q01
    S02Q01_Gender
    ...

I'm doing the script to create the table. However, given the number of columns I would like, in the best case, to automate the detection of the type of column to be able to write it in the right part, in the worst case to put a generic type of the type TEXT.

So far, I've tried

output = ""
file_name = "columns.txt"
string_to_add = " TINYTEXT,"

with open(file_name, 'r') as f:
    file_lines = [''.join([x.strip(), string_to_add, '\n']) for x in f.readlines()]

with open(file_name, 'w') as f:
    f.writelines(file_lines)

It gives me back:

IDI TINYTEXT,
GKID TINYTEXT,
S01Q01 TINYTEXT,
S02Q01_Gender TINYTEXT,
...

And, then, I can do:

CREATE TABLE my_table (
IDI TINYTEXT,
GKID TINYTEXT,
S01Q01 TINYTEXT,
S02Q01_Gender TINYTEXT,
...

But I'm not sure that this enough to make a table able to receive my csv file.

Best Answer

there are command line tools that do field type inferencing.

One is xsv https://github.com/BurntSushi/xsv/

Running this command on this (https://gist.githubusercontent.com/aborruso/3b1af402f0d2ed49465f218d19be81d9/raw/c0e95b320924e9e49902633d16e7ab253046ca16/input.csv)

xsv stats input.csv --everything | xsv table

you have

field       type     sum                 min       max         min_length  max_length  mean                stddev              median              mode  cardinality
id          Integer  5050                1         100         1           3           50.5                28.86607004772212   50.5                N/A   100
first_name  Unicode                      Annabal   Willabella  3           11                                                                      N/A   98
last_name   Unicode                      Albinson  Zaniolini   3           13                                                                      N/A   100
f           Float    2063.2419999999984  0.2656    51.1245     4           7           20.632419999999996  12.603955889545158  17.930799999999998  N/A   100

Using csvkit (https://csvkit.readthedocs.io/en/latest/index.html) and running

csvstat --csv input.csv

you will have

column_id column_name type   nulls unique min   max    sum       mean   median stdev  len freq
1         id          Number False 100    1     100    5,050     50.5   50.5   29.011 -   1, 2, 3, 4, 5
2         first_name  Text   False 98     -     -      -         -      -      -      11  Caren, Weylin, Heall, Flori, Lydia
3         last_name   Text   False 100    -     -      -         -      -      -      13  Saxby, Joderli, Kleinzweig, Coyle, Kleinplac
4         f           Number False 100    0.266 51.124 2,063.242 20.632 17.931 12.667 -   5.356, 12.596, 32.1245, 5.32, 0.2656

In csvkit you have also

csvsql -i postgresql input.csv

that gives you

CREATE TABLE input (
        id DECIMAL NOT NULL,
        first_name VARCHAR NOT NULL,
        last_name VARCHAR NOT NULL,
        f DECIMAL NOT NULL
);

Related Solutions

Sql-server – What MySQL data types REQUIRE a max length (size) declaration in a create table statement

MySQL Perspective

@DerickDowney discussed this in his post INT(5) vs SMALLINT(5). He quoted from the MySQL Docs:

The display width does not constrain the range of values that can be stored in the column. Nor does it prevent values wider than the column display width from being displayed correctly. For example, a column specified as SMALLINT(3) has the usual SMALLINT range of -32768 to 32767, and values outside the range permitted by three digits are displayed in full using more than three digits.

Given this statement, if you create a table with a column type INT it will appear as INT(11). Why 11 ? INT has a range of -2147483648 and 2147483647. It takes 11 spaces to display the smallest signed INT -2147483648. Nothing more, nothing less.

Character data is entirely different story. Defining a VARCHAR requires the table header knowing the maximum length in advance. Rather than go into a lengthy explanation, please read the following posts:

From the information in these posts, I would confidently say that you have no choice but to define a length for a VARCHAR. If you not sure, just define all VARCHARs as VARCHAR(255). No wasted is there. Alter loading data you could always run the following

SELECT varcharcolumnname FROM yourtable PROCEDURE ANALYSE();

and let it tell you the proper VARCHAR size.

Postgresql – Copying CSV file to temp table with dynamic number of columns

Basics:

The 1st row of the CSV file has column names of defined format.
The PROGRAM clause of COPY and GET DIAGNOSTICS after COPY require Postgres 9.3+.
format() requires Postgres 9.1+
This works with pure standard Postgres - except for the head command that the shell is expected to provide. For Windows versions consider:
- How to do what head, tail, more, less, sed do in Powershell?

Full automation

This function copies any table structure completely dynamically:

CREATE OR REPLACE FUNCTION f_dynamic_copy(_file    text
                                        , _tbl     text = 'tmp1'
                                        , _delim   text = E'\t'
                                        , _nodelim text = chr(127)) -- see below!
  RETURNS text AS
$func$
DECLARE
   row_ct int;
BEGIN
   -- create staging table for 1st row as  single text column 
   CREATE TEMP TABLE tmp0(cols text) ON COMMIT DROP;

   -- fetch 1st row
   EXECUTE format($$COPY tmp0 FROM PROGRAM 'head -n1 %I' WITH (DELIMITER %L)$$  -- impossible delimiter
                , _file, _nodelim);

   -- create actual temp table with all columns text
   EXECUTE (
      SELECT format('CREATE TEMP TABLE %I(', _tbl)
          || string_agg(quote_ident(col) || ' text', ',')
          || ')'
      FROM  (SELECT cols FROM tmp0 LIMIT 1) t
           , unnest(string_to_array(t.cols, E'\t')) col
      );

   -- Import data
   EXECUTE format($$COPY %I FROM %L WITH (FORMAT csv, HEADER, NULL '\N', DELIMITER %L)$$
                , _tbl, _file, _delim);

   GET DIAGNOSTICS row_ct = ROW_COUNT;
   RETURN format('Created table %I with %s rows.', _tbl, row_ct);
END
$func$  LANGUAGE plpgsql;

Call variants:

SELECT f_dynamic_copy('/path/to/file.csv');
SELECT f_dynamic_copy('/path/to/file2.csv', 'tmp_file2');
SELECT f_dynamic_copy(_file  => '/path/to/file2.csv'
                    , _tbl   => 'tmp_file2');
                    , _delim => E'\t'); -- using assignment operator since pg 9.5

Answer:

Created table tmp_file2 with 123 rows.

Before the main COPY, run a preliminary COPY ... TO tmp0 to fetch the first row with column names, which are expected to be unquoted, case-sensitive strings like COPY ... TO ... (FORMAT csv, HEADER) would export them.

The structure of the actual target table is derived from it, all columns with data type text. The default name of the resulting table is tmp1 - or provide your own as 2nd function parameter.

Then COPY is executed. The default delimiter is a tab character - or provide your delimiter as 3rd function parameter.

Use any single-byte character for the non-delimiter _nodelim which does not appear in the first line of your CSV file. I am arbitrarily picking the control character "Delete" (ASCII 127). That character would be swallowed here on SO, so I generate with chr(127) instead, which is also valid. Assuming the character won't pop up - or provide your non-delimiter as 4th function parameter.

The function returns table name and number of imported rows.
Remember, a temporary table dies with the end of the session.

The manual:

Executing a command with PROGRAM might be restricted by the operating system's access control mechanisms, such as SELinux.

Postgres 8.4

That version is too old, I am not going to back-port that far.

GET DIAGNOSTICS is an optional feature. You can just leave it away or replace it with a full count on the table
A primitive (expensive) alternative for the PROGRAM clause of COPY in pg 9.3 would be to import the complete table instead:
```
EXECUTE format($$COPY tmp0 FROM %L WITH (DELIMITER %L)$$, _file, _delim);
```

Or you prepare a second input file, or you can make it work by piping from the shell: COPY tablename FROM STDIN is available in pg 8.4.

format() can be replaced with plain string concatenation. Be wary of SQL injection though!
- SQL injection in Postgres functions vs prepared queries