Processing CSV Data on Import in PostgreSQL with PostGIS

importpostgispostgresql

I have an externally-produced CSV filled with geographic points and associated data, but I would like to run some functions on the columns before saving them. For example I need to digest longitude/latitude columns into points, run some numerical processing on other columns, and discarding a couple altogether.

What's the best practise here? Options I've considered:

Processing the data with a shell script before importing. Can be invoked directly from the COPY command with COPY FROM PROGRAM, but still leaves the problem of turning the lon/lat into a geographical type.
Having two tables, data and data_import, and updating data using functions on the columns of data_import.
Having a table where the imported data is as it appears in the CSV, working with it through a View that does the necessary processing.

Best Answer

The common way is to use a staging table (the table you called "data_import" which conventionally would be called "data_stg" or similar variation).
Preferably, no "updating" would be involved (maybe it was just a poor choice of words) but only a single insert into ... select ....

Related Solutions

PostGIS – Removing Latitude and Longitude from Table

The point of reference comes from the cafe in the center, so you can use a subquery to retrieve it from the addresses table instead of the manual input:

SELECT c.*, a.*, ST_Distance(t.lonlat, a.lonlat) AS distance -- pick columns you need
FROM   addresses a
JOIN   cafes     c ON c.id = a.cafe_id
    , (SELECT lonlat
       FROM   addresses
       WHERE  cafe_id = 10
       AND    type = 'SearchAddress') t -- center cafe
WHERE  ST_DWithin(t.lonlat, a.lonlat, 4828)
ORDER  BY distance;

Table alias was missing: cafes c.
Makes no sense to have a.cafe_id, c.id in the SELECT list, since those are the same by definition. Pick the columns you actually need.
The columns type or state shouldn't be type text (or character varying(255)), I suggest an enum for a hand full of types or a adr_type look-up table and adr_type_id int REFERENCES adr_type (adr_type_Id in addresses.
Either way, I also suggest a partial, covering index for a faster look-up of the center coordinate. Building on "There can only be one of those addresses with that type.":
```
CREATE addresses_search_idx ON addresses (cafe_id, lonlat)
WHERE type = 'SearchAddress';
```
A lot more details discussed in chat (don't know for how long the link will be good).

Postgresql – Numerical data with lots of NULLs: 6NF or document database

JSONB might be a good option. If the TYPES of your flags and sensors are always the same (let's say they are all floats and ints), you could also use another strategy:

CREATE TABLE tb
    (id           serial primary key, 
     timetag      timestamp default now(),
     sensor_nr    integer,
     sensor_value float,
     flag_nr      integer,
     flag_value   integer
    ) ;

If you have real values for sensors 2 and 1000, you would then do:

INSERT INTO 
    tb 
    (sensor_nr, sensor_value) 
VALUES 
    (2, 123.456), 
    (1000, 0.123) ;

Or, if there are flags:

INSERT INTO 
    tb 
    (sensor_nr, sensor_value, flag_nr, flag_value)
VALUES
    (1000, 123.456, 1000, 0),
    (2, 234.567, 2, 1) ;

The NaN value would be represented by NULL; and the "no reading" would be represented by just the non-existence of the time/sensor_nr row.

If you need frequent range SELECTs of different types, you would index by timetag; and also by sensor_nr and sensor_value... The insert costs would be relatively high, but the SELECTs could be fast.

If the "new unexpected fields that pop at any type" are of different types (i.e.: you have some co-ordinate pairs (float,float) and not just simple floats) this approach won't be flexible enough. In that occasion, probably JSON(B) and the new indices is probably your best alternative; at the cost of losing (some) type safety.

Best Answer

Related Solutions

PostGIS – Removing Latitude and Longitude from Table

Postgresql – Numerical data with lots of NULLs: 6NF or document database

Related Question