Postgresql – Efficient group by on sorted CSV using PostgreSQL FDW

postgresqlpostgresql-11postgresql-fdw

I have a large CSV file sorted on the first column. Here is an extremely simplified version):

a 7
a 4
b 6
c 3
c 9
c 2

I want to group by the first column and sum the second.

How do I tell the PostgreSQL foreign data wrapper (fdw) to assume that the file is sorted on the first column, thereby not scanning the entire file before producing the output?

I am using version 11.1

Update: The above example is dramatically simplified. In the real life use case the first column has a very high number of values. It is the Google ngrams data set where first column is a phrase, second is the year, third is the occurrences in literature published that year. I want phrases with total count over all years greater than say 10,000. A simple python program can stream the result efficiently. So why not PostgreSQL? FDW could simulate an index over the table if informed of the sort order, just as it simulates the table from the CSV file, so I asked if it does something like that.

My main concern is efficiency. I previously did this using Unix command line tools such as awk but I would like to avoid juggling Unix scripts which are pretty much specific hand crafted execution plans. Save me PostgreSQL!!

Best Answer

I'm afraid not.

There is no such thing as a "sorted csv" file format, all you can have is a csv file that just happens to be sorted. I could take your example file and move the top line to the bottom and it would still be a valid csv.

If the fdw made that trusted your assertion and you made a mistake it would return incorrect results which is obviously unacceptable.

So, you will either want to import that file if you are doing lots of processing, convert it to a sorted db file format that fdw supports or pre summarize the data if that is all you are doing.