Postgresql – How to find out how far along the PostgreSQL query is

postgresql

I have a pretty decent idea of how many rows my SELECT…INTO query will actually process (e.g. I know how many will materialize).

I understand Postgres won't tell me percentage completeness, is there a way (buried deep in logs, system tables, or otherwise) that I can find out how many rows have been pumped into the destination table or have been read by the SELECT query?

Best Answer

As Daniel Vérité mentioned there doesn't seem to be a generic solution. When loading data into a table from a file the following technique can be used to get the progress of the load.

COPY command console progress bar

Create an empty table.

CREATE TABLE mytest (n int);

Create a data file with 10 million lines for loading into the table.

$ seq 10000000 > /tmp/data.txt

Load data from file into the table and display a progress bar.

$ pv /tmp/data.txt | psql -c "COPY mytest FROM STDIN;"

Demo

How this works

By using the copy commands STDIN option we can feed in the data for the copy operation from another process. The pv command will output a file and track it's progress displaying a progress bar, ETA, total time elapsed and the rate of data transfer.

COPY command graphical progress bar

Using the same general technique we could display a progress bar in a graphical application or a web-based application. Using python for example the psycopg2 module lets you call the copy command with a file object of your choosing. You could then track how much of your file object has been read and display a progress bar.

Related Solutions

Postgresql – Checking for multiple identical values in a Posgresql array

Note: this will only work if you make data.id a PRIMARY KEY

SELECT
    data.id
FROM 
    data, 
    LATERAL (SELECT DISTINCT unnest(values)) no_duplicates
GROUP BY 
    data.id 
HAVING
    array_length(values, 1) > COUNT(no_duplicates)

Here's an SQL Fiddle.

This works by converting your array into a recordset/table (which I've called "no_duplicates") using unnest(), and removing duplicates using DISTINCT:

LATERAL (SELECT DISTINCT unnest(values)) no_duplicates

Then I GROUP BY the original data table's ID, and compare the length of the new, filtered recordset with the old, unfiltered table. If the original, unfiltered array is bigger, then we removed duplicates so we should select that row:

array_length(values, 1) > COUNT(no_duplicates)

Optimizing WHERE x BETWEEN a AND b GROUP BY y Query in PostgreSQL

Idea 1

Judging by their names, the columns "denormalizedData" and "hugeText" seem to be comparatively big, probably many times as big as the columns involved in your query. Size matters for big queries like this. Very big values (> 2kb) for text or jsonb get "toasted", which can avert the worst. But even the remainder or smaller values stored inline can be several times as big as the columns relevant to your query, which span around 100 bytes.

Configuring PostgreSQL for read performance

Splitting columns relevant to the query into a separate 1:1 table might go a long way. (Depends on the complete situation. You add some storage overhead for another row header and another PK and writing to the tables gets a bit more complicated and expensive.)

Idea 2

Also (like you confirmed) only 4 columns are relevant to determine the top 50.

You might have an angle there for a much smaller materialized view (MV) containing just those columns plus "timestampCol" and "textCol" and only the "last 2 weeks" or "last month" of data. Run a fast query on the MV to identify the top 50 "textCol" and only retrieve those rows from the big table. Or, to be precise, just the additional columns not contained in your MV - you get sums for those in the first step.

You only need an index on ("textCol") for the big table. And another one on ("timestampCol") for the MV - which would only be used for instances of your query with a selective WHERE clause. Else, it will be cheaper to sequentially scan the whole MV.

If many of your queries cover the same period of time, you might go one step further: only save one row per "textCol" in the MV with pre-aggregated sums (maybe two or more MV for a couple of frequently used time periods). You get the idea. That should be much faster, yet.

You might even create the MVs with the whole result set and refresh before the first new query for the day.

Depending on exact numbers, you might combine both ideas.