Postgresql – Optimize user segmentation by bytea column

byteapostgresqlpostgresql-performance

First i would too explain my problem:
I have around 10 millions of customers in the table. Based on customer attributes, we are creating segments.
My table what joins customer with segments looks like

customer_segmet(customer_id int, segment_id int)

Every 3 hours we have a cronjob, what refreshing segments and reinsert them again to this table.

Each refresh means, delete all customers from segment, and insert again.
Deleting customers are fast enough, but i have problem with insert. Inserting 10mln of rows to this table takes 20 minutes. It is too much and im trying to resolve this problem.

Idea:

I'm thinking to optimize inserts. My idea is remove customer_segment table and instead of this, create "customers" column in segments table. This new column will be bytea type.
To store information if user is in segment, i will generate each time binary file where:

Bit position = customer_id
Bit value = is customer in segment or not.

For my calculation looks like for 10mln of customers i need 10mln of bits what is 1.25MB only. So default for this column will be 1.25mb data of nulls.

After that if i want to add only one customer to the segment, i will execute query:

UPDATE segments SET customers = set_bit(customers, 555, 1);

where 555 is my customer ID.

If i want to check if customer is in the segment, i can do it by query:

SELECT COUNT(1) FROM segment WHERE get_bit(customers, 5555) = 1 AND id = 1

Also to simplify the joins between Customer-Segment i created function:

CREATE OR REPLACE FUNCTION get_segment_customers(segment BYTEA) RETURNS SETOF integer AS
$$
for (let i = 0; i < segment.length; i++) {
    if(0 === segment[i]){
        continue;
    }
    for(let bit = 0; bit < 8; bit++){
        if(!(segment[i] & ( 1 << bit))){
            continue;
        }

        plv8.return_next((i * 8) + bit);
    }
}
$$
LANGUAGE plv8;

And my query to select customers:

select 
_c.* from
    get_segment_customers((SELECT data FROM segments where id = 1)) _t
JOIN 
    customer _c ON _c.id = _t;

I made test and is fast.

Only about what im scared is query:

UPDATE segments SET customers = set_bit(customers, 555, 1);

I'm not sure, if PostgreSQL each time of this update will overwrite full file or only one bit in the hard disk?

If someone can answer question above and give the feedback about idea and potential problems, i will be happy.

Thank you

Best Answer

I'm not sure, if PostgreSQL each time of this update will overwrite full file or only one bit in the hard disk?

PostgreSQL will definitely rewrite the whole column on any change, as it cannot do in-place updates, let alone in-place intra-column updates.

But you could break down the segment into fixed-sized sub-segments or partitions. For instance with segments of 2^16 bits, you would at most rewrite 65536/8 = 8192 bytes per update. Personally I've been using this for custom full-text indexing (one vector per word and document-IDs used as bit numbers) and have found this one-bit-per-match-inside-bytea structure very efficient overall when segmented like that.

Also bytea in TOAST storage is compressed by default, so if there are large portions of identical bytes in the vectors, they're going to be stored in less bytes than the nominal size.

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

You write:

Each customer can have multiple sites, but only one should be displayed in this list.

Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site is to be picked.

Either way, it does not matter much here. Your EXPLAIN shows only 5026 rows for the site scan (5018 for the customer scan). So hardly any customer actually has more than one site. Did you ANALYZE your tables before running EXPLAIN?

From the numbers I see in your EXPLAIN, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?

Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.

If this query is your bottleneck, I would suggest you implement a materialized view.

After you provided more information I find that my diagnosis pretty much holds.

The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.

If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id.

Then you can query:

SELECT *
FROM   materialized_view
WHERE  mv_id >= 2700
AND    mv_id <  2800;

This will perform very fast. LIMIT / OFFSET cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Postgresql – Get TEXT value of a CLOB OID in Postgresql

Thanks goes to a_horse_with_no_name. Solution is:

update answers set answer_value = lo_get(cast(value as bigint))

Note - the lo_get function appears to be present in Postgres 9.4 or higher. For earlier versions, I don't see a way of doing this directly. I'm currently running 9.0, but this just accelerated my upgrade plans.

Best Answer

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

pgAdmin timing

Postgresql – Get TEXT value of a CLOB OID in Postgresql

Related Question