Postgresql – Optimize user segmentation by bytea column

byteapostgresqlpostgresql-performance

First i would too explain my problem:
I have around 10 millions of customers in the table. Based on customer attributes, we are creating segments.
My table what joins customer with segments looks like

customer_segmet(customer_id int, segment_id int)

Every 3 hours we have a cronjob, what refreshing segments and reinsert them again to this table.

Each refresh means, delete all customers from segment, and insert again.
Deleting customers are fast enough, but i have problem with insert. Inserting 10mln of rows to this table takes 20 minutes. It is too much and im trying to resolve this problem.

Idea:

I'm thinking to optimize inserts. My idea is remove customer_segment table and instead of this, create "customers" column in segments table. This new column will be bytea type.
To store information if user is in segment, i will generate each time binary file where:

  • Bit position = customer_id
  • Bit value = is customer in segment or not.

For my calculation looks like for 10mln of customers i need 10mln of bits what is 1.25MB only. So default for this column will be 1.25mb data of nulls.

After that if i want to add only one customer to the segment, i will execute query:

UPDATE segments SET customers = set_bit(customers, 555, 1);

where 555 is my customer ID.

If i want to check if customer is in the segment, i can do it by query:

SELECT COUNT(1) FROM segment WHERE get_bit(customers, 5555) = 1 AND id = 1

Also to simplify the joins between Customer-Segment i created function:

CREATE OR REPLACE FUNCTION get_segment_customers(segment BYTEA) RETURNS SETOF integer AS
$$
for (let i = 0; i < segment.length; i++) {
    if(0 === segment[i]){
        continue;
    }
    for(let bit = 0; bit < 8; bit++){
        if(!(segment[i] & ( 1 << bit))){
            continue;
        }

        plv8.return_next((i * 8) + bit);
    }
}
$$
LANGUAGE plv8; 

And my query to select customers:

select 
_c.* from
    get_segment_customers((SELECT data FROM segments where id = 1)) _t
JOIN 
    customer _c ON _c.id = _t;

I made test and is fast.

Only about what im scared is query:

UPDATE segments SET customers = set_bit(customers, 555, 1);

I'm not sure, if PostgreSQL each time of this update will overwrite full file or only one bit in the hard disk?

If someone can answer question above and give the feedback about idea and potential problems, i will be happy.

Thank you

Best Answer

I'm not sure, if PostgreSQL each time of this update will overwrite full file or only one bit in the hard disk?

PostgreSQL will definitely rewrite the whole column on any change, as it cannot do in-place updates, let alone in-place intra-column updates.

But you could break down the segment into fixed-sized sub-segments or partitions. For instance with segments of 2^16 bits, you would at most rewrite 65536/8 = 8192 bytes per update. Personally I've been using this for custom full-text indexing (one vector per word and document-IDs used as bit numbers) and have found this one-bit-per-match-inside-bytea structure very efficient overall when segmented like that.

Also bytea in TOAST storage is compressed by default, so if there are large portions of identical bytes in the vectors, they're going to be stored in less bytes than the nominal size.