Postgresql – Storing bitsets in PostgreSQL

postgresql

I need to calculate reach, grouped by different criterions, and would prefer to keep everything in my database. I'm looking for either a pre-built bitset implementation for PostgreSQL, or pointers on how I may achieve my goal.

What I'm currently doing is storing serialized Java instances of EWAHCompressedBitSet in bytea columns, but this means I have to round-trip to a Java process to continue my calculations.

Ideally, I'd like to be able to do the following:

-- Unique people reached during a time window
SELECT   service_name, last_updated_at, bitset_or(followers_bs)
FROM     followers
GROUP BY service_name, last_updated_at

I have used a commercial solution (Truviso) that had such a feature, and it was based on PostgreSQL as well.

Note that I do have a table that has the data in an uncompressed format, but because the data is so huge (10M rows and counting), JOINing on it simply takes too much time than I want to devote to it.

Best Answer

Can you use the "bit" type instead of "bytea", and use your own user-defined aggregate?

postgres=> CREATE AGGREGATE bitset_or ("bit")
postgres-> ( sfunc    = bitor,
postgres->   stype    = "bit",
postgres->   initcond =   '0'  );

Now bitset_or should work with your query as written (again, assuming bytea -> bit).