Linux – Recommended Database(s) for Selecting Random Rows

linuxrandomselect

I'm familiar with selecting random rows from a table in PostgreSQL. It's really slow. I've looked into some other databases and they all seem to be bad at it. Mostly because it's not a normal use case for any of them.

Is there any type of database that can select a random row quickly? The only hard requirement is that it be installable on Linux.

So far I have a table with 500 million rows and anticipate it growing to a few billion in the near future and would like to be able to select a random item. I'd also like to be able to delete the same row a short time later, so it'll need some sort of key/index for that.

Right now my psql schema has 3 columns: a text field, a date, and an integer sequence id. The date is not important and doesn't need to be in the new database.

Best Answer

I was using one of two ways to get a random row:

SELECT * FROM mytable ORDER BY RANDOM();

and

SELECT * FROM mytable LIMIT 1 OFFSET <random number>;

The offset method was faster for low offset numbers, otherwise super slow. An offset of 1 million took ~600 msec, while 100 million was 60 seconds (average query time for midpoint offset was 2.5 minutes). For order by random, all queries were ~5 minutes.

I ended up finding a good solution, similar to Mat's comment, but I don't need to keep track of how many rows there are.

I added an integer column called "randval" with a btree index. When a record is saved, I generate a random number between 1 and 2 billion. It took a while to migrate existing data (about half a day on slow hardware), but now random selects are super fast, typically around 1 millisecond using this query:

SELECT * FROM mytable WHERE randval >= <random number> ORDER BY randval LIMIT 1;

With 500 million rows, not every random number has a value, so the order by with limit gives us the next closest one and the index makes it quick, with a 150,000x speedup for what was an average 2.5-minute query.

Databases don't like doing random things, but they sure do well at specific things that just happen to have random values.