Mysql – the most efficient way to exclude records with a particular ID in a MySQL query for random records

join;MySQLPHPrandom

I have an api endpoint that returns X amount of random posts from a table called "posts". This endpoint is used by a mobile app. It retreives random posts by doing a SELECT COUNT(*) on the posts table and returning the amount of posts in the table. It then enters a for loop in which, at the start of each loop, a random number from 0 to the COUNT(*) is generated. A random post is then obtained using the handy OFFSET. This for loop goes until X amount of random posts obtained.

pseudocode:

FUNCTION getRandPosts :

$numOfPosts = queryExecute("SELECT COUNT(*) from posts");

for (iterations < numOfRandomPostsNeeded) {
    $randomNumber = rand(0, $numOfPosts)

    $randomPost = queryExecute("SELECT * from posts LIMIT 1, OFFSET $randomNumber")


}

Now, for each call to a getRandPosts function I want them to always retrieve a unique post that wasn't retrieved before. For this current getRandPosts call AND FOR PASTS CALLS.

I've thought of several ways of going about doing this:

  1. I could store the ID's of already seen posts in an array IN the app and then send this array to the getRandPosts function. The getRandPosts function then uses a "NOT IN" clause.

    FUNCTION getRandPosts(ARRAY $idsOfPostsAlreadySeen) :
    ...
    for ...
    
    $randomPost = queryExecute("
        SELECT * from posts
        WHERE id NOT IN $idsOfPostsAlreadySeen
        LIMIT 1, OFFSET $randomNumber")
    
    $idsOfPostsAlreadySeen.addToArray($randomPost.id)
    ...
    
  2. After each random post is retrieved, save that post id to a table called "seenPosts" where there is also a column called idOfApp which is used to distinguish users.

    FUNCTION getRandPosts(STRING appInstallID) :
    ...
    for ...
    
    
    $randomPost = queryExecute("
        SELECT posts.* FROM posts
        JOIN seenPosts.postID FROM seenPosts
        ON posts.id = seenPosts.postID
        WHERE seenPosts.postID = null AND seenPosts.appInstallID = $appInstallID
        LIMIT 1, OFFSET $randomNumber")
    
    queryExecute("INSERT INTO seenPosts SET postID = $randomPost.id, appInstallID = $appInstallID")
    ...
    

I'd also like to ask about the idea of, for idea number one, creating a temporary table from the array and using JOIN with that temp table vs using the array.

I'm avoiding using RAND() because it has been said it doesn't work for large tables. I wonder how large is large?

Consider the idea that, at most, 600 posts will have to be excluded for any given user (so an array of 600 ids in the case of the first idea $idsOfPostsAlreadySeen… or 600 records in the seenPosts for a particular person).

Best Answer

I'm avoiding using RAND() because it has been said it doesn't work for large tables. I wonder how large is large?

As for "doesn't work", I say that is false. RAND() does not repeat after 64K, like some random number generators.

As far as "how large", time the first query below -- if it is "too slow" then the table is "too large". Then skip to the link at the bottom.

Case 1 Done all at once, this gives you $N random, not repeated, rows:

$array = SELECT id FROM ... WHERE ... ORDER BY RAND() LIMIT $N;

Case 2 If you need to fetch the rows individually, such as with separate requests and minimal "memory" between requests:

$rnd = UNIX_TIMESTAMP();
$offs = $offs + 1;
$array = SELECT id FROM ... WHERE ... ORDER BY RAND($rnd) LIMIT $N OFFSET $offs;

By remembering $rnd and $x (eg passing via a URL or SESSION), you can get the "next" one in a predictable sequence.

Caveat 1: If rows are inserted/deleted, this sequence can mess up.
Caveat 2: If the table is 'large', the SELECT could be costly.

8 Faster ways to get random rows. None is 'perfect', and each depends on something not in evidence, such as whether there is an auto_increment id and whether that has gaps.