I have an api endpoint that returns X amount of random posts from a table called "posts". This endpoint is used by a mobile app. It retreives random posts by doing a SELECT COUNT(*) on the posts table and returning the amount of posts in the table. It then enters a for loop in which, at the start of each loop, a random number from 0 to the COUNT(*) is generated. A random post is then obtained using the handy OFFSET. This for loop goes until X amount of random posts obtained.
pseudocode:
FUNCTION getRandPosts :
$numOfPosts = queryExecute("SELECT COUNT(*) from posts");
for (iterations < numOfRandomPostsNeeded) {
$randomNumber = rand(0, $numOfPosts)
$randomPost = queryExecute("SELECT * from posts LIMIT 1, OFFSET $randomNumber")
}
Now, for each call to a getRandPosts function I want them to always retrieve a unique post that wasn't retrieved before. For this current getRandPosts call AND FOR PASTS CALLS.
I've thought of several ways of going about doing this:
-
I could store the ID's of already seen posts in an array IN the app and then send this array to the getRandPosts function. The getRandPosts function then uses a "NOT IN" clause.
FUNCTION getRandPosts(ARRAY $idsOfPostsAlreadySeen) : ... for ... $randomPost = queryExecute(" SELECT * from posts WHERE id NOT IN $idsOfPostsAlreadySeen LIMIT 1, OFFSET $randomNumber") $idsOfPostsAlreadySeen.addToArray($randomPost.id) ...
-
After each random post is retrieved, save that post id to a table called "seenPosts" where there is also a column called idOfApp which is used to distinguish users.
FUNCTION getRandPosts(STRING appInstallID) : ... for ... $randomPost = queryExecute(" SELECT posts.* FROM posts JOIN seenPosts.postID FROM seenPosts ON posts.id = seenPosts.postID WHERE seenPosts.postID = null AND seenPosts.appInstallID = $appInstallID LIMIT 1, OFFSET $randomNumber") queryExecute("INSERT INTO seenPosts SET postID = $randomPost.id, appInstallID = $appInstallID") ...
I'd also like to ask about the idea of, for idea number one, creating a temporary table from the array and using JOIN with that temp table vs using the array.
I'm avoiding using RAND() because it has been said it doesn't work for large tables. I wonder how large is large?
Consider the idea that, at most, 600 posts will have to be excluded for any given user (so an array of 600 ids in the case of the first idea $idsOfPostsAlreadySeen… or 600 records in the seenPosts for a particular person).
Best Answer
As for "doesn't work", I say that is false.
RAND()
does not repeat after 64K, like some random number generators.As far as "how large", time the first query below -- if it is "too slow" then the table is "too large". Then skip to the link at the bottom.
Case 1 Done all at once, this gives you $N random, not repeated, rows:
Case 2 If you need to fetch the rows individually, such as with separate requests and minimal "memory" between requests:
By remembering $rnd and $x (eg passing via a URL or SESSION), you can get the "next" one in a predictable sequence.
Caveat 1: If rows are inserted/deleted, this sequence can mess up.
Caveat 2: If the table is 'large', the
SELECT
could be costly.8 Faster ways to get random rows. None is 'perfect', and each depends on something not in evidence, such as whether there is an auto_increment id and whether that has gaps.