How are pseudorandom and truly random numbers different and why does it matter

random number generatorSecurity

I've never quite got this. Just say you write a small program in any language at all which rolls some dice (just using dice as an example). After 600,000 rolls, each number would have been rolled around 100,000 times, which is what I'd expect.

Why are there websites dedicated to 'true randomness'? Surely, given the observation above, the chances of getting any number are almost exactly 1 over how ever many numbers it can choose from.

I tried it in Python: Here's the result of 60 million rolls. The highest variation is like 0.15. Isn't that as random as it's going to get?

1 - 9997653 2347.0
2 - 9997789 2211.0
3 - 9996853 3147.0
4 - 10006533 -6533.0
5 - 10002774 -2774.0
6 - 9998398 1602.0

Best Answer

Let's play some computer poker, just you, me and a server we both trust. The server uses a pseudo-random number generator which is initialized with a 32-bit seed right before we play. So there are about four billion possible decks.

I get five cards in my hand -- apparently we are not playing Texas Hold 'Em. Suppose the cards are dealt out one to me, one to you, one to me, one to you, and so on. So I have the first, third, fifth, seventh and ninth cards in the deck.

Earlier I ran the pseudo-random number generator four billion times, once with each seed, and wrote down the first card generated for each into a database. Suppose my first card is the queen of spades. That only shows up one as the first card in one in every 52 of those possible decks, so we've cut down the possible decks from four billion to around 80 million or so.

Suppose my second card is the three of hearts. Now I run my RNG 80 million more times using the 80 million seeds that produce the queen of spades as the first number. This takes me a couple of seconds. I write down all the decks that produce the three of hearts as the third card -- the second card in my hand. That's again only about 2% of the decks, so now we're down to 2 million decks.

Suppose the third card in my hand is the 7 of clubs. I have a database of 2 million seeds that deal out my two cards; I run my RNG another 2 million times to find the 2% of those decks that produce the 7 of clubs as the third card, and we're down to only 40 thousand decks.

You see how this goes. I run my RNG 40000 more times to find all the seeds that produce my fourth card, and that gets us down to 800 decks, and then run it 800 more times to get the ~20 seeds that produce my fifth card, and now I just generate those twenty decks of cards and I know that you have one of twenty possible hands. Moreover, I have a very good idea of what I'm going to draw next.

Now do you see why true randomness is important? The way you describe it, you think that distribution is important, but distribution isn't what makes a process random. Unpredictability is what makes a process random.

UPDATE

Based on the (now deleted due to their unconstructive nature) comments, at least 0.3% of people who've read this are confused as to my point. When people argue against points I haven't made, or worse, argue for points that I did make on the assumption that I didn't make them, then I know that I need to explain more clearly and carefully.

There seems to be particular confusion around the word distribution so I want to call out usages carefully.

The questions at hand are:

  • How do pseudorandom numbers and truly random numbers differ?
  • Why is the difference important?
  • Do the differences have something to do with the distribution of the output of the PRNG?

Let's begin by considering the perfect way to generate a random deck of cards with which to play poker. Then we'll see how other techniques for generating decks are different, and if it is possible to take advantage of that difference.

Let's begin by supposing that we have a magic box labeled TRNG. As its input we give it an integer n greater than or equal to one, and as its output it gives us a truly random number between one and n, inclusive. The output of the box is entirely unpredictable (when given a number other than one) and any number between one and n is as likely as another; that is to say that the distribution is uniform. (There are other more advanced statistical checks of randomness that we could perform; I'm ignoring this point as it is not germane to my argument. TRNG is perfectly statistically random by assumption.)

We start with an unshuffled deck of cards. We ask the box for a number between one and 52 -- that is, TRNG(52). Whatever number it gives back, we count off that many cards from our sorted deck and remove that card. It becomes the first card in the shuffled deck. Then we ask for TRNG(51) and do the same to select the second card, and so on.

Another way to look at it is: there are 52! = 52 x 51 x 50 ... x 2 x 1 possible decks, which is roughly 2226. We have chosen one of them at truly at random.

Now we deal the cards. When I look at my cards I have no idea whatsoever what cards you have. (Aside from the obvious fact that you don't have any of the cards I have.) They could be any cards, with equal probability.

So let me make sure that I explain this clearly. We have uniform distribution of each individual output of TRNG(n); each one picks a number between 1 and n with probability 1/n. Also, the result of this process is that we have chosen one of 52! possible decks with a probability of 1/52!, so the distribution over the set of possible decks is also uniform.

All right.

Now let's suppose that we have a less magic box, labeled PRNG. Before you can use it, it must be seeded with a 32-bit unsigned number.

ASIDE: Why 32? Couldn't it be seeded with a 64- or 256- or 10000-bit number? Sure. But (1) in practice most off-the-shelf PRNGs are seeded with a 32-bit number, and (2) if you have 10000 bits of randomness to make the seed then why are you using a PRNG at all? You already have a source of 10000 bits of randomness!

Anyway, back to how the PRNG works: after it is seeded, you can use it the same way you use TRNG. That is, you pass it a number, n, and it gives you back a number between 1 and n, inclusive. Moreover, the distribution of that output is more or less uniform. That is, when we ask PRNG for a number between 1 and 6, we get 1, 2, 3, 4, 5 or 6 each roughly one sixth of the time, no matter what the seed was.

I want to emphasize this point several times because it seems to be the one that is confusing certain commenters. The distribution of the PRNG is uniform in at least two ways. First, suppose we choose any particular seed. We would expect that the sequence PRNG(6), PRNG(6), PRNG(6)... a million times would produce a uniform distribution of numbers between 1 and 6. And second, if we chose a million different seeds and called PRNG(6) once for each seed, again we would expect a uniform distribution of numbers from 1 to 6. The uniformity of the PRNG across either of these operations is not relevant to the attack I am describing.

This process is said to be pseudo-random because the behaviour of the box is actually fully deterministic; it chooses from one of 232 possible behaviours based on the seed. That is, once it is seeded, PRNG(6), PRNG(6), PRNG(6), ... produces a sequence of numbers with a uniform distribution, but that sequence is entirely determined by the seed. For a given sequence of calls, say, PRNG(52), PRNG(51) ... and so on, there are only 232 possible sequences. The seed essentially chooses which one we get.

To generate a deck the server now generates a seed. (How? We'll come back to that point.) Then they call PRNG(52), PRNG(51) and so on to generate the deck, similar to before.

This system is susceptible to the attack I described. To attack the server we first, ahead of time, seed our own copy of the box with 0 and ask for PRNG(52) and write that down. Then we re-seed with 1, ask for PRNG(52), and write that down, all the way up to 232-1.

Now, the poker server that is using PRNG to generate decks has to generate a seed somehow. It does not matter how they do so. They could call TRNG(2^32) to get a truly random seed. Or they could take the current time as a seed, which is hardly random at all; I know what time it is as much as you do. The point of my attack is that it does not matter, because I have my database. When I see my first card I can eliminate 98% of the possible seeds. When I see my second card I can eliminate 98% more, and so on, until eventually I can get down to a handful of possible seeds, and know with high likelihood what is in your hand.

Now, again, I want to emphasize that the assumption here is that if we called PRNG(6) a million times we would get each number roughly one sixth of the time. That distribution is (more or less) uniform, and if uniformity of that distribution is all you care about, that's fine. The point of the question was are there things other that distribution of PRNG(6) that we care about? and the answer is yes. We care about unpredictability as well.

Another way to look at the problem is that even though the distribution of a million calls to PRNG(6) might be fine, because the PRNG is choosing from only 232 possible behaviours, it cannot generate every possible deck. It can only generate 232 of the 2226 possible decks; a tiny fraction. So the distribution over the set of all decks is very bad. But again, the fundamental attack here is based on our being able to successfully predict the past and future behaviour of PRNG from a small sample of its output.

Let me say this a third or four time to make sure this sinks in. There are three distributions here. First, the distribution of the process that produces the random 32-bit seed. That can be perfectly random, unpredictable and uniform and the attack will still work. Second, the distribution of a million calls to PRNG(6). That can be perfectly uniform and the attack will still work. Third, the distribution of decks chosen by the pseudo-random process I have described. That distribution is extremely poor; only a tiny fraction of the IRL possible decks can possibly be chosen. The attack depends on the predictability of the behaviour of the PRNG based on partial knowledge of its output.

ASIDE: This attack requires that the attacker know or be able to guess what the exact algorithm used by the PRNG is. Whether that is realistic or not is an open question. However, when designing a security system you must design it to be secure against attacks even if the attacker knows all the algorithms in the program. Put another way: the portion of a security system which must remain secret in order for the system to be secure is called the "key". If your system depends for its security on the algorithms you use being a secret then your key contains those algorithms. That is an extremely weak position to be in!

Moving on.

Now let's suppose that we have a third magic box labeled CPRNG. It is a crypto-strength version of PRNG. It takes a 256-bit seed rather than a 32-bit seed. It shares with PRNG the property that the seed chooses from one of 2256 possible behaviours. And like our other machines, it has the property that a large number of calls to CPRNG(n) produce a uniform distribution of results between 1 and n: each happens 1/n of the time. Can we run our attack against it?

Our original attack requires us to store 232 mappings from seeds to PRNG(52). But 2256 is a much larger number; it is completely infeasible to run CPRNG(52) that many times and store the results.

But suppose there is some other way to take the value of CPRNG(52) and from that deduce a fact about the seed? We've been pretty dumb so far, just brute-forcing all the possible combinations. Can we look inside the magic box, figure out how it works, and deduce facts about the seed based on the output?

No. The details are too complicated to explain, but CPRNGs are cleverly designed so that it is infeasible to deduce any useful fact about the seed from the first output of CPRNG(52) or from any subset of the output, no matter how large.

OK, so now let's suppose the server is using CPRNG to generate decks. It needs a 256-bit seed. How does it choose that seed? If it chooses any value that an attacker can predict then suddenly the attack becomes viable again. If we can determine that of the 2256 possible seeds, only four billion of them are likely to be chosen by the server, then we are back in business. We can mount this attack again, only paying attention to the small number of seeds that can possibly be generated.

The server therefore should do work to ensure that the 256-bit number is uniformly distributed -- that is, each possible seed is chosen with probability of 1/2256. Basically the server should be calling TRNG(2^256)-1 to generate the seed for CPRNG.

What if I can hack the server and peer into it to see what seed was chosen? In that case, the attacker knows the complete past and future of the CPRNG. The author of the server needs to guard against this attack! (Of course if I can successfully mount this attack then I can probably also just transfer the money to my bank account directly, so maybe that's not that interesting. Point is: the seed has to be a hard-to-guess secret, and a truly random 256-bit number is pretty darn hard to guess.)

Returning to my earlier point about defense-in-depth: the 256-bit seed is the key to this security system. The idea of a CPRNG is that the system is secure as long as the key is secure; even if every other fact about the algorithm is known, as long as you can keep the key secret, the opponent's cards are unpredictable.

OK, so the seed should be both secret and uniformly distributed because if it is not, we can mount an attack. We have by assumption that the distribution of outputs of CPRNG(n) is uniform. What about the distribution over the set of all possible decks?

You might say: there are 2256 possible sequences output by the CPRNG, but there are only 2226 possible decks. Therefore there are more possible sequences than decks, so we're fine; every possible-IRL deck is now (with high probability) possible in this system. And that's a good argument except...

2226 is only an approximation of 52!. Divide it out. 2256/52! cannot possibly be a whole number because for one thing, 52! is divisible by 3 but no power of two is! Since this is not a whole number now we have the situation where all decks are possible, but some decks are more likely than others.

If that's not clear, consider the situation with smaller numbers. Suppose we have three cards, A, B and C. Suppose we use a PRNG with an 8-bit seed, so there are 256 possible seeds. There are 256 possible outputs of PRNG(3) depending on the seed; there's no way to have one third of them be A, one third of them be B and one third of them be C because 256 isn't evenly divisible by 3. There has to be a small bias towards one of them.

Similarly, 52 doesn't divide evenly into 2256, so there must be some bias towards some cards as the first card chosen and a bias away from others.

In our original system with a 32-bit seed there was a massive bias and the vast majority of possible decks were never produced. In this system all decks can be produced, but the distribution of decks is still flawed. Some decks are very slightly more likely than others.

Now the question is: do we have an attack based on this flaw? and the answer is in practice, probably not. CPRNGs are designed so that if the seed is truly random then it is computationally infeasible to tell the difference between CPRNG and TRNG.

OK, so let's sum up.

How do pseudorandom numbers and truly random numbers differ?

They differ in the level of predictability they exhibit.

  • Truly random numbers are not predictable.
  • All pseudo-random numbers are predictable if the seed can be determined or guessed.

Why is the difference important?

Because there are applications where the security of the system relies upon unpredictability.

  • If a TRNG is used to choose each card then the system is unassailable.
  • If a CPRNG is used to choose each card then the system is secure if the seed is both unpredictable and unknown.
  • If an ordinary PRNG with a small seed space is used then the system is not secure regardless of whether the seed is unpredictable or unknown; a small enough seed space is susceptible to brute-force attacks of the kind I have described.

Does the difference have something to do with the distribution of the output of the PRNG?

The uniformity of distribution or lack thereof for individual calls to RNG(n) is not relevant to the attacks I've described.

As we've seen, both a PRNG and CPRNG produce poor distributions of the probability of choosing any individual deck of all the possible decks. The PRNG is considerably worse, but both have problems.

One more question:

If TRNG is so much better than CPRNG, which is in turn so much better than PRNG, why does anyone use CPRNG or PRNG?

Two reasons.

First: expense. TRNG is expensive. Generating truly random numbers is difficult. CPRNGs give good results for arbitrarily many calls with only one call to TRNG for the seed. The down side is of course that you have to keep that seed a secret.

Second: sometimes we want predictability and all we care about is good distribution. If you're generating "random" data as program inputs for a test suite, and it shows up a bug, then it would be nice that running the test suite again produces the bug again!

I hope that is now much more clear.

Finally, if you enjoyed this then you might enjoy some further reading on the subject of randomness and permutations:

Related Question