I tried a bash script, but it took too long to create a simple 1 MB file. I think the answer lies in using /dev/random
or /dev/urandom
, but other posts here only show how to add all kinds of data to a file using these things, but I want to add only numbers.
So, is there a command that I can use to create a random file of size 1 GB containing only numbers between 0 and 9?
Edit:
I want the output to be something like this
0 1 4 7 ..... 9
8 7 5 8 ..... 8
....
....
8 7 5 3 ..... 3
The range is 0 – 9 meaning only numbers 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. Also I need them to be space separated and 100 per line, up to n
number of lines. This n is something I don't care, I want my final size to be 1 GB.
Edit:
I am using Ubuntu 16.04 LTS
Best Answer
This is partially a tongue-in-cheek answer, because of the title of the question.
When you look for "the fastest way to ...", the answer is almost always some specialized tool. This "answers" shows one such tool, just so you can experiment.
This is not a serious answer, because you should not look into specialized tools for jobs you only do once, or very rarely. You see, you'll end up spending more time looking for tools and learning about them, than actually doing stuff. Shells and utilities like
bash
andawk
are not the fastest, but you can usually write a one-liner to achieve the job, spending only seconds. Better scripting languages likeperl
can also be used, although the learning curve forperl
is steep, and I hesitate to recommend it for such purposes, because I've been traumatized by awful perl projects.python
on the other hand is slightly handicapped by its rather slow I/O; it is only an issue when you filter or generate gigabytes of data, though.In any case, the following C89 example program (which uses POSIX.1 for higher accuracy clock only if available) should achieve about 100 MB/s generation rate (tested in Linux on a laptop with an Intel i5-4200U processor, piping the output to
/dev/null
), using a pretty good pseudo-random number generator. (The output should pass all the BigCrunch tests, except the MatrixRank test, as the code uses xorshift64* and the exclusion method to avoid biasing the digits.)decimal-digits.c:
We can make it a lot faster, if we switch to a line buffer, and
fwrite()
it once instead of outputting each digit at a time. Note that we still keep the stream fully buffered, to avoid partial (non-power-of-two) writes if the output is a block device.Note: both examples edited on 2016-11-18 to ensure uniform distribution of digits (zero is excluded; see e.g. here for comparison and details on various pseudo-random number generators).
Compile using for example
and optionally install system-wide to
/usr/bin
usingIt takes the number of digits per line, and the number of lines. Because
1000000000 / 100 / 2 = 5000000
(five million; total bytes divided by columns divided by 2), you can useto generate the gigabyte-sized
digits.txt
as desired by the OP.Note that the program itself is written more with readability than efficiency in mind. My intent here is not to showcase the efficiency of the code — I'd use POSIX.1 and low-level I/O anyway, rather than generic C interfaces — but to let you easily see what kind of balance there is with effort spent in developing dedicated tools versus their performance, compared to one-liners or short shell or awk scriptlets.
Using the GNU C library, calling the
fputc()
function for every character output incurs a very small overhead (of an indirect function call, or conditionals -- theFILE
interface is actually pretty complex and versatile, you see). On this particular Intel Core i5-4200U laptop, redirecting the output to/dev/null
, the first (fputc) version takes about 11 seconds, whereas the line-at-a-time version takes just 1.3 seconds.I happen to often write such programs and generators only because I like to play with huge datasets. I'm weird that way. For example, I once wrote a program to print all finite positive IEEE-754 floating-point values into a text file, with sufficient precision to yield the exact same value when parsed. The file was a few gigabytes in size (perhaps 4G or so); there are not that many finite positive
float
s as one might think. I used this to compare implementations that read and parse such data.For normal use cases, like the OP is having, shell scripts and scriptlets and one-liners are the better approach. Less time spent to accomplish the overall task. (Except if they need a different file every day or so, or there are many people who need a different file, in which — rare — case, a dedicated tool like above, might warrant the effort spent.)