Generate 1 GB Text File – Fastest Way to Create Random Digits

filesrandom

I tried a bash script, but it took too long to create a simple 1 MB file. I think the answer lies in using /dev/random or /dev/urandom, but other posts here only show how to add all kinds of data to a file using these things, but I want to add only numbers.

So, is there a command that I can use to create a random file of size 1 GB containing only numbers between 0 and 9?

Edit:
I want the output to be something like this

0 1 4 7 ..... 9
8 7 5 8 ..... 8
....
....
8 7 5 3 ..... 3

The range is 0 – 9 meaning only numbers 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. Also I need them to be space separated and 100 per line, up to n number of lines. This n is something I don't care, I want my final size to be 1 GB.

Edit:
I am using Ubuntu 16.04 LTS

Best Answer

This is partially a tongue-in-cheek answer, because of the title of the question.

When you look for "the fastest way to ...", the answer is almost always some specialized tool. This "answers" shows one such tool, just so you can experiment.

This is not a serious answer, because you should not look into specialized tools for jobs you only do once, or very rarely. You see, you'll end up spending more time looking for tools and learning about them, than actually doing stuff. Shells and utilities like bash and awk are not the fastest, but you can usually write a one-liner to achieve the job, spending only seconds. Better scripting languages like perl can also be used, although the learning curve for perl is steep, and I hesitate to recommend it for such purposes, because I've been traumatized by awful perl projects. python on the other hand is slightly handicapped by its rather slow I/O; it is only an issue when you filter or generate gigabytes of data, though.

In any case, the following C89 example program (which uses POSIX.1 for higher accuracy clock only if available) should achieve about 100 MB/s generation rate (tested in Linux on a laptop with an Intel i5-4200U processor, piping the output to /dev/null), using a pretty good pseudo-random number generator. (The output should pass all the BigCrunch tests, except the MatrixRank test, as the code uses xorshift64* and the exclusion method to avoid biasing the digits.)

decimal-digits.c:

#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
#include <errno.h>
#include <time.h>

/* This program is licensed under the CC0 license,
       https://creativecommons.org/publicdomain/zero/1.0/
   In other words, this is dedicated to the public domain.
   There are no warranties either, so if something breaks,
   you only have yourself to blame.
*/

#if _POSIX_C_SOURCE-199309 >= 0
static uint64_t time_seed(void)
{
    struct timespec  ts;

    if (clock_gettime(CLOCK_REALTIME, &ts))
        return (uint64_t)time(NULL);

    return (uint64_t)ts.tv_sec
         ^ (((uint64_t)ts.tv_nsec) << 32);
}
#else
static uint64_t time_seed(void)
{
    return (uint64_t)time(NULL);
}
#endif

/* Preferred output I/O block size.
 * Currently, about 128k blocks yield
 * maximum I/O throughput on most devices.
 * Note that this is a heuristic value,
 * and may be increased in the future.
*/
#ifndef  IO_BLOCK_SIZE
#define  IO_BLOCK_SIZE  262144
#endif

/* This is the Xorshift* pseudo-random number generator.
 * See https://en.wikipedia.org/wiki/Xorshift#xorshift.2A
 * for details. This is an incredibly fast generator that
 * passes all but the MatrixRank test of the BigCrush
 * randomness test suite, with a period of 2^64-1.
 * Note that neither xorshift_state, nor the result of
 * this function, will ever be zero.
*/
static uint64_t xorshift_state;

static uint64_t xorshift_u64(void)
{
    xorshift_state ^= xorshift_state >> 12;
    xorshift_state ^= xorshift_state << 25;
    xorshift_state ^= xorshift_state >> 27;
    return xorshift_state * UINT64_C(2685821657736338717);
}

/* This function returns a number between (inclusive)
 * 0 and 999,999,999,999,999,999 using xorshift_u64()
 * above, using the exclusion method. Thus, there is
 * no bias in the results, and each digit should be
 * uniformly distributed in 0-9.
*/
static uint64_t quintillion(void)
{
    uint64_t result;

    do {
        result = xorshift_u64() & UINT64_C(1152921504606846975);
    } while (!result || result > UINT64_C(1000000000000000000));

    return result - UINT64_C(1);
}

/* This function returns a single uniformly random digit.
*/
static unsigned char digit(void)
{
    static uint64_t       digits_cache = 0;
    static unsigned char  digits_cached = 0;
    unsigned char         retval;

    if (!digits_cached) {
        digits_cache = quintillion();
        digits_cached = 17; /* We steal the first one! */
    } else
        digits_cached--;
    
    retval = digits_cache % (uint64_t)(10);
    digits_cache /= (uint64_t)(10);

    return retval;
}

static int parse_ulong(const char *src, unsigned long *to)
{
    const char   *end = src;
    unsigned long value;

    if (!src)
        return errno = EINVAL;

    errno = 0;
    value = strtoul(src, (char **)&end, 0);
    if (errno)
        return errno;

    if (end == src)
        return errno = EINVAL;
    while (*end)
        if (isspace(*end))
            end++;
        else
            return errno = EINVAL;

    if (to)
        *to = value;
    return 0;
}

int main(int argc, char *argv[])
{
    unsigned long lines, cols, line, col, seed;
    
    /* When parsing the command-line parameters,
     * use locale conventions. */
    setlocale(LC_ALL, "");

    /* Standard output should be fully buffered, if possible.
     * This only affects output speed, so we're not too worried
     * if this happens to fail. */
    (void)setvbuf(stdout, NULL, _IOFBF, (size_t)IO_BLOCK_SIZE);

    if (argc < 3 || argc > 4 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s COLS LINES [ SEED ]\n", argv[0]);
        fprintf(stderr, "\n");
        fprintf(stderr, "This program generates random decimal digits\n");
        fprintf(stderr, "0 - 9, separated by spaces, COLS per line,\n");
        fprintf(stderr, "LINES lines.  In total, COLS*LINES*2 bytes\n");
        fprintf(stderr, "will be used.\n");
        fprintf(stderr, "\n");
        fprintf(stderr, "SEED is the optional seed for the Xorshift64*\n");
        fprintf(stderr, "pseudo-random number generator used in this program.\n");
        fprintf(stderr, "If omitted, current time is used as the seed.\n");
        fprintf(stderr, "\n");
        return EXIT_SUCCESS;
    }

    if (parse_ulong(argv[1], &cols) || cols < 1UL) {
        fprintf(stderr, "%s: Invalid number of digits per line.\n", argv[1]);
        return EXIT_FAILURE;
    }
    if (parse_ulong(argv[2], &lines) || lines < 1UL) {
        fprintf(stderr, "%s: Invalid number of lines.\n", argv[2]);
        return EXIT_FAILURE;
    }

    if (argc > 3) {
        if (parse_ulong(argv[3], &seed)) {
            fprintf(stderr, "%s: Invalid Xorshift64* seed.\n", argv[3]);
            return EXIT_FAILURE;
        }
    } else
        seed = time_seed();

    /* Since zero seed is invalid, we map it to ~0. */
    xorshift_state = seed;
    if (!xorshift_state)
        xorshift_state = ~(uint64_t)0;

    /* Discard first 1000 values to make the initial values unpredictable. */
    for (col = 0; col < 1000; col++)
        xorshift_u64();

    for (line = 0UL; line < lines; line++) {
        fputc('0' + digit(), stdout);
        for (col = 1UL; col < cols; col++) {
            fputc(' ', stdout);
            fputc('0' + digit(), stdout);
        }
        fputc('\n', stdout);

        /* Check for write errors. */
        if (ferror(stdout))
            return EXIT_FAILURE;
    }

    return EXIT_SUCCESS;
}

We can make it a lot faster, if we switch to a line buffer, and fwrite() it once instead of outputting each digit at a time. Note that we still keep the stream fully buffered, to avoid partial (non-power-of-two) writes if the output is a block device.

#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <locale.h>
#include <ctype.h>
#include <stdio.h>
#include <errno.h>
#include <time.h>

#if _POSIX_C_SOURCE-199309 >= 0
static uint64_t time_seed(void)
{
    struct timespec  ts;

    if (clock_gettime(CLOCK_REALTIME, &ts))
        return (uint64_t)time(NULL);

    return (uint64_t)ts.tv_sec
         ^ (((uint64_t)ts.tv_nsec) << 32);
}
#else
static uint64_t time_seed(void)
{
    return (uint64_t)time(NULL);
}
#endif

/* Preferred output I/O block size.
 * Currently, about 128k blocks yield
 * maximum I/O throughput on most devices.
 * Note that this is a heuristic value,
 * and may be increased in the future.
*/
#ifndef  IO_BLOCK_SIZE
#define  IO_BLOCK_SIZE  262144
#endif

/* This is the Xorshift* pseudo-random number generator.
 * See https://en.wikipedia.org/wiki/Xorshift#xorshift.2A
 * for details. This is an incredibly fast generator that
 * passes all but the MatrixRank test of the BigCrush
 * randomness test suite, with a period of 2^64-1.
 * Note that neither xorshift_state, nor the result of
 * this function, will ever be zero.
*/
static uint64_t xorshift_state;

static uint64_t xorshift_u64(void)
{
    xorshift_state ^= xorshift_state >> 12;
    xorshift_state ^= xorshift_state << 25;
    xorshift_state ^= xorshift_state >> 27;
    return xorshift_state * UINT64_C(2685821657736338717);
}

/* This function returns a number between (inclusive)
 * 0 and 999,999,999,999,999,999 using xorshift_u64()
 * above, using the exclusion method. Thus, there is
 * no bias in the results, and each digit should be
 * uniformly distributed in 0-9.
*/
static uint64_t quintillion(void)
{
    uint64_t result;

    do {
        result = xorshift_u64() & UINT64_C(1152921504606846975);
    } while (!result || result > UINT64_C(1000000000000000000));

    return result - UINT64_C(1);
}

/* This function returns a single uniformly random digit.
*/
static unsigned char digit(void)
{
    static uint64_t       digits_cache = 0;
    static unsigned char  digits_cached = 0;
    unsigned char         retval;

    if (!digits_cached) {
        digits_cache = quintillion();
        digits_cached = 17; /* We steal the first one! */
    } else
        digits_cached--;
    
    retval = digits_cache % (uint64_t)(10);
    digits_cache /= (uint64_t)(10);

    return retval;
}

static int parse_ulong(const char *src, unsigned long *to)
{
    const char   *end = src;
    unsigned long value;

    if (!src)
        return errno = EINVAL;

    errno = 0;
    value = strtoul(src, (char **)&end, 0);
    if (errno)
        return errno;

    if (end == src)
        return errno = EINVAL;
    while (*end)
        if (isspace(*end))
            end++;
        else
            return errno = EINVAL;

    if (to)
        *to = value;
    return 0;
}

int main(int argc, char *argv[])
{
    unsigned long lines, cols, line, col, seed;
    char         *oneline;
    
    /* When parsing the command-line parameters,
     * use locale conventions. */
    setlocale(LC_ALL, "");

    /* Standard output should be fully buffered, if possible.
     * This only affects output speed, so we're not too worried
     * if this happens to fail. */
    (void)setvbuf(stdout, NULL, _IOFBF, (size_t)IO_BLOCK_SIZE);

    if (argc < 3 || argc > 4 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s COLS LINES [ SEED ]\n", argv[0]);
        fprintf(stderr, "\n");
        fprintf(stderr, "This program generates random decimal digits\n");
        fprintf(stderr, "0 - 9, separated by spaces, COLS per line,\n");
        fprintf(stderr, "LINES lines.  In total, COLS*LINES*2 bytes\n");
        fprintf(stderr, "will be used.\n");
        fprintf(stderr, "\n");
        fprintf(stderr, "SEED is the optional seed for the Xorshift64*\n");
        fprintf(stderr, "pseudo-random number generator used in this program.\n");
        fprintf(stderr, "If omitted, current time is used as the seed.\n");
        fprintf(stderr, "\n");
        return EXIT_SUCCESS;
    }

    if (parse_ulong(argv[1], &cols) || cols < 1UL) {
        fprintf(stderr, "%s: Invalid number of digits per line.\n", argv[1]);
        return EXIT_FAILURE;
    }
    if (parse_ulong(argv[2], &lines) || lines < 1UL) {
        fprintf(stderr, "%s: Invalid number of lines.\n", argv[2]);
        return EXIT_FAILURE;
    }

    if (argc > 3) {
        if (parse_ulong(argv[3], &seed)) {
            fprintf(stderr, "%s: Invalid Xorshift64* seed.\n", argv[3]);
            return EXIT_FAILURE;
        }
    } else
        seed = time_seed();

    /* Since zero seed is invalid, we map it to ~0. */
    xorshift_state = seed;
    if (!xorshift_state)
        xorshift_state = ~(uint64_t)0;

    /* Discard first 1000 values to make the initial values unpredictable. */
    for (col = 0; col < 1000; col++)
        xorshift_u64();

    /* Allocate memory for a full line. */
    oneline = malloc((size_t)(2 * cols + 1));
    if (!oneline) {
        fprintf(stderr, "Not enough memory for %lu column buffer.\n", cols);
        return EXIT_FAILURE;
    }

    /* Set spaces and terminating newline. */
    for (col = 0; col < cols; col++)
        oneline[2*col + 1] = ' ';
    oneline[2*cols-1] = '\n';

    /* Not needed, but in case a code modification treats it as a string. */
    oneline[2*cols] = '\0';

    for (line = 0UL; line < lines; line++) {
        for (col = 0UL; col < cols; col++)
            oneline[2*col] = digit();

        if (fwrite(oneline, 2*cols, 1, stdout) != 1)
            return EXIT_FAILURE; 
    }

    /* Check for write errors. */
    if (ferror(stdout))
        return EXIT_FAILURE;

    return EXIT_SUCCESS;
}

Note: both examples edited on 2016-11-18 to ensure uniform distribution of digits (zero is excluded; see e.g. here for comparison and details on various pseudo-random number generators).

Compile using for example

gcc -Wall -O2 decimal-digits.c -o decimal-digits

and optionally install system-wide to /usr/bin using

sudo install -o root -g root -m 0755 decimal-digits /usr/bin

It takes the number of digits per line, and the number of lines. Because 1000000000 / 100 / 2 = 5000000 (five million; total bytes divided by columns divided by 2), you can use

./decimal-digits 100 5000000 > digits.txt

to generate the gigabyte-sized digits.txt as desired by the OP.

Note that the program itself is written more with readability than efficiency in mind. My intent here is not to showcase the efficiency of the code — I'd use POSIX.1 and low-level I/O anyway, rather than generic C interfaces — but to let you easily see what kind of balance there is with effort spent in developing dedicated tools versus their performance, compared to one-liners or short shell or awk scriptlets.

Using the GNU C library, calling the fputc() function for every character output incurs a very small overhead (of an indirect function call, or conditionals -- the FILE interface is actually pretty complex and versatile, you see). On this particular Intel Core i5-4200U laptop, redirecting the output to /dev/null, the first (fputc) version takes about 11 seconds, whereas the line-at-a-time version takes just 1.3 seconds.

I happen to often write such programs and generators only because I like to play with huge datasets. I'm weird that way. For example, I once wrote a program to print all finite positive IEEE-754 floating-point values into a text file, with sufficient precision to yield the exact same value when parsed. The file was a few gigabytes in size (perhaps 4G or so); there are not that many finite positive floats as one might think. I used this to compare implementations that read and parse such data.

For normal use cases, like the OP is having, shell scripts and scriptlets and one-liners are the better approach. Less time spent to accomplish the overall task. (Except if they need a different file every day or so, or there are many people who need a different file, in which — rare — case, a dedicated tool like above, might warrant the effort spent.)

Related Question