Text Processing – Remove First 300 Million Lines from a 700 GB txt File

awkfilesheadsedtail

How do I remove the first 300 million lines from a 700 GB text file
on a system with 1 TB disk space total, with 300 GB available?
(My system has 2 GB of memory.)
The answers I found use sed, tail, head:

But I think (please correct me) I cannot use them due to the disk space being limited to 1 TB and they produce a new file and/or have a tmp file during processing.

The file contains database records in JSON format.

Best Answer

If you have enough space to compress the file, which should free a significant amount of space, allowing you to do other operations, you can try this:

gzip file && zcat file.gz | tail -n +300000001 | gzip > newFile.gz

That will first gzip the original input file (file) to create file.gz. Then, you zcat the newly created file.gz, pipe it through tail -n +300000001 to remove the first 3M lines, compress the result to save disk space and save it as newFile.gz. The && ensures that you only continue if the gzip operation was successful (it will fail if you run out of space).

Note that text files are very compressible. For example, I created a test file using seq 400000000 > file, which prints the numbers from 1 to 400,000,000 and this resulted in a 3.7G file. When I compressed it using the commands above, the compressed file was only 849M and the newFile.gz I created only 213M.

Related Solutions

Shell – Remove lines from tab-delimited file with missing values

If your fields can never contain whitespace, an empty field means either a tab as a first character (^\t), a tab as the last character (\t$) or two consecutive tabs (\t\t). You could therefore filter out lines containing any of those:

grep -Ev $'^\t|\t\t|\t$' file

If you can have whitespace, things get more complex. If your fields can begin with spaces, use this instead (it considers a field with only spaces to be empty):

grep -Pv '\t\s*(\t|$)|\t$|^\t' file

The change filters out lines matching a tab followed by 0 or more spaces and then either another tab or the end of the line.

That will also fail if the last field contains nothing but spaces. To avoid that too, use perl with the -F and -a options to split input into the @F array, telling it to print unless one of the fields is empty (/^$/):

perl -F'\t' -lane 'print unless grep{/^$/} @F' file

Command to remove the first N number of lines in input

NOTE: if your system already has pgrep/pkill then you are re-inventing the wheel here. If your system doesn't have these utilities, then you should be able to format the output of ps to get the unencumbered PID list directly e.g. ps -u user -opid=

If you are already using awk, there is no need to pipe through an additional process in order to remove the first line (record): simply add a condition on the record number NR

ps -u user | awk 'NR>1{print $1;}'

Since you mention head and tail, the formula you probably want in this case is tail -n +2

Best Answer

Related Solutions

Shell – Remove lines from tab-delimited file with missing values

Command to remove the first N number of lines in input

Related Question