I have a wordlist that I'd like to extract every line of text from that contains only numbers, to a new file. What do I do?
Delete all lines of text that isn’t all numbers
greptext processing
Related Solutions
You can use csplit
. It does the job well, except that it's somewhat inflexible regarding the output file names (you can only specify a prefix, not a suffix) and you need a first pass to calculate the number of pieces.
csplit -f text -- input.txt '//' "{$(wc -l input.txt)}"
for x in text[0-9]*; do mv -- "$x" "$x.txt"; done
The GNU version, but not the OSX version, has extensions that solve both issues.
csplit -b '%d.txt' -f text -- input.txt '//' '{*}'
Alternatively, if csplit
is too inflexible, you can use awk.
awk '{filename = sprintf("text%d.txt", NR); print >filename; close(filename)}' input.txt
Here are an alternative method and a bit of benchmarking, adding to that in Weijun Zhou's answer.
join
Assuming you have a data
file you want to extract rows from and a line_numbers
file that lists the numbers of the rows you want to extract, if the sorting order of the output is not important you can use:
join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | cut -d ' ' -f 2-
This will number the lines of your data
file, join it with the padded_line_numbers
file on the first field (the default) and print out the common lines (excluding the join field itself, that is cut away).
join
needs the input files to be sorted alphabetically. The aforementioned padded_line_numbers
file has to be prepared by left-padding each line of your line_numbers
file. E.g.:
while read rownum; do
printf '%.12d\n' "$rownum"
done <line_numbers >padded_line_numbers
The -w 12 -n rz
options and arguments instruct nl
to output 12 digits long numbers with leading zeros.
If the sorting order of the output has to match that of your line_numbers
file, you can use:
join -1 2 -2 1 <(nl padded_line_numbers | sort -k 2,2) \
<(nl -w 12 -n rz data) |
sort -k 2,2n |
cut -d ' ' -f 3-
Where we are numbering the padded_line_numbers
file, sorting the result alphabetically by its second field, joining it with the numbered data
file and numerically sorting the result by the original sorting order of padded_line_numbers
.
Process substitution is here used for convenience. If you can not or do not want to rely on it and, as it is likely, you are not willing to waste the storage needed for creating regular files to hold intermediate results, you can leverage named pipes:
mkfifo padded_line_numbers
mkfifo numbered_data
while read rownum; do
printf '%.12d\n' "$rownum"
done <line_numbers | nl | sort -k 2,2 >padded_line_numbers &
nl -w 12 -n rz data >numbered_data &
join -1 2 -2 1 padded_line_numbers numbered_data | sort -k 2,2n | cut -d ' ' -f 3-
Benchmarking
Since the peculiarity of your question is the number of rows in your data
file, I thought it could be useful to test alternative approaches with a comparable amount of data.
For my tests I used a 3.2 billion lines data file. Each line is just 2 bytes of garbage coming from openssl enc
, hex-encoded using od -An -tx1 -w2
and with spaces removed with tr -d ' '
:
$ head -n 3 data
c15d
061d
5787
$ wc -l data
3221254963 data
The line_numbers
file has been created by randomly choosing 10,000 numbers between 1 and 3,221,254,963, without repetitions, using shuf
from GNU Coreutils:
shuf -i 1-"$(wc -l <data)" -n 10000 >line_numbers
The testing environment was a laptop with a i7-2670QM Intel quad-core processor, 16 GiB of memory, SSD storage, GNU/Linux, bash
5.0 and GNU tools.
The only dimension I measured has been the execution time, by means of the time
shell builtin.
Here I'm considering:
- The
sed
solution from Weijun Zhou's answer. - The
awk
solution from Micha's answer. - The
perl
solution from wurtel's answer. - The
join
solution above.
perl
seems to be the fastest:
$ time perl_script line_numbers data | wc -l
10000
real 14m51.597s
user 14m41.878s
sys 0m9.299s
awk
's performance looks comparable:
$ time awk 'FNR==NR { seen[$0]++ }; FNR!=NR && FNR in seen' line_numbers data | wc -l
10000
real 29m3.808s
user 28m52.616s
sys 0m10.709s
join
, too, appears to be comparable:
$ time join <(sort padded_line_numbers) <(nl -w 12 -n rz data) | wc -l
10000
real 28m24.053s
user 27m52.857s
sys 0m28.958s
Note that the sorted version mentioned above has roughly no performance penalty over this one.
Finally, sed
appears to be significantly slower: I killed it after approximately nine hours:
$ time sed -nf <(sed 's/$/p/' line_numbers) data | wc -l
^C
real 551m12.747s
user 550m53.390s
sys 0m15.624s
Best Answer
To extract lines containing only digits:
The regular expression
[0-9][0-9]*
will match any line contains at least one digit, and the-x
option togrep
requires that the whole line matches the given expression.If your file doesn't have empty lines, you may change it to
grep -x '[0-9]*'
.If you want lines with numbers, delete lines with alphabetic characters instead (easier than trying to construct a regular expression for a generic number):
Both variants using POSIX character classes:
Update: If you want to select the lines containing floating point numbers you could use the (extended) regular expression
^[+-]?([0-9]*\.)?[0-9]+$
:It all comes down to what kind of "number" we're looking for.