Shell – Is there any faster way to get this output file in linux

grepshell-scripttext processing

cat file_1

my colour is red
my rose is red
my colour is blue
my rose id blue

cat file_2 
red
blue

cat output_file should be
my colour is red
my colour is blue

here i am using

cat file_2 | while read line;do cat file_1 | grep "$line" | head -1;done

here i am trying to get the top most line containing the pattern "red" and "blue" which is present in the file_2

is there any other way to do , as fast as possible, while loop is taking time

Best Answer

You can use a while construct to loop over the patterns from file2 and then use -m 1 with grep to stop after first match on file1:

while IFS= read -r i; do grep -Fm1 "$i" file1; done <file2

-F treats the pattern literally
-m 1 makes grep to exit after first match

Shell loops are usually not efficient, but given the pattern list is small it is usable in this case.

Faster alternative, xargs:

xargs -a file2 -n1 -P2 -I'{}' grep -Fm1 {} file1

Use more parallel processes (-P) for more patterns.

Example:

% while IFS= read -r i; do grep -Fm1 "$i" file1; done <file2
my colour is red
my colour is blue

% xargs -a file2 -n1 -P2 -I'{}' grep -Fm1 {} file1
my colour is blue
my colour is red

Related Solutions

Text Processing – Faster Way to Remove a Line by Line Number

What you could do to avoid writing a copy of the file is to write the file over itself like:

{
  sed "$l1,$l2 d" < file
  perl -le 'truncate STDOUT, tell STDOUT'
} 1<> file

Dangerous as you've no backup copy there.

Or avoiding sed, stealing part of manatwork's idea:

{
  head -n "$(($l1 - 1))"
  head -n "$(($l2 - $l1 + 1))" > /dev/null
  cat
  perl -le 'truncate STDOUT, tell STDOUT'
} < file 1<> file

That could still be improved because you're overwriting the first l1 - 1 lines over themselves while you don't need to, but avoiding it would mean a bit more involved programming, and for instance do everything in perl which may end up less efficient:

perl -ne 'BEGIN{($l1,$l2) = ($ENV{"l1"}, $ENV{"l2"})}
    if ($. == $l1) {$s = tell(STDIN) - length; next}
    if ($. == $l2) {seek STDOUT, $s, 0; $/ = \32768; next}
    if ($. > $l2) {print}
    END {truncate STDOUT, tell STDOUT}' < file 1<> file

Some timings for removing lines 1000000 to 1000050 from the output of seq 1e7:

sed -i "$l1,$l2 d" file: 16.2s
1st solution: 1.25s
2nd solution: 0.057s
3rd solution: 0.48s

They all work on the same principle: we open two file descriptors to the file, one in read-only mode (0) using < file short for 0< file and one in read-write mode (1) using 1<> file (<> file would be 0<> file). Those file descriptors point to two open file descriptions that will have each a current cursor position within the file associated with them.

In the second solution for instance, the first head -n "$(($l1 - 1))" will read $l1 - 1 lines worth of data from fd 0 and write that data to fd 1. So at the end of that command, the cursor on both open file descriptions associated with fds 0 and 1 will be at the start of the $l1th line.

Then, in head -n "$(($l2 - $l1 + 1))" > /dev/null, head will read $l2 - $l1 + 1 lines from the same open file description through its fd 0 which is still associated to it, so the cursor on fd 0 will move to the beginning of the line after the $l2 one.

But its fd 1 has been redirected to /dev/null, so upon writing to fd 1, it will not move the cursor in the open file description pointed to by {...}'s fd 1.

So, upon starting cat, the cursor on the open file description pointed to by fd 0 will be at the start of the next line after $l2, while the cursor on fd 1 will still be at the beginning of the $l1th line. Or said otherwise, that second head will have skipped those lines to remove on input but not on output. Now cat will overwrite the $l1th line with the next line after $l2 and so on.

cat will return when it reaches the end of file on fd 0. But fd 1 will point to somewhere in the file that has not been overwritten yet. That part has to go away, it corresponds to the space occupied by the deleted lines now shifted to the end of the file. What we need is to truncate the file at the exact location where that fd 1 points to now.

That's done with the ftruncate system call. Unfortunately, there's no standard Unix utility to do that, so we resort on perl. tell STDOUT gives us the current cursor position associated with fd 1. And we truncate the file at that offset using perl's interface to the ftruncate system call: truncate.

In the third solution, we replace the writing to fd 1 of the first head command with one lseek system call.

Bash – way to make this one-liner faster

There's a part you can easily improve, but it isn't the slowest part.

find /home/mydir/ -type f | sort | \
awk "/xml_20140207_000016.zip/,/xml_20140207_235938.zip/"

This is somewhat wasteful because it first lists all files, then sorts the file names and extracts the interesting ones. The find command has to run to completion before the sorting can begin.

It would be faster to list only the interesting files in the first place, or at least as small a superset as possible. If you need a finer-grained filter on names than find is capable of, pipe into awk, but don't sort: awk and other line-by-line filters can process lines one by one but sort needs the complete input.

find /home/mydir/ -name 'xml_20140207_??????.zip' -type f | \
awk 'match($0, /_[0-9]*.zip$/) &&
     (time = substr($0, RSTART+1, RLENGTH-5)) &&
     time >= 16 && time <= 235938' |
xargs -n 1 -P 10 zipgrep "my search string"

The part which is most obviously suboptimal is zipgrep. Here there is no easy way to improve performance because of the limitations of shell programming. The zipgrep script operates by listing the file names in the archive, and calling grep on each file's content, one by one. This means that the zip archive is parsed again and again for each file. A Java program (or Perl, or Python, or Ruby, etc.) can avoid this by processing the file only once.

If you want to stick to shell programming, you can try mounting each zip instead of using zipgrep.

… | xargs -n1 -P2 sh -c '
    mkdir "mnt$$-$1";
    fuse-zip "$1" "mnt$$-$1";
    grep -R "$0" "mnt$$-$1"
    fusermount -u "mnt$$-$1"
' "my search string"

Note that parallelism isn't going to help you much: the limiting factor on most setups will be disk I/O bandwidth, not CPU time.

I haven't benchmarked anything, but I think the biggest place for improvement would be to use a zipgrep implementation in a more powerful language.

Best Answer

Related Solutions

Text Processing – Faster Way to Remove a Line by Line Number

Bash – way to make this one-liner faster

Related Question