Shell – Best Way to Find Multiple Strings in Large Text File

awkgrepibm-unix-system-servicessedshell

The short, general question is: In Unix/Linux, what is the best way to find a list of several (about 150) strings within a large text file?

I am asking this to all Unix/Linux experts as a general question, in the hopes that I can find a solution that pertains to my particular case: I have a feeling this is going to take a little tinkering.

I have a large text file (actually, an MVS dataset) on an IBM Unix System Services (USS) machine– I believe it is somewhere around 6GB.

I also have a list of about 150 5-character identifiers in the format AAAAA that I need to find within this file– that is, I'd like to extract rows from the file that contain any one of the 150 the specific identifers that I am looking for.

The format of each line in the large file is:

00000000000A00000000000000000AAAAA\n

where 0 represents a digit, and A represents an alphanumeric character. The string that I'm searching for is always at the end of the row.

Working with datasets seems to be a little awkward in USS, and I am not able to copy it over into the Unix environment because it is too large. The standard Unix utilities don't all operate on datasets (dd for example); however sed, awk, and grep seem to work to some degree (although the command line switches seem to be a bit different).

I can grep the dataset as follows:

cat  "//'MVS.DATASET'" | grep -e"LOOKFOR1" -e"LOOKFOR2" -e"LOOKFOR3" > output_to_file.txt

However, it won't allow me to grep for all 150 items on one line; I could split it up and run it several times, but I feel like there should a better way.

I tried using a sed script as follows, but I don't know sed at all, and I got an error that said "garbage after command". I saved the following in a file sed-script.txt:

s/AAA01/&/p
s/AAA30/&/p
s/AAA10/&/p
... etc ...

and then ran sed -f sed-script.txt "//'MVS.DATASET'"

Again, this failed with "sed: FSUM7294 garbage after command".

So,
1. How would one normally tackle this problem in the "average" Unix environment, and 2. Do you have any specific insights to this particular case?

Best Answer

grep supports getting patterns from a file -f, and becomes more efficient if you also specify fixed strings (-F):

grep -F -f patterns.txt "//'MVS.DATASET'"
Related Question