grep
command options
I wanted to search my whole drive for a string. Following the accepted answer in Stack Overflow I used:
sudo time grep -rnw '/' -e 'Sony 50"'
and it took 53 Hours to process 20 GB of data on one of the fastest PCIe NVMe M.2 SSD's around; Samsung Pro 960.
grep
output log
When grep
processes some files it generates error messages. These can be suppressed by appending 2>/dev/null
to the command. However the errors give feedback on progress being made. Some of the sample output (it won't all fit) is below:
Binary file /home/Me/.config/google-chrome/Default/Sync Data/SyncData.sqlite3 matches
grep: /sys/kernel/security/ima/policy: Permission denied
grep: /sys/kernel/slab/:dt-0000008/alloc_calls: Function not implemented
grep: /sys/kernel/slab/:dt-0000008/free_calls: Function not implemented
(... SNIP ... 12 hours later PID 882 processed below...)
grep: /proc/882/task/922/attr/sockcreate: Invalid argument
grep: /proc/882/task/923/mem: Input/output error
(... SNIP ... 24 hours later PID 2954 below...)
grep: /proc/2598/attr/sockcreate: Invalid argument
grep: /proc/2954/task/2954/mem: Input/output error
(... SNIP ... 42 hours later PID 4396 below...)
grep: /proc/4389/attr/sockcreate: Invalid argument
grep: /proc/4396/task/4396/mem: Input/output error
(... SNIP ... After 53 hours `grep` finally finishes...)
grep: /run/user/1000/gvfs: Permission denied
Command exited with non-zero status 2
97355.34user 83223.12system 53:07:40elapsed 94%CPU (0avgtext+0avgdata 31116maxresident)k
593910020inputs+0outputs (1major+10731minor)pagefaults 0swaps
grep
gives impression it's frozen
Sometimes I thought grep
was frozen because the screen didn't update for an hour and the hard disk light didn't flash much. However Conky tells me it is still running and taking 100% CPU on a single core, as seen in this GIF.
Of the 19.5 GiB out of 43.8 GiB being used on the Linux (Ubuntu 16.04.3 LTS) partition, half of this space, 10 GB are used by kernels. Downloading and testing kernels is my pass-time.
This test took most of my week-end plus Monday to complete.
How can I speed up grep
and still get what I'm looking for?
Best Answer
Exclude virtual file systems
Looking at the sample output log we see virtual file systems are included in the search which is an unnecessary waste of time. Drop these and other directories from the search with the
--exclude-dir
option. For example:When
grep
parses the/proc
directory chain it is uselessly looking at all the process ID's which takes more than a day in my case.Also when processing
/mnt
it will be looking at mounted Windows NTFS drives and USBs unnecessarily./media
is holds the CD/DVD drive and external usb drives.Output:
There you go 56 Seconds instead of 50 Hours!
Note if you exclude
usr
(containing 6.5 GB of files in in my case) from the search it is only 8 seconds:Interesting Notes
Keeping out the system directories seems to keep
grep
on better track and it never hits 100% CPU on a single core. Plus the hard disk light flashes constantly so you knowgrep
is really working and not "thinking in circles".If you don't prefix
tmp
with/
then it will ignore any sub-directory containingtmp
for example/home/Me/tmp
. If you use --exclude-dir/tmp
then your directory/home/Me/tmp
will be searched.If on the other hand you prefix
sys
with/
then then/sys
directory is searched and errors are reported. The same is true for/proc
. So you have to usesys,proc
and not prefix them with/
. The same is true for other system directories I tested.Create alias
grepall
Consider setting up an alias in
~/.bashrc
so you don't have to type the--exclude-dir
parameter list every time:Detailed time breakdown
This section breaks down how much time is saved by incrementally adding directories to the
--exclude-dir
parameter list:/proc
and/sys
saving 52 hours/media
saving 3 minutes/mnt
saving 21 minutes/usr/src
(by specifyingsrc
) saving 53 seconds/lib/modules
(by specifyingmodules
) saving 39 secondsExclude
/proc
and/sys
directoriesThe
/proc
and/sys
directories are the most time consuming, the most useless to search and generate the most errors. It's "useless" because these two directories are dynamically created at run-time and don't contain permanent files you would want togrep
.A great time savings is realize by excluding them:
Only 27 Minutes this time saving over 52 Hours!
There are still errors though. In
/var
directory which is also a "virtual directory" created at run time. The/run
directory which contains an Android Cell Phone and the/media
directory which contains an old broken laptop hard drive now connected to an USB external HDD enclosure.Add
/media
to exclude listThe
/media
directory contains an old laptop drive connected via USB 3.0 port. Smartctl daily reports errors on the drive and doesn't have files we are looking for. We'll exclude it to save time and reduce error messages:Excluding the faulty hard drive connected via USB 3.0 enclosure only saved 3 minutes but reduced error messages.
Add
/mnt
(Windows NTFS partitions) to exclude listThe
/mnt
directory contains:C:
andE:
) on an SSD with 105 GiB of dataD:
) on an HDD with 42 GiB of dataThere is nothing of interest in Windows so we'll exclude
/mnt
to save time:Now
grep
only takes 2 minutes and 8 seconds. By excluding Windows 10 partitions with 147 Gib of Programs and Data saves 21.5 minutes!Add
/usr/src
Linux Headers to exclude listThe
/usr/src
directory contains Linux Headers source code. In my case there are 20+ kernels manually installed which takes considerable space. To specify the directory though the argument used issrc
:Now grep is only taking 1 minutes and 15 seconds. Excluding
/usr/src
by specifyingsrc
on the--exclude-dir
list saves 53 seconds.Add
/lib/modules
Kernel modules to exclude listThe
/lib/modules
directory contains compiled Kernel Modules. To specify the directory though the argument used ismodules
:By skipping 6 GB of kernel modules, our
grep
time is 36 seconds. Adding/lib/modules
by specifyingmodules
in the--exclude-dir
parameter saves 39 seconds.Miscellaneous directories
Summary list of other directories: