Searching for a large number of extensions with find

filesfind

I'm looking to create a baseline of file extensions and then search for the inverse of them (essentially scanning for new extensions and then reporting on them).

I have:

base_file=`find "/volume1/" -type f | grep -E ".*\.[a-zA-Z0-9]*$" | sed -e 's/.*\(\.[a-zA-Z0-9]*\)$/\1/' | sort | uniq -u`

to create my baseline – the initial search of the files on the volume.

For a small amount of files, using

find $dir -type f \( -not -name "foo*" -and -not -name "*bar" \) 

worked wonderfully. Alas, I have tons of files, though. If I pipe every unique extension into the find command, it does not work (understandably).

Ex. of output:

.acx .adb .aex .agt .ahs .alt .amsorm .ANI .ARTX .ASAX .ASDefs .asmdot .ASMDOT .ASPX .atb .atm .aus .auth .authd .awk .ben .Bin .BIO .bkp .bms .boo .bootstrap .bplist .bridgesupport .bto .btt .CBK .ccp .cd .cdm .cdrom .CFGOLD .cfm .cfp .CFS .cg .cidb .cilk .clk .cmptag .CMValidateMovieDataReferenceService .ColorSyncXPCAgent .common .con .CONFIG .COR .cpi .cpu .crc .crdownload .crmlog .cryptodev .csh .ctd .ctl .cue .cws .d .daeexportpreset .daeimportpreset .DATA .dbg .DBG .dbl .DCD .DCX .debug .defaults .defltools .defmtools .der .desktop .dfont .DGDLL .DGN .DictionaryServiceHelper .dig .django .dla .dlb .dlh .dlk .dLL .dlmp .DLO .DMP .DNP .dps .DriverHelper .DRWDOT .dsd .dtc .DTL .dwd .dwfx .dwG .e .eai .eapol .EDB .edc .edited .ENC .eng .ENV .epub .erl .esi .esm .EVM .EVP .ews .example .exv .fac .fatal .fbk .FBK .fbT .FCL .fe .file .fin .fl .FLL .font .FontDownloadHelper .for .fpk .fre .frT .FW .FXP .gadget .Gadget .gdb .generic .ger .gi .glo .gm .gpx .groovy .group .gsl .gss .gws .GZ .ham .hbs .hd .hidden .hkf .hpdata .hs .htb .HTT .hun .hx .hxd .hxx .HXX .IBM .ICNS .igb .IGS .iHB .imaging .IME .IMG .in .INP .install .Installsettings .int .IPConfiguration .IPMonitor .ITK .ITS .iuf .java .jnilib .job .JPEG .jqx .kd .keychainproxy .keys .kondo .krn .kscript .ksh .lfs .libraries .LID .lisp .liveReg .local .LOCAL .lok .lppi .lsl .lt .ltools .mak .mako .mapping .mappings .mas .masm .matlab .mbr .mch .MDE .mdmp .mdw .me .med .MediaLibraryService .mem .mholders .MIF .MIG .min .mk .mm .mno .mobileconfig .mom .mp .MPE .mpq .MPV .mpx .MPX .msdb .MSDefs .msilog .MSM .MSP .mtools .mup .nasm .netsa .new .nfm .nlog .nor .nsi .ntd .numbers .nut .nv .nvv .NWD .O .oai .oct .Ocx .oft .ogv .older .omo .ooc .openAndSavePanelService .ori .orignal .osf .override .pad .page .partial .pas .patch .pbb .pch .Pdf .PDFFileRefsValidator .pdn .PDR .pexe .pfw .phar .pif .pike .pix .PJT .PJX .PLS .plsql .po .pokki .pot .ppf .ppk .pptm .preferences .PRG .prm .PRN .pro .propdesc .prtdot .PRTDOT .prx .PSDefs .PST .psw .pta .ptb .ptg .python .r .rayhosts .rc .rcd .RCF .rd .RecentPictureService .regcccc .registerassistantservice .RLA .rnd .rpk .RPW .RSC .rst .rupldb .rus .salog .sap .SAP .sbt .sbx .sbxx .SCH .schemas .scm .SCR .sct .SDP .sds .sdu .Search .securityd .SEP .set .setup .Setuplog .SFV .sfx .sgi .sgn .sidb .sidd .sigs .sites .skin .slddrt .smc .SMC .smf .smilebox .SOL .spdc .speechsynthesisd .spn .sqfs .squashfs .srt .srx .ssi .st .ste .stg .styx .swb .swtag .TAR .TDC .tdf .tex .th .tib .time .tips .tmx .tpg .tpm .trace .transformed .trm .TSK .tst .Txt .txz .type .udf .ufm .ult .uninstall .upd .upstart .urf .user .User .UserDictionary .UserProfile .UserScriptService .usr .ux .v .vala .values .var .VAR .vbe .VBR .vcs .vcxproj .vdb .vdf .VERSION .VersionsUIHelper .vhdl .vms .vmsn .vmss .VOL .voucher .vps .vsb .vst .vvv .wax .wbt .Wdf .webp .WIZ .wnt .WPT .ws .wsc .wsdl .WSF .wsp .xap .xht .XLL .xlS .XLT .xmp .xpfwext .xtext .yaml .zipx .zz

How can I search for all of these, or inverse of, without running into issues? Or, more importantly, is there a better solution for this type of task?

Best Answer

You can use grep's -f option that allows you to search for a list of patterns stored in a file:

# find "$dir" -type f | grep -f ext_patterns.txt

Here file ext_patterns.txt must contain extensions as regex, like:

\.html$
\.java$
\.jpg$

You can create that file the same way you create your baseline. Here is a command using awk:

find -type f -name "*.*" \
| awk -F. '{ print "\\." $NF "$" }' \
| sort -u \
> ext_patterns.txt

Here find output filenames having an extension; awk prints the extension along with a leading (escaped) dot and a final $ (regex code meaning "end of line"); and sort -u makes each pattern unique.

Related Question