I'm looking to create a baseline of file extensions and then search for the inverse of them (essentially scanning for new extensions and then reporting on them).
I have:
base_file=`find "/volume1/" -type f | grep -E ".*\.[a-zA-Z0-9]*$" | sed -e 's/.*\(\.[a-zA-Z0-9]*\)$/\1/' | sort | uniq -u`
to create my baseline – the initial search of the files on the volume.
For a small amount of files, using
find $dir -type f \( -not -name "foo*" -and -not -name "*bar" \)
worked wonderfully. Alas, I have tons of files, though. If I pipe every unique extension into the find
command, it does not work (understandably).
Ex. of output:
.acx .adb .aex .agt .ahs .alt .amsorm .ANI .ARTX .ASAX .ASDefs .asmdot .ASMDOT .ASPX .atb .atm .aus .auth .authd .awk .ben .Bin .BIO .bkp .bms .boo .bootstrap .bplist .bridgesupport .bto .btt .CBK .ccp .cd .cdm .cdrom .CFGOLD .cfm .cfp .CFS .cg .cidb .cilk .clk .cmptag .CMValidateMovieDataReferenceService .ColorSyncXPCAgent .common .con .CONFIG .COR .cpi .cpu .crc .crdownload .crmlog .cryptodev .csh .ctd .ctl .cue .cws .d .daeexportpreset .daeimportpreset .DATA .dbg .DBG .dbl .DCD .DCX .debug .defaults .defltools .defmtools .der .desktop .dfont .DGDLL .DGN .DictionaryServiceHelper .dig .django .dla .dlb .dlh .dlk .dLL .dlmp .DLO .DMP .DNP .dps .DriverHelper .DRWDOT .dsd .dtc .DTL .dwd .dwfx .dwG .e .eai .eapol .EDB .edc .edited .ENC .eng .ENV .epub .erl .esi .esm .EVM .EVP .ews .example .exv .fac .fatal .fbk .FBK .fbT .FCL .fe .file .fin .fl .FLL .font .FontDownloadHelper .for .fpk .fre .frT .FW .FXP .gadget .Gadget .gdb .generic .ger .gi .glo .gm .gpx .groovy .group .gsl .gss .gws .GZ .ham .hbs .hd .hidden .hkf .hpdata .hs .htb .HTT .hun .hx .hxd .hxx .HXX .IBM .ICNS .igb .IGS .iHB .imaging .IME .IMG .in .INP .install .Installsettings .int .IPConfiguration .IPMonitor .ITK .ITS .iuf .java .jnilib .job .JPEG .jqx .kd .keychainproxy .keys .kondo .krn .kscript .ksh .lfs .libraries .LID .lisp .liveReg .local .LOCAL .lok .lppi .lsl .lt .ltools .mak .mako .mapping .mappings .mas .masm .matlab .mbr .mch .MDE .mdmp .mdw .me .med .MediaLibraryService .mem .mholders .MIF .MIG .min .mk .mm .mno .mobileconfig .mom .mp .MPE .mpq .MPV .mpx .MPX .msdb .MSDefs .msilog .MSM .MSP .mtools .mup .nasm .netsa .new .nfm .nlog .nor .nsi .ntd .numbers .nut .nv .nvv .NWD .O .oai .oct .Ocx .oft .ogv .older .omo .ooc .openAndSavePanelService .ori .orignal .osf .override .pad .page .partial .pas .patch .pbb .pch .Pdf .PDFFileRefsValidator .pdn .PDR .pexe .pfw .phar .pif .pike .pix .PJT .PJX .PLS .plsql .po .pokki .pot .ppf .ppk .pptm .preferences .PRG .prm .PRN .pro .propdesc .prtdot .PRTDOT .prx .PSDefs .PST .psw .pta .ptb .ptg .python .r .rayhosts .rc .rcd .RCF .rd .RecentPictureService .regcccc .registerassistantservice .RLA .rnd .rpk .RPW .RSC .rst .rupldb .rus .salog .sap .SAP .sbt .sbx .sbxx .SCH .schemas .scm .SCR .sct .SDP .sds .sdu .Search .securityd .SEP .set .setup .Setuplog .SFV .sfx .sgi .sgn .sidb .sidd .sigs .sites .skin .slddrt .smc .SMC .smf .smilebox .SOL .spdc .speechsynthesisd .spn .sqfs .squashfs .srt .srx .ssi .st .ste .stg .styx .swb .swtag .TAR .TDC .tdf .tex .th .tib .time .tips .tmx .tpg .tpm .trace .transformed .trm .TSK .tst .Txt .txz .type .udf .ufm .ult .uninstall .upd .upstart .urf .user .User .UserDictionary .UserProfile .UserScriptService .usr .ux .v .vala .values .var .VAR .vbe .VBR .vcs .vcxproj .vdb .vdf .VERSION .VersionsUIHelper .vhdl .vms .vmsn .vmss .VOL .voucher .vps .vsb .vst .vvv .wax .wbt .Wdf .webp .WIZ .wnt .WPT .ws .wsc .wsdl .WSF .wsp .xap .xht .XLL .xlS .XLT .xmp .xpfwext .xtext .yaml .zipx .zz
How can I search for all of these, or inverse of, without running into issues? Or, more importantly, is there a better solution for this type of task?
Best Answer
You can use
grep
's-f
option that allows you to search for a list of patterns stored in a file:Here file
ext_patterns.txt
must contain extensions as regex, like:You can create that file the same way you create your baseline. Here is a command using
awk
:Here
find
output filenames having an extension;awk
prints the extension along with a leading (escaped) dot and a final$
(regex code meaning "end of line"); andsort -u
makes each pattern unique.