Here's what I'm trying to do:
I have a huge mess of files (around ten thousand) of various formats. Each file can be defined as a certain type (ex: product sheet, business plan, offer, presentation, etc). The files are in no particular order and might as well be looked at as a single list. I'm interested in creating a catalogue by type.
The idea is that, for a certain format and a certain type, I know what keywords to look for in the file's contents. I would like to have a powershell script that basically executes a series of scripts looking for all the files of a certain format containing specific keywords and outputting each list to a separate csv. The crucial point here is that the keyword will be in the content (body of a pdf, cell of an excel etc.) and not in the filename. As of now I've tried the following:
get-childitem -Recurse | where {!$_.PSIsContainer} |
select-object FullName, LastWriteTime, Length, Extension | export-csv -notypeinformation -delimiter '|' -path C:\Users\Uzer\Documents\file.csv -encoding default
That is nice and gives me the complete list of files including their size and extension. I'm looking for something similar but filtering by content. Any ideas?
Edit: based on the solution below her's the new code:
$searchstring = "foo"
$directory = Get-ChildItem -include ('*.pdf') -Path "C:\Users\Uzer\Searchfolder" -Recurse
foreach ($obj in $directory)
{Get-Content $obj.fullname | Where-Object {$_.Contains($searchstring)}| select-object FullName, LastWriteTime, Length, Extension | export-csv -notypeinformation -delimiter '|' -path C:\Users\Uzer\Documents\file2.csv -encoding default}
However I get a bunch of these errors:
An object at the specified path C:[blabla]\filename.pdf does not exist, or has been filtered by the -Include or -Exclude parameter.
Best Answer
Powershell using itextsharp.dll. The below evaluates the text on each page of each pdf for keywords, then exports any matches to a csv. You can run with this to rename files if matches are found, move them to categorized folders, and the likes.
EDIT: Github page for itextsharp indicates it is end-of-life and links to Itext7 https://github.com/itext/itext7-dotnet (dual licensed as AGPL/Commercial software, seems free for non-commercial use.)
The console output:
The csv output: