Search pdf contents with powershell and output a file list

Here's what I'm trying to do:

I have a huge mess of files (around ten thousand) of various formats. Each file can be defined as a certain type (ex: product sheet, business plan, offer, presentation, etc). The files are in no particular order and might as well be looked at as a single list. I'm interested in creating a catalogue by type.

The idea is that, for a certain format and a certain type, I know what keywords to look for in the file's contents. I would like to have a powershell script that basically executes a series of scripts looking for all the files of a certain format containing specific keywords and outputting each list to a separate csv. The crucial point here is that the keyword will be in the content (body of a pdf, cell of an excel etc.) and not in the filename. As of now I've tried the following:

get-childitem -Recurse | where {!$_.PSIsContainer} |
select-object FullName, LastWriteTime, Length, Extension | export-csv -notypeinformation -delimiter '|' -path C:\Users\Uzer\Documents\file.csv  -encoding default

That is nice and gives me the complete list of files including their size and extension. I'm looking for something similar but filtering by content. Any ideas?

Edit: based on the solution below her's the new code:

$searchstring = "foo"
$directory = Get-ChildItem -include ('*.pdf') -Path "C:\Users\Uzer\Searchfolder" -Recurse

foreach ($obj in $directory)
{Get-Content $obj.fullname | Where-Object {$_.Contains($searchstring)}| select-object FullName, LastWriteTime, Length, Extension | export-csv -notypeinformation -delimiter '|' -path C:\Users\Uzer\Documents\file2.csv  -encoding default}

However I get a bunch of these errors:

 An object at the specified path C:[blabla]\filename.pdf does not exist, or has been filtered by the -Include or -Exclude parameter.

Add-Type -Path "C:\path_to_dll\itextsharp.dll" $pdfs = gci "C:\path_to_pdfs" *.pdf $export = "C:\path_to_export\export.csv" $results = @() $keywords = @('Keyword1','Keyword2','Keyword3') foreach($pdf in $pdfs) { Write-Host "processing -" $pdf.FullName # prepare the pdf $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $pdf.FullName # for each page for($page = 1; $page -le $reader.NumberOfPages; $page++) { # set the page text $pageText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader,$page).Split([char]0x000A) # if the page text contains any of the keywords we're evaluating foreach($keyword in $keywords) { if($pageText -match $keyword) { $response = @{ keyword = $keyword file = $pdf.FullName page = $page } $results += New-Object PSObject -Property $response } } } $reader.Close() } Write-Host "" Write-Host "done" $results | epcsv $export -NoTypeInformation

Best Answer

Powershell using itextsharp.dll. The below evaluates the text on each page of each pdf for keywords, then exports any matches to a csv. You can run with this to rename files if matches are found, move them to categorized folders, and the likes.

EDIT: Github page for itextsharp indicates it is end-of-life and links to Itext7 https://github.com/itext/itext7-dotnet (dual licensed as AGPL/Commercial software, seems free for non-commercial use.)

The console output:

processing - C:\path_to_pdfs\1.pdf
processing - C:\path_to_pdfs\2.pdf
processing - C:\path_to_pdfs\3.pdf
processing - C:\path_to_pdfs\4.pdf
processing - C:\path_to_pdfs\5.pdf

done
PS C:\>

The csv output:

keyword    page    file
Keyword2   14      C:\path_to_pdfs\3.pdf
Keyword3   22      C:\path_to_pdfs\3.pdf
Keyword1   6       C:\path_to_pdfs\5.pdf

Best Answer

Related Solutions

Windows – Remove PDF passwords with PowerShell (or CMD)

Related Question