Search pdf contents with powershell and output a file list

pdfpowershell

Here's what I'm trying to do:

I have a huge mess of files (around ten thousand) of various formats. Each file can be defined as a certain type (ex: product sheet, business plan, offer, presentation, etc). The files are in no particular order and might as well be looked at as a single list. I'm interested in creating a catalogue by type.

The idea is that, for a certain format and a certain type, I know what keywords to look for in the file's contents. I would like to have a powershell script that basically executes a series of scripts looking for all the files of a certain format containing specific keywords and outputting each list to a separate csv. The crucial point here is that the keyword will be in the content (body of a pdf, cell of an excel etc.) and not in the filename. As of now I've tried the following:

get-childitem -Recurse | where {!$_.PSIsContainer} |
select-object FullName, LastWriteTime, Length, Extension | export-csv -notypeinformation -delimiter '|' -path C:\Users\Uzer\Documents\file.csv  -encoding default

That is nice and gives me the complete list of files including their size and extension. I'm looking for something similar but filtering by content. Any ideas?

Edit: based on the solution below her's the new code:

$searchstring = "foo"
$directory = Get-ChildItem -include ('*.pdf') -Path "C:\Users\Uzer\Searchfolder" -Recurse

foreach ($obj in $directory)
{Get-Content $obj.fullname | Where-Object {$_.Contains($searchstring)}| select-object FullName, LastWriteTime, Length, Extension | export-csv -notypeinformation -delimiter '|' -path C:\Users\Uzer\Documents\file2.csv  -encoding default}

However I get a bunch of these errors:

 An object at the specified path C:[blabla]\filename.pdf does not exist, or has been filtered by the -Include or -Exclude parameter.

Best Answer

Powershell using itextsharp.dll. The below evaluates the text on each page of each pdf for keywords, then exports any matches to a csv. You can run with this to rename files if matches are found, move them to categorized folders, and the likes.

EDIT: Github page for itextsharp indicates it is end-of-life and links to Itext7 https://github.com/itext/itext7-dotnet (dual licensed as AGPL/Commercial software, seems free for non-commercial use.)

Add-Type -Path "C:\path_to_dll\itextsharp.dll"
$pdfs = gci "C:\path_to_pdfs" *.pdf
$export = "C:\path_to_export\export.csv"
$results = @()
$keywords = @('Keyword1','Keyword2','Keyword3')

foreach($pdf in $pdfs) {

    Write-Host "processing -" $pdf.FullName

    # prepare the pdf
    $reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $pdf.FullName

    # for each page
    for($page = 1; $page -le $reader.NumberOfPages; $page++) {
    
        # set the page text
        $pageText = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader,$page).Split([char]0x000A)

        # if the page text contains any of the keywords we're evaluating
        foreach($keyword in $keywords) {
            if($pageText -match $keyword) {
                $response = @{
                    keyword = $keyword
                    file = $pdf.FullName
                    page = $page
                }
                $results += New-Object PSObject -Property $response
            }
        }
    }
    $reader.Close()
}

Write-Host ""
Write-Host "done"

$results | epcsv $export -NoTypeInformation

The console output:

processing - C:\path_to_pdfs\1.pdf
processing - C:\path_to_pdfs\2.pdf
processing - C:\path_to_pdfs\3.pdf
processing - C:\path_to_pdfs\4.pdf
processing - C:\path_to_pdfs\5.pdf

done
PS C:\>

The csv output:

keyword    page    file
Keyword2   14      C:\path_to_pdfs\3.pdf
Keyword3   22      C:\path_to_pdfs\3.pdf
Keyword1   6       C:\path_to_pdfs\5.pdf
Related Question