Ubuntu – Convert PDF or FDF to CSV

libreofficepdf

It seems crazy, but while Libre Base can export a form as PDF, there is no way to import the completed form data from the PDF into Libre Base.

One idea is to extract PDF form data to FDF (or unfriendly CSV) and then convert that to a friendly CSV at which point one can copy and paste CSV data into Libre Base.

  • How can I convert a PDF, FDF, or unfriendly CSV file to a friendly CSV with:
    • Row 1: "FieldName1, FieldName2, …"
    • Row 2: "Value1, Value2, …"

Useful files on Ge.tt until 2015,02,25:

  1. Super basic blank pdf with "Submit format" as FDF
    • Libre Base > Forms > Edit > Open Database Object –> File > Export as PDF
  2. Filled PDF made from (1)
    • Type data manually
  3. Extracted CSV data from (2)
    • pdftk filled_form-submit_format_fdf.pdf dump_data_fields output filled_form.csv
  4. Extracted FDF data from (2)
    • pdftk filled_form-submit_format_fdf.pdf generate_fdf output filled_form.fdf

*Note: You can also use Libre Base to export a blank form as XML or HTML, but I don't know how to input data into those files nor import that data back into Base.

Best Answer

How to convert FDF to CSV?

Option A:

awk -F "[()]" '{ if ($1=="/V ") value[$2];} \
    END {printf( "CompanyName\tEmailAddress\t\tCountryOrRegion\n" ); \
           for (x in value)printf("%s\t", x);print "" ; \
    }' filled_form.fdf > filled_form.CSV

Option B:

grep -oP '(?<=\/T \(txt).*(?=\))' filled_form.fdf |awk '{ORS=(NR%3?",":RS)}1'; \
grep -oP '(?<=\/V \().*(?=\))' filled_form.fdf |awk '{ORS=(NR%3?",":RS)}1'; 

Short version of above command would be:

paste -sd, <(grep -oP '(?<=\/T \(txt).*(?=\))' filled_form.fdf) <(grep -oP '(?<=\/V \().*(?=\))' filled_form.fdf)

Option C:

awk 'NR%2==0{type[$0]} NR%2{value[$0]} END{for (x in type)printf("%s\t", x);print "" ;for (y in value)printf("%s\t", y);print "" ;}' <(grep -oP '(?<=\/T \(txt|\/V \().*(?=\))' filled_form.fdf)

How to convert unfriendly CSV to friendly CSV ?

Option A:

awk -F: '{ if ($1=="FieldValue") value[$2];} \
    END {printf( "CountryOrRegion\tCompanyName\tEmailAddress\n" ); \
           for (x in value)printf("%s\t", x) ;print ""; \
    }' filled_form.csv > friendly_filled_form.CSV

Option B:

grep -oP '(?<=FieldName: txt).*' filled_form.csv |awk '{ORS=(NR%3?",":RS)}1'; \
grep -oP '(?<=FieldValue: ).*' filled_form.csv |awk '{ORS=(NR%3?",":RS)}1'

*Note that this command is one linear. so for running it you have to type/copy both line.

And short version for this one would be:

paste -sd, <(grep -oP '(?<=FieldName: txt).*' filled_form.csv) <(grep -oP '(?<=FieldValue: ).*' filled_form.csv)

Option C:

awk 'NR%2{type[$0]} NR%2==0{value[$0]} END{for (x in type)printf("%s\t", x);print "" ;for (y in value)printf("%s\t", y);print "" ;}' <(grep -oP '(?<=FieldName: txt|FieldValue: ).*' filled_form.csv)

or even this one:

awk 'NR%2{type[$0]} NR%2==0{value[$0]} END{for (x in type)printf("%s\t", x);print "" ;for (y in value)printf("%s\t", y);print "" ;}' <(awk -F'FieldValue: |FieldName: txt' 'NF>1{print $2}' filled_form.csv)

How to convert PDF to CSV?

I will complete the solution with pdfgrep tomorrow but if you want to try it for yourself, here is the command:

pdfgrep 'CompanyName|CountryOrRegion|EmailAddress' filled_form-submit_format_fdf.pdf

It needs to work on output format. If you want to get only whole words, use the -C 0 option. good luck and I hope helps ;)