How to remove duplicate fonts in a PDF document with 150,000 embedded fonts

fontspdf

I was analyzing a PDF file to see why is it so large (400MB) and what options do I have for reducing its size.

The PDF file is produced by merging about 15,000 smaller PDF files which are produced using a template and theoretically should have the same set of fonts. These small PDF files are produced by a third-party system that is a blackbox to us.

I was suspicious that each font is added multiple times in the merged PDF file, so I thought as a first step, I might be able to reduce its file size by removing duplicate fonts.

Using pdffonts I inspected the fonts used in one of the individual, small PDF files:

$ pdffonts small-00001.pdf
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
[none]                               Type 3            Custom           yes no  yes      6  0
[none]                               Type 3            Custom           yes no  yes     25  0
[none]                               Type 3            Custom           yes no  yes     56  0
[none]                               Type 3            Custom           yes no  yes     95  0
[none]                               Type 3            Custom           yes no  yes    125  0
[none]                               Type 3            Custom           yes no  yes    141  0
[none]                               Type 3            Custom           yes no  yes    188  0
[none]                               Type 3            Custom           yes no  yes    212  0
[none]                               Type 3            Custom           yes no  yes    237  0
[none]                               Type 3            Custom           yes no  yes    245  0

It is a bit strange that none of the fonts have names but apparently that is permitted in the spec (see Table 112 (.pdf)) as the version of the PDF files is 1.4. Anyway, this small PDF file has 10 fonts.

I had a look at the merged PDF file and:

$ pdffonts merged.pdf | wc -l
150000 # rounded number

Voila! The merged PDF file had more than 150K fonts!

The iText Java PDF library has a class named PdfSmartCopy for merging PDF files that has some smarts in it to avoid duplicate resources such as fonts in the merged output.

I wrote a program to merge the individual files using PdfSmartCopy and it reduced size of the merged PDF file from 400MB to less than 100MB. I was expecting pdffonts to show that there are only 10 fonts in this "smartly" merged PDF but to my surprise:

$ pdffonts
20000 # rounded number

So the number of embedded fonts were reduce from about 150K to 20K, but I was wondering if there's a way to reduce the number of fonts even further or if these 20K fonts are really distinct as far as PDF is concerned?

Opening the small PDF file in a text editor, I see these font definitions:

6 0 obj
<</FontBBox[-40 -200 1000 920]/LastChar 249/FontMatrix[.001 0 0 .001 0 0]/Type/Font/Subtype/Type3/Encoding 7 0 R/CharProcs 8 0 R/ToUnicode 24 0 R/Widths[280 1000 1000 1000 1000 1000 1000 1000 1000 1000 560 280 600 320 600 1000 680 1000 1000 1000 1000 1000 1000 1000 1000 1000 320 560 400 320 280 1000 320 280 1000 1000 1000 1000 1000 1000 1000 1000 1000 280 880 560 600 560 1000 1000 1000 1000 1000 1000 1000 1000 1000 320 280 560 1000 200 600 360 1000 560 560 520 560 560 280 560 560 240 1000 1000 1000 1000 1000 1000 1000 240 520 240 880 560 560 560 560 320 1000 1000 1000 1000 1000 1000 1000 1000 520 280 560 440 720 440 480 480 1000 1000 1000 280 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 280 1000 1000 1000 680 680 720 720 680 600 760 720 240 1000 1000 1000 1000 1000 1000 1000 520 680 560 840 720 760 680 760 720 1000 1000 1000 1000 1000 1000 560 1000 680 560 720 680 1040 680 640 600 1000 1000 1000 1000 1000 1000 560 560 560 560 560 560 560 560 560 560]/FirstChar 64>>
endobj

25 0 obj
<</FontBBox[-57 -228.86 1029 914]/LastChar 249/FontMatrix[.001 0 0 .001 0 0]/Type/Font/Subtype/Type3/Encoding 26 0 R/CharProcs 27 0 R/ToUnicode 55 0 R/Widths[286 1029 1029 1029 1029 1029 1029 1029 1029 1029 543 286 571 343 571 1029 657 1029 1029 1029 1029 1029 1029 1029 1029 1029 314 543 400 343 286 1029 343 286 1029 1029 1029 1029 1029 1029 1029 1029 1029 286 886 543 571 543 1029 1029 1029 1029 1029 1029 1029 1029 1029 343 286 543 1029 200 571 343 1029 514 543 514 543 514 286 543 543 200 1029 1029 1029 1029 1029 1029 1029 257 514 200 829 543 514 543 543 343 1029 1029 1029 1029 1029 1029 1029 1029 514 286 543 486 686 429 486 486 1029 1029 1029 286 1029 1029 1029 1029 1029 1029 1029 1029 1029 1029 1029 1029 1029 1029 1029 286 1029 1029 1029 657 657 714 714 657 600 771 714 257 1029 1029 1029 1029 1029 1029 1029 514 657 543 829 714 771 657 771 714 1029 1029 1029 1029 1029 1029 543 1029 657 600 714 657 1029 657 657 600 1029 1029 1029 1029 1029 1029 543 543 543 543 543 543 543 543 543 543]/FirstChar 64>>
endobj

56 0 obj
<</FontBBox[-30 -214.77 1002.5 939]/LastChar 227/FontMatrix[.001 0 0 .001 0 0]/Type/Font/Subtype/Type3/Encoding 57 0 R/CharProcs 58 0 R/ToUnicode 94 0 R/Widths[273 1030 1030 1030 1030 1030 1030 1030 1030 1030 545 273 576 333 576 1030 636 1030 1030 1030 1030 1030 1030 1030 1030 1030 333 545 394 333 273 1030 333 273 1030 1030 1030 1030 1030 1030 1030 1030 1030 273 879 545 576 545 1030 1030 1030 1030 1030 1030 1030 1030 1030 333 273 545 1030 182 576 364 1030 515 545 515 545 515 303 545 545 212 1030 1030 1030 1030 1030 1030 1030 212 485 212 818 545 515 545 545 333 1030 1030 1030 1030 1030 1030 1030 1030 515 273 545 455 667 455 455 485 1030 1030 1030 273 1030 1030 1030 1030 1030 1030 1030 1030 1030 1030 1030 1030 1030 1030 1030 273 1030 1030 1030 667 667 727 727 667 606 788 727 273 1030 1030 1030 1030 1030 1030 1030 515 667 545 818 727 788 667 788 727 1030 1030 1030 1030 1030 1030 515 1030 667 636]/FirstChar 64>>
endobj

95 0 obj
<</FontBBox[-52 -243.17 1036 914]/LastChar 232/FontMatrix[.001 0 0 .001 0 0]/Type/Font/Subtype/Type3/Encoding 96 0 R/CharProcs 97 0 R/ToUnicode 124 0 R/Widths[276 983 983 983 983 983 983 983 983 983 552 276 586 328 586 983 707 983 983 983 983 983 983 983 983 983 310 552 397 328 345 983 328 276 983 983 983 983 983 983 983 983 983 276 879 552 586 603 983 983 983 983 983 983 983 983 983 328 345 552 983 241 586 483 983 552 603 552 603 552 328 603 603 276 983 983 983 983 983 983 983 276 534 276 897 603 603 603 603 397 983 983 983 983 983 983 983 983 552 328 603 517 776 534 517 500 983 983 983 328 983 983 983 983 983 983 983 983 983 983 983 983 983 983 983 328 983 983 983 707 724 724 724 672 603 776 724 276 983 983 983 983 983 983 983 552 707 603 845 724 776 672 776 724 983 983 983 983 983 983 552 983 672 603 724 672 948 655 638]/FirstChar 64>>
endobj

125 0 obj
<</FontBBox[-48 -236.5 1021.8 929]/LastChar 230/FontMatrix[.001 0 0 .001 0 0]/Type/Font/Subtype/Type3/Encoding 126 0 R/CharProcs 127 0 R/ToUnicode 140 0 R/Widths[286 976 976 976 976 976 976 976 976 976 548 286 595 333 595 976 714 976 976 976 976 976 976 976 976 976 333 548 381 333 333 976 333 286 976 976 976 976 976 976 976 976 976 286 833 548 595 619 976 976 976 976 976 976 976 976 976 333 333 548 976 238 595 476 976 548 619 548 619 548 333 619 619 286 976 976 976 976 976 976 976 286 524 286 881 619 619 619 619 381 976 976 976 976 976 976 976 976 548 333 619 524 786 524 548 500 976 976 976 333 976 976 976 976 976 976 976 976 976 976 976 976 976 976 976 333 976 976 976 690 714 714 714 667 619 786 714 286 976 976 976 976 976 976 976 548 690 619 833 714 786 667 786 714 976 976 976 976 976 976 548 976 667 643 714 667 929]/FirstChar 64>>
endobj

141 0 obj
<</FontBBox[-53 -238.45 1028.2 895]/LastChar 249/FontMatrix[.001 0 0 .001 0 0]/Type/Font/Subtype/Type3/Encoding 142 0 R/CharProcs 143 0 R/ToUnicode 187 0 R/Widths[289 1026 1026 1026 1026 1026 1026 1026 1026 1026 553 289 579 342 579 1026 658 1026 1026 1026 1026 1026 1026 1026 1026 1026 289 553 395 342 289 1026 342 289 1026 1026 1026 1026 1026 1026 1026 1026 1026 289 895 553 579 553 1026 1026 1026 1026 1026 1026 1026 1026 1026 342 289 553 1026 184 579 342 1026 553 553 500 553 526 289 553 553 237 1026 1026 1026 1026 1026 1026 1026 237 500 237 868 553 553 553 553 342 1026 1026 1026 1026 1026 1026 1026 1026 500 289 553 500 684 447 500 474 1026 1026 1026 289 1026 1026 1026 1026 1026 1026 1026 1026 1026 1026 1026 1026 1026 1026 1026 289 1026 1026 1026 658 658 711 711 658 605 789 711 289 1026 1026 1026 1026 1026 1026 1026 500 658 553 816 711 789 658 789 711 1026 1026 1026 1026 1026 1026 553 1026 658 605 711 658 1000 658 658 605 1026 1026 1026 1026 1026 1026 553 553 553 553 553 553 553 553 553 553]/FirstChar 64>>
endobj

188 0 obj
<</FontBBox[-40 -240 1040 940]/LastChar 227/FontMatrix[.001 0 0 .001 0 0]/Type/Font/Subtype/Type3/Encoding 189 0 R/CharProcs 190 0 R/ToUnicode 211 0 R/Widths[280 980 980 980 980 980 980 980 980 980 560 280 580 340 580 980 720 980 980 980 980 980 980 980 980 980 300 560 380 340 340 980 340 280 980 980 980 980 980 980 980 980 980 280 840 560 580 620 980 980 980 980 980 980 980 980 980 340 340 560 980 240 580 480 980 560 620 560 620 560 340 620 600 280 980 980 980 980 980 980 980 280 540 280 880 600 620 620 620 380 980 980 980 980 980 980 980 980 560 340 600 520 780 540 540 500 980 980 980 340 980 980 980 980 980 980 980 980 980 980 980 980 980 980 980 340 980 980 980 700 720 720 720 660 620 780 720 260 980 980 980 980 980 980 980 560 700 620 820 720 780 660 780 720 980 980 980 980 980 980 560 980 660 640]/FirstChar 64>>
endobj

212 0 obj
<</FontBBox[-48 -212.73 998.26 905]/LastChar 249/FontMatrix[.001 0 0 .001 0 0]/Type/Font/Subtype/Type3/Encoding 213 0 R/CharProcs 214 0 R/ToUnicode 236 0 R/Widths[286 1024 1024 1024 1024 1024 1024 1024 1024 1024 548 286 595 333 595 1024 667 1024 1024 1024 1024 1024 1024 1024 1024 1024 333 548 381 333 286 1024 333 286 1024 1024 1024 1024 1024 1024 1024 1024 1024 286 881 548 595 548 1024 1024 1024 1024 1024 1024 1024 1024 1024 333 286 548 1024 190 595 357 1024 548 548 500 548 548 310 548 548 238 1024 1024 1024 1024 1024 1024 1024 238 500 238 857 548 548 548 548 333 1024 1024 1024 1024 1024 1024 1024 1024 476 286 548 548 667 524 500 500 1024 1024 1024 286 1024 1024 1024 1024 1024 1024 1024 1024 1024 1024 1024 1024 1024 1024 1024 286 1024 1024 1024 667 667 714 714 667 619 786 714 286 1024 1024 1024 1024 1024 1024 1024 500 667 548 833 714 786 667 786 714 1024 1024 1024 1024 1024 1024 548 1024 667 619 714 667 1000 643 667 619 1024 1024 1024 1024 1024 1024 548 548 548 548 548 548 548 548 548 548]/FirstChar 64>>
endobj

237 0 obj
<</FontBBox[-27 -253.67 733 800]/LastChar 243/FontMatrix[.001 0 0 .001 0 0]/Type/Font/Subtype/Type3/Encoding 238 0 R/CharProcs 239 0 R/ToUnicode 244 0 R/Widths[173 173 173 173]/FirstChar 240>>
endobj

245 0 obj
<</FontBBox[-53 -238.45 1028.6 947]/LastChar 249/FontMatrix[.001 0 0 .001 0 0]/Type/Font/Subtype/Type3/Encoding 246 0 R/CharProcs 247 0 R/ToUnicode 293 0 R/Widths[289 974 974 974 974 974 974 974 974 974 553 289 579 342 579 974 711 974 974 974 974 974 974 974 974 974 289 553 395 342 342 974 342 289 974 974 974 974 974 974 974 974 974 289 868 553 579 605 974 974 974 974 974 974 974 974 974 342 342 553 974 237 579 474 974 553 605 553 605 553 342 605 605 289 974 974 974 974 974 974 974 289 526 289 921 605 605 605 605 395 974 974 974 974 974 974 974 974 553 342 605 526 763 526 526 474 974 974 974 342 974 974 974 974 974 974 974 974 974 974 974 974 974 974 974 342 974 974 974 711 711 711 711 658 605 789 711 289 974 974 974 974 974 974 974 553 684 605 868 711 789 658 789 711 974 974 974 974 974 974 553 974 658 605 711 658 921 632 632 605 974 974 974 974 974 974 553 553 553 553 553 553 553 553 553 553]/FirstChar 64>>
endobj

1 0 obj
<</Contents 2 0 R/Type/Page/Resources<</XObject<</Im3 3 0 R/Im1 4 0 R/Im2 5 0 R>>/Font<</F1 6 0 R/F2 25 0 R/F10 56 0 R/F3 95 0 R/F4 125 0 R/F5 141 0 R/F6 188 0 R/F9 212 0 R/F8 237 0 R/F7 245 0 R>>>>/MediaBox[0 0 595.2 841.92]/Parent 294 0 R>>
endobj

Are these really 10-11 different fonts? Or are these 10 components of the same font?

Also according to the spec:

CharProcs: (Required) A dictionary in which each key shall be a glyph
name and the value associated with that key shall be a content stream
that constructs and paints the glyph for that character. The stream
shall include as its first operator either d0 or d1, followed by
operators describing one or more graphics objects, which may include
path, text, or image objects. See below for more details about Type 3
glyph descriptions.

But the CharProcs in this PDF file look to be too small to "be a content stream that constructs and paints the glyph for that character". For example:

237 0 obj
<<
/FontBBox[-27 -253.67 733 800]
/LastChar 243
/FontMatrix[.001 0 0 .001 0 0]
/Type/Font
/Subtype/Type3
/Encoding 238 0 R
/CharProcs 239 0 R
/ToUnicode 244 0 R
/Widths[173 173 173 173]
/FirstChar 240>>
endobj
  1. Are these /FontBBox snippets each represent a font? Why their /CharProcs is too small?

  2. How can I confirm that these 20,000K fonts in the final PDF file are really distinct — and if not, how can I further remove the duplicates?

Best Answer

Type3 fonts are incredibly rare in PDF files. Type3 fonts are actually PDF fonts, in that the glyphs are described in PDF page graphic operators. So you would never encounter a type 3 font outside of a PDF, since only a PDF renderer would know what to do.

  1. The glyphs are defined in the object referenced by the CharProcs key. So Object 239 in your last example. The /FontBBox is normally just used for text selection. You could probably just union all all the FontBBoxes.

  2. You could check the graphic operators, or even just hash the streams, to find matches. Then perhaps you could synthesize new fonts with the collected fonts. However, you also need to check the encodings. If each font is encoded differently, with different character codes mapping to different glyphs, then you need to go back and also rewrite the page content streams, using the new character codes. Finally, you probably want to keep the ToUnicode mappings correct (if you want to preserve text selection/extraction), which means also tracking character code to unicode mappings and generating new ToUnicode CMaps.

In short, repairing as a post-processing step is non-trivial.

It is typically much better/easier to go back and deal with the root of the issue, when you create the PDF files, and merge, so you don't have the issue.

Related Question