Ubuntu – Text inside files has squares with numbers in it

asciigeditpdftext;unicode

Some text files I come across, have little squares with numbers in them (in place of certain characters). I am unable to copy and paste these in Ubuntu, but may search and replace in gedit each character individually (replacing for what I think is it's best match), obviously this is only feasible if there are only a few types of square.

An example of several of the squares

I'm lead to believe that these squares are displayed because I am missing certain fonts… My aim is to convert this into an ePub or PDF file.

My question is:

  • What type of coding is this? And why does this happen?
  • If it is missing fonts, can I install them and will this solve the problem (allow me to convert symbols to PDF e.g. using Calibre)?
  • Is there an application to convert my text file to a text file without these squares, instead replacing them with a similar character? For example, the symbol enter image description here is pretty much a y, so I would like this function to replace each instance of enter image description here with a y.

An example txt file is here and it originally looked like this (note inaccuracies followed OCR).

Note: I couldn't get either uni2ascii or iconv to work (though I may not have been using the correct [options]), so please check with the given file before posting a solution!

Best Answer

The boxes mean "glyph not found"; the characters in the box are hexidecimal representations of the codepoint, in unicode.

There are two possibilities: the character encoding is garbled, or the font you are using doesn't have a glyph for that character. This is a great overview character encoding if you really want to understand it: http://trochee.net/2011/05/character-encoding-tutorial/

Curiously, U+001F and U+001D are really just glorified line breaks. It seems odd that OCR would return those.