Ubuntu – Text inside files has squares with numbers in it

Some text files I come across, have little squares with numbers in them (in place of certain characters). I am unable to copy and paste these in Ubuntu, but may search and replace in gedit each character individually (replacing for what I think is it's best match), obviously this is only feasible if there are only a few types of square.

An example of several of the squares

I'm lead to believe that these squares are displayed because I am missing certain fonts… My aim is to convert this into an ePub or PDF file.

My question is:

What type of coding is this? And why does this happen?
If it is missing fonts, can I install them and will this solve the problem (allow me to convert symbols to PDF e.g. using Calibre)?
Is there an application to convert my text file to a text file without these squares, instead replacing them with a similar character? For example, the symbol is pretty much a y, so I would like this function to replace each instance of with a y.

An example txt file is here and it originally looked like this (note inaccuracies followed OCR).

Note: I couldn't get either uni2ascii or iconv to work (though I may not have been using the correct [options]), so please check with the given file before posting a solution!

Best Answer

The boxes mean "glyph not found"; the characters in the box are hexidecimal representations of the codepoint, in unicode.

There are two possibilities: the character encoding is garbled, or the font you are using doesn't have a glyph for that character. This is a great overview character encoding if you really want to understand it: http://trochee.net/2011/05/character-encoding-tutorial/

Curiously, U+001F and U+001D are really just glorified line breaks. It seems odd that OCR would return those.

Best Answer

Related Solutions

Ubuntu – gedit can’t recognize character encoding, but gvim can

Ubuntu – How to convert all pdf files to text (within a folder) with one command

Related Question