How are character encodings related to fonts

character encodingfonts

I mean, does a font have to support every character encoding? Or does a character encoding have to support every font?

What do Unicode fonts mean? are they fonts that support only Unicode, and they dont support, say, windows-1252?

Best Answer

To start with basics, everything is based on US-ASCII which is an 7 bit code with 128 code points in the set, numbered hex 00 through 7F or decimal 0-127. This is mapped to control codes, English alphanumeric, and basic punctuation characters

Adding 1 bit to this for an 8 bit code (byte) gives us another 128 code points or Extended ASCII.

Character sets/code pages were required early on to change how the code points in the upper 128 bits mapped to characters to cover the alphabet for the particular language you wished to represent. This works reasonably well for most western European languages. ISO 8859-1/Latin-1 is an example of such a character set. Another is Windows-1252 which has changes from ISO 8859-1 to help it cover more or different characters.

Languages with more complex character sets like Chinese, Japanese, and Korean exceed the capabilities of the 256 code point set and use a double-byte code to enable their representation.

Unicode UTF-8 is a multi-byte character encoding scheme (1-4 bytes) with backward compatibility to ISO 8859-1/Latin-1 being its first 128 characters. It has room for over 1 million code points which means that each code point can actually represent a character, unlike the mucking around done with Extended ASCII which means that a code point maps to a different character, depending on the character set/code page/encoding.

Fonts are glyphs that are mapped to code points and visually represent characters. The contents of a font are dependent on what languages it was originally meant to cover. You can use Character Map to see what glyphs are contained within the font.

Unicode fonts don't necessarily cover all the code points, you need to see where they were intended to be used. For example, in Windows 7, fire up Character Map and view the characters in Calibri and then compare them to Ebrima, Meiryo and Raavi. Note that they are vastly different because each one is tailored to a different geographic region.

As to Unicode fonts and the Windows-1252 character set, Windows uses a mapping table to translate Windows-1252 to Unicode where it doesn't match ISO 8859-1 for a "Best Fit" scenario where some characters in the Windows-1252 character set may not display.

Related Solutions

Firefox character encoding problem

Generally the page's encoding is followed, unless the server specifies an encoding. As the <meta> tag seems to specify what you're expecting, and as manually switching to that value helps, it sounds like the server you're getting the page from is sending an incorrect encoding (Windows-1252) in the headers to the browser.

The proper way to fix it is to configure the server properly. For a company webserver, this probably means bugging the server admin to do it.

To see the (wrong) headers, if you're familiar with such tools, you can use things like Firebug's "Net" panel in Firefox, or Web Inspector's "Resources" panel in Chrome or Safari. Or, if you don't know these tools and the web site is publicly accessible, then you easily see the server's headers online using, for example, Web-Sniffer.

Assuming the login page specifies the same as the actual pages, then this yields:

Content-Type: text/html

...without any value for charset. Not sure if a browser should then still interpret that <meta> tag, but apparently Firefox is ignoring it, and making some best guess.

Firefox ignoring it might be caused by the HTML source. The <meta> tag should always be specified within <head> before anything else, as it might also apply to the title, scripts, CSS and so on. On this site, it doesn't and, even worse, the HTML is a total mess:

<SCRIPT LANGUAGE=JavaScript SRC="/dergi/_ScriptLibrary/pm.js"></SCRIPT>
<SCRIPT LANGUAGE=JavaScript>
  thisPage._location = "/dergi/giris/login.asp";
</SCRIPT>
<FORM name=thisForm METHOD=post>
<HTML>
<style type="text/css">
<!--
  [..]
-->
</style>
<HEAD>
  [..]
  <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="TEXT/HTML; CHARSET=WINDOWS-1254">
  <META NAME="GENERATOR" CONTENT="Microsoft FrontPage 5.0">
  <META NAME="AUTHOR" CONTENT="[removed to protect the innocent...]">  
  <TITLE>YAYSAT DERGİ RAPORLARI</TITLE>
</HEAD>
<BODY>
<center>
[..]
</center>
</body>
<INPUT type=hidden name="_method">
<INPUT type=hidden name="_thisPage_state" value="">
</FORM>
</html>

Huge developer fail.

(Incidentally, Web-Sniffer shows <meta http-equiv=content-type content="text/html; charset=ISO-8859-1">, but that is due to its values for Accept-Charset. Firebug shows <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="TEXT/HTML; CHARSET=WINDOWS-1254"> just like in the question.)

Search PDFs with non-standard character encodings

Foxit Reader, perhaps?

For what it's worth, I just checked the PDF you linked to with Safari 4.0.4 on Mac OS X 10.6.2 and while there is some Engrish, the PDF it renders flawlessly without any onscreen "garbage". Perhaps you're having Unicode issues (more common on Windows than Mac OS)?

Best Answer

Related Solutions

Firefox character encoding problem

Search PDFs with non-standard character encodings

Related Question