Character Sets and Languages

So what is a “character set,” anyway?

A long time ago, way back in the Seventies, ASCII was invented, which stands for American Standard for Computer Information Interchange. Since computers can only deal with numbers internally, ASCII provides an agreed-upon standard so that numbers can represent letters, punctuation, spaces, and all the other characters you see on your computer screen. For example, in ASCII the number 65 stands for an uppercase ‘A’, 97 stands for a lowercase ‘a’, 32 represents a space, and so on. Each ASCII character requires 7 bits of storage, which happens to fit nicely into a computer byte, which is 8 bits long.

ASCII has become the single most entrenched standard in all of computer-dom. Its wide use is why you are able to transfer text files from one computer architecture to another without trouble, from an Amiga to a DEC minicomputer, say. Alas, ASCII just isn’t up to the big job of displaying all the world’s many and varied languages. Since each character is stored in 7 bits, there can be only 128 unique values, which is suitable for displaying simple text in U.S. English, but nothing else. To solve this problem, various “character sets” were created to allow the display of more information — accented characters, monetary unit symbols like the pound and the Euro, and so on — needed for other languages. The easiest and most obvious first step is to expand the character width to 8 bits, which still fits into a single computer byte, but doubles the number of displayable characters.

There is a wide variety of character sets that are 8-bit extensions of ASCII, optimized for particular languages or cultures. ISO-8859-1 through ISO-8859-13, to name the most popular bunch. But this scheme leaves much to be desired. That’s a lot of character sets to keep up with, for one thing. A computer program designed to deal with multiple languages must keep track of all of them, and have some way of knowing which character set was used to create a given document. Some languages have far too many unique characters to be displayed in a character set that is simply an 8-bit extension of ASCII, like Japanese for instance. This has led to bizarre character sets like Shift-JIS, which is almost as complicated as a programming language. Wouldn’t it be better if we could use just one character set to display all languages?

Enter Unicode. The Unicode standard can deal with thousands of characters and is capable of representing almost every language ever known. Alas, a single Unicode character won’t fit into a single 8-bit byte. Unfortunately, ASCII and the concept of the 8-bit character are far too entrenched for Unicode to take over the world and become the dominant standard any time soon. This has led to compromise character sets like UTF-8, which represents each Unicode character in one to six bytes, and provides pretty good backward compatibility with ASCII.

Neither Unicode nor its kissin’ cousin UTF-8 have taken over the world. On USENET, Unicode is far less prevalent than ISO-8859-1, in fact. A program that tries to be a good international citizen must take all of this into account.

Character sets, USENET, and Pineapple News

This is the way things should work, ideally: Every USENET message ever posted would have a header line indicating what character set was used to create it, and Pineapple News would recognize that character set and decode it properly into something displayable on your particular computer. If you believe that is what will actually happen, I’ve got a bridge I’d like to sell you.

Many things can and do go wrong. Some messages do not have a header line saying what character set was used to create them. Some messages have such a line, but it indicates a character set that Mac OS X is not capable of decoding. Even worse, sometimes a message will indicate a character set that was not the one used to create it.

Another thing that can go wrong is that the program might correctly decode a given message, but the font you are using doesn’t have the proper “glyphs” to display all the characters. A “glyph” is a representation of how a given character should look onscreen. A single glyph describes the arrangement of pixels necessary to display a lowercase letter ‘e’, for instance. If your font is missing the glyph for a particular character, then that character might be displayed as a diamond shape with a question mark inside. Fortunately, this almost never happens on Mac OS X, because it performs a clever trick I’m going to call “font substitution,” for lack of a better term. If the current font in use doesn’t have a glyph needed to display a particular character, Mac OS X will go searching through other installed fonts until it finds one that’s appropriate, then substitutes the proper glyph from the “donor font.” This is so effective that, despite all the rigorous testing I’ve done while working on my newsreader, I’ve never once seen the dreaded diamond-with-question-mark glyph. The only reason I’m aware of the missing character glyph at all is because of a botched download attempt: one day my web browser tried to display a binary file as if it were text.

Installing languages in Mac OS X

Mac OS X requires that you install support for a given language before you are able to make use of it. To see what languages are installed on your computer, start up System Preferences, then select the “International” preference icon. The first tab lists the languages installed on your computer, and the order of preference for each. Language support includes input methods, which allow you to type characters in that language, encoding for related character sets, and fonts that contain the glyphs needed to render the characters in that language.

Character sets currently supported

To see the full list of supported charsets, start at the Pineapple News Application menu, then pick “Preferences...” Go to the tab labeled “Message Creation.” You should see a pop-up menu button that allows you to select the character set that will be used for messages you write. Click on the menu button to open it, and you’ll see a menu with all character sets supported by the program on your computer. If you don’t see a character set you would like to use, you may need to install additional languages, described earlier. If you believe you have the proper languages installed, then it may be that Pineapple News does not yet support that character set. Send me an e-mail, I might be able to fix it for you. If Mac OS X has support for your character set, then only a small change in my code will make Pineapple News support it as well.

Setting a default character set for a newsgroup

Unfortunately, it is all too common for people to use newsreaders that do not properly tag their messages with the character set that was used to create them. The users in a particular newsgroup tend to all use the same character set, so it makes sense to assign a default character set for the group. The default character set will be used for any message in that group for which the proper character set cannot be determined.

To set a default character set for a newsgroup, click on its icon in the storage view. Right-click or control-click to bring up the group’s context menu. From that menu, select “Newsgroup Prefs.” The newsgroup preferences window has a pop-up menu that allows you to assign a default charset for the group.

Setting a character set for a single message

You may occasionally encounter a “rogue” message that is not tagged with a charset value and is also not encoded according to the newsgroup default. Therefore, Pineapple News makes it possible to assign a character set for each individual message.

In the program’s main window, select a message in the headers view. From the Message menu, select “Character Set...” A small window will appear that allows you to assign a new character set for the message.

Selecting a message font

Pineapple News allows you to select the font used to display messages in the main window. Click on the message view, right-click or control-click to display its context menu, select “Font,” then “Show Fonts.” If you have a particular language in mind, I have to assume you know which fonts have the right characters for it, because I’m not going to be much help there. The program will remember the font you selected and apply it the next time the program starts. You can also right-click on the headers displayed directly above the message view and select a font for that area of the window.

Selecting a character set for messages you create

Pineapple News lets you select the character set that will be used to create new messages. While you are typing the message, the text is stored in Unicode. The character set you specify doesn’t get used until the message is saved to disk, when the translation from Unicode to your chosen character set takes place.

To set your preferred character set, select “Preferences...” from the Pineapple News application menu, then select the tab labeled “Message Creation.” That tab contains a pop-up menu button that allows you to select your preferred character set. The default is UTF-8. If you have a burning desire to use a non-standard character set, then here are some of the more popular ones, and why you might choose them.

Help Index