Kindle Newspapers Have Text-Encoding Problems
by tino, Wednesday February 06th 2008, 12:39
Filed under: Technology

Or at least the Washington Post does. Or at least today’s Washington Post. In a story about how Hispanics mostly voted for Hillary, and not Obama, yesterday, the Post has a quote from one Cecilia Muñoz, from the National Council of La Raza:

Munoz

Text-encodings are the bane of my life, so I have a tiny bit of sympathy for whoever produces this thing. Only a very tiny bit, though, because after all I am paying for this (or would be if I were not in the free-trial period, anyway).

The problem arises from the fact that the modern digital computer, and the Internet, and E-Mail and most of this stuff are all American inventions. A lot of people in the U.S. speak Spanish these days, but in the computer-science departments where they design this kind of thing, everyone speaks English. And English is almost unique among European languages in that it does not use diacritic marks.

In a way, this is unfortunate, because there are a a lot of sounds in English that do not map well to the Latin alphabet. But that’s the way it is, and it’s made life easier for printers for hundreds of years. You can express the entire universe of thought in English with just fifty-two characters:

ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz

Continental Europe is not as efficient, and thus they require all of those, and also:

áàâäéèêëîïíìóòöôüûúùûÿçñ

as well as many others. When computers capable of dealing with text were first built, the cost of things was such that they didn’t even use lowercase letters. If your character set is limited to 26 characters and a few punctuation symbols, you can fit a single character into five bits — which is important when every bit of memory costs a few bucks.

For a long time, the most common encoding for text was called ASCII. ASCII is an 8-bit character set, which means that eight bits of memory are used to store every character. Some of these are:

BinaryCharacter
01000001A
01000010B
01000100C

and so on. Lowercase letters are the same, except the second bit is 1 instead of 0:

BinaryCharacter
01100001a
01100010b
01100100c

So pressing the shift key on an old terminal or teletype just caused that second bit to be set to 1. To convert an ASCII string from uppercase to lowercase or vice-versa you don’t have to worry about what the characters actually are; you just have to set the second bit appropriately. This is particularly important when you are doing a search. To the computer, the strings ‘TiNoToPiA’ and ‘Tinotopia’ are entirely different. To do a search that’ll find either one, you just look for that string of bits, while ignoring the state of the second bit of every character. Thus the computer, which is just a collection of electricity, can see that ‘B’ and ‘b’ represent the same thing as easily as you can.

You’ll also note that because A comes before B, and because Z comes before a, you can sort a list by the binary values (which is fast), and get an alphabetized list with all the capitalized stuff on top. Most computers these days go to great lengths to ignore capitalization when sorting, like this view from the Mac Finder:

200802061132-1

The very same thing viewed in the terminal shows the clever idea from 1963 still at work inside the modern computer:

200802061132

Classically, ASCII only used seven bits. In the character examples above, you will note that the first bit is always 0, because these characters all fit into seven-bit ASCII. If you’re reading this on a computer, and you’re using a standard American keyboard, look down at it. Every character printed on the top of all of the 47 keys in the main part of the keyboard fits into seven-bit ASCII. Since each key can generate two characters depending on whether the shift key is pressed or not, that’s 94 characters.

The space bar generates another character, and all of the letter keys and a few others ([, ], ^, _, ?, and @) can generate another ASCII character as well: these are called control characters, most of which are not visible to you. You can see a Tab character, but you can’t see a ‘End of Transmission’ or ‘Bell’. The ‘Bell’ character (^G) used to make a bell on the terminal ring whenever it was ‘displayed’ (these days, in most situations, it’ll make the computer beep).

Anyway, all of this fits into 7 bits, which can store 27, or 128, possible values. This is all you need to express things in English.

Obviously, this will cause problems if you are trying to write in Spanish, or French, or German, or any of a whole bunch of other languages that require diacritics. Most European languages are written with the Latin alphabet, but all of them but English (and, with the exception of a few umlauts on imported words, Dutch) require diacritics.

Missing diacritics can in some cases completely alter the meaning of a word. In French, for instance, pâté means, well, pâté, as in pâté de foie gras. Pâte, on the other hand — without the acute accent on the e — is pronounced differently (’pot’, more or less), and means ‘pasta’ or ‘dough’.

So diacritics are important. In comes ISO-8859-1 to the rescue. This is another character set, but the first 128 characters are exactly the same as ASCII. In ISO-8859-1, the first bit — 0 for all ASCII characters — is 1, which means that the number of possible values is doubled. The additional 128 characters are used for things like all those vowels with their jaunty continental hats, the ß the Germans use for ss in now bafflingly specific situations, the ¿ used in Spanish to warn you that a question is coming so you’d better pay attention, etc., etc.

Among these characters is the humble ñ, which has the binary value 1110001 — 241 in decimal. ‘¿’ is 191, so there’s no possibility that there’s a conflict there.

To avoid as much breakage as possible, most (all?) modern text-encoding schemes take ISO-8859-1 (which, remember, incorporates the old ASCII) as their first 255 characters. So if you take an ASCII string, or an ISO-8859-1 string, and you just bash it into any other encoding, the same shapes should be displayed.

So I conclude (possibly erroneously) that I’m seeing a ¿ not because there’s a 10111111 in the Kindle Washington Post somewhere, but because there’s a 1110001 (ñ), but that the Kindle Washington Post is either saying ‘Hello, I’m a 7-bit ASCII string’, or because the Post is saying ‘Hello, I’m not telling what text-encoding I use’ and the Kindle is defaulting to 7-bit ASCII for some unfathomable reason.

Further, if I’m correct, the Kindle uses ¿ to indicate an unknown character, which is another bad idea because of course ¿ is a perfectly valid character. There’s a character just for saying ‘I can’t display this character’, and it’s this: � — U+FFFD, the Unicode Replacement Character. Or you might see this: 󠄀 — that’s actually ‘Variation Selector 17′, but I’m pretty sure that it’ll be displayed as ‘I can’t display this character’ on pretty much anything.

So: text-encoding gremlins 0, Kindle Washington Post: 0.

Possibly related posts:
  • Kindle Typographical Addendum
  • The Kindle After Almost Two Years
  • The Kindle Dictionary Is Lacking
  • Kindle Newspapers Suck
  • CNN Labels Ads As ‘News’


  • Kindle Newspapers Suck
    by tino, Tuesday February 05th 2008, 18:03
    Filed under: Media, Technology

    Or, The Washington Post sucks on the Kindle. At least this morning’s version. Amazon offers a two-week free trial of newspaper subscriptions on the Kindle, so this morning I poked and prodded, and wound up with the Post on there. And it’s terrible.

    I don’t know why I’m surprised. I’ve been complaining for years that online newspapers suck because they almost totally fail to take advantage of one of the newspapers’ best skills — selling stories by placement. And I’ve been pointing out that the Kindle is good for one thing only — reading single, long pieces of text. And still I’m shocked at how bad the Kindle version of the Post is.

    This morning’s paper Washington Post has eight different sizes of headline on the front page. The front page of the Post’s website right now has three.

    The Kindle version has one headline size.

    What’s more, the Post’s liberal use of label heads means that a lot of the Kindle headlines are almost totally useless. A label head is a headline that is not, by any stretch of the imagination, a complete sentence. In traditional newspaper headlines, you leave out articles, forms of be, etc., etc. and wind up with something that’s extremely pithy but that still tells the story that it sits atop. ‘Elvis Dead’ would be a good example, or, to use an example from this morning’s Post, ‘Bush’s Budget Projects Deficits’.

    Label heads, on the other hand, are just that: labels. They don’t tell a story, and they don’t have even an implied, invisible verb. From this morning’s Post, we get:

    • Two Races, One Big Day
    • A Rich Market For Russian Icons
    • In China, Pulled by Opposing Tides

    Those last two are iffy: you could say that they mean ‘[There Is] A Rich Market For Russian Icons’ and ‘[People] In China [are] Pulled by Opposing Tides’, but both of those would be terrible headlines.

    The Post uses two-deck headlines a lot, though: a label head on top and a quite prolix (for a headline) thing underneath, usually set in italics. Whoever they have writing headlines at the Post is doing a pretty good job — not as good as at the New York Times, which generally has excellent headlines, but pretty good nevertheless — but those headlines can’t be repurposed for other media without being rewritten completely.

    For the Kindle version, of course, they don’t rewrite them. Except for a very few stories from the front page, they don’t include the subheads. Here’s the front of the Washington Post as seen on the Kindle:

    Wpkindle1

    Here’s the same thing as seen on paper. Click on any of these pictures for a bigger version:

    Pa1

    I’ve gone to the trouble of photographing the entire A section of today’s Post (Virginia Boonies Edition), and the whole of the Kindle article list for the same section. There are a number of outright differences — stories which are present in one version that are totally missing from the other. Some of this might be explained by the fact that the version of the Post that you get out here in the hinterland is put to bed at about 10 p.m. the night before.

    (more…)

    Possibly related posts:
  • Kindle Typographical Addendum
  • The Kindle Dictionary Is Lacking
  • iPhoto Sucks Less
  • How Not To Sell Newspapers
  • The Kindle After Almost Two Years