Filed under: Technology
Or at least the Washington Post does. Or at least today’s Washington Post. In a story about how Hispanics mostly voted for Hillary, and not Obama, yesterday, the Post has a quote from one Cecilia Muñoz, from the National Council of La Raza:

Text-encodings are the bane of my life, so I have a tiny bit of sympathy for whoever produces this thing. Only a very tiny bit, though, because after all I am paying for this (or would be if I were not in the free-trial period, anyway).
The problem arises from the fact that the modern digital computer, and the Internet, and E-Mail and most of this stuff are all American inventions. A lot of people in the U.S. speak Spanish these days, but in the computer-science departments where they design this kind of thing, everyone speaks English. And English is almost unique among European languages in that it does not use diacritic marks.
In a way, this is unfortunate, because there are a a lot of sounds in English that do not map well to the Latin alphabet. But that’s the way it is, and it’s made life easier for printers for hundreds of years. You can express the entire universe of thought in English with just fifty-two characters:
abcdefghijklmnopqrstuvwxyz
Continental Europe is not as efficient, and thus they require all of those, and also:
as well as many others. When computers capable of dealing with text were first built, the cost of things was such that they didn’t even use lowercase letters. If your character set is limited to 26 characters and a few punctuation symbols, you can fit a single character into five bits — which is important when every bit of memory costs a few bucks.
For a long time, the most common encoding for text was called ASCII. ASCII is an 8-bit character set, which means that eight bits of memory are used to store every character. Some of these are:
| Binary | Character |
| 01000001 | A |
| 01000010 | B |
| 01000100 | C |
and so on. Lowercase letters are the same, except the second bit is 1 instead of 0:
| Binary | Character |
| 01100001 | a |
| 01100010 | b |
| 01100100 | c |
So pressing the shift key on an old terminal or teletype just caused that second bit to be set to 1. To convert an ASCII string from uppercase to lowercase or vice-versa you don’t have to worry about what the characters actually are; you just have to set the second bit appropriately. This is particularly important when you are doing a search. To the computer, the strings ‘TiNoToPiA’ and ‘Tinotopia’ are entirely different. To do a search that’ll find either one, you just look for that string of bits, while ignoring the state of the second bit of every character. Thus the computer, which is just a collection of electricity, can see that ‘B’ and ‘b’ represent the same thing as easily as you can.
You’ll also note that because A comes before B, and because Z comes before a, you can sort a list by the binary values (which is fast), and get an alphabetized list with all the capitalized stuff on top. Most computers these days go to great lengths to ignore capitalization when sorting, like this view from the Mac Finder:
The very same thing viewed in the terminal shows the clever idea from 1963 still at work inside the modern computer:

Classically, ASCII only used seven bits. In the character examples above, you will note that the first bit is always 0, because these characters all fit into seven-bit ASCII. If you’re reading this on a computer, and you’re using a standard American keyboard, look down at it. Every character printed on the top of all of the 47 keys in the main part of the keyboard fits into seven-bit ASCII. Since each key can generate two characters depending on whether the shift key is pressed or not, that’s 94 characters.
The space bar generates another character, and all of the letter keys and a few others ([, ], ^, _, ?, and @) can generate another ASCII character as well: these are called control characters, most of which are not visible to you. You can see a Tab character, but you can’t see a ‘End of Transmission’ or ‘Bell’. The ‘Bell’ character (^G) used to make a bell on the terminal ring whenever it was ‘displayed’ (these days, in most situations, it’ll make the computer beep).
Anyway, all of this fits into 7 bits, which can store 27, or 128, possible values. This is all you need to express things in English.
Obviously, this will cause problems if you are trying to write in Spanish, or French, or German, or any of a whole bunch of other languages that require diacritics. Most European languages are written with the Latin alphabet, but all of them but English (and, with the exception of a few umlauts on imported words, Dutch) require diacritics.
Missing diacritics can in some cases completely alter the meaning of a word. In French, for instance, pâté means, well, pâté, as in pâté de foie gras. Pâte, on the other hand — without the acute accent on the e — is pronounced differently (’pot’, more or less), and means ‘pasta’ or ‘dough’.
So diacritics are important. In comes ISO-8859-1 to the rescue. This is another character set, but the first 128 characters are exactly the same as ASCII. In ISO-8859-1, the first bit — 0 for all ASCII characters — is 1, which means that the number of possible values is doubled. The additional 128 characters are used for things like all those vowels with their jaunty continental hats, the ß the Germans use for ss in now bafflingly specific situations, the ¿ used in Spanish to warn you that a question is coming so you’d better pay attention, etc., etc.
Among these characters is the humble ñ, which has the binary value 1110001 — 241 in decimal. ‘¿’ is 191, so there’s no possibility that there’s a conflict there.
To avoid as much breakage as possible, most (all?) modern text-encoding schemes take ISO-8859-1 (which, remember, incorporates the old ASCII) as their first 255 characters. So if you take an ASCII string, or an ISO-8859-1 string, and you just bash it into any other encoding, the same shapes should be displayed.
So I conclude (possibly erroneously) that I’m seeing a ¿ not because there’s a 10111111 in the Kindle Washington Post somewhere, but because there’s a 1110001 (ñ), but that the Kindle Washington Post is either saying ‘Hello, I’m a 7-bit ASCII string’, or because the Post is saying ‘Hello, I’m not telling what text-encoding I use’ and the Kindle is defaulting to 7-bit ASCII for some unfathomable reason.
Further, if I’m correct, the Kindle uses ¿ to indicate an unknown character, which is another bad idea because of course ¿ is a perfectly valid character. There’s a character just for saying ‘I can’t display this character’, and it’s this: � — U+FFFD, the Unicode Replacement Character. Or you might see this: 󠄀 — that’s actually ‘Variation Selector 17′, but I’m pretty sure that it’ll be displayed as ‘I can’t display this character’ on pretty much anything.
So: text-encoding gremlins 0, Kindle Washington Post: 0.





