CODING.INF

Japanese Codes

This is just a few explanatory words about Japanese coding. It started life as something I posted on the sci.lang.japan newsgroup.

First, the main codes for Japanese characters are laid down in the JIS X 0208-1997 standard. It has the kana, 6,355 kanji, a lot of symbols, and the more-or-less complete Latin (known as JIS-ASCII), Greek and Cyrillic alphabets. (There are two more standards: JIS X 0212 and JIS X 0213, which add even more kanji, etc.)

Let me give an example. The very first kanji in JIS X 0208 is 亜. In the standard it is defined by a pair of decimal numbers which place it in the 94 x 94 matrix used in that standard. This is known as the "kuten" code of the kanji.

Kuten: 16-01

The raw "JIS" code is formed by adding 32 (0x20) to (each of) the kuten pair. (In JIS the pair is usually stated in hexadecimal coding.) This moves the code up into the "printable ASCII" range where it won't upset software by pretending to be escapes, line-feeds, etc.

(Raw) JIS: 3021 (hex) I.e. 0! (ASCII)

The EUC coding is formed by turning on bit-8 (MSB) of the raw JIS code.

EUC: B0A1

Shift-JIS is a bit hard to explain in words. Suffice to say that the 14 bits of the raw JIS code are put through a transformation to make a pair of bytes. The MSB of the first is always set, and the second always lies in or above the printable ASCII range.

Shift-JIS: 889f

[Did I hear you ask why Shift-JIS is so messy when EUC is simple and elegant. Well Shift-JIS *was* invented by Microsoft... 8-)} Seriously though, a few more words about Shift-JIS are at the end of this document.]

And for the sake of completeness, the JIS/ISO-2022-JP coding wraps the "raw" JIS in the escape sequences I mentioned above:

ISO-2022-JP: ^[ B 0 ! ^[ ( B

Note that as many raw JIS codes as you like can be in the wrapper, although the RFC for Japanese in email limits it to 72 (I think.)

I'll have a short class test on this tomorrow to see how much you have remembered. 8-)}

There *are* some differences between JIS coding and ISO-2022-JP which I have glossed over. There is also the Unix X-Windows CTEXT_JA (Compound Text) coding, which is similar to JIS/ISO-2022-JP. You only need to worry about it if you are deep into X.

Unicode and UTF-8

The JIS standard described above has been around since the 1970s, and was the first (national) standard that included Kanji. Other standards were developed for Chinese and Korean, so we had a case where the one character appeared in different incompatible standards. After an attempt was made in the 1980s to create a common "East Asian" character set standard, a fresh start was made in 1988 to produce a single character set standard covering all of the world's writing systems. Initially a consortium of computer companies (Unicode) and the ISO started doing this independently, but they soon merged their efforts to produce the first ISO 10646/Unicode standard in the early 1990s.

The creation of a single kanji/hanzi/hanja set covering Japanese, Chinese and Korean use of "Chinese" characters is referred to as the "Han Unification", and has been a major element of the Unicode development.

At its most basic, Unicode is a 16-bit code (actually it is longer now, but most of the codes fit into 16 bits.) The 亜 kanji has a Unicode value of (hex) 4e9c. All of the kana and many of the kanji have these 16-bit codes (the rarer kanji use a 32-bit code.) While the raw 16-bit Unicode values can be used in files, they often cause problems because one or other of the two bytes might be confused with a "/" or a new-line, etc. For this reason, most recording of Unicode characters in files uses a coding called "Unicode Transformation Format 8-bit" (UTF-8). This format takes the Unicode value, and converts it into a sequence of bytes with bit 8 of each byte turned on, so they cannot be confused with ordinary ASCII characters. The UTF-8 coding for 亜 is (hex) e4ba9c.

EMAIL AND NEWSGROUPS

In general, Japanese text included in emails and news messages should be in the JIS/ISO-2022-JP codings. Originally there were two reasons for this:

the old email standards (developed by and for Americans) prohibited 8-bit characters in emails;
a lot of communications software only passed 7-bit characters. (Some even removed escape characters.)

In addition, some other software, such as window handlers, mail readers, etc. which had not been written for 8-bit text, were known to crash when subjected to b-bit codes.

Many of these reasons no longer apply, especially the communication paths, which are almost universally 8-bit clear, but enough people still run old software that the receiver-friendly thing is to use JIS/ISO-2022-JP codings.

WWW pages

Fortunately, WWW pages are in better shape than mail and news. You are free to code Japanese in WWW pages in EUC, Shift-JIS or ISO-2022-JP; just don't mix them in the page.

Browsers can be set to EUC/Shift-JIS, or to "autodetect" the codes. ISO-2022-JP can always be detected accurately, but telling the other two apart can be chancy, as the ranges overlap. Some people put characters in unique ranges in a comment at the front of the page to make sure the detection stays on the rails.

IMNSHO, it is very important that WWW pages state their coding in a way that browsers can detect. The traditional way to head up a page with Japanese in it is to have a "META HTTP" directive at the front. Here is what I have as default on the WWWJDIC pages:

This both tells the browser what the codes are in the page, and more importantly, tells it how to code Japanese text in the input fields in a form before sending it to a server.

Using the HTML "meta" line for this purpose is now deprecated; instead W3C wants us to state the coding in the MIME header that precedes the HTML text. So in the WWWJDIC pages I begin with :

Content-type: text/html; charset=euc-jp

That's fine for CGI programs, but if you have static HTML pages, the MIME header is added by the server, and many have the charset set to "ISO-8859-1", which is the 8-bit code for Western European languages. Many sites are reluctant to change this default, as Apache advises it be set that way, so don't be too surprised if browsers can't detect you have Japanese in your pages. We still have to convince software developers that there are languages other than American.

WHY SHIFT-JIS?

Why is there a "shift" in Shift-JIS? Well the reason actually goes back to an earlier standard now called JIS X 0201, which is sort-of the extended Japanese version of ASCII. JIS201 is an 8-bit code, and thus has 256 possible values. The first 128 are pretty much ASCII, except that the "\" is replaced by the Yen symbol. As this standard, or to be more precise its predecessor, was one of the first to have Japanese characters in it, it was highly desirable to have at least a set of kana available for the early computers in Japan. As the full set of kana, when all the nigori/maru diacritical marks are added, overtaxes the space available in 127 codes, it was decided to use just the basic katakana set, and have the diacritic marks as separate characters. This led to the so-called "half-width kana: (hankaku kana). I remember when I was first living in Japan in 1981/2, my electricity invoices were all in this rather clumsy form. Long-distance train tickets were too.

Well, hankaku kana is from the button-boots era of Japanese computing, but Microsoft, in their Infinite Wisdom, decided that it was terribly necessary, once the full kanji set became available as a two-byte code, to enable files and text for/from legacy systems to co-exist with the newer form. And thus arose "Shift-JIS", where the codes in the JIS X 0208 standard are shifted aside to make room for the old hankakukana codes from JIS201.

In my biased, after-the-event view, this was unnecessary, and has caused a much bigger problem for the information processing industry in Japan than would have been the case had the EUC approach been adopted universally. It is not that Shift-JIS is that complicated a code: after all who apart from idiots like me ever tries to decode these things on paper. It is because a large proportion of the two-byte code-space was wasted supporting an obsolete code. This has severely limited the range of kanji and other special characters that can fit in two bytes. In theory a two-byte sequence with the MSB on the first can code 32,000+ characters. Shift-JIS is limited to about 10,000, thanks to its support for JIS201 kana. Moreover, it has helped keep JIS201 half-width kana alive and well. Just look at a cash-register docket next time you are in Japan.

Anyway, that's the history.

Jim Breen
School of Computer Science & Software Engineering
Monash University
6 June 1999
29 October 2002

Jim Breen's Home Page and Japanese Page.