Nov 02, 2021 20:00:00

An overseas engineer seriously explains the Japanese 'garbled characters', and a Japanese enthusiast who is as good as a Japanese person will be shown.

by Whooym

It is said that 'garbled characters ' that characters are not displayed properly and become unreadable can be used with ' Mojibake ' even among overseas engineers. Paul Oliri McCann, who is developing natural language processing (NLP) in Tokyo, explained the types of such garbled characters.

A Field Guide to Japanese Mojibake
https://www.dampfkraft.com/mojibake-field-guide.html

According to Mr. McCann, garbled characters occur by opening a document with a different character code than when it was created. If the text is garbled, it will be a meaningless character string, so it can not be read, but different patterns will appear depending on what kind of character code was used, so when you get used to it, guess the type of character code used. It seems that you can do it.

◆ UTF-8
UTF-8 is the most common character code on the Internet and has become popular in Japan in recent years. If you open the sample text created in UTF-8 in Shift JIS, it will be as follows.

First, let's take the following as a sample sentence.

I am a cat. there is no name yet.
If you make a mistake in the encoding settings, the characters will be garbled.
The height of Tokyo Tower is 333m.

And the above is encoded in UTF-8, then opened in Shift JIS and garbled.

Regarding this garbled character, Mr. McCann said, 'I can see that certain characters appear frequently, but there is an interesting

article on Daily Portal Z that explores the meaning. The word 'Ungen' refers to the colored striped pattern used in textiles. '

In the image below, the edge of the tatami mat on which Yoshimitsu Ashikaga sits is the 'Shoen '. 'This used to mean a high rank, but today you can see it in the Hina dolls used in the Hinamatsuri,' explains McCann.

In addition, Mr. McCann found

out that the garbled sign that appeared in the animation 'Ura Sekai Picnic ' broadcasted in 2021 using this knowledge was a UTF-8 displayed in Shift JIS.

After seeing a lot of mojibake over the years I realized that different encoding pairs have different visual textures because of the particular kinds of garbage that come out. This scene from Urasekai Picnic has UTF8 rendered as SJIS, a very common combination. Pic.twitter. com / 798nEE8G1I
— Paul O'Leary McCann (@ polm23) October 31, 2021

◆ Shift JIS
Shift JIS is the character code that was most commonly used on Japanese sites before. Most of the sites in recent years have been replaced by UTF-8, but it is said that Shift JIS is used for old feature phones, so-called 'garakei' emails.

If you open the above sample sentence created in Shift JIS in UTF-8, it will be as follows.

Mr. McCann said, 'It is a rare pattern of general characters that is often seen in garbled sentences in Shift JIS. Especially noticeable is the variant character' taka 'of' high ', which connects the lines. It is called 'Hashigodaka' and is often used for surnames. 'Tatsusaki', in which 'Large' in 'Saki' is changed to 'Standing', is a similar case, but in 'Taka'. It seems that there are few compared to that. '

◆ EUC-JP

EUC-JP, which was developed for UNIX , shares the basic standard with ISO-2022-JP , which will be described later, but the encoding method is simpler. It was used in the same way as Shift JIS, but it was not so popular.

When EUC-JP is displayed in UTF-8 mentioned above, it looks like this.

Even if I open it with Shift JIS, it is not displayed well.

Mr. McCann said, 'The interesting thing about the garbled characters that open EUC-JP in Shift JIS is that half-width Katakana often appears. This is because Shift JIS represents half-width Katakana in 1 byte.' I am commenting.

◆ ISO-2022-JP

ISO-2022-JP is not often used for anything other than email, but it is said that there are occasional opportunities to see it. However, since there are many derivations, it seems that there will be a problem that one system can read it but another system cannot.

If you open the sample text created in ISO-2022-JP in UTF-8 / Shift JIS / EUC-JP, the character code will be as follows.

According to Mr. McCann, the garbled characters when opening ISO-2022-JP characters in UTF-8, Shift JIS, and EUC-JP are the same because there are no characters that can be interpreted as escapes in other character codes. That is.

Related Posts:

Nov 02, 2021 20:00:00 in Note, Posted by log1l_ks