'Tokenizer' that you can see at a glance how chat AI such as ChatGPT recognizes sentences as tokens
Various AIs, including ChatGPT developed by OpenAI, are now able to conduct human-level conversations. When AI reads and writes sentences, it recognizes them in units of 'tokens', but there is a tool ' Tokenizer ' on the OpenAI page that tells you at a glance how ordinary sentences are broken down into tokens. It is published in
The screen of Tokenizer looks like this. You need to enter some text, but once you click 'Show example' to see an example.
In the case of English, it seems that it will be 64 tokens in 252 character sentences. At the bottom, the letters are color-coded by group of tokens.
Click 'TOKEN IDS' to check each value. Humans don't know what it is, but GPT judges the text by looking at this list of numbers.
I also typed in Japanese. In the case of Japanese, 49 characters equals 50 tokens, which is far more than in English. In particular, one kanji character uses two to three tokens, but considering the amount of information contained in one kanji character, I can only think that it is appropriate.
I also experimented with a series of numbers. Up to 3 characters are 1 token ......
4 characters divided into 2 tokens.
It seems that the sequence of characters other than words can be combined into one token with 2 to 3 characters.
Also, GPT-3 and Codex handle white space differently. GPT-3 is 10 tokens with 10 spaces.
On the other hand, Codex treats any number of spaces as one token.
Since ChatGPT charges by the number of tokens, it seems possible to keep the price low if you know in what cases the number of tokens will increase. The following article makes it easy to understand how this token is processed inside GPT-3.
Experts explain what kind of processing OpenAI-developed text generation AI 'GPT-3' is doing