AI that breaks through CAPTCHAs with greater accuracy than humans
by
On 4chan, the world's largest image board with the highest number of hits in the English-speaking world, users must pass a CAPTCHA to prove they are human before posting a message. Full-stack software developer Blackjack published a record of the project to break through this CAPTCHA.
Breaking the 4Chan CAPTCHA | nullpt.rs
https://www.nullpt.rs/breaking-the-4chan-captcha
First, Blackjack wrote a script to scrape hundreds of CAPTCHA images from 4chan to get data to train his model.
In the process, we learned that the more frequently you make requests, the harder the CAPTCHA becomes to prevent fraud. Below is an example of a more difficult CAPTCHA image.
Your training dataset must contain real-world solutions to CAPTCHAs.
Blackjack tried solving the CAPTCHAs himself, asked a trusted friend, or used an outsourced service that would solve the CAPTCHA images for him, but the accuracy was too low. So I had to give up.
Giving up on solving it manually, Blackjack came up with a solution using synthetic data. 4chan CAPTCHAs are made up of background noise and specific characters. Here's an example of how to extract the characters from a CAPTCHA image: All that's left is the noise that was lost.
According to Blackjack, it was easy to remove only the 'larger contours' that are the components of the characters contained in the CAPTCHA image and leave only the noise.
Next, they used a script to extract characters and manually tagged them using VoTT, an open source software from Microsoft, and collected 50 to 150 images for each character.
In the process, Blackjack noticed that the CAPTCHA image only contained a limited number of characters.
Below is a sample of the CAPTCHA characters. It is speculated that the lack of the number '3' or the letter 'B' is to avoid the use of ambiguous characters.
The training dataset finally included about 500 hand-created CAPTCHA images and about 50,000 synthetically generated images. We used a convolutional neural network (CNN) with three convolutional layers. We built a model with a long short-term memory (LSTM) CNN architecture that combines a CNN with two LSTM layers and trained it on the dataset.
The AI model developed in this way was able to solve 4chan CAPTCHAs with an accuracy of over 90% -- higher than the 80% achieved by the aforementioned commercial service that uses human workers to solve CAPTCHAs.
'This project was a lot of fun,' said Blackjack. 'There were some challenges to overcome, but I learned a lot about machine learning and computer vision along the way. Of course, there are areas to improve, but we're very pleased with what we initially aimed for.' I'm pleased with the results so far, as I've achieved my goal.'
Related Posts:
in Software, Posted by log1l_ks