A challenge will be held to break through CAPTCHA in just 15 minutes using machine learning


ByBecky Stern

Used to prevent unauthorized logins on the netCAPTCHAThere are a few who thought that it was a little troublesome, but by using machine learning technology of remarkable artificial intelligence, learning CAPTCHA itself in just 15 minutes, let it recognize characters automatically and be able to break through There is a person who has gone.

How to break a CAPTCHA system in 15 minutes with Machine Learning
https://medium.com/@ageitgey/how-to-break-a-captcha-system-in-15-minutes-with-machine-learning-dbebb035a710

This attempt was made by Adam Geitgey who is familiar with machine learning related technology. Geitgey,WordPress plugin directoryBased on "Really Simple CAPTCHA" which hit the highest when searching with "captcha" in, we are developing an application to analyze CAPTCHA.


The reason for choosing this plugin is that it has been installed more than one million times, plus the source code has been released. Even the author of this plug-in is already an old plug-in that has stated that "we can not say it is secure anymore so we should take other measures", so it is unknown how useful it is at this point, but machine learning This attempt is being done to measure the potential of.

Really Simple The screen of CAPTCHA generated by CAPTCHA is as follows. By looking at the four characters displayed in different fonts and entering them with the keyboard, it can be said that it is a very common CAPTCHA mechanism.


And the source code of Really Simple CAPTCHA which is published is this. 4 characters are displayed in random font, and the two letters "O" and "I" are not used to avoid user confusion.


This time, Geitgey prepares the following environment and is challenging for this challenge.

· Python 3
Python 3 of the programming language, because the library on machine learning and computer vision is excellent.

· OpenCV
OpenCV, a popular framework for computer vision technology and image processing, is used for analysis of CAPTCHA. It is also beneficial that API for Python is provided.

· Keras
Deep learning framework written in Python. Learning and operation of neural network is possible with minimum coding.

· TensorFlow
A deep learning framework provided by Google. Although it uses Keras for coding, the logic of the neural network is diverted to TensorFlow.

Geitgey's aim was to analyze the image generated by CAPTCHA and to automatically identify the characters it contains, as shown below.


Geitgey uses Really Simple CAPTCHA and creates 10,000 kinds of CAPTCHA images for learning. At that time, it is said that correct character strings are output together with the image so that learning can be verified. The time taken so far is 5 minutes.


Next, let CAPTCHA be read in the convolution neural network, let us advance learning to raise correct answer rate.


What was done here is to break down the four letters and let them recognize in a separate state. By doing so, it is an intention to simplify the object and increase learning efficiency and correct answer rate.


In the challenge this time, it seems that the decomposition and analysis of the image was comparatively easy because the condition that CAPTCHA outputted 4 letters was decided. We use "findContours ()" implemented in OpenCV to distinguish between character and background areas and cut out one character at a time.


In that case, processing is done so that the character part and the background part are completely distinguished by converting the image to 2 gradation.


By doing this, we were able to separate the four characters automatically.


But there is a problem here. Really Simple In some images generated by CAPTCHA, two characters overlap each other as follows, and it turned out that it can not be determined well.


It seems that it will be recognized as "3 letters" like this as a result of forced discrimination.


So Geitgey created an algorithm so that it judges that "the aspect ratio of the recognized character is two characters when the width is wider". Then, we devised a mechanism to separate the overlapped characters by cutting out the cut image at the center of the left and right.


In fact, it is the following example that we cut out what the characters overlapped. Some of the characters next to each other protrude, but it seems to be said that cutting out is performed accurately as a single character.


Next, let us learn to recognize characters contained in the image cut out in this way. Since this time is a relatively low hurdle to distinguish characters from simple images, Geitgey has two Convolutional Layers that convolve feature quantities, and from the feature quantities, "Fully Connected Layer "We build a simple convolution neural network architecture using two.


Since Keras is used for actual coding, a simple code with few lines is created as follows.


And let's do the actual learning with the following code.


Learning was performed 10 times for all images (10 passes), and as a result, it was able to realize a correct answer rate of almost 100%. And actually, it seems that it took only 15 minutes from the start of the work took time to do so.


The output to the command line looks something like this.


Geitgey has published the code created with this challenge,This linkIt is possible to download ZIP files from. The opportunities actually used for CAPTCHA are decreasing, and since the recognition rate seems to decrease sharply with CAPTCHA with a complicated background, I can not think that this code can be exploited as it is, but it makes machine learning possible It seems to be interesting in the sense that you can experience the power yourself.

in Software, Posted by darkhorse_log