Using machine learning proved to be able to distinguish individuals from anonymous source code


by Brian Klug

The "source code" written using the programming language needs to be written according to predefined rules, so it may seem difficult to identify individuals from anonymously published code. However, in fact, the characteristics of individuals manifested well in the code, and it turned out that individuals can be identified from the code sample using machine learning.

DEF CON® 26 Hacking Conference Speakers
https://www.defcon.org/html/defcon-26/dc-26-speakers.html#Greenstadt

Machine Learning Can Identify the Authors of Anonymous Code | WIRED
https://www.wired.com/story/machine-learning-identify-anonymous-code/

Mr. Rachel Greenstadt , Associate Professor of University of Drexel University and Mr. Iin Kalskin , Associate Professor of Computer Science at George Washington University , said that the code written in the programming language is not completely anonymous, We announced the research result that it is possible to identify individuals using machine learning.

The two people analyzed code samples in machine learning algorithms and extracted all the features such as choice of words used, length of code and how to organize codes. Next, they screened only the features that are useful for identifying individuals from among the extracted features, and narrowed down the list that should be noticed when identifying individuals from the code. Unlike ordinary sentences, code writers have restrictions that they have to write code according to certain rules, but still seem to be able to extract features that can identify individuals from the code.

Also, the code sample does not need to be very long, and according to the 2017 paper (PDF) published by Greenstadt et al., Even a piece of short code published to GitHub is identified You can identify developers and other developers. In addition, Mr. Kalskin said that even personally identifiable from codes already compiled into machine words represented by 0 and 1.


by Christiaan Colen

Mr. Kalskin and his team have identified the codes written by 100 developers to algorithms based on the code written in Google Code Jam of the programming contest to be held by Google . Then, he said that he was able to identify individuals with an accuracy of 96%. In addition, even if the number of developers to be identified is expanded to 600 people, it is said that it was possible to identify individuals with an accuracy of 83%.

Mr. Greenstadt and Mr. Kalskin say that AI to identify individuals from code, such as when judging whether students studying programming stealed other codes or when identifying malware developers will be useful . In addition, it is possible to abandon the existence of the person behind against cyber crime carried out as an unrelated third party.

On the other hand, the privacy of programmers who are anonymously participating in open source projects or programmers who publish code anonymously may be threatened. "It is necessary to understand that hiding 100% of the code developer's identity is difficult to think generally," Greenstadt said, and in the future a tool to make individuals indistinguishable from code Although it may be developed, for a while it has been said that there is a danger that individuals will be identified from code anonymously published.


by Penn State

In addition, Greenstadt and others have found the fact that advanced users are easier to identify individuals in programming beginners and advanced users. This is because the beginner copies a part of the code from the programming practice site and features are hard to come up, whereas as the advanced person gets coding, the difference becomes easy to get between individuals It is said that. Besides, the two are better when the code sample is a "code written to solve a complicated problem" rather than a case where it is "a code written to solve a simple problem" We also found that the accuracy of individual identification improves.

In the preliminary survey conducted by Greenstadt et al., It seems that the information obtained from the code is more than expected, such as being able to distinguish between Canadian 's written code and Chinese written code with more than 90% accuracy. At the time of article creation, identification of individuals by code does not have accuracy close to 100% like individual identification by fingerprints, but it is thought that identification accuracy will further improve in the future.


by Katy Levinson

in Software, Posted by log1h_ik