How to spot projects that buy “fake stars” on GitHub to disguise their credibility
GitHub allows users to star repositories, and repositories with a large number of stars are often considered 'popular' and 'highly reliable.' However, there are also repositories that take advantage of this by purchasing fake stars to disguise their popularity and credibility. Dagster, a startup that provides data orchestration services, has developed a method to check whether the stars attached to a repository are legitimate or fraudulent and posted it on its blog.
Tracking the Fake GitHub Star Black Market with Dagster, dbt and BigQuery | Dagster Blog
'Stars ' on GitHub are functionally the same as 'likes' on Facebook or It has features such as being advantageous when raising funds .
When the Dagster team was observing several projects, they discovered that the number of stars suddenly increased by several hundred immediately after creating a repository or immediately before a new release or big announcement. When we checked the accounts with these suspicious stars, we found that they were created on the same date as shown in the image below.
In order to research how to detect fake stars, the Dagster team created a dummy repository and actually purchased stars from the two services below.
・Baddhi Shop
It is a service that allows you to purchase not only GitHub stars but also various online metrics. GitHub's stars are $ 64 (about 9,600 yen) per 1000 pieces, and the Dagster team purchased 500 stars, and the stars were granted over a week. Furthermore, after one month, 75% of the stars had disappeared.
・GitHub24
This is a high-class service that costs 0.85 euros (approximately 136 yen) per star. When the Dagster team ordered 100 stars, they were granted within 48 hours and all stars were still available a month later.
If you look at the history graph of the number of stars below, you can see that the number of stars jumps at the time of purchase.
There are two types of accounts that can be given fake stars on GitHub: ``fake accounts that make no attempt to hide the fact that they are spam accounts and can be easily identified just by looking at them'' and ``sophisticated fake accounts with activities that appear to be genuine.'' It can be divided into types. The Dagster team decided to have two spam detectors to handle both.
◆Identify obvious fake accounts
When analyzed using the GitHub API, the Dagster team found that clearly fake accounts follow a clear pattern:
・Created after 2022
・Less than 1 follower
・Following less than 1 person
・No public Gist
・4 or fewer public repositories
・Email, employment availability, self-introduction, blog, and username of X (formerly Twitter) are empty.
・Star grant date, account creation date, and account update date are the same
By using these patterns, it is possible to identify suspicious accounts using only data available from the GitHub API.
◆Identifying sophisticated fake accounts
It was quite difficult to identify another group of fake GitHub accounts. In addition to a human-like history of activity, each account also has a profile photo and biography, making it difficult to tell the difference even if you knew it was a purchased account.
Ultimately, the Dagster team used a technique called 'unsupervised clustering.' This technique groups users based on data such as the day an activity occurred. Whereas a real person's activity would be spread out over different days, a fake user's activities would be synchronized on the dates of their activity due to control scripts or other circumstances, forming a large group of fake accounts.
For example, the figure below plots the users who starred the dummy repository by ``number of activities that occurred on the same day as other users'' and ``total number of repositories operated.'' Red dots indicate known fake accounts, and yellow dots indicate suspected fake accounts.
On the other hand, the result of plotting the users who have starred the Dagster repository is shown in the figure below. The yellow dot at the bottom left indicates that a false positive has occurred.
The analysis results of repositories suspected of containing a mixture of real and fake data are shown in the figure below. A group of fake accounts has been formed.
Additionally, the Dagster team found that groups of fake accounts tend to interact with specific repositories, improving the reliability of their analysis. In the end, the fake account was identified using the following steps.
1: Get user list
Get a list of all users who have starred the repository you are analyzing.
2: Identify “suspicious user groups”
Find repositories that are commonly starred by users in your list. A group of users who star a large number of repositories over and over again is highly suspicious, but in some areas even genuine accounts may star a set of the same repositories, which is why we do step 3.
3: Filter by activity level
Finally, analyze each user's activity. Most of the activity of users with little activity was in the repository that appeared in step 2, and there was no additional legitimate activity. In this way, we were able to identify the fake account.
When analyzing known fake stars for dummy accounts, it was stated that fake accounts could be detected with an accuracy of 98% and a recall rate of 85%, although it was very computationally expensive. The results of investigating the number of fake stars for several repositories using these two methods are as follows. By using unsupervised clustering, we are now able to detect fake accounts that could not be detected using simple heuristics.
simple heuristic (Obvious fake, low recall rate) | Simple heuristics + unsupervised clustering (Obvious & sophisticated fake) | |||
---|---|---|---|---|
repository | Total number of stars | fake star | fake star% | Fake Star% since 2022 |
okcash | 759 | 1 | 0.13% | 97% |
Simple-GPU | 787 | 159 | 20% | 87% |
Notifio | 841 | 97 | 12% | 76% |
Mage.ai | 3,629 | 533 | 15% | 30% |
Apache Airflow | 29,435 | 17 | 0.06% | 1.6% |
Plumber | 3,002 | 6 | 0.2% | 1.5% |
Dagster | 6,538 | 8 | 0.12% | 1.5% |
Flyte | 3,154 | 1 | 0.03% | 1.1% |
The code used for this analysis is publicly available on GitHub , so if you are interested, please check it out.
A forum related to this article has been set up on the GIGAZINE official Discord server. Anyone can write freely, so please feel free to comment!
• Discord | 'Do you ever use the number of stars on GitHub as a reference?' | GIGAZINE
https://discord.com/channels/1037961069903216680/1166661428208468068
Related Posts:
in Software, Web Service, Posted by log1d_ts