Amazon reveals it employs a grey method of scraping GitHub to train AI models



To improve the quality of AI models, large amounts of high-quality datasets are needed. Amazon has reportedly been collecting coding data from

GitHub , a software development platform that has been owned by Microsoft since 2018, for AI development.

Amazon Has A Secret Way To Scrape Microsoft's GitHub And Feed Its AI Model - Dataconomy
https://dataconomy.com/2024/06/14/amazon-has-a-secret-way-to-scrape-microsofts-github-and-feed-its-ai-model/



According to an internal Amazon memo obtained by Business Insider , a New York-based business and technology news site, the group working on developing general artificial intelligence (AGI) at Amazon claims that they need 'quantitative and qualitative metadata from GitHub' to train their AI. However, GitHub has data scraping limits and can only process 5,000 requests per hour per account. GitHub will have more than 150 million public repositories by the end of 2023, so collecting data according to the scraping limits would take years to accumulate.



It has been reported that Amazon is shortening the data collection process from years to weeks by having employees create multiple GitHub accounts. According to Dataconomy, a data technology news site, Amazon's efforts do not constitute theft in a legal sense, but they may raise ethical concerns about data privacy and appropriate use of platform resources. An internal memo obtained by Business Insider contains detailed instructions on how employees should create and manage accounts to ensure that data collection across multiple accounts complies with legal and security guidelines.

As for why Amazon is scraping GitHub, Dataconomy said, 'Amazon doesn't just need a ton of code. GitHub data contains valuable details about how projects change over time, who contributes, and how developers collaborate. This metadata is essential for AI models to learn patterns, improve accuracy, and develop better ways to solve problems.'



While Amazon claims that its multi-account scraping approach was approved by its legal and security teams, Amazon's actions could be called into question if GitHub or the affected users themselves perceive it as a violation.

This issue has also been a hot topic on the social news site HackerNews, where some believe that Microsoft may tighten the terms of service to avoid its rival company Amazon collecting data from its subsidiary GitHub, but will not pursue legal action any more severely. On the other hand, some have pointed out that Amazon's actions violate GitHub's terms of service, which state that 'API keys cannot be shared to circumvent restrictions and there is only one free account per person or organization.' However, this provision is about free accounts, and the details of how Amazon is doing this have not been made clear at the time of writing, so the problem is unclear. On the other hand, many people have expressed anger from the perspective of individual users, saying, 'GitHub is publishing code for other users, not for large companies.'

in Software,   Web Service, Posted by log1e_dh