Oct 23, 2024 23:00:00

Google has a database of about 25 million scanned books, including out-of-print books, that are lying dormant and unread.

Google was once working on a project to scan almost all books ever published and make out-of-copyright books accessible with one click. However, the project was blocked by legal barriers, and the database of about 25 million scanned books is sleeping without being read by anyone, as the monthly magazine The Atlantic explains.

Torching the Modern-Day Library of Alexandria - The Atlantic

https://www.theatlantic.com/technology/archive/2017/04/the-tragedy-of-google-books/523320/

Google co-founder Larry Page has been interested in digitizing and making books accessible since the company was founded. The student project that led to Google Search was conceived as part of a technology initiative to create a 'single, integrated, universal digital library.'

In 2002, when Google was getting off the ground, Page approached the University of Michigan, which was then a world leader in digitally scanning books. He proposed a contract in which Google would borrow and scan books from the library and provide the digital data to universities and libraries. By 2004, Google had started scanning books and had signed contracts not only with the University of Michigan, but also with Harvard University, Stanford University, Oxford University, the New York Public Library, and dozens of other library systems.

The books taken from the library were loaded onto trucks and transported to Google's scanning center, where they were placed on carts like those found in libraries and bookstores and handed over to human operators. The book scanning device built by Google was equipped with four cameras that photograph the pages of the book and a radar that measures the curvature of the paper, and the operator turned the pages one by one by hand and pressed a foot pedal to scan the pages. The scanning device was said to be able to scan books at a speed of 1,000 pages per hour.

Google solved many of the problems that made scanning so slow with software, such as developing algorithms to correct for curved pages. At its peak, the company employed around 50 software engineers, developing optical character recognition software to convert raw image data into text, routines to process images, systems to convert page numbers and footnotes, and algorithms to rank books by relevance.

Google spent several years scanning about 25 million books at an estimated cost of $400 million. Google did not intend to make the full text of the books public, but rather to create a full-text search service for books (

Google Books ), and therefore believed that the creation of the service was protected by fair use .

However, when authors and publishers learned that a huge number of books had been borrowed from libraries and scanned by Google without their knowledge, they put a stop to this initiative. In 2005, the industry group, the National Association of Authors , filed a class action lawsuit, which was joined by publisher groups, and a legal battle between Google and the publishing industry over digital copyright began.

It's not uncommon for the tech industry to clash with established industries over content distribution, and sometimes these cases end up being win-wins, such as when a lawsuit over records and radio play led to a system where rights holders had to pay licensing fees every time music was sold or played, creating a new revenue stream for musicians.

In fact, the authors and publishers who sued Google found a mutually beneficial compromise after a few years. It was the development of a new market: 'selling digital data of out-of-print books that are no longer available in stores.' For a long time, out-of-print books were dead property that brought no new profits to publishers and authors, but if data sales become possible through Google's large-scale digitization, it will become a new source of revenue for publishers and authors. In addition, Google also benefits from selling digital books on its platform.

'We realized we had an opportunity to do something special for readers and scholars in this country,' said Richard Sarnoff, president of the Association of American Publishers at the time. 'We realized we could shine a light on the industry's out-of-print books and accelerate both discovery and consumption.'

With the goal of 'making it possible to sell out-of-print books digitally using Google's digital scan data,' the significance of the publishing industry's victory in the lawsuit against Google Books faded. Rather, a system like Google Books that displays part of a book can help readers discover out-of-print books that cannot be found in bookstores, which could lead to increased digital sales.

In addition, one of the problems with the digital sale of out-of-print books is the high cost of checking which books are available for digital sale and who the current copyright holder is for old books. However, because the lawsuit between the National Authors Guild and Google Books is a class action lawsuit, the ruling can theoretically be legally binding on almost all authors and publishers of books in American libraries. In other words, by finding a good compromise between the publishing industry and Google through the class action lawsuit, it is possible to avoid various problems and realize the digital sale of out-of-print books.

The interests of the publishing industry and Google coincided, and in 2008 a settlement was submitted in which Google would pay the publishing industry a total of $125 million (approximately 13.5 billion yen at the time) in damages and legal costs, in exchange for a license to Google Books and a 63% share of future digital data sales revenues for Google Books and Google Books. It was also stipulated that out-of-print books would be packaged as 'institutional subscription databases' and sold to universities.

If successful, the settlement, which University of California, Berkeley law professor Pamela Samuelson called 'probably the most adventurous class-action settlement ever attempted,' could have created new revenue streams for both Google and the publishing industry.

However, this drew opposition from university libraries and Google's competitors. University library officials and researchers argued that Google's monopoly on digital book licenses would lead to price gouging, as has happened in the academic journal market. Meanwhile, Google's competitors, Microsoft and Amazon, were concerned that Google would gain monopoly power in search engines and digital book sales.

The Department of Justice, which investigated the proposed settlement in response to these objections, took a negative view of the settlement, stating that 'Google's competitors would have to go through the unlikely process of mass-scanning books, being sued for a class action lawsuit, and attempting to settle in order to obtain similar rights.' As a result, the settlement was rejected in 2011, and the class action lawsuit ended in Google's victory in 2016, with 'Google Books' actions being fair use.' And so, the future of large numbers of out-of-print books being sold digitally never came to fruition.

Dan Clancy, who worked on the settlement as a Google engineering leader, said the reason it was rejected was likely due to opposition from librarians and academic authors. 'Without the active involvement of libraries, Bob Darnton (then Harvard University librarian), Samuelson and others, the Department of Justice would never have gotten involved,' Clancy told The Atlantic.

Ironically, many of the people opposed to the settlement believed that digital sales of out-of-print books would be possible without the 'class action settlement' process. Even Samuelson, who opposed the settlement, wrote, 'It would be a tragedy not to pursue this vision now that it is so clearly possible.'

However, more than 10 years after the settlement was rejected, as of the time of writing, there has been no progress in lobbying Congress to allow digital sales of out-of-print books. The Atlantic noted, 'It certainly seems unlikely that anyone would use political capital to try to change the book licensing system,' and 'It's no coincidence that the class action lawsuit against Google is perhaps the only forum for this kind of reform. Google was the only company with the initiative and the funds to make it happen.'

Although Google won the class action lawsuit, it has stopped scanning most of its old books, and the database of 25 million scanned books is lying dormant somewhere at Google. The database is said to be 50 to 60 petabytes in size, but it is only visible to a few engineers who are responsible for locking the database.

Related Posts:

Oct 23, 2024 23:00:00 in Note, Posted by log1h_ik