How is 'Internet Archive' that records and stores all information on the Internet managed?



" Internet archive " is a non-profit organization based in San Francisco, California, which operates an archive browsing service of web pages called " wayback machines ". Technology and business webmasters " The Hustle " report on how such an Internet archive is managed and how to save a massive website that continues to increase day by day.

Inside Wayback Machine, the internet's time capsule
https://thehustle.co/inside-wayback-machine-internet-archive

Buildings of the Internet archive have a staircase like an old palace, shaped like a temple or church rising over Corinthian pillars. At first glance it seems that it is not related to digital at all, and the history of a large number of web pages collected by the Internet archive so far is preserved.

The blue light blinks in the room where the computer servers inside the building line up, and the server packs the history over the Internet over 22 years. Tens of billions of web pages, personal tweets, latest articles, videos and net memes are included in the data the internet archive collects. The world of the Internet is so vast, there are as many as 1.8 billion web pages at the time of writing the article, and that amount has doubled every two to five years.

Meanwhile, the average web page's posting period is about 100 days, and many pages are something that will be forgotten in as little as five minutes after posting. It is a mission of the Internet archive to save many web pages that will disappear from the world if someone does not archive.


by Beatrice Murch

It is Mr. Brewster Kale , an American businessman who operates an Internet archive. Mr. Kale, who studied computers at the Massachusetts Institute of Technology , invented a WAIS system for text information retrieval system, sold $ 15 million (about 1.7 billion yen) in 1995 and built his wealth.

After that, since 1996 Mr. Kale personally starts "Back up the Internet" work. This project, called the Internet Archive, is likened to the Alexandria Library, which is said to have accumulated the world's most books.

The Internet archive aims to be "accessible to every knowledge from anywhere", and for six years Kale personally collected 10 billion web pages. Among them were a wide variety of web pages from the Geocities community of free website offering service to movie review site.



As of 2018, the wayback machine has a history of 338 billion web pages. The collection of Internet archives is not only web pages, but also includes software, etc. in addition to media data such as books, records, images in movies. The total data volume is about 40 PB ( petabyte : approx. 40 million GB) and it seems that 63% of the data that can be searched on the wayback machine is 63%.

The amount of 40 petabytes is too large so it may not come with pins but this is a little less than all the letters that human beings living on earth have written from the invention of the letters to the present day Thing. In addition, it is estimated that about 28 TB ( terabytes ) of texts collected in the Library of the American Congress , the largest library in the United States , is about, but this is not even 0.1% of the total data amount of the Internet archive.



Internet archive keeps 7000 bots every week crawling on the Internet and collects copies of a large number of web pages. These copies called "snapshots" saves the state of web pages at a certain frequency and keeps the history of web pages at a specific point in the archive.

For example, for CNN 's webpage , an American news broadcasting station, it is possible to search over 20,000 snapshots for 18 years on a wayback machine. Every week, 500 million new web pages are stored in the Internet archive, 20 million Wikipedia pages, 20 million tweets on Twitter, and 100 million news stories are newly saved every week.

All of this enormous work is done non-profitly, and it is said that everything is covered by donation, such as technology development, software development, and operating costs to machines operating servers and bots. In addition, Internet-archive not only collects and stores data but also tries to solve ethical problems concerning Internet history.



It seems that the Internet is simply calculated and growing at a rate of 70 TB per second, and it is impossible to cover everything even with an Internet archive that holds a large scale server. Also, Internet archives will not access private data such as e-mail and cloud data.

Mark Graham , the administrator of the wayback machine, said: "We back up a large number of web pages, but we can not back them up, the priority of which web page to back up is" It is judged on the basis of what is important when thinking about what the Internet is "and" whether it is useful "."

The Internet-archive team has decided the extent to which Bot preserves the website on the web to a great extent. A particular bot is crawling the 700 "Most visited site" crowning, and that site includes YouTube · Wikipedia · Reddit · Twitter etc. Mr. Graham said, "From an archive perspective, interesting things are governments, NGOs and news sites around the world," the Internet archive cooperates with some 600 experts and partners We are backing up web pages based on our own principles.



The web page search service of the wayback machine is an important tool for the era when "fake news" that makes people false with fake information spread easily. Even if the fake news were spread and corrected to the correct information and the "past is being modified", even if it is saved in the Internet archive, the lie of spreading will be revealed.

The internet archive is closely watching the movements of President Trump who won the US presidential election in November 2016. After winning Mr. Trump, he said, "Copy of the data collected by the Internet archive, We will set it up in not Canada. "

Internet archive plans to build a new server in Canada to prepare for the strengthening of Internet censorship by the next President of Trump - GIGAZINE



The building of the Internet archive which is the architecture like a church, the interior is also made like a church. In the seats set up to the stage, the dolls of employees who have worked in the Internet archive so far are aligned with the slurry. These dolls are supposed to be built as an Internet archive for over 3 years and will continue to be left forever.

It seems that 6 servers are protected and installed in the direction of people's eyes, and each server costs about 60,000 dollars (about 6.6 million yen). It is composed of 10 computers and 36 36 TB drives are loaded, and from the blog post 20 years ago, the history of the web such as TED's talk is packed in. Even if the web page itself disappears, the data stored in the Internet archive is said to remain semi-permanently in the future.

in Web Service, Posted by log1h_ik