What is the method developed to scientifically investigate ``How many videos are stored on YouTube and how many people are watching them?''



It is no exaggeration to say that the video distribution platform YouTube has now become a service that most Internet users have used at least once, and it is the largest social media service. Ethan Zuckerman, associate professor of public policy, communication, and information studies at the University of Massachusetts Amherst, explains the research method he developed to scientifically understand the scale of YouTube.

How Big is YouTube? - Ethan Zuckerman

https://ethanzuckerman.com/2023/12/22/how-big-is-youtube/

Much of the current social media research focuses on discovering fake news, misinformation, and hate speech. The research itself is not that difficult, as all you have to do is search for specific keywords on social media and calculate the number of posts and impressions that appear. However, Associate Professor Zuckerman regards the tendency to investigate only the absolute number that is the numerator without clarifying the entire number that is the denominator as a problem, calling it the 'denominator problem.'

For example, in August 2020, the research company Avaaz released a report on misinformation related to the new coronavirus , which reported that misinformation about the new coronavirus was viewed 3.8 billion times in one year. Masu. 3.8 billion views is a very large number, but it does not show how many posts were viewed by all users, so it is unclear how large the number 3.8 billion views is in the whole. . In fact, considering that Facebook's 3 billion users generate tens to hundreds of views a day, 3.8 billion total views can be interpreted as a very small number.



Social media that had access to the entire denominator data included Reddit and Twitter (currently X). However, since both have blocked public access and charged for APIs , it has become almost impossible for researchers to conduct research on Reddit and Twitter on a denominator basis, Associate Professor Zuckerman said. Masu.

Therefore, Associate Professor Zuckerman focused on YouTube, which is probably more widely used by Internet users than Reddit or Twitter. According to a survey by the research media Pew Research Center, 93% of teenagers use YouTube, compared to 63% for TikTok and 60% for SnapChat. It can be said that it is an easy-to-use social media.

However, although YouTube has several APIs, there is no way to randomly extract sample videos from YouTube. Previous YouTube-related research has only focused on either a selected list of videos or following recommended videos from a single specified video, and of course, such methods are also good enough. Although it is possible to investigate, it is not possible to obtain samples of all YouTube videos. Associate Professor Zuckerman points out that it is impossible to estimate the overall size of YouTube without a way to randomly extract samples.



Therefore, Associate Professor Zuckerman consulted Jason Baumgartner, the operator of

Pushshift.io , a site that stores and provides all past posts on Reddit. Using an undocumented API called YouTube's Innertube API, Baumgartner built a system that guesses random URLs and checks whether a video exists.

The URL of YouTube is 'https://www.youtube.com/ watch?v=○○○○', and the ○○○○ part contains uppercase and lowercase letters of the alphabet, numbers, and '_'. Contains an 11-digit string consisting of '-'. A rough estimate of the number of possible character strings is 1.84 quadrillion, and no matter how many videos are stored on YouTube, it is unlikely that these character string patterns will run out. Assuming that YouTube stores 1 billion videos, the probability of obtaining a valid address even if you randomly select a URL is 1 in 18.4 billion.

Associate Professor Zuckerman and Baumgartner say that the method of ``randomly generating a string of characters and checking whether there is a video'' is ``similar to a drunk person calling the number he came up with and seeing if the person answers. It was called 'drunk dialing' because it was similar. Mr. Baumgartner improved this drunk dialing to be 32,000 times faster, and further reduced the number of attempts by limiting the character strings verified by 'drunk dialing' and improved the video extraction rate. We are also devising In addition, a method was established to extract over 10,000 random YouTube videos in a few months by running a large number of scripts.

The site ``TubeStats'', which summarizes the results of extracting 24,964 videos as samples using this script and estimating the overall size of YouTube from there, is published below.

TubeStats
https://tubestats.org/



It is estimated that there will be 13,325,821,970 videos on YouTube in 2023. The bar graph below summarizes the trends in the estimated number of videos stored on YouTube from 2006 to 2023.



Number of video views. The mode is 10.880%, which is '17 to 32 times', which shows that most videos have not been able to exceed the 1000 view limit.



The language of the video is English for 31.844% of the total. Japanese videos ranked 7th with 3.178% of the total. TubeStats updates sample-based estimates once a month.



Associate Professor Zuckerman said: 'Perhaps most importantly, we intend to preserve Tubestats for as long as possible. YouTube disputes the existence of this data and the methods we used to create this data. However, I believe that all media platforms should be publishing high-level data like this on a regular basis. Platforms like YouTube are They're some of the most important parts of the world, and we need much more information about what's out there, who's creating this content and who it's reaching.' states.

in Web Service,   Science, Posted by log1i_yk