Nov 03, 2020 18:00:00

How to perform 'web scraping' to automatically acquire website information with JavaScript

Web

scraping , which automatically acquires website information, saves time by letting the program collect the information, and also helps prevent human error because the same operation can be performed accurately every time. Engineer Pavel Prokudin explains how to do such web scraping with JavaScript using sample code.

Web scraping with JS | Analog Forest
https://qoob.cc/web-scraping/

As a group of web scraping tools, Python 3 is often used as a programming language, the Requests library is often used to acquire HTML, and Beautiful Soup is often used to analyze HTML. However, Prokudin points out that the scraping method using this group of tools has not changed since several years ago, and that it is expensive to use for JavaScript engineers. Prokudin says he wants to put in place documentation for anyone who wants to do web scraping with JavaScript.

◆ Check the data in advance
Prokudin advises that you should first check 'whether web scraping is necessary' before doing web scraping. Since modern web applications often dynamically generate structured data instead of writing the data directly in HTML, it is often possible to retrieve the data without web scraping in the first place. That is.

◆ Get data
If you need web scraping, first get the HTML data itself from your website. It is possible to use the

HTTPS standard module of Node.js as it is, but Prokudin recommends node-fetch that supports asynchronous processing. The code below is a website that provides information such as soccer match results.
I 'm getting the HTML of the page about Lionel Messi from ' Transfermarkt '.

◆ Analyze the data
Tools that analyze the acquired HTML include

cheerio and jsdom, and Prokudin uses jsdom in the sample code. First, create a document object model (DOM) so that the acquired HTML can be handled programmatically.

Create a

NodeList object by specifying the CSS selector of the part where the information you want to get from the created DOM exists. Then use the Array.from () method to convert the NodeList object to an Array object. Now you can get the information you want in the form of an easy-to-use array.

◆ Process data
Since the acquired array contains unnecessary information, it is said that processing will be performed so that only the necessary information remains. Prokudin's sample code gets the line lengths in an array and examines the shape of the data. The array obtained by this sample code has four rows with lengths of 1, 5, 14, and 15.

Processing is added according to the line length. Lines with a length of 15 are separated into the 5th value from the beginning and the 6th and subsequent values.