Web Scraping With Any Headless Browser: A Puppeteer Tutorial

0

Online data mining for research has evolved significantly, especially with the emergence of innovative and adaptive web scraping techniques that facilitate manual data collection.

You can perform data recovery tasks using a Hypertext Transfer Protocol (HTTP) client or a web browser. However, if you come across a dynamic website, you cannot accomplish the same task. Fortunately, headless browsers were purposely designed and developed to scrape dynamic web pages.

Throughout this article you will learn how to recover data online using any compatible headless and Puppeteer web browser. In short, this article serves as a complete Puppeteer tutorial on headless data mining. However, if you want to learn more and see an in-depth puppeteer tutorial, the Oxylabs website has an article just for you.

Technical terms explained

In the following subsections you will come across some technical words that you need to know in more detail.

I. Web Recovery

Web scraping is a structured way of collecting web data usually performed in an automated manner. It is also known as web harvesting or web data mining by amateurs and professionals.

As one of the most frequently used data scraping techniques today, web scraping is seen in market research and news monitoring, among other applications.

ii. Headless web browser

Internet browsers today have a graphical user interface (GUI), also called in this context “head”, for faster and more user-friendly use of software, such as Chrome. However, there are other browser variants designed and developed for web scraping. Take the headless web browser, for example.

A headless browser does not have a GUI, but you can run it using a command line interface (CLI) or network communication instead. The feature or headless mode runs on servers without a dedicated display and enables programming language functions like those written in JavaScript.

In some browsers, it also lets you implement and run large-scale web application tests or jump from one web page to another without human intervention.

iii. Puppeteer

Puppeteer is a software library with a high-level application programming interface (API) that primarily controls headless browsers through a “devtools” (web development tools) protocol. It is fully compatible with the JavaScript-based runtime environment Node.js or simply Node.

Apart from automated web application testing, professionals and hobbyists also use Puppeteer for web scraping due to its maximum overall efficiency.

iv. Node.js

Node.js is an open-source JavaScript runtime system that runs JS code outside of a web browser and offers back-end support.

It allows developers to use the JavaScript programming language to code command-line tools and start server-side scripts for generating dynamic web page content.

Advantages of scraping with a headless browser via Puppeteer

Scraping dynamic websites using a headless browser via Puppeteer gives you a reasonable number of benefits. These benefits include the following:

I. Faster data recovery

Use a headless browser compatible with Puppeteer, and you’ll experience a faster way to fetch web pages for valuable data compared to a full (non-headless) browser. Puppeteer’s default non-graphics mode is the main factor behind this optimal performance.

ii. Accelerated test automation

The brilliant combination of a headless browser and the Puppeteer library also allows for enhanced test automation. Not only can you automate one or more UI tests, but you can also apply the same configuration to manually initiated form submissions and keyboard input.

iii. Better performance diagnostics

A headless browser powered by Puppeteer allows you to capture the timeline trace of your website. This resulting log will help diagnose any performance issues.

Headless Chrome and Puppeteer Setup Guide

The next part of this Puppeteer tutorial will focus on installing and configuring Headless Chrome, then Puppeteer. Since Node.js is a prerequisite for this tutorial, we strongly recommend that you log on to the official Node.js website for the complete, separate installation guide.

Step 1 – Configuring Headless Chrome and Puppeteer

  • Install Puppeteer via “npm” command to include the most stable and updated headless browser version and wait a few minutes for this setup to complete.

npm i puppeteer –save

Step 2 – Setting up your project

  • Navigate to your project directory, start a new file from there, and open that file with your favorite code editor.
  • In your script, import Puppeteer and get the URL (Uniform Resource Locator) or web address from several command line arguments.

const puppeteer = require(‘puppeteer’);

const url=process.argv[2];

if (!url) {

run “Please provide a URL as the first argument”;

}

  • Define an asynchronous function and refer to the code below.

asynchronous function execute() {

const browser = await puppeteer.launch();

const page = await browser.newPage();

await page.goto(url);

wait page.screenshot({path: ‘screenshot.png’});

browser.close();

}

Course();

  • Make sure the final code is the same as shown below.

const puppeteer = require(‘puppeteer’);

const url=process.argv[2];

if (!url) {

throw “Please provide the URL as the first argument”;

}

asynchronous function execute() {

const browser = await puppeteer.launch();

const page = await browser.newPage();

await page.goto(url);

wait page.screenshot({path: ‘screenshot.png’});

browser.close();

}

Course();

  • Finally, navigate to your project’s root directory and run the following command to take a test screenshot.

node screenshot.js https://github.com

Conclusion

It takes patience and time to practice headless scratching through Puppeteer, especially in the absence of a GUI and frequent tool interactions via command lines. When you get used to it, however, your web data collection routine will improve further.

Share.

Comments are closed.