How to scrape that web page with Node.js and puppeteer

If you’re like me sometimes you want to scrape a web page so bad. You probably want some data in a readable format or just need a way to re-crunch that data for other purposes.

I solemnly swear that I am up to no good.

I’ve found my optimal setup after many tries with Guzzle, BeautifulSoup, etc… Here it is:

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.

What does it mean? It means you can run a Chrome instance and put it at your service. Cool, isn’t it?

Let’s see how to do it.

Setup

Yes, the usual setup. Fire up your terminal, create a folder for your project and run npm init in the folder.

When you’re setup you’ll probably have a package.json file. We’re good to go. Now run npm i -S puppeteer to install Puppeteer.

A little warning. Puppeteer will download a full version of Chromium in your node_modules folder

Don’t worry: since version 1.7.0 Google publishes the puppeteer-core package, a version of Puppeteer that doesn’t download Chromium by default.

So, if you’re willing to try it, just run npm i -S puppeteer-core

puppeteer-core is intended to be a lightweight version of puppeteer for launching an existing browser installation or for connecting to a remote one.

Ok, we’re good to go now.

Your first scraper

Touch an index.js file in the project folder and paste this code in it.

That’s all you need to setup a web scraper. You can also find it in my repo https://github.com/napolux/puppy.

Let’s dig a bit in the code

For the sake of our example we’ll just grab all the post titles and URLs from my blog homepage. To add a nice touch we’ll change our user-agent in order to look like a good old iPhone while browsing the webpage we’re scraping.

And because we’re lazy, we’ll inject jQuery to the page in order to use it’s wonderful CSS selectors.

So… Let’s go line by line:

  • Line 1-2 we’ll require Puppeteer and configure the website we’re going to scrape
  • Line 4 we’re launching Puppeteer. Please remember we’re in the kingdom of Lord Asynchronous, so everything is a Promise, is async, or has to wait for something else 😉 As you can see the conf is self-explanatory. We’re telling the script to run Chromium headless (no UI).
  • Line 5-10 The browser is up, we create a new page, we set the viewport size to a mobile screen, we set a fake user-agent and we open the webpage we want to scrape. In order to be sure that the page is loaded, we wait for the selector body.blog to be there.
  • Line 11 As I said, we are injecting jQuery into the page
  • Line 13-28 Here is where the magic happens: we evaluate our page and run some jQuery code in order to extract the data we need. Nothing fancy, if you ask me.
  • Line 31-37 We’re done: we close the browser and print out our data:

Run from the project folder node index.js and you should end up with something like…

Post: Blah 1? URL: https://test.co/blah1/
Post: Blah 2? URL: https://test.co/blah2/
Post: Blah 3? URL: https://test.co/blah3/
// if we consider
// test.co our
// test domain

Recap

So, welcome to the world of web scraping. It was easier than expected, right? Just remember that web scraping is a controversial matter: please scrape only websites you’re authorized to scrape.

No. As the owner of https://coding.napolux.com I don’t authorize you

I leave to you how to scrape AJAX based webpages 😉

If you want to share a comment or report an issue with this post, please send me an email to napolux@gmail.com

Share on: Twitter, Facebook, Reddit, Hacker News or go back home.