If you’re like me sometimes you want to scrape a web page so bad. You probably want some data in a readable format or just need a way to re-crunch that data for other purposes.
I solemnly swear that I am up to no good.
I’ve found my optimal setup after many tries with Guzzle, BeautifulSoup, etc… Here it is:
- Puppeteer: check https://github.com/GoogleChrome/puppeteer
- A little Raspberry Pi where my scripts can run all day long.
Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.
What does it mean? It means you can run a Chrome instance and put it at your service. Cool, isn’t it?
Let’s see how to do it.
Yes, the usual setup. Fire up your terminal, create a folder for your project and run
npm init in the folder.
When you’re setup you’ll probably have a
package.json file. We’re good to go. Now run
npm i -S puppeteer to install Puppeteer.
A little warning. Puppeteer will download a full version of Chromium in your
Don’t worry: since version
1.7.0 Google publishes the
puppeteer-core package, a version of Puppeteer that doesn’t download Chromium by default.
So, if you’re willing to try it, just run
npm i -S puppeteer-core
puppeteer-coreis intended to be a lightweight version of puppeteer for launching an existing browser installation or for connecting to a remote one.
Ok, we’re good to go now.
Your first scraper
index.js file in the project folder and paste this code in it.
That’s all you need to setup a web scraper. You can also find it in my repo https://github.com/napolux/puppy.
Let’s dig a bit in the code
For the sake of our example we’ll just grab all the post titles and URLs from my blog homepage. To add a nice touch we’ll change our user-agent in order to look like a good old iPhone while browsing the webpage we’re scraping.
And because we’re lazy, we’ll inject jQuery to the page in order to use it’s wonderful CSS selectors.
So… Let’s go line by line:
- Line 1-2 we’ll require Puppeteer and configure the website we’re going to scrape
- Line 4 we’re launching Puppeteer. Please remember we’re in the kingdom of Lord Asynchronous, so everything is a Promise, is async, or has to wait for something else 😉 As you can see the conf is self-explanatory. We’re telling the script to run Chromium headless (no UI).
- Line 5-10 The browser is up, we create a new page, we set the viewport size to a mobile screen, we set a fake user-agent and we open the webpage we want to scrape. In order to be sure that the page is loaded, we wait for the selector
body.blogto be there.
- Line 11 As I said, we are injecting jQuery into the page
- Line 13-28 Here is where the magic happens: we evaluate our page and run some jQuery code in order to extract the data we need. Nothing fancy, if you ask me.
- Line 31-37 We’re done: we close the browser and print out our data:
Run from the project folder
node index.js and you should end up with something like…
Post: Blah 1? URL: https://test.co/blah1/
Post: Blah 2? URL: https://test.co/blah2/
Post: Blah 3? URL: https://test.co/blah3/
// if we consider
// test.co our
// test domain
So, welcome to the world of web scraping. It was easier than expected, right? Just remember that web scraping is a controversial matter: please scrape only websites you’re authorized to scrape.
No. As the owner of https://coding.napolux.com I don’t authorize you
I leave to you how to scrape AJAX based webpages 😉