Experiments with web scraping

Few weeks ago I came across web scraping. Found the subject intriguing with numerous life applications.

I researched a bit and there are a lot of applications and frameworks one can use to quickly make an app that can scrape data off websites.

Ones that I tried and experimented a little are:

Nokogiri: Ruby Based framework

A ruby based framework, quick and easy to learn. And more importantly it just works.

In a short time, I was able to make a quick script that could scrape data off craigslist and spit it out to a csv file. Even take screenshots. This can be great for testing teams.

As I dived deeper into scraping data off dynamic websites, I started to hit roadblocks. Nokogiri works great if you’re scraping information off static websites but now a days, but most websites are responsive and dynamic content is generated post page is loaded. This is where I started looking for other options. I turned to javascript frameworks.

Nightmare.js & Horseman.js

I came across these two great frameworks where you can get to quickly scrape information without much coding. For example a standard pseudo code would look like:
<<Hey horseman.js
<<Open a page
<<Open a link
<<Get information of an element
<<Take a screenshot
<<Close the page

Pretty straightforward.

While I was working on these two I came across another library called phantom.js. Apparently, both nightmare.js and horseman.js are based on phantom.js. So I decided why not just work directly on phantom.js. I’ll need to write a few more lines of code but then it will give me greater flexibility in performing operations. And I was right.

Phantom.js

So I started working on phantom.js. And it’s a killer. A bit on how it works- phantom.js basically a headless web browser -one that can be run in a terminal. And it can do everything that a web browser can do. Neat.

What I wanted phantom.js to do is to to scrape information about products and product – prices off an an e-commence site, something like bigbasket.com. And since this site uses dynamically generated content, I had chosen the correct framework. I made a pseudo code for the project
* Get the links of the products for which I wanted to scrape name and prices
* I save these links in database. (I saved it in a csv file)
* Go to each link and save product, price and unit of measurement (I saved it in a csv)

and started to work on it. The final code is on Github.

Advertisements

2 thoughts on “Experiments with web scraping

  1. My requirements are kind of same, I want to find a way to extract data from multiple website and ran on to horseman and finally decided to start working only with phantom. But I’m stuck, not able to find enough help online on phantom, I can write javascript on node, but I’m unable to find all the required pieces and getting them work together. I cant see any github code link on your post. Can you help me with this?

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s