Experiments with web scraping

Few weeks ago I came across web scraping. Found the subject intriguing with numerous life applications.

I researched a bit and there are a lot of applications and frameworks one can use to quickly make an app that can scrape data off websites.

Ones that I tried and experimented a little are:

Nokogiri: Ruby Based framework

A ruby based framework, quick and easy to learn. And more importantly it just works.

In a short time, I was able to make a quick script that could scrape data off craigslist and spit it out to a csv file. Even take screenshots. This can be great for testing teams.

As I dived deeper into scraping data off dynamic websites, I started to hit roadblocks. Nokogiri works great if you’re scraping information off static websites but now a days, but most websites are responsive and dynamic content is generated post page is loaded. This is where I started looking for other options. I turned to javascript frameworks.

Nightmare.js & Horseman.js

I came across these two great frameworks where you can get to quickly scrape information without much coding. For example a standard pseudo code would look like:
<<Hey horseman.js
<<Open a page
<<Open a link
<<Get information of an element
<<Take a screenshot
<<Close the page

Pretty straightforward.

While I was working on these two I came across another library called phantom.js. Apparently, both nightmare.js and horseman.js are based on phantom.js. So I decided why not just work directly on phantom.js. I’ll need to write a few more lines of code but then it will give me greater flexibility in performing operations. And I was right.

Phantom.js

So I started working on phantom.js. And it’s a killer. A bit on how it works- phantom.js basically a headless web browser -one that can be run in a terminal. And it can do everything that a web browser can do. Neat.

What I wanted phantom.js to do is to to scrape information about products and product – prices off an an e-commence site, something like bigbasket.com. And since this site uses dynamically generated content, I had chosen the correct framework. I made a pseudo code for the project
* Get the links of the products for which I wanted to scrape name and prices
* I save these links in database. (I saved it in a csv file)
* Go to each link and save product, price and unit of measurement (I saved it in a csv)

and started to work on it. The final code is on Github.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s