Few weeks ago I came across web scraping. Found the subject intriguing with numerous life applications.
I researched a bit and there are a lot of applications and frameworks one can use to quickly make an app that can scrape data off websites.
Ones that I tried and experimented a little are:
Nokogiri: Ruby Based framework
A ruby based framework, quick and easy to learn. And more importantly it just works.
In a short time, I was able to make a quick script that could scrape data off craigslist and spit it out to a csv file. Even take screenshots. This can be great for testing teams.
Nightmare.js & Horseman.js
I came across these two great frameworks where you can get to quickly scrape information without much coding. For example a standard pseudo code would look like:
Open a page
Open a link
Get information of an element
Take a screenshot
Close the page
While I was working on these two I came across another library called
phantom.js. Apparently, both
horseman.js are based on
phantom.js. So I decided why not just work directly on
phantom.js. I’ll need to write a few more lines of code but then it will give me greater flexibility in performing operations. And I was right.
So I started working on
phantom.js. And it’s a killer. A bit on how it works- phantom.js basically a headless web browser -one that can be run in a terminal. And it can do everything that a web browser can do. Neat.
What I wanted phantom.js to do is to to scrape information about products and product – prices off an an e-commence site, something like bigbasket.com. And since this site uses dynamically generated content, I had chosen the correct framework. I made a pseudo code for the project
* Get the links of the products for which I wanted to scrape name and prices
* I save these links in database. (I saved it in a csv file)
* Go to each link and save product, price and unit of measurement (I saved it in a csv)
and started to work on it. The final code is on Github.