Experiments with web scraping

Few weeks ago I came across web scraping. Found the subject intriguing with numerous life applications.

I researched a bit and there are a lot of applications and frameworks one can use to quickly make an app that can scrape data off websites.

Ones that I tried and experimented a little are:


How to setup LAMP server on AWS EC2

I’m hoping there would be more people like me who set up LAMP server on AWS Linux EC2 Instance. I’ve done it multiple times and realized that it’s time to write a script to automate the setup.

I've shared the script on GitHub. Detailed below are the steps to be followed.

Quick back up website

To quickly backup your public folder use the below:

tar -zcvf /home/protected/public-date +%Y%m%d.tar.gz /home/public/

What the above will do is, compress the files in home/public and place the archive in /home/protected folder.

If you’re using Amazon EC2, the below will help:

tar -zcvf /var/www/-date +%Y%m%d.tar.gz /var/www/html
This will compress the files in /var/www/html and place the archive in /var/www/ folder.


How to enable gzip on Amazon EC2 Instance

I am still tweaking small things on the Amazon EC2 server that is hosting my site. One of the things that I did not do immediately is enable gzip compression of all the site data when it is served to a browser. What this does is compress all the files down before they are pushed across all those tubes that make up the internet, and the browser then decompresses the files on the other side.


How to gzip static pages

Gzipping static pages reduces the total amount of bandwidth you use and reduces page load times by serving smaller files.

Make your site serve gzipped versions if available

To make your site serve the gzipped version if it is available, add the following to a file called .htaccess in /home/public:


