Build a scraper with php

There are a lot of ways to get data from websites, most of the time this is in the form of an API. When this isn’t available, you can build a scrapper to pull data right off of the website through its html.

When can you use a scraper?

Anytime you have access to html you can use a scraper to parse that information and gather the data you are looking for. The main issue with scrapers is you are relying on the structure of the page to be the same every time you send your script off to gather data.  This is why I’m weary about using scrapers as long term solution for data gathering.

Is PHP the only programming language to use with scrapers?

No,  you can pretty much use any programming language as long as it has the ability to pull in a web page’s content and parse the html that is returned. One thing to look for that is extremely helpfully is a pre built HTML dom parser. To find if the language you are using has one search “(programming language) html parser”  and you will find a ton of different options. I will be using http://sourceforge.net/projects/simplehtmldom/. A dom parser allows you to easily search through the returned html and find specific tags and the data contained within the tags. Now that we have a basic understanding of what tools we will be using lets get started!

Start Scraping!

Download this sample from github https://github.com/barrettbreshears/PHPTwitterScraper

Well actually lets hold off on that. Before we write any code I always like to get and idea of what the html structure looks like, and if there is a pattern or unique class or id I will be able to use to grab the data from the website. I will be using my twitter page for this demo https://twitter.com/BarrettBreshear and I will be grabbing all the tweets and when they were tweeted from the page. Also if you guys would like to follow me on twitter I would really appreciate it! Anyway when I inspect the page’s html I see that all the tweets are nested inside of an <ol> and each tweet is located inside of a <li> and all the tweet data is located inside of a div with the class .tweet!

Screen Shot 2014-02-11 at 10.29.30 PM

Now that we know the basic structure of the page lets get started scraping!

find('div.tweet');

// loop through each tweet div and pull out the timestamp and text
foreach ($tweets as $tweet) {
	$tweetText = $tweet->find('p.tweet-text', 0)->plaintext;
	$tweetDate = $tweet->find('a.tweet-timestamp', 0);
	echo $tweetText . " : " . $tweetDate->title . "\n";
}

 
Thats is! As you can see there really isn’t much to it. To run this code locate the directory of your project in terminal and type

php index.php

or go to the url in a browser (I’m not going to tell you how to live your life) and you will start to see the glorious data we were looking for.
Screen Shot 2014-02-11 at 11.56.07 PM
PHP simple html dom parser is a pretty cool project, and has lots of features that can really help you scrape the specific data your are looking for. For the complete manual take a look at http://simplehtmldom.sourceforge.net/manual.htm
One thing I would like to also add before you go is to throttle your scrapes. You usually aren’t scraping your own websites, so keep this in mind and slow down you program to a considerate rate.

Thanks for reading and I hope you guys enjoyed this basic example. Hopefully it will lead you to scraping glory!