Easy page scraping with Zend\Dom (from Zend Framework 2)

Last modified date

Comment: 1

The other day I was interested in getting some information from the sussex.academia.edu site, specifically I wanted a list of tags for each of the faculty members. Now, this sounds relatively easy except when you consider that initial page contains a list of links to various schools/departments people have listed, and then under each of those pages you have different fieldsets with different types of people on them (and I was only interested in the faculty fieldset), and each person may or may not have tags and even then those tags may be hidden behind some javascript so that you click and view all of the tags… When you consider all of that you would be forgiven in thinking that it’s actually quite a daunting task!

Let me assure you, though, that by using Zend\Dom from the Zend Framework 2 library it’s actually a really simple task. In fact, I did it in around 20 lines of code.

So let’s start by looking at the code and then break it down a little more.

The first thing you’ll notice is that, OK, it’s not just 20 lines. I’ve added a little caching function to be kind to the academia.edu servers if I want to test out new selectors and so on, but the code functionality starts at line 31. But that’s getting ahead of ourselves, so let’s break it down.

First you’ll want to include your autoloader however you’re doing it, be it with Composer, setting it up yourself, or whatever, and include the required namespace classes (here I’ve also included Zend\Debug with the Zend\Dom just for easy output).

Simple caching

This just caches the result of the file_get_contents fetch so that it can be used again and it also ensures that the contents are stored in UTF-8. On subsequent calls with the same url it’ll return the cached results rather than connect to the remote website to get the contents. But you know that – I don’t really need to teach you what caching is all about!

Get each page

We start off by getting the initial page and instantiate Zend\Dom\Query using the page contents. The power of this class is really in its simplicity of use, which is basically passing it a selector path. If you’ve ever used jQuery or Zepto or something like that you’ll be very familiar with how handy and powerful selectors can be to get down the dom tree to the elements you want. Zend\Dom does the same thing, only in PHP.

So the code above uses the query parser to get all the anchor tags that are in unordered list elements that are contain within a div with the id of department_list. The execute method returns a NodeList which implements both Iterator and ArrayAccess, so you can loop through the results very easily. And from each anchor we can get the href property which gives us the page that contains all of the people and their tags.

With each page found we load up the contents and create another Query object and from there we can execute a query to look for the h1 tag which contains the title text of the page we’re currently processing, so we want to get that for later.

If you’ve ever used selectors in jQuery you may be aware that you can look for selectors within the scope of another selector – basically you can think of that as doing a sub-query in a chink of html within your page. I found that the easiest way to to this with Zend\Dom was to convert the node that was found from my previous query execute into SimpleXML and then output that out as XML and pass it right back to a new instance of the Query class. All of that takes place on lines 36 and 38, but as I had the SimpleXML object I also used it to make sure the fieldset my previous execute had found had the legend of ‘faculty’.

All of this should be pretty familiar by now… It’s just looping through the results of a selector, doing a sub-query with the SimpleXML trick, and then adding the results to an array.

Results

So starting with a page that only contains links to other pages you’d end up with an array that looks like this:

[code lang=”php”]
array (size=62)
‘Accounting’ =>
array (size=1)
‘Metodio Moniz’ =>
array (size=5)
0 => string ‘Accounting’ (length=10)
1 => string ‘Dendi’ (length=5)
2 => string ‘Education’ (length=9)
3 => string ‘Philosophy’ (length=10)
4 => string ‘History’ (length=7)
‘American Studies’ =>
array (size=2)
‘Russell Dent’ =>
array (size=54)
0 => string ‘Translation Studies’ (length=19)
1 => string ‘American Literature’ (length=19)
2 => string ‘Poetry’ (length=6)
3 => string ‘French Literature’ (length=17)
// and so on…
[/code]

and all of that with a handful of code. Marvelous stuff!

Share

1 Response

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.