Scraping content with PHP as if it was jQuery

Building a spider or a bot needs some knowledge of regular expressions, you must know and use preg_match or preg_match_all…

Dicembre 8, 2013

Building a spider or a bot needs some knowledge of regular expressions, you must know and use preg_match or preg_match_all to selectively find tags and extract informations from the html source. Sometimes, while I was watching the html code of a page, I’ve been thinking “If I could only use jQuery to get it! It would be so easy!“.

This happens because sometimes there are nested items which are not easily recognizable with regular expressions, you don’t have a clear and stable point to use to detect the informations… So you need to use the DOM, that is to say the document structure (Document Object Model).
And you have load the target page with function like simplexml_load_string and then you have to navigate throught the objects to find the information needed. This is not easy. If you’re also a front-end developer you probably know what is jQuery, basically it’s a framework that helps programmers to handle differences between browsers and lets you develop anything in javascript incredibly faster.

The first thing you learn with jQuery is its simple system to find and get html tags, selecting them using classes or other attributes.
So, you need something like a jQuery Php or a Php jQuery lib!

I’m not going to teach you jQuery, but I’m going to talk you about Simple HTML DOM Php class, that you can find here on sourceforge. It is the PHP jQuery lib I was searching for. It’s a class that lets you build scrapers using methods to navigate the DOM like the ones used in jQuery. Using this class I was able to build in just ten minutes the mini-widget in the right sidebar that embeds an animated GIF and it’s description from the very funny tumblr the_coding_love which every day publishes funny animated GIFs about coding. The code is only this and could be done better:

include("simple_html_dom.php");
$html = file_get_html('http://thecodinglove.com/random');
$src = $link = $text = "";
foreach($html->find('div.post div.centre h3') as $e) {
	foreach($e->find("a") as $a) $text = $a->innertext;
	foreach($e->find("a") as $a) $link = $a->href;
}
foreach($html->find('div.post div.bodytype') as $e) {
	foreach($e->find("img") as $a) $src = $a->src;
}
echo $src."\n";
echo $link."\n";
echo $text."\n";

I know the above code is not so well written and could be better, but I wrote it at 1:00 am.

You can see the result (with some css) in the right sidebar (or below if you are on mobile).

Here is a complete list of the methods/functions and properties in the last 1.5 version:

Helper functions

Name Description
objectstr_get_html ( string $content ) Creates a DOM object from a string.
objectfile_get_html ( string $filename ) Creates a DOM object from a file or a URL.

DOM methods & properties

Name Description
void__construct ( [string $filename] ) Constructor, set the filename parameter will automatically load the contents, either text or file/url.
stringplaintext Returns the contents extracted from HTML.
voidclear () Clean up memory.
voidload ( string $content ) Load contents from a string.
stringsave ( [string $filename] ) Dumps the internal DOM tree back into a string. If the $filename is set, result string will save to file.
voidload_file ( string $filename ) Load contents from a from a file or a URL.
voidset_callback ( string $function_name ) Set a callback function.
mixedfind ( string $selector [, int $index] ) Find elements by the CSS selector. Returns the Nth element object if index is set, otherwise return an array of object.

Element methods & properties

Name Description
string[attribute] Read or write element’s attribure value.
stringtag Read or write the tag name of element.
stringoutertext Read or write the outer HTML text of element.
stringinnertext Read or write the inner HTML text of element.
stringplaintext Read or write the plain text of element.
mixedfind ( string $selector [, int $index] ) Find children by the CSS selector. Returns the Nth element object if index is set, otherwise, return an array of object.

DOM  traversing

Name Description
mixed$e->children ( [int $index] ) Returns the Nth child object if index is set, otherwise return an array of children.
element$e->parent () Returns the parent of element.
element$e->first_child () Returns the first child of element, or null if not found.
element$e->last_child () Returns the last child of element, or null if not found.
element$e->next_sibling () Returns the next sibling of element, or null if not found.
element$e->prev_sibling () Returns the previous sibling of element, or null if not found.

Camel naming convertions

You can also call methods with W3C STANDARD camel naming convertions.

Method Mapping
array$e->getAllAttributes () array$e->attr
string$e->getAttribute ( $name ) string$e->attribute
void$e->setAttribute ( $name, $value ) void$value = $e->attribute
bool$e->hasAttribute ( $name ) boolisset($e->attribute)
void$e->removeAttribute ( $name ) void$e->attribute = null
element$e->getElementById ( $id ) mixed$e->find ( “#$id”, 0 )
mixed$e->getElementsById ( $id [,$index] ) mixed$e->find ( “#$id” [, int $index] )
element$e->getElementByTagName ($name ) mixed$e->find ( $name, 0 )
mixed$e->getElementsByTagName ( $name [, $index] ) mixed$e->find ( $name [, int $index] )
element$e->parentNode () element$e->parent ()
mixed$e->childNodes ( [$index] ) mixed$e->children ( [int $index] )
element$e->firstChild () element$e->first_child ()
element$e->lastChild () element$e->last_child ()
element$e->nextSibling () element$e->next_sibling ()
element$e->previousSibling () element$e->prev_sibling ()

Author

PHP expert. Wordpress plugin and theme developer. Father, Maker, Arduino and ESP8266 enthusiast.

Recommended

Find values recursively inside complex json objects in PHP

A PHP function to to quickly search complex, nested php structures for specific values.

Dicembre 18, 2022

Limit the number of categories for posts in WordPress

CHOOSE ONLY ONE CATEGORY WORDPRESS If you need to limit the number of categories used by the authors of your…

Settembre 14, 2015

Get instagram data without official api in PHP

Instagram has an official API to interact with its database of images and users. If you have enough time to…

Dicembre 3, 2013

Make a cron job with IFTTT

Cron is a software utility, a time-based job scheduler in Unix-like computer operating systems. People who set up and maintain…

Novembre 12, 2013

How many times a web link has been shared on Twitter

Twitter share button and Facebook share button are the most used buttons to share links on Internet. You can read…

Ottobre 19, 2012

How to read facebook likes count from PHP

When you add facebook like button to your site, probably, you also want to save the number of likes of…

Ottobre 8, 2012