Building a spider or a bot needs some knowledge of regular expressions, you must know and use preg_match
or preg_match_all
to selectively find tags and extract informations from the html source. Sometimes, while I was watching the html code of a page, I’ve been thinking “If I could only use jQuery to get it! It would be so easy!“.
This happens because sometimes there are nested items which are not easily recognizable with regular expressions, you don’t have a clear and stable point to use to detect the informations… So you need to use the DOM
, that is to say the document structure (Document Object Model).
And you have load the target page with function like simplexml_load_string
and then you have to navigate throught the objects to find the information needed. This is not easy. If you’re also a front-end developer you probably know what is jQuery, basically it’s a framework that helps programmers to handle differences between browsers and lets you develop anything in javascript incredibly faster.
The first thing you learn with jQuery is its simple system to find and get html tags, selecting them using classes or other attributes.
So, you need something like a jQuery Php or a Php jQuery lib!
I’m not going to teach you jQuery, but I’m going to talk you about Simple HTML DOM Php class, that you can find here on sourceforge. It is the PHP jQuery lib I was searching for. It’s a class that lets you build scrapers using methods to navigate the DOM like the ones used in jQuery. Using this class I was able to build in just ten minutes the mini-widget in the right sidebar that embeds an animated GIF and it’s description from the very funny tumblr the_coding_love which every day publishes funny animated GIFs about coding. The code is only this and could be done better:
include("simple_html_dom.php"); $html = file_get_html('http://thecodinglove.com/random'); $src = $link = $text = ""; foreach($html->find('div.post div.centre h3') as $e) { foreach($e->find("a") as $a) $text = $a->innertext; foreach($e->find("a") as $a) $link = $a->href; } foreach($html->find('div.post div.bodytype') as $e) { foreach($e->find("img") as $a) $src = $a->src; } echo $src."\n"; echo $link."\n"; echo $text."\n";
I know the above code is not so well written and could be better, but I wrote it at 1:00 am.
You can see the result (with some css) in the right sidebar (or below if you are on mobile).
Here is a complete list of the methods/functions and properties in the last 1.5 version:
Helper functions
Name | Description |
---|---|
objectstr_get_html ( string $content ) | Creates a DOM object from a string. |
objectfile_get_html ( string $filename ) | Creates a DOM object from a file or a URL. |
DOM methods & properties
Name | Description |
---|---|
void__construct ( [string $filename] ) | Constructor, set the filename parameter will automatically load the contents, either text or file/url. |
stringplaintext | Returns the contents extracted from HTML. |
voidclear () | Clean up memory. |
voidload ( string $content ) | Load contents from a string. |
stringsave ( [string $filename] ) | Dumps the internal DOM tree back into a string. If the $filename is set, result string will save to file. |
voidload_file ( string $filename ) | Load contents from a from a file or a URL. |
voidset_callback ( string $function_name ) | Set a callback function. |
mixedfind ( string $selector [, int $index] ) | Find elements by the CSS selector. Returns the Nth element object if index is set, otherwise return an array of object. |
Element methods & properties
Name | Description |
---|---|
string[attribute] | Read or write element’s attribure value. |
stringtag | Read or write the tag name of element. |
stringoutertext | Read or write the outer HTML text of element. |
stringinnertext | Read or write the inner HTML text of element. |
stringplaintext | Read or write the plain text of element. |
mixedfind ( string $selector [, int $index] ) | Find children by the CSS selector. Returns the Nth element object if index is set, otherwise, return an array of object. |
DOM traversing
Name | Description |
---|---|
mixed$e->children ( [int $index] ) | Returns the Nth child object if index is set, otherwise return an array of children. |
element$e->parent () | Returns the parent of element. |
element$e->first_child () | Returns the first child of element, or null if not found. |
element$e->last_child () | Returns the last child of element, or null if not found. |
element$e->next_sibling () | Returns the next sibling of element, or null if not found. |
element$e->prev_sibling () | Returns the previous sibling of element, or null if not found. |
Camel naming convertions
You can also call methods with W3C STANDARD camel naming convertions.
Method | Mapping |
---|---|
array$e->getAllAttributes () | array$e->attr |
string$e->getAttribute ( $name ) | string$e->attribute |
void$e->setAttribute ( $name, $value ) | void$value = $e->attribute |
bool$e->hasAttribute ( $name ) | boolisset($e->attribute) |
void$e->removeAttribute ( $name ) | void$e->attribute = null |
element$e->getElementById ( $id ) | mixed$e->find ( “#$id”, 0 ) |
mixed$e->getElementsById ( $id [,$index] ) | mixed$e->find ( “#$id” [, int $index] ) |
element$e->getElementByTagName ($name ) | mixed$e->find ( $name, 0 ) |
mixed$e->getElementsByTagName ( $name [, $index] ) | mixed$e->find ( $name [, int $index] ) |
element$e->parentNode () | element$e->parent () |
mixed$e->childNodes ( [$index] ) | mixed$e->children ( [int $index] ) |
element$e->firstChild () | element$e->first_child () |
element$e->lastChild () | element$e->last_child () |
element$e->nextSibling () | element$e->next_sibling () |
element$e->previousSibling () | element$e->prev_sibling () |