PHP Web page to text function

I’ve found this nice small bot on the www.php.net site, thanks to the author of the script on the preg_replace…

Gennaio 16, 2010

I’ve found this nice small bot on the www.php.net site, thanks to the author of the script on the preg_replace page.
This bot returns the text content of a url and it could be used to take text from a site and find relevant words to search.

It’s nice because it uses CURL and let us see some nice stuff that CURL does:

hide himself presenting with a specific user agent, to seem a browser and not a spider
follows redirect (this means that you can call this function with a tiny url to retrieve the text in the real page!)
use very powerful regular expression to remove html tags, but also javascript and styles (they remain if you simply do a strip_tags)

This script will be included in the next version of the Mini Bots PHP Class.

function webpage2txt($url) {
	$user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";

	$ch = curl_init();    // initialize curl handle
	curl_setopt($ch, CURLOPT_URL, $url); // set url to post to
	curl_setopt($ch, CURLOPT_FAILONERROR, 1);              // Fail on errors
	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);    // allow redirects
	curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // return into a variable
	curl_setopt($ch, CURLOPT_PORT, 80);            //Set the port number
	curl_setopt($ch, CURLOPT_TIMEOUT, 15); // times out after 15s

	curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);

	$document = curl_exec($ch);

	$search = array('@<script[^>]*?>.*?</script>@si',  // Strip out javascript
		'@<style[^>]*?>.*?</style>@siU',    // Strip style tags properly
		'@<[\/\!]*?[^<>]*?>@si',            // Strip out HTML tags
		'@<![\s\S]*?–[ \t\n\r]*>@',         // Strip multi-line comments including CDATA
		'/\s{2,}/',
	);

	$text = preg_replace($search, "\n", html_entity_decode($document));

	$pat[0] = "/^\s+/";
	$pat[2] = "/\s+$/";
	$rep[0] = "";
	$rep[2] = " ";

	$text = preg_replace($pat, $rep, trim($text));

	return $text;
}

echo webpage2txt("http://www.rockit.it");

Author

Giulio Pons facebook linkedin github Envato

PHP expert. Wordpress plugin and theme developer. Father, Maker, Arduino and ESP8266 enthusiast.

Comments on “PHP Web page to text function”

There are 2 thoughts

Canyoun ha detto:

Agosto 3, 2011 alle 5:50 pm

hello friend

Thanks for this one

Comments are closed

How many times a web link has been shared on Twitter

Twitter share button and Facebook share button are the most used buttons to share links on Internet. You can read…

Ottobre 19, 2012

Coding

get MySpace events with a PHP function

Here is a function to read the concerts for a myspace band page. This code retrieves the “shows page” for…

Febbraio 21, 2011

Coding

PHP to get twitter infos and avatar

I’ve just updated the Mini Bot Php Class with an improved version of the twitterInfo function, here is the code…

Marzo 1, 2010

Coding

New version of Mini Bots PHP Class (v.1.4)

I’ve added three more bots to the Mini Bots Php Class, now the version number is 1.4 and it has…

Gennaio 20, 2010

Coding

Bot that retrieves url meta data and other infos

From a given url this function retrieves page title, meta description, keywords, favicon, and an array of 5 images to…

Gennaio 12, 2010

Coding

PHP bot to grab meteo information from Google

Google has many usefull functions that give you data fast, such as cinema infos, or for meteo forecasts. I think…

Dicembre 24, 2009

barattalo

PHP Web page to text function

Author

Comments on “PHP Web page to text function”

There are 2 thoughts

Recommended

How many times a web link has been shared on Twitter

get MySpace events with a PHP function

PHP to get twitter infos and avatar

New version of Mini Bots PHP Class (v.1.4)

Bot that retrieves url meta data and other infos

PHP bot to grab meteo information from Google