PHP Classes
Icontem

File: README.txt


  Search   All class groups All class groups   Latest entries Latest entries   Top 10 charts Top 10 charts   Newsletter Newsletter   Blog Blog   Forums Forums   Help FAQ Help FAQ  
  Login   Register  
Recommend this page to a friend! ReTweet ReTweet Stumble It! Stumble It! Bookmark in del.icio.us Bookmark in del.icio.us
  Classes of Andy Pieters  >  Robots_txt  >  README.txt  
File: README.txt
Role: Documentation
Content type: text/plain
Description: Usage Examples
Class: Robots_txt
Test if a URL may be crawled looking at robots.txt
 

Contents

Class file image Download
		Robots exclusion standard is considered propper netiquette, so any kind of script that exhibits
		crawling-like behavior is expected to abide by it.

		The intended use of this class is to feed it a url before you intend to visit it. The class will
		automatically attempt to read the robots.txt file and will return a boolean value to indicate if
		you are allowed to visit this url.

		Maximum Crawl-delays and request-rates maxed-out at 60seconds.

		The class will block until the detected crawl-delay (or request-rate) allows visiting the url.

		For instance, if Crawl-delay is set to 3, the Robots_txt::urlAllowed() method will block for 3
		seconds when called a second time. An internal clock is kept with the last visited time, so if
		the delay is already expired, the method will not block.

		Example usage

		foreach($arrUrlsToVisit as $strUrlToVisit) {

			if(Robots_txt::urlAllowed($strUrlToVisit,$strUserAgent)) {

				#visit url, do processing. . . 
			}
		}

		The simple example above will ensure you abide by the wishes of the site owners.

		Note: an unofficial non-standard extension exists, that limits the times that crawlers
			  are allowed to visit a site. I choose to ignore this extension because I feel it
			  is unreasonable.

		Note: You are only *required* to specify your userAgent the first time you call the urlAllowed method, and
			  only the first value is ever used.
			  
Example Usage
	var_dump(Robots_txt::urlAllowed('http://slashdot.org/','Slurp'));
	var_dump(Robots_txt::urlAllowed('http://slashdot.org/test','Slurp'));

 
  Advertise on this site Advertise on this site   Site map Site map   Statistics Statistics   Site tips Site tips   Privacy policy Privacy policy   Contact Contact  

For more information send a message to :
info at phpclasses dot org.
Copyright (c) Icontem 1999-2009 PHP Classes - PHP Class Scripts
  PHP Book Reviews - Reviews of books and other products