 |
|
Innovation award
 Nominee: 1x |
This class can be used to check whether a page may be crawled by looking at the robots.txt file of its site.
It takes the URL of a page and retrieves the robots.txt file of the same site.
The class parses the robots.txt file and looks up for the rules defined in that file to see if the site allows crawling the intended page.
The class also stores the time when a page is crawled to check whether next time another page of the same site is being crawled it is honoring the intended crawl delay and request rate limits.
|
|
| Name: |
Robots_txt |
| Base name: |
robots_txt |
| Description: |
Test if a URL may be crawled looking at robots.txt |
| Version: |
1.1 |
| PHP version: |
5.0 |
| License: |
GNU General Public License (GPL) |
| All time users: |
1243 users |
| All time rank: |
2696 |
| Week users: |
3 users |
| Week rank: |
893  |
| |
|
 January 2008
Number 8 |
robots.txt is a file that sites need to have in their domain Web root to tell search engine crawlers and Web robots in general which pages should not be crawled.
This class can parse a robots.txt file of a domain to determine whether a given page should be crawled or not.
It is useful to implement a friendly crawler which respects the wishes of site owners that do not want to have certain pages crawled by Web robot programs.
Manuel Lemos |
| Ratings | Utility |
Consistency |
Documentation |
Examples |
Tests |
Videos |
Overall |
Rank |
| All time: |
Not sure (50.0%) |
Sufficient (62.5%) |
Not sure (50.0%) |
- |
- |
- |
Insufficient (36.2%) |
2152 |
| Month: |
Not yet rated by the users |
| |
Applications that use this class |
|
|
No application links were specified for this class.

If you know an application of this package, send a message to the
author to add a link here.
| |
Files |
|
|