PHP Classes
elePHPant
Icontem

Link Searcher: Crawl Web pages to search for given text

Recommend this page to a friend!
Stumble It! Stumble It! Bookmark in del.icio.us Bookmark in del.icio.us
  Info   Screenshots Screenshots   View files View files (5)   DownloadInstall with Composer Download .zip   Reputation   Support forum   Blog    
Last Updated Ratings Unique User Downloads Download Rankings  
2015-01-12 (7 months ago) RSS 2.0 feedNot yet rated by the usersTotal: 1,875 This week: 2All time: 1,985 This week: 570Up
Version License PHP version Categories  
link_searcher 1.0GNU General Publi...4.0HTTP, Searching
Description Author  

This class can be used to crawl Web pages to search for given text in it.

It retrieves a given Web page and searches for links contained in it.

The new links that are found are added to a queue to be crawled later and so implement recursive searching up to a given depth limit.

The class looks for pages with text that match a given regular expression.

Innovation Award  
PHP Programming Innovation award nominee
June 2008
Number 3


Prize: One book of choice by Apress
When you need to provide a search engine for a site, usually it is better to have a crawler program retrieving the site contents and index the contents in a database, so it can searched more easily.

However, when you need to search a site that you do not control or was not indexed by a search engine, you need to crawl and search on demand.

This class provides an on demand search solution to crawl and search text in the pages of a given site.

Manuel Lemos
Picture of Nadir Latif
Name: Nadir Latif is available for providing paid consulting. Contact Nadir Latif .
Classes: 14 packages by
Country: Pakistan Pakistan
Age: 32
All time rank: 841 in Pakistan Pakistan
Week rank: 49 Up2 in Pakistan Pakistan Down
Innovation award
Innovation award
Nominee: 9x

Winner: 1x

Details provided by the author  
Made by: Nadir Latif (nadir.latif@yahoo.com)

Dependencies: None.

This script is a web crawler that allows users to search for text inside web pages using regular expressions. The crawler starts from a page and does a breadth first search of all links that it finds on the page. The user can specify the depth to which the crawler will run. The text to search for is specified using regular expressions. The user can also optionally specify the link that should be searched. e.g links called "Next". e.g a search performed on a web site can return many pages of results. If the user wants to know which pages contains a certain text, he can use this application instead of manually clicking on the "Next" link. The script can easily be extended to process pages in any way.

1) Usage:

-Copy the files to the directory of a web server and run index.php. Enter the following :

   - In Enter URL field enter the url of the page from which the search should begin. (e.g http://search.yahoo.com/search?q=iran)
   - In Enter Regex field enter a regular expression that should match. The regular expression can simply be a text string that should occur in a page. (e.g Iran).
   - In Enter Link to Search field enter the text of the hyperlink that should be used to go to the next page. The is the text between the <a> tag (e.g Next &gt;).
   - In Enter Search Depth field enter the level to which the search should be carried out.

2) What does this script do?

The script initially retrieves the specified page. It then parses out all hyper links on the page. Those links that have the text specified by the user (in the "Enter Link to Search" text box) are placed in a FIFO queue. If no link text is specified then all links are placed in the queue. The first link in the queue is then retrieved. The page is downloaded and its content matched against the regular expression entered by the user. If there is a match, a link to that page is displayed on the browser. All links (or those that match the link text) in the downloaded page are placed in the queue. This process in repeated until the specified depth is reached.

3)List of files:

a)index.php (initial file)
b)link_searcher.php (main program file)
c)queue.php (used to store the links in a page)
d)readme.txt (help file)

-Feel free to contact me for any assistance regarding this script.
Screenshots  
  • screen_shot.jpg
  Files folder image Files  
File Role Description
Plain text file index.php Example initial file
Plain text file link_searcher.php Class main program file
Plain text file queue.php Class used to store the links in a page
Plain text file readme.txt Doc. help file
Plain text file LICENSE.txt Doc. Documentation

 Version Control Unique User Downloads Download Rankings  
 100%Total:1,875All time:1,985
 This week:2This week:570Up