|All reviews||Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL||Latest reviews||Best sellers ranking|
Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL
March 30, 2007
Week: 309 All time: 102
yyztech.ca/articles...While the primary method of traversing the Internet since the early 90s has been the Web browser, automated browsing via a Web bot or spider opens up some exciting possibilities for new ways of using information freely available.
This book is broken down into four sections: an introduction to Web bots and how they work, a series of example projects, advanced techniques and a section on the designing well-behaved Web bots, including some legal issues.
Since, by their nature, Web bots are going to be going out into the wild, interacting with computers that are not your own, these "larger issues" are important to note and the author touches on them several times in the book.
Readers should be aware that it requires the cURL extension, that is not always part of default PHP installs. Those without it might want to explore the Zend Framework's HTTP class or PEAR's own.
The first section explains what Web bots are and what kind of things you might do with them. For a technical book, there is also a fair amount of talk of business applications for Web bots, which is good to see.
Also, readers will notice that most of the development is done with Windows in mind, since, early on the author mentions putting older PCs to work running Web bots, this makes sense.
In the first section, "Fundamental concepts and techniques", they cover the basics. It starts with the introduction where the author discovers that he does not need a browser to view Web pages, instead using Telnet, software that hearkens back to the BBS system of the 1980s.
This then goes into the what (are Web bots), ideas for projects, how to download a Web page using PHP and cURL, parsing, form submission and managing large amounts of data.
A note here, except for the early examples, most of the examples use the cURL extension. Its installation is briefly covered. As well, a lot of the code depends on a set of libraries provided at the book's Web site, that hides a lot of low-level work.
The second section is about projects. Some of the projects demonstrated are: a price monitoring Web bots, image capture Web bots, link verification Web bots, anonymous browsing, search ranking, aggregation Web bots, FTP bots, NNTP bots, and bots that read and send e-mail.
As the author stated in the introduction, he has been writing Web bots for years, and here you get to see his range of experience.
Part three, "Advanced technical considerations", builds on the projects, covering spiders, procurement Web bots and snipers, Web bots and cryptography, authentication, advanced cookie management, scheduling Web bots and spiders.
The forth part, larger considerations, is the chapter that some readers might skip. However, since Web bots and spiders operate "in the wild", this is an important chapter.
Topics include: designing stealthy Web bots and spiders, writing fault tolerant Web bots, designing Web bot friendly Web sites, killing spiders, and keeping Web bots out of trouble.
The book closes with three appendix: PHP/cURL reference, HTTP and NNTP status codes, and SMS email addresses for a number of Canadian, U.S., European and Asian carriers.
A quick search of the Internet shows that there are not a lot of recent books written on Web bots, especially on designing them, so this book fills the gap nicely. It is well written, has lots of practical examples and does not ignore the larger ethical and legal issues, recommended.
While the expression "Web bot" might bring to mind some kind of shady software to some readers, this book explains that they are, when used properly, just another, possibly profitable, way of using the Internet.
No comments were submitted yet.