Zend Framework Blog: Scrape Screens with zend-dom
The Zend Framework blog has posted another tutorial focusing on the use of one of the components that makes up the framework. In this latest tutorial Matthew Weier O’Phinney focuses on the zend-dom component and how to use it for scraping content from remote sources.
Even in this day-and-age of readily available APIs and RSS/Atom feeds, many sites offer none of them. How do you get at the data in those cases? Through the ancient internet art of screen scraping.
The problem then becomes: how do you get at the data you need in a pile of HTML soup? You could use regular expressions or any of the various string functions in PHP. All of these are easily subject to error, though, and often require some convoluted code to get at the data of interest.
[…] zend-dom provides CSS selector capabilities for PHP, via the ZendDomQuery class. […] While it does not implement the full spectrum of CSS selectors, it does provide enough to generally allow you to get at the information you need within a page.
He gives an example of it in use, showing how to grab a navigation list from the Zend Framework documentation site (a list of items in a <ul> tag). He also suggests some other uses of the tool including use in testing of your application, checking content in the page without having to hard-code specific strings.