Parsing Scraped Web Pages using Goutte
So you've managed to scrape some web pages. That's great, but all this gives you is a bunch of HTML, which unfortunately is not very useful and definitely is not what you want as a final result. What you're really after, is the structured data which is contained in those HTML pages. This is where Goutte comes in: Goutte is essentially a simple wrapper which fetches web pages using Guzzle and returns an instance of a Symfony DomCrawler.
Fetching Web Pages using Guzzle
In order to reach our goal of extracting structured data from the scraped HTML pages, as a first step, you need some code which scrapes a web page and returns a Goutte crawler instance. In order to keep this sample code as simple as possible, I'll be showing you sample code which does sequential (ie: non-async) page fetching. Given this, our page fetching code will look like this:
<?php
use Goutte\Client as GoutteClient;
use GuzzleHttp\Client as GuzzleClient;
use Symfony\Component\DomCrawler\Crawler;
public static function fetchPage (string $url, array $headers=[]) : Crawler
{
$goutteClient = new GoutteClient();
foreach ($headers as $k=>$v) {
$goutteClient->setHeader($k, $v);
}
$guzzleClient = new GuzzleClient();
$goutteClient->setClient($guzzleClient);
$domCrawler = $goutteClient->request('GET', $url);
$result = $goutteClient->getResponse()->getStatus();
if ($result != 200) {
throw new \Exception ("Return code [$result] received for site [$url]");
}
return $domCrawler;
}
Parsing Scraped HTML using Goute/DomCrawler
As you can see, the above code fetches an HTML page and returns a Symfony DomCrawler instance. Using this crawler, you can now parse the previously fetched HTML page using CSS expressions which is quite cool, because just about everybody knows CSS expressions (and if you don't know them, don't worry: they're very easy to understand). In practice, our code to extract specific DOM nodes will look like this:
<?php
use Symfony\Component\DomCrawler\Crawler;
public static function parsePage(Crawler $crawler, string $baseUrl) : array
{
$result = [];
$result['pageTitles'] = $crawler->filter('h1')->each(function($node) {
return trim($node->text());
});
$result['providerLinks'] = $crawler->filter('#main > .itemList > a')->each(function($node) {
return $node->link()->getUri();
});
$result['providerNames'] = $crawler->filter('#main > .itemList > a')->each(function($node) {
return trim($node->text());
});
$result['providerImages'] = $crawler->filter('#main > .itemList > .logo')->each(function($node) {
return $node->image()->getUri();
});
$result['providerRatings'] = $crawler->filter('#main > .itemList > .dataTable > tr:nth-child(4) > td:nth-child(2)')->each(function($node) {
return trim($node->text());
});
return $result;
}
Things to keep in mind
- As you can see, the above code extracts specific HTML nodes using CSS selectors.
- If you want more advanced capabilities you can use XPath expressions, see here.
- Using the approach used above, the result of each filter you run is stored in an array, even if it only contains one item. The reason for this is that the DomCrawler can not know beforehand how many instances a given selector will match, so storing all results in arrays keeps things consistent.
- As the above code shows, there are multiple methods to access node content, depending of whether the node contains text, a link, an image, etc. Consult the DomCrawler docs for more info.
Goutte is a nice wrapper which makes web scraping a little easier than dealing with all the libraries it leverages by yourself. The most important component for data extraction is the DomCrawler, which is a very flexible and powerful package. If you want to do more than just the most trivial scraping, you're well advised to read the DomCrawler documentation (preferrably more than once) so you know what tricks it is capable of. Or just start building stuff and consult the manual when you get stuck (which is what I usually do) ...