Parallel/Async Web Scraping using Guzzle
Web scraping is both useful and cool: it allows you programatically acquire content from pretty much any public web page. Whether you're building a content aggregator, a comparison platform, a specialized search engine or anything else that requires access to a large volume of public data, web scraping is a great way to get the data you need.
Sequential Scraping
The problem is that the naive approach - sequential scraping - is slow. You request a page, wait for the reply and then move on to the next page. If we assume that on average this takes 1 second per page, this means that we can scrape 3600 pages per hour or 86400 pages per day. While 86k/day may not seem bad, it's also not very good since for most subject areas, it will be easy to find orders of magnitude more pages than our anemic 86k/day limit.
However, this is no reason to despair, because (of course) there are ways around this: you could run multiple instances of your scraper or scale even more instances across multiple machines, but this complicates things quite a bit. There is however an easier way: async scraping.
Async/Parallel Scraping
When using async scraping, you issue multiple page fetch requests at the same time and then process the returned data as it arrives. Since most the time spent during scraping is spent waiting for the response from the remote server, it makes perfect sense to issue multiple requests simultaneously and then process the results as they become available. This works really well, the only difference to sequential scraping is that the result data will not be returned in the same order in which the scrape requests were issued. As such, you have to use the $index variable in the code below to store each scrape response into it's appropriate offset on the $result array (this also holds true for failures/errors which are processed in the rejected handler using the same technique).
If you've done a reasonable amount of Javascript development, you're in luck, because this means that you're already familiar with the concept of async processing and promises and the code below will look relatively familiar. If you've only done PHP and you've never seen this type of code before, your first reaction might be to think that it looks rather confusing. However, don't worry: once you understand the basic flow, it's actually quite easy to apply this to your scraping code. A basic example of this technique using Guzzle is shown below (we use Guzzle (rather than cURL) because Guzzle has nice support for promises out of the box).
<?php
use App\Models\ExecLog;
use GuzzleHttp\Client as GuzzleClient;
use GuzzleHttp\Exception\RequestException;
use GuzzleHttp\Promise\EachPromise;
use GuzzleHttp\Promise;
use Psr\Http\Message\ResponseInterface;
use Symfony\Component\DomCrawler\Crawler;
public static function fetchPagesAsync (array $urls, int $maxConcurrency=10, bool $debug) : array
{
$result = [];
$client = new GuzzleClient();
$promises = (function () use ($urls, $client) {
foreach ($urls as $url) {
yield $client->getAsync($url);
}
})();
(new EachPromise($promises, [
'concurrency' => $maxConcurrency,
'fulfilled' => function (ResponseInterface $response, int $index) use (&$result, $urls, $debug) {
$result[$index] = $response->getBody();
$_url = $urls[$index];
if ($debug) {
$_url = $urls[$index];
print "ASYNC Fetch SUCCESS: $index: $_url\n";
}
},
'rejected' => function (RequestException $ex, int $index) use (&$result, $urls, $debug) {
$result[$index] = null;
if ($debug) {
$_url = $urls[$index];
print "ASYNC Fetch FAILURE: for $_url: " . $ex->getMessage() . ' ' . $ex->getCode() . "\n";
}
}
]))->promise()->wait();
return $result;
}
On a previous job at an employer that shall not be named, using this technique we were able to scrape 700K pages every day in a few hours using a single machine without running into any resource limits or performance issues. So if you have a non-trivial scraping volume to process, the technique described here can be extremely useful as it will allow you to process much more data using a simple/single setup rather than having you turn to running multiple instances or even multiple machines. There is of course a limit to how far this approach will scale, but you might be pleasantly surprised by how far it can take you.