Crawling the web at scale forms the basis of many
important systems: web search engines, smart assistants, generative AI, web archives, and so on. Yet, the research community
has paid little attention to this workload in the last decade. In
this paper, we highlight the need to revisit the notion that web
crawling is a solved problem. Specifically, to discover and fetch
all page resources dependent on JavaScript and modern web
APIs, crawlers today have to employ compute-intensive web
browsers. This significantly inflates the scale of the infrastructure necessary to crawl pages at high throughput.
To make web crawling more efficient without any loss of
fidelity, we present Sprinter, which combines browser-based
and browserless crawling to get the best of both. The key to
Sprinter’s design is our observation that crawling workloads
typically include many pages per site and, unlike in traditional
user-facing page loads, there is significant potential to reuse
client-side computations across pages. Taking advantage of this
property, Sprinter crawls a small, carefully chosen, subset of
pages on each site using a browser, and then efficiently identifies
and exploits opportunities to reuse the browser’s computations
on other pages. Sprinter was able to crawl a corpus of 50,000
pages 5x faster than browser-based crawling, while still closely
matching a browser in the set of resources fetched.