We are increasingly relying on the internet and specifically the world wide web (WWW) to
exchange information and access services. Despite its ubiquitous use, there are two key
barriers to accessing information that is shared on the web: 1) Many web pages suffer from
poor performance with respect to both end-user loading latency and crawling throughput as
observed by large-scale web crawlers. 2) Many web pages cease to exist over time causing a
significant fraction of published information to no longer be available.
My dissertation addresses these issues by employing fine-grained data-flow and controlflow analysis of web computations, specifically JavaScript execution. Using this analysis, I am
able to extract and modify JavaScript runtime behavior during web page loads and leverage
this ability to build a number of web systems. First, I propose a client-side computation
caching system that stores results of JavaScript (JS) execution to reduce compute delays
and improve web page load times. I show that up to 85% of JavaScript runtime can be
skipped by using such a computation cache. Second, I demonstrate that legacy JavaScript
code has untapped potential for parallelization across multiple cores of modern smartphones
to improve page load times. I show that 88% speedup in JS execution can be achieved
by parallelizing execution on 8 cores of a given mobile device. Third, I built Sprinter, a
distributed web crawler that crawls the web at 5 times the rate of traditional browser-based
crawlers while preserving perfect fidelity. Sprinter accomplishes this by carefully selecting a
subset of pages on any site to be crawled which it crawls using a browser, and caches the
corresponding compute. It then performs browser-less crawling of the remaining pages on
that site using those cached computations. Finally, I built Jawa, a web archival crawler that
reduces the storage overhead of web archives by 41% while eliminating all fidelity issues.
Jawa accomplishes this by exploiting the differences between live and archived pages, and
accurately identifying and patching the sources of non-determinism that impair JavaScript
execution on archived pages