Thesis | Ayush Goel

We are increasingly relying on the internet and specifically the world wide web (WWW) to exchange information and access services. Despite its ubiquitous use, there are two key barriers to accessing information that is shared on the web: 1) Many web pages suffer from poor performance with respect to both end-user loading latency and crawling throughput as observed by large-scale web crawlers. 2) Many web pages cease to exist over time causing a significant fraction of published information to no longer be available. My dissertation addresses these issues by employing fine-grained data-flow and controlflow analysis of web computations, specifically JavaScript execution. Using this analysis, I am able to extract and modify JavaScript runtime behavior during web page loads and leverage this ability to build a number of web systems. First, I propose a client-side computation caching system that stores results of JavaScript (JS) execution to reduce compute delays and improve web page load times. I show that up to 85% of JavaScript runtime can be skipped by using such a computation cache. Second, I demonstrate that legacy JavaScript code has untapped potential for parallelization across multiple cores of modern smartphones to improve page load times. I show that 88% speedup in JS execution can be achieved by parallelizing execution on 8 cores of a given mobile device. Third, I built Sprinter, a distributed web crawler that crawls the web at 5 times the rate of traditional browser-based crawlers while preserving perfect fidelity. Sprinter accomplishes this by carefully selecting a subset of pages on any site to be crawled which it crawls using a browser, and caches the corresponding compute. It then performs browser-less crawling of the remaining pages on that site using those cached computations. Finally, I built Jawa, a web archival crawler that reduces the storage overhead of web archives by 41% while eliminating all fidelity issues. Jawa accomplishes this by exploiting the differences between live and archived pages, and accurately identifying and patching the sources of non-determinism that impair JavaScript execution on archived pages