Jawa: Web Archival in the Era of JavaScript

(Under submission), 2021

Ayush Goel, Jingyuan Zhe, Harsha V. Madhyastha, Ravi Netravali. (Under submission)

Abstract

By repeatedly crawling and saving web pages over time, web archival systems (such as the Internet Archive) enable users to visit historical versions of any page. In this paper, we point out that existing web archives are not well designed to cope with the increasing presence of JavaScript on the web. Some archives store petabytes of JavaScript code which either does not function, results in erroneous executions, or never gets used. Other archives instead store the end-state of page loads (e.g., screen captures) but, as a result, break post-load interactions implemented in JavaScript.

To address these problems, we present Jawa, a new design for web archives that significantly reduces the storage necessary to save modern web pages while also improving the fidelity with which archived pages are served. Key to Jawa’s design are our techniques for a) efficiently identifying the subset of JavaScript code that need be retained when archiving a page, and b) storing modified JavaScript files in a manner that preserves the benefits of deduplication. On a corpus comprising 350K archived pages, Jawa reduces overall storage needs by 36% and eliminates all failed resource fetches on the median page, when compared to the techniques currently used by the Internet Archive.