WACZ (Web Archive Collection Zipped) is an open specification developed by Webrecorder for packaging web archives in a portable, replayable, tamper-evident format. WACZ is used by the Internet Archive, the Library of Congress, and major eDiscovery and digital-preservation platforms. Storing PageCrawl archives in WACZ means they are interoperable with the wider archival ecosystem.
This article explains what's inside a PageCrawl WACZ, what the embedded signature does, and why we ship additional sidecar proofs alongside.
Inside the WACZ zip
A WACZ file is a zip archive with a defined internal structure:
archive/data.warc.gz, the WARC (Web ARChive) file containing the captured HTTP responses (HTML, images, scripts, stylesheets, linked PDFs, etc.) in their original byte form.pages/pages.jsonl, a list of pages captured, one JSON object per line, with the URL, timestamp, and title.datapackage.json, a manifest listing every file inside the archive along with its size, mime type, and SHA-256 hash. This is the canonical integrity manifest.datapackage-digest.json, a SHA-256 hash ofdatapackage.jsonitself, plus an optionalsignedDatablock (WACZ Auth specification).
The hashes in datapackage.json chain into datapackage-digest.json, which is itself either signed by the WACZ Auth signedData block or simply stored alongside. Modifying any byte of any captured resource invalidates the manifest. The system is structurally tamper-evident.
The embedded signedData block (WACZ Auth)
When you enable WACZ capture on an Ultimate-plan page, the archive includes an embedded canonical signature following the WACZ Auth specification 0.1.0. The signature lives inside datapackage-digest.json and contains:
- The cryptographic signature of
datapackage.json's hash. - An RFC 3161 timestamp issued by a Trust Service Provider.
- The signing service's domain certificate, proving its identity.
This is the WACZ-spec-compliant way to sign a WACZ archive. When a WACZ-aware tool (such as ReplayWeb.page) opens the archive, it reads the signedData block and renders an integrity badge if the signature validates. The badge tells a reviewer that the archive was signed by the named domain at the indicated time.
Sidecar proof files
Alongside the WACZ, PageCrawl retains additional proof files that don't fit inside the WACZ Auth spec:
archive.wacz.ots, OpenTimestamps proof, anchored to the Bitcoin blockchain.archive.wacz.digicert.tsr, RFC 3161 timestamp from DigiCert (an Adobe Approved Trust List Trust Service Provider).archive.wacz.sectigo.tsr, RFC 3161 timestamp from Sectigo (also an AATL TSP).archive.wacz.qtsa.tsr(Custom plans), eIDAS qualified RFC 3161 timestamp from a Qualified Trust Service Provider.
The WACZ Auth spec only supports one embedded signature, so additional providers ship as sidecar files. Sidecars do not violate the WACZ format spec; they live in the same directory and are independent artefacts. Each sidecar is verifiable with public tooling (ots verify, openssl ts -reply -in) without touching the WACZ.
This dual approach gives the best of both worlds: spec-compliant embedded signature for WACZ-aware tooling, plus multi-provider redundancy for evidentiary depth.
How to read a WACZ
The simplest way to inspect a WACZ archive is to drag it into ReplayWeb.page. It renders the captured pages as the user originally saw them, including JavaScript-rendered content where applicable, plus the integrity badge from the embedded signature.
If you want to inspect the WACZ outside ReplayWeb.page, treat it as a regular zip archive. Standard zip tools can list and extract its contents. datapackage.json enumerates the captured resources and their hashes; pages/pages.jsonl enumerates the captured URLs.
How to download
Each tracked change with an archive shows a download button in the PageCrawl interface. From the archive details panel you can also download the per-provider timestamp proofs and the underlying WARC file for ingestion into other archival systems.
