# What's in a PageCrawl WACZ Archive

Source: PageCrawl.io Help Center
URL: https://pagecrawl.io/help/web-archives/article/wacz-format-explained

---

[Image: Web archiving: timestamped snapshots of web pages preserved over time]

WACZ (Web Archive Collection Zipped) is an open specification developed by Webrecorder for packaging web archives in a portable, replayable, tamper-evident format. WACZ is used by the Internet Archive, the Library of Congress, and major eDiscovery and digital-preservation platforms. Storing PageCrawl archives in WACZ means they are interoperable with the wider archival ecosystem.

This article explains what's inside a PageCrawl WACZ, what the embedded signature does, and why we ship additional sidecar proofs alongside.

### Inside the WACZ zip

A WACZ file is a zip archive with a defined internal structure:

- `archive/data.warc.gz`, the WARC (Web ARChive) file containing the captured HTTP responses (HTML, images, scripts, stylesheets, linked PDFs, etc.) in their original byte form.
- `pages/pages.jsonl`, a list of pages captured, one JSON object per line, with the URL, timestamp, and title.
- `datapackage.json`, a manifest listing every file inside the archive along with its size, mime type, and SHA-256 hash. This is the canonical integrity manifest.
- `datapackage-digest.json`, a SHA-256 hash of `datapackage.json` itself, plus an optional `signedData` block (WACZ Auth specification).

The hashes in `datapackage.json` chain into `datapackage-digest.json`, which is itself either signed by the WACZ Auth `signedData` block or simply stored alongside. Modifying any byte of any captured resource invalidates the manifest. The system is structurally tamper-evident.

### The embedded signedData block (WACZ Auth)

When you enable WACZ capture on an Ultimate-plan page, the archive includes an embedded canonical signature following the [WACZ Auth specification 0.1.0](https://specs.webrecorder.net/wacz-auth/0.1.0/). The signature lives inside `datapackage-digest.json` and contains:

- The cryptographic signature of `datapackage.json`'s hash.
- An RFC 3161 timestamp issued by a Trust Service Provider.
- The signing service's domain certificate, proving its identity.

This is the WACZ-spec-compliant way to sign a WACZ archive. When a WACZ-aware tool (such as ReplayWeb.page) opens the archive, it reads the `signedData` block and renders an integrity badge if the signature validates. The badge tells a reviewer that the archive was signed by the named domain at the indicated time.

### Sidecar proof files

Alongside the WACZ, PageCrawl retains additional proof files that don't fit inside the WACZ Auth spec:

- `archive.wacz.ots`, OpenTimestamps proof, anchored to the Bitcoin blockchain.
- `archive.wacz.<provider>.tsr`, an RFC 3161 timestamp file. Each captured archive carries one or more provider TSR files from commercial Trust Service Providers. The current providers are listed on the verification page accompanying each archive.
- `archive.wacz.qtsa.tsr` (Custom plans), qualified electronic timestamp from a QTSP on the EU Trusted List.

The WACZ Auth spec only supports one embedded signature, so additional providers ship as sidecar files. Sidecars do not violate the WACZ format spec; they live in the same directory and are independent artefacts. Each sidecar is verifiable with public tooling (`ots verify`, `openssl ts -reply -in`) without touching the WACZ.

This dual approach gives the best of both worlds: spec-compliant embedded signature for WACZ-aware tooling, plus multi-provider redundancy for evidentiary depth.

### How to read a WACZ

The simplest way to inspect a WACZ archive is to drag it into [ReplayWeb.page](https://replayweb.page). It renders the captured pages as the user originally saw them, including JavaScript-rendered content where applicable, plus the integrity badge from the embedded signature.

If you want to inspect the WACZ outside ReplayWeb.page, treat it as a regular zip archive. Standard zip tools can list and extract its contents. `datapackage.json` enumerates the captured resources and their hashes; `pages/pages.jsonl` enumerates the captured URLs.

### How to download

Each tracked change with an archive shows a download button in the PageCrawl interface. From the archive details panel you can also download the per-provider timestamp proofs and the underlying WARC file for ingestion into other archival systems.

### Related articles

- [Verifying a PageCrawl Web Archive](/help/web-archives/article/verifying-a-web-archive.md)
- [Sharing archives publicly](/help/web-archives/article/share-archives-publicly.md)
- [Packaging a PageCrawl Audit Trail for a Regulator](/help/web-archives/article/audit-trail-for-regulators.md)

---

Need more? The complete PageCrawl.io help center, with every article, is available as a single document at https://pagecrawl.io/llms-full.txt. Read it for context on anything this page does not cover.
