📄 warcio/cli

File: cli.md | Updated: 11/18/2025

WARCIO CLI Tools

The library ships with several command-line tools for working with WARC files.

Index

The warcio index command prints a simple index of the records in a WARC file as newline-delimited JSON lines (NDJSON).

Basic Usage

warcio index ./test/data/example-iana.org-chunked.warc

Specifying Fields

WARC header fields to include in the index can be specified via the -f flag. These fields are included in the JSON block in order.

warcio index ./test/data/example-iana.org-chunked.warc -f warc-type,warc-target-uri,content-length

Output:

{"warc-type": "warcinfo", "content-length": "137"}
{"warc-type": "response", "warc-target-uri": "http://www.iana.org/", "content-length": "7566"}
{"warc-type": "request", "warc-target-uri": "http://www.iana.org/", "content-length": "76"}

Including HTTP Headers

HTTP header fields can be included by prefixing them with http:. The special field offset refers to the record offset within the WARC file.

warcio index ./test/data/example-iana.org-chunked.warc -f offset,content-type,http:content-type,warc-target-uri

Output:

{"offset": "0", "content-type": "application/warc-fields"}
{"offset": "405", "content-type": "application/http;msgtype=response", "http:content-type": "text/html; charset=UTF-8", "warc-target-uri": "http://www.iana.org/"}
{"offset": "8379", "content-type": "application/http;msgtype=request", "warc-target-uri": "http://www.iana.org/"}

Note: This library does not produce CDX or CDXJ format indexes often associated with web archives. To create these indexes, please see the cdxj-indexer tool which extends warcio indexing to provide this functionality.

Check

The warcio check command validates the payload and block digests of WARC records, if possible.

Basic Usage

warcio check path/to/file.warc.gz

An exit value of 1 indicates a failure.

Verbose Mode

warcio check -v path/to/file.warc.gz

Verbose mode (-v) prints detailed output for each record in the WARC file.

Recompress

The recompress command allows for re-compressing or normalizing WARC (or ARC) files to a record-compressed, gzipped WARC file.

Each WARC record is compressed individually and concatenated. This is the 'canonical' WARC storage format used by Webrecorder and other web archiving institutions, and is usually stored with a .warc.gz extension.

Use Cases

The recompress command can be used to:

  • Compress an uncompressed WARC
  • Convert any ARC file to a compressed WARC
  • Fix an improperly compressed WARC file (e.g., a WARC compressed entirely instead of by record)

Usage

warcio recompress ./input.arc.gz ./output.warc.gz

Extract

The extract command provides a way to extract either the WARC and HTTP headers and/or payload of a WARC record to stdout.

Given a WARC filename and an offset, extract will print the (decompressed) record at that offset in the file to stdout.

Basic Usage

Extract the entire record (headers + payload):

warcio extract filename offset

Extract Payload Only

warcio extract --payload filename offset

Extract Headers Only

Output only the WARC + HTTP headers (if any):

warcio extract --headers filename offset

See Also