File: cli.md | Updated: 11/18/2025
The library ships with several command-line tools for working with WARC files.
The warcio index command prints a simple index of the records in a WARC file as newline-delimited JSON lines (NDJSON).
warcio index ./test/data/example-iana.org-chunked.warc
WARC header fields to include in the index can be specified via the -f flag. These fields are included in the JSON block in order.
warcio index ./test/data/example-iana.org-chunked.warc -f warc-type,warc-target-uri,content-length
Output:
{"warc-type": "warcinfo", "content-length": "137"}
{"warc-type": "response", "warc-target-uri": "http://www.iana.org/", "content-length": "7566"}
{"warc-type": "request", "warc-target-uri": "http://www.iana.org/", "content-length": "76"}
HTTP header fields can be included by prefixing them with http:. The special field offset refers to the record offset within the WARC file.
warcio index ./test/data/example-iana.org-chunked.warc -f offset,content-type,http:content-type,warc-target-uri
Output:
{"offset": "0", "content-type": "application/warc-fields"}
{"offset": "405", "content-type": "application/http;msgtype=response", "http:content-type": "text/html; charset=UTF-8", "warc-target-uri": "http://www.iana.org/"}
{"offset": "8379", "content-type": "application/http;msgtype=request", "warc-target-uri": "http://www.iana.org/"}
Note: This library does not produce CDX or CDXJ format indexes often associated with web archives. To create these indexes, please see the cdxj-indexer tool which extends warcio indexing to provide this functionality.
The warcio check command validates the payload and block digests of WARC records, if possible.
warcio check path/to/file.warc.gz
An exit value of 1 indicates a failure.
warcio check -v path/to/file.warc.gz
Verbose mode (-v) prints detailed output for each record in the WARC file.
The recompress command allows for re-compressing or normalizing WARC (or ARC) files to a record-compressed, gzipped WARC file.
Each WARC record is compressed individually and concatenated. This is the 'canonical' WARC storage format used by Webrecorder and other web archiving institutions, and is usually stored with a .warc.gz extension.
The recompress command can be used to:
warcio recompress ./input.arc.gz ./output.warc.gz
The extract command provides a way to extract either the WARC and HTTP headers and/or payload of a WARC record to stdout.
Given a WARC filename and an offset, extract will print the (decompressed) record at that offset in the file to stdout.
Extract the entire record (headers + payload):
warcio extract filename offset
warcio extract --payload filename offset
Output only the WARC + HTTP headers (if any):
warcio extract --headers filename offset