📄 warcio/README

File: README.md | Updated: 11/18/2025

WARCIO: WARC (and ARC) Streaming Library

Build Status Code Coverage

Overview

This library provides a fast, standalone way to read and write WARC Format commonly used in web archives. Requires Python 3.7+ (with only six as an external dependency).

Key Features:

  • Read and write WARC files compliant with WARC 1.0 and WARC 1.1 ISO standards
  • Fast, low-level access oriented around a stream of WARC records
  • On-the-fly ARC to WARC record conversion
  • Automatic decompression and de-chunking of HTTP payload content
  • Capture HTTP/S traffic directly to WARC files

This library is a spin-off of the WARC reading and writing component of pywb, a key component of Webrecorder.

Installation

pip install warcio

For optional features:

pip install warcio[all]

Quick Start

Reading WARC Files

Iterate over WARC records and print URLs of response records:

from warcio.archiveiterator import ArchiveIterator

with open('path/to/file.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream):
        if record.rec_type == 'response':
            print(record.rec_headers.get_header('WARC-Target-URI'))

Writing WARC Files

Capture HTTP/S traffic directly to a WARC file:

from warcio.capture_http import capture_http
import requests  # Must be imported after capture_http

with capture_http('example.warc.gz'):
    requests.get('https://example.com/')

Documentation

License

warcio is licensed under the Apache 2.0 License and is part of the Webrecorder project.

See NOTICE and LICENSE for details.