Writing WARC Records

Starting with version 1.6, warcio introduces a way to capture HTTP/S traffic directly to a WARC file by monkey-patching Python's http.client library.

This approach works well with the popular requests library often used to fetch HTTP/S content.

Note: requests must be imported after the capture_http module.

Quick Start

Fetching the URL https://example.com/ while capturing the response and request into a gzip-compressed WARC file named example.warc.gz can be done with the following four lines:

from warcio.capture_http import capture_http
import requests  # requests must be imported after capture_http

with capture_http('example.warc.gz'):
    requests.get('https://example.com/')

The WARC example.warc.gz will contain two records (the response is written first, then the request).

To write to a default in-memory buffer (BufferWARCWriter), don't specify a filename:

with capture_http() as writer:
    requests.get('https://example.com/')

Additional requests in the capture_http context will be appended to the WARC as expected.

The WARC-IP-Address header will also be added for each record if the IP address is available.

Understanding the Output

The following example demonstrates the resulting records created with capture_http:

from warcio.capture_http import capture_http
from warcio.archiveiterator import ArchiveIterator
import requests

with capture_http() as writer:
    requests.get('http://example.com/')
    requests.get('https://google.com/')

expected = [('http://example.com/', 'response', True),
            ('http://example.com/', 'request', True),
            ('https://google.com/', 'response', True),
            ('https://google.com/', 'request', True),
            ('https://www.google.com/', 'response', True),
            ('https://www.google.com/', 'request', True)
           ]

actual = [
            (record.rec_headers['WARC-Target-URI'],
             record.rec_type,
             'WARC-IP-Address' in record.rec_headers)

            for record in ArchiveIterator(writer.get_stream())
          ]

assert actual == expected

Customizing WARC Writing

The library provides a simple and extensible interface for writing standards-compliant WARC files.

The library comes with:

WARCWriter - For writing to a single WARC file
BufferWARCWriter - For writing to an in-memory buffer
BaseWARCWriter - Can be extended to support more complex operations

Note: There is no support for writing legacy ARC files.

For more flexibility, such as using a custom WARCWriter class:

from warcio.capture_http import capture_http
from warcio import WARCWriter
import requests  # requests *must* be imported after capture_http

with open('example.warc.gz', 'wb') as fh:
    warc_writer = WARCWriter(fh)
    with capture_http(warc_writer):
        requests.get('https://example.com/')

WARC/1.1 Support

By default, warcio creates WARC 1.0 records for maximum compatibility with existing tools. To create WARC/1.1 records, simply specify the warc version:

with capture_http('example.warc.gz', warc_version='1.1'):
    requests.get('https://example.com/')

Or when using WARCWriter directly:

WARCWriter(fh, warc_version='1.1')

Version Differences

The main difference is that the WARC-Date timestamp header will be written with microsecond precision in WARC 1.1, while WARC 1.0 only supports second precision.

WARC 1.0:

WARC/1.0
...
WARC-Date: 2018-12-26T10:11:12Z

WARC 1.1:

WARC/1.1
...
WARC-Date: 2018-12-26T10:11:12.456789Z

Filtering HTTP Capture

When capturing via HTTP, you can provide a custom filter function to determine if particular request and response records should be written to the WARC file or skipped.

The filter function is called with the request and response record before they are written, and can be used to:

Substitute a different record (for example, a revisit instead of a response)
Skip writing altogether by returning nothing

def filter_records(request, response, request_recorder):
    # Return None, None to indicate records should be skipped
    if response.http_headers.get_statuscode() != '200':
        return None, None

    # The response record can be replaced with a revisit record
    elif check_for_dedup():
        response = create_revisit_record(...)

    return request, response

with capture_http('example.warc.gz', filter_records):
     requests.get('https://example.com/')

Please refer to test/test_capture_http.py for additional examples of capturing requests traffic to WARC.

Manual/Advanced WARC Writing

Before version 1.6, this was the primary method for fetching a URL and then writing to a WARC. This process is more verbose but provides full control of WARC creation and avoids monkey-patching.

The following example loads http://example.com/, creates a WARC response record, and writes it, gzip compressed, to example.warc.gz. The block and payload digests are computed automatically.

from warcio.warcwriter import WARCWriter
from warcio.statusandheaders import StatusAndHeaders
import requests

with open('example.warc.gz', 'wb') as output:
    writer = WARCWriter(output, gzip=True)

    resp = requests.get('http://example.com/',
                        headers={'Accept-Encoding': 'identity'},
                        stream=True)

    # Get raw headers from urllib3
    headers_list = resp.raw.headers.items()

    http_headers = StatusAndHeaders('200 OK', headers_list, protocol='HTTP/1.0')

    record = writer.create_warc_record('http://example.com/', 'response',
                                        payload=resp.raw,
                                        http_headers=http_headers)

    writer.write_record(record)

Additional Capabilities

The library also includes additional semantics for:

Creating warcinfo and revisit records
Writing response and request records together
Writing custom WARC records
Reading a full WARC record from a stream

Please refer to warcwriter.py and test/test_writer.py for additional examples.

📄 warcio/writing