File: writing.md | Updated: 11/18/2025
Starting with version 1.6, warcio introduces a way to capture HTTP/S traffic directly to a WARC file by monkey-patching Python's http.client library.
This approach works well with the popular requests library often used to fetch HTTP/S content.
Note: requests must be imported after the capture_http module.
Fetching the URL https://example.com/ while capturing the response and request into a gzip-compressed WARC file named example.warc.gz can be done with the following four lines:
from warcio.capture_http import capture_http
import requests # requests must be imported after capture_http
with capture_http('example.warc.gz'):
requests.get('https://example.com/')
The WARC example.warc.gz will contain two records (the response is written first, then the request).
To write to a default in-memory buffer (BufferWARCWriter), don't specify a filename:
with capture_http() as writer:
requests.get('https://example.com/')
Additional requests in the capture_http context will be appended to the WARC as expected.
The WARC-IP-Address header will also be added for each record if the IP address is available.
The following example demonstrates the resulting records created with capture_http:
from warcio.capture_http import capture_http
from warcio.archiveiterator import ArchiveIterator
import requests
with capture_http() as writer:
requests.get('http://example.com/')
requests.get('https://google.com/')
expected = [('http://example.com/', 'response', True),
('http://example.com/', 'request', True),
('https://google.com/', 'response', True),
('https://google.com/', 'request', True),
('https://www.google.com/', 'response', True),
('https://www.google.com/', 'request', True)
]
actual = [
(record.rec_headers['WARC-Target-URI'],
record.rec_type,
'WARC-IP-Address' in record.rec_headers)
for record in ArchiveIterator(writer.get_stream())
]
assert actual == expected
The library provides a simple and extensible interface for writing standards-compliant WARC files.
The library comes with:
WARCWriter - For writing to a single WARC fileBufferWARCWriter - For writing to an in-memory bufferBaseWARCWriter - Can be extended to support more complex operationsNote: There is no support for writing legacy ARC files.
For more flexibility, such as using a custom WARCWriter class:
from warcio.capture_http import capture_http
from warcio import WARCWriter
import requests # requests *must* be imported after capture_http
with open('example.warc.gz', 'wb') as fh:
warc_writer = WARCWriter(fh)
with capture_http(warc_writer):
requests.get('https://example.com/')
By default, warcio creates WARC 1.0 records for maximum compatibility with existing tools. To create WARC/1.1 records, simply specify the warc version:
with capture_http('example.warc.gz', warc_version='1.1'):
requests.get('https://example.com/')
Or when using WARCWriter directly:
WARCWriter(fh, warc_version='1.1')
The main difference is that the WARC-Date timestamp header will be written with microsecond precision in WARC 1.1, while WARC 1.0 only supports second precision.
WARC 1.0:
WARC/1.0
...
WARC-Date: 2018-12-26T10:11:12Z
WARC 1.1:
WARC/1.1
...
WARC-Date: 2018-12-26T10:11:12.456789Z
When capturing via HTTP, you can provide a custom filter function to determine if particular request and response records should be written to the WARC file or skipped.
The filter function is called with the request and response record before they are written, and can be used to:
def filter_records(request, response, request_recorder):
# Return None, None to indicate records should be skipped
if response.http_headers.get_statuscode() != '200':
return None, None
# The response record can be replaced with a revisit record
elif check_for_dedup():
response = create_revisit_record(...)
return request, response
with capture_http('example.warc.gz', filter_records):
requests.get('https://example.com/')
Please refer to test/test_capture_http.py for additional examples of capturing requests traffic to WARC.
Before version 1.6, this was the primary method for fetching a URL and then writing to a WARC. This process is more verbose but provides full control of WARC creation and avoids monkey-patching.
The following example loads http://example.com/, creates a WARC response record, and writes it, gzip compressed, to example.warc.gz. The block and payload digests are computed automatically.
from warcio.warcwriter import WARCWriter
from warcio.statusandheaders import StatusAndHeaders
import requests
with open('example.warc.gz', 'wb') as output:
writer = WARCWriter(output, gzip=True)
resp = requests.get('http://example.com/',
headers={'Accept-Encoding': 'identity'},
stream=True)
# Get raw headers from urllib3
headers_list = resp.raw.headers.items()
http_headers = StatusAndHeaders('200 OK', headers_list, protocol='HTTP/1.0')
record = writer.create_warc_record('http://example.com/', 'response',
payload=resp.raw,
http_headers=http_headers)
writer.write_record(record)
The library also includes additional semantics for:
warcinfo and revisit recordsresponse and request records togetherPlease refer to warcwriter.py and test/test_writer.py for additional examples.