Reading WARC Records

A key feature of warcio is the ability to iterate over a stream of WARC records using the ArchiveIterator.

Features

Reading WARC 1.0, WARC 1.1, or ARC streams
On-the-fly ARC to WARC record conversion
Automatic decompression and de-chunking of HTTP payload content stored in WARC/ARC files

Basic Usage

The following example prints the URL for each WARC response record:

from warcio.archiveiterator import ArchiveIterator

with open('path/to/file.warc.gz', 'rb') as stream:
    for record in ArchiveIterator(stream):
        if record.rec_type == 'response':
            print(record.rec_headers.get_header('WARC-Target-URI'))

The stream object can be a file on disk or a remote network stream. The ArchiveIterator reads the WARC content in a single pass.

ArcWarcRecord Object

The record is represented by an ArcWarcRecord object which contains:

format - The archive format (ARC or WARC)
rec_type - The record type
rec_headers - The record headers
raw_stream - Raw stream for reading the payload
http_headers - HTTP headers (if any)
content_type - Content type
length - Content length

class ArcWarcRecord(object):
    def __init__(self, *args):
        (self.format, self.rec_type, self.rec_headers, self.raw_stream,
         self.http_headers, self.content_type, self.length) = args

Reading WARC Content

The raw_stream can be used to read the rest of the payload directly. A special ArcWarcRecord.content_stream() function provides a stream that automatically decompresses and de-chunks the HTTP payload, if it is compressed and/or transfer-encoding chunked.

ARC Files

The library provides support for reading (but not writing) ARC files. The ARC format is legacy but is important to support in a consistent manner. The ArchiveIterator can equally iterate over ARC and WARC files to emit ArcWarcRecord objects. The special arc2warc option converts ARC records to WARCs on the fly, allowing them to be accessed using the same API.

Special WARCIterator and ARCIterator subclasses of ArchiveIterator are also available to read only WARC or only ARC files.

WARC and ARC Streaming

The following example streams a WARC and ARC file over HTTP using requests, printing the warcinfo record (or ARC header) and any response records (or all ARC records) that contain HTML:

import requests
from warcio.archiveiterator import ArchiveIterator

def print_records(url):
    resp = requests.get(url, stream=True)

    for record in ArchiveIterator(resp.raw, arc2warc=True):
        if record.rec_type == 'warcinfo':
            print(record.raw_stream.read())

        elif record.rec_type == 'response':
            if record.http_headers.get_header('Content-Type') == 'text/html':
                print(record.rec_headers.get_header('WARC-Target-URI'))
                print(record.content_stream().read())
                print('')

# WARC
print_records('https://archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.warc.gz')

# ARC with arc2warc
print_records('https://archive.org/download/ExampleArcAndWarcFiles/IAH-20080430204825-00000-blackbook.arc.gz')

Working with Different Record Types

Response Records

Response records contain the HTTP response from a web server:

for record in ArchiveIterator(stream):
    if record.rec_type == 'response':
        uri = record.rec_headers.get_header('WARC-Target-URI')
        content = record.content_stream().read()

Request Records

Request records contain the HTTP request sent to a web server:

for record in ArchiveIterator(stream):
    if record.rec_type == 'request':
        uri = record.rec_headers.get_header('WARC-Target-URI')
        request_data = record.raw_stream.read()

Warcinfo Records

Warcinfo records contain metadata about the WARC file:

for record in ArchiveIterator(stream):
    if record.rec_type == 'warcinfo':
        metadata = record.raw_stream.read()

📄 warcio/reading