Architecture Overview

wicket is designed as a high-performance, streaming Wikipedia dump text extractor. It processes multi-gigabyte XML dumps by combining streaming I/O with batch-based parallelism.

High-Level Data Flow

Input (.xml / .xml.bz2)
  |
  v
DumpReader (streaming XML parse + namespace filter)
  |  yields Article { id, title, namespace, text }
  v
Batch (1000 articles)
  |
  v
rayon par_iter (parallel processing)
  |  clean_wikitext(text) -> plain text
  |  format_page(id, title, url_base, text, format) -> formatted string
  v
OutputSplitter (sequential write, file rotation)
  |
  v
Output files (AA/wiki_00, AA/wiki_01, ...)

Design Principles

Streaming processing – XML is parsed as a stream; only one article is in memory at a time
Batch parallelism – CPU-bound wikitext cleaning is parallelized via rayon while I/O remains sequential
Structured output – doc format and JSON format with organized directory structure
Fail-soft – malformed pages are logged and skipped rather than causing the entire process to abort
Library-first – core functionality lives in the wicket library crate; the CLI is a thin wrapper

Keyboard shortcuts

wicket Documentation

Architecture Overview

High-Level Data Flow

Design Principles