Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

wicket is a high-performance tool that extracts plain text from Wikipedia XML dump files, offering fast processing through parallel execution and efficient streaming.

Key Features

  • Streaming XML parsing – handles multi-gigabyte dumps without loading them into memory
  • Parallel text extraction – uses multiple CPU cores via rayon
  • Automatic bzip2 decompression – transparently handles .xml.bz2 dump files
  • Dual output formats – both doc format and JSON format
  • File splitting – configurable maximum size per output file
  • Namespace filtering – extract only specific page types (main articles, talk pages, etc.)

Output Formats

Doc Format (default)

<doc id="1" url="https://en.wikipedia.org/wiki/April" title="April">
April is the fourth month of the year...
</doc>

JSON Format

{"id":"1","url":"https://en.wikipedia.org/wiki/April","title":"April","text":"April is the fourth month of the year..."}

Current Version

wicket v0.1.0 – Rust Edition 2024, minimum Rust version 1.85.

Getting Started

Welcome to wicket! This section will help you get up and running quickly.

wicket extracts plain text from Wikipedia XML dump files. It reads MediaWiki XML dumps (optionally bzip2-compressed), removes wiki markup, and writes clean text in doc or JSON format.

Next Steps

  • Installation – install wicket from source or crates.io
  • Quick Start – extract text from a Wikipedia dump in minutes

Installation

Prerequisites

  • Rust 1.85 or later (stable channel) from rust-lang.org
  • Cargo (Rust’s package manager, included with Rust)

Installing the CLI Tool

From crates.io

cargo install wicket-cli

From Source

git clone https://github.com/mosuka/wicket.git
cd wicket
cargo build --release

The binary will be available at ./target/release/wicket.

Verify the installation:

./target/release/wicket --help

Using as a Library

Add wicket to your project’s Cargo.toml:

[dependencies]
wicket = "0.1.0"

Supported Platforms

wicket is tested on the following platforms:

OSArchitecture
Linuxx86_64, aarch64
macOSx86_64 (Intel), aarch64 (Apple Silicon)
Windowsx86_64, aarch64

Quick Start

Obtaining a Wikipedia Dump

Download a Wikipedia dump from https://dumps.wikimedia.org/. For testing, the Simple English Wikipedia dump is recommended due to its small size:

wget https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2

CLI Quick Start

Basic Extraction

Extract plain text from a Wikipedia dump:

wicket simplewiki-latest-pages-articles.xml.bz2 -o output/

This reads the dump, extracts plain text from all main namespace articles, and writes the output to the output/ directory in doc format, splitting files at 1 MB.

JSON Output

wicket simplewiki-latest-pages-articles.xml.bz2 -o output/ --json

Write to stdout

wicket simplewiki-latest-pages-articles.xml.bz2 -o - -q | head -50

Library Quick Start

Here is a minimal Rust program that opens a dump and processes articles:

use wicket::{open_dump, clean_wikitext, format_page, make_url, OutputFormat};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let reader = open_dump("simplewiki-latest-pages-articles.xml.bz2".as_ref(), &[0])?;
    let url_base = reader.url_base().to_string();

    for result in reader.take(5) {
        let article = result?;
        let text = clean_wikitext(&article.text);
        let url = make_url(&url_base, &article.title);
        let output = format_page(
            article.id, &article.title, &url, &text, OutputFormat::Doc,
        );
        println!("{}", output);
    }

    Ok(())
}

What’s Next

Architecture Overview

wicket is designed as a high-performance, streaming Wikipedia dump text extractor. It processes multi-gigabyte XML dumps by combining streaming I/O with batch-based parallelism.

High-Level Data Flow

Input (.xml / .xml.bz2)
  |
  v
DumpReader (streaming XML parse + namespace filter)
  |  yields Article { id, title, namespace, text }
  v
Batch (1000 articles)
  |
  v
rayon par_iter (parallel processing)
  |  clean_wikitext(text) -> plain text
  |  format_page(id, title, url_base, text, format) -> formatted string
  v
OutputSplitter (sequential write, file rotation)
  |
  v
Output files (AA/wiki_00, AA/wiki_01, ...)

Design Principles

  • Streaming processing – XML is parsed as a stream; only one article is in memory at a time
  • Batch parallelism – CPU-bound wikitext cleaning is parallelized via rayon while I/O remains sequential
  • Structured output – doc format and JSON format with organized directory structure
  • Fail-soft – malformed pages are logged and skipped rather than causing the entire process to abort
  • Library-first – core functionality lives in the wicket library crate; the CLI is a thin wrapper

Workspace Structure

wicket is organized as a Cargo workspace with two crates and supporting directories.

Directory Layout

wicket/
├── Cargo.toml              # Workspace manifest
├── Cargo.lock              # Dependency lock file
├── LICENSE                 # MIT OR Apache-2.0
├── README.md               # Project overview
├── wicket/                # Core library crate
│   ├── Cargo.toml
│   └── src/
│       ├── lib.rs          # Module declarations and re-exports
│       ├── dump.rs         # XML dump streaming parser
│       ├── cleaner.rs      # Wikitext to plain text conversion
│       ├── extractor.rs    # Output formatting (doc/JSON)
│       ├── output.rs       # File splitting and rotation
│       └── error.rs        # Error types
├── wicket-cli/            # CLI binary crate
│   ├── Cargo.toml
│   └── src/
│       └── main.rs         # CLI entry point
├── docs/                   # mdBook documentation (this book)
│   ├── book.toml
│   ├── src/
│   └── ja/                 # Japanese documentation
│       ├── book.toml
│       └── src/
└── .github/
    └── workflows/          # CI/CD pipelines
        ├── regression.yml  # Test on push/PR
        ├── release.yml     # Release builds and publishing
        ├── periodic.yml    # Weekly stability tests
        └── deploy-docs.yml # Documentation deployment

Crate Details

wicket (Core Library)

The core library provides streaming XML parsing, wikitext cleaning, output formatting, and file splitting.

DependencyVersionPurpose
quick-xml0.39Streaming XML parsing
parse-wiki-text-20.2Wikitext AST parsing
regex1.12Fallback wikitext cleaning
bzip20.6Bzip2 compression/decompression
serde1.0Serialization framework
serde_json1.0JSON output formatting
rayon1.11Data parallelism (used by CLI)
thiserror2.0Error type derivation
log0.4Logging facade

wicket-cli (CLI Binary)

The CLI provides a command-line interface to wicket’s functionality.

DependencyVersionPurpose
clap4.5Command-line argument parsing
rayon1.11Parallel batch processing
bzip20.6Compressed output support
env_logger0.11Logging output
anyhow1.0Error handling in binary
wicket0.1Core library (workspace member)

Workspace Configuration

The workspace uses Cargo resolver version 3 (Rust Edition 2024):

[workspace]
resolver = "3"
members = ["wicket", "wicket-cli"]

[workspace.package]
version = "0.1.0"
edition = "2024"
license = "MIT OR Apache-2.0"

Shared dependencies are defined at the workspace level in [workspace.dependencies] and referenced by each crate with { workspace = true }.

Module Design

The wicket library crate is organized into five modules, each with a clear responsibility.

Module Overview

ModulePrimary TypesPurpose
dumpArticle, DumpReaderStreaming XML dump parsing
cleanerclean_wikitext()Wikitext to plain text conversion
extractorOutputFormat, format_page()Output formatting (doc/JSON)
outputOutputConfig, OutputSplitterFile splitting and rotation
errorErrorError type definitions

Module Details

dump – XML Dump Reader

Streaming XML parser built on quick-xml. Reads MediaWiki XML dump files and yields Article structs.

  • Article – A single Wikipedia page with id (u64), title (String), namespace (i32), and text (String)
  • DumpReader<R: BufRead> – Iterator that streams articles from an XML source with namespace filtering
  • open_dump(path, namespaces) – Opens a dump file with automatic .bz2 detection using MultiBzDecoder

The reader parses <siteinfo><base> to extract the wiki’s URL base, which is exposed via url_base().

cleaner – Wikitext Cleaner

Converts MediaWiki markup into plain text using a two-stage approach:

  1. AST-based cleaning – Uses parse_wiki_text_2 to build an AST and walks text nodes
  2. Regex fallback – When AST parsing fails, falls back to regex-based cleanup

Key function: clean_wikitext(wikitext: &str) -> String

extractor – Output Formatter

Formats extracted articles into the output representation.

  • OutputFormat – Enum with Doc and Json variants
  • format_page(id, title, url, text, format) – Formats a single article
  • make_url(url_base, title) – Constructs a Wikipedia article URL
  • parse_file_size(spec) – Parses size specifications like 1M, 500K, 1G

output – File Splitter

Manages writing extracted articles to split output files following a two-letter directory naming convention.

  • OutputConfig – Configuration for output path, max file size, and compression
  • OutputSplitter – Manages file rotation with AA/wiki_00 naming (100 files per directory, directories AA through ZZ)

Supports stdout output (path = "-"), bzip2 compression, and configurable file size limits.

error – Error Types

Defines the Error enum using thiserror:

  • Io – I/O errors
  • XmlReader – XML parsing errors from quick-xml
  • JsonSerialization – JSON serialization errors

Public Exports

The library’s lib.rs re-exports key types for convenience:

#![allow(unused)]
fn main() {
pub use cleaner::clean_wikitext;
pub use dump::{open_dump, Article, DumpReader};
pub use error::Error;
pub use extractor::{format_page, make_url, parse_file_size, OutputFormat};
pub use output::{OutputConfig, OutputSplitter};
}

Data Flow

This page describes how data flows through wicket from input to output.

Processing Pipeline

Input (.xml / .xml.bz2)
  |
  v
DumpReader (streaming XML parse + namespace filter)
  |  yields Article { id, title, namespace, text }
  v
Batch (1000 articles)
  |
  v
rayon par_iter (parallel processing)
  |  clean_wikitext(text) -> plain text
  |  format_page(id, title, url_base, text, format) -> formatted string
  v
OutputSplitter (sequential write, file rotation)
  |
  v
Output files (AA/wiki_00, AA/wiki_01, ...)

Stage Details

1. XML Dump Reading

DumpReader uses quick-xml to parse the MediaWiki XML dump as a stream. For .xml.bz2 files, the stream is automatically wrapped with MultiBzDecoder for transparent decompression.

The reader extracts:

  • Page ID (<id> inside <page>, not inside <revision>)
  • Title (<title>)
  • Namespace (<ns>)
  • Wikitext body (<text>)
  • URL base from <siteinfo><base> (extracted once at startup)

Pages with namespaces not in the filter list are skipped at the iterator level.

2. Batch Collection

Articles are collected into batches of 1000 from the DumpReader iterator. This batch size balances parallelization overhead against memory usage.

3. Parallel Processing

Each batch is processed with rayon::par_iter(), which distributes work across CPU cores:

  1. clean_wikitext(text) – Converts wikitext markup to plain text. This is the most CPU-intensive step.
  2. format_page(id, title, url, text, format) – Formats the clean text into doc or JSON format.

Results are collected in order (rayon preserves element ordering with par_iter).

4. Sequential Output

Formatted strings are written sequentially to the OutputSplitter, which:

  • Creates subdirectories (AA, AB, …, ZZ) as needed
  • Rotates to a new file after reaching the configured size limit
  • Applies bzip2 compression when enabled
  • Outputs to stdout when the path is "-"

Parallelization Strategy

wicket uses a batch-based parallelization approach rather than a pipeline with channels:

  1. The main thread reads articles from the DumpReader in batches of 1000
  2. Each batch is processed in parallel using rayon::par_iter()
  3. Results are written sequentially to maintain deterministic output ordering
  4. This repeats until all articles are processed

This approach is simple, maintains output ordering, and effectively parallelizes the CPU-bound cleaning step while keeping the I/O-bound reading and writing sequential.

Library API Overview

The wicket crate provides a Rust API for extracting plain text from Wikipedia XML dump files.

Installation

[dependencies]
wicket = "0.1.0"

Module Map

ModulePrimary TypesPurpose
wicket::dumpArticle, DumpReader, open_dump()Streaming XML dump parsing
wicket::cleanerclean_wikitext()Wikitext to plain text conversion
wicket::extractorOutputFormat, format_page(), make_url()Output formatting
wicket::outputOutputConfig, OutputSplitterFile splitting and rotation
wicket::errorErrorError type definitions

Quick Example

use wicket::{open_dump, clean_wikitext, format_page, make_url, OutputFormat};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let reader = open_dump("dump.xml.bz2".as_ref(), &[0])?;
    let url_base = reader.url_base().to_string();

    for result in reader {
        let article = result?;
        let text = clean_wikitext(&article.text);
        let url = make_url(&url_base, &article.title);
        let output = format_page(
            article.id, &article.title, &url, &text, OutputFormat::Doc,
        );
        print!("{}", output);
    }

    Ok(())
}

API Documentation

Full API documentation is available on docs.rs/wicket.

dump

The dump module provides streaming XML parsing of MediaWiki dump files.

Types

Article

A single Wikipedia page extracted from the dump.

FieldTypeDescription
idu64Page ID
titleStringPage title
namespacei32Namespace ID (0 = main articles)
textStringRaw wikitext content

DumpReader<R: BufRead>

An iterator that streams Article values from an XML dump source.

  • Implements Iterator<Item = Result<Article, Error>>
  • Filters articles by namespace at the iterator level
  • Exposes url_base() to retrieve the wiki’s base URL from <siteinfo>

Functions

open_dump(path: &Path, namespaces: &[i32]) -> Result<DumpReader<...>>

Opens a Wikipedia XML dump file for reading.

  • Automatically detects .bz2 extension and applies MultiBzDecoder
  • Parses <siteinfo> to extract the URL base
  • Configures namespace filtering

Usage

#![allow(unused)]
fn main() {
use wicket::open_dump;

let reader = open_dump("dump.xml.bz2".as_ref(), &[0])?;
println!("URL base: {}", reader.url_base());

for result in reader {
    let article = result?;
    println!("[{}] {}", article.id, article.title);
}
}

cleaner

The cleaner module converts MediaWiki wikitext markup into plain text.

Functions

clean_wikitext(wikitext: &str) -> String

Converts raw wikitext into clean plain text by removing all MediaWiki markup.

The cleaning process uses a three-stage approach:

  1. AST-based cleaning – Uses parse_wiki_text_2 to parse the wikitext into an AST and extracts text content from relevant nodes
  2. Regex fallback – When AST parsing fails or for markup not handled by the AST, applies regex-based pattern removal
  3. Post-processing – Removes markup remnants that survive the first two stages, such as orphaned template braces (}}), template parameter lines, and HTML comment fragments

The parser is configured with both English and Japanese Wikipedia namespaces, so it correctly handles dumps from either language edition without requiring any configuration changes.

Handled Markup

The cleaner handles the following MediaWiki markup elements:

  • Bold/Italic'''bold''' and ''italic''
  • Internal links[[Article]] and [[Article|display text]]
  • External links[https://example.com text]
  • Templates{{template|...}}
  • HTML tags<ref>, <nowiki>, <gallery>, etc.
  • Categories[[Category:...]] and [[カテゴリ:...]]
  • Files[[File:...]], [[Image:...]], and [[ファイル:...]]
  • Tables – Wikitext table markup
  • Comments<!-- comments -->
  • Magic words__TOC__, __NOTOC__, etc.
  • Redirects#REDIRECT and #転送

Usage

#![allow(unused)]
fn main() {
use wicket::clean_wikitext;

let wikitext = "'''April''' is the [[month|fourth month]] of the year.";
let text = clean_wikitext(wikitext);
assert_eq!(text, "April is the fourth month of the year.");
}

extractor

The extractor module formats extracted articles into the final output representation.

Types

OutputFormat

An enum specifying the output format.

VariantDescription
DocDoc format with XML-like tags
JsonJSON Lines format (one JSON object per article)

Functions

format_page(id: u64, title: &str, url: &str, text: &str, format: OutputFormat) -> String

Formats a single article into the specified output format.

Doc format output:

<doc id="1" url="https://en.wikipedia.org/wiki/April" title="April">
April is the fourth month of the year...
</doc>

JSON format output:

{"id":"1","url":"https://en.wikipedia.org/wiki/April","title":"April","text":"April is the fourth month of the year..."}

make_url(url_base: &str, title: &str) -> String

Constructs a full Wikipedia article URL from the URL base and title. Spaces in the title are replaced with underscores.

parse_file_size(spec: &str) -> Result<u64, Error>

Parses a human-readable file size specification into bytes.

InputResult
"1M"1,048,576
"500K"512,000
"1G"1,073,741,824
"0"0 (one article per file)

output

The output module manages writing extracted articles to split output files following a two-letter directory naming convention.

Types

OutputConfig

Configuration for the output splitter.

FieldTypeDescription
pathPathBufOutput directory path, or "-" for stdout
max_file_sizeu64Maximum bytes per output file
compressboolWhether to compress output with bzip2

OutputSplitter

Manages file rotation and writing. Creates subdirectories and files as needed.

Directory Naming Convention

Output files are organized using the following naming convention:

output/
  AA/
    wiki_00
    wiki_01
    ...
    wiki_99
  AB/
    wiki_00
    ...
  • Each directory holds up to 100 files
  • Directory names follow the pattern AA, AB, …, AZ, BA, …, ZZ
  • When compress is enabled, files are named wiki_00.bz2, etc.

Special Modes

  • stdout mode – When path is "-", all output is written to stdout without splitting
  • Zero size – When max_file_size is 0, each article is written to its own file

error

The error module defines the error types used throughout the wicket library.

Error Type

The Error enum is derived using thiserror and covers all error conditions:

VariantSourceDescription
Iostd::io::ErrorFile I/O errors
XmlReaderquick_xml::ErrorXML parsing errors
JsonSerializationserde_json::ErrorJSON serialization errors

Result Type

The library provides a Result type alias:

#![allow(unused)]
fn main() {
pub type Result<T> = std::result::Result<T, Error>;
}

All public functions in the library return this Result type.

CLI Reference Overview

The wicket CLI extracts plain text from Wikipedia XML dump files.

Usage

wicket [OPTIONS] <INPUT>

Quick Reference

OptionDescriptionDefault
<INPUT>Input Wikipedia XML dump file (.xml or .xml.bz2)(required)
-o, --outputOutput directory, or - for stdouttext
-b, --bytesMaximum bytes per output file (e.g., 1M, 500K, 1G)1M
-c, --compressCompress output files using bzip2false
--jsonWrite output in JSON formatfalse
--processesNumber of parallel workersCPU count
-q, --quietSuppress progress output on stderrfalse
--namespacesComma-separated namespace IDs to extract0

Detailed Documentation

  • Options – detailed description of all CLI options
  • Examples – common usage patterns and examples

CLI Options

Input

wicket <INPUT>

The input file is a positional argument. It must be a Wikipedia XML dump file, either uncompressed (.xml) or bzip2-compressed (.xml.bz2). Compression is detected automatically by file extension.

Output Directory

wicket dump.xml.bz2 -o output/
wicket dump.xml.bz2 -o -

-o, --output <PATH> – Specifies the output directory. Defaults to text.

  • When set to a directory path, output files are created using the two-letter directory naming convention (AA/wiki_00, etc.)
  • When set to -, all output is written to stdout without file splitting

File Size

wicket dump.xml.bz2 -b 500K
wicket dump.xml.bz2 -b 1M
wicket dump.xml.bz2 -b 1G
wicket dump.xml.bz2 -b 0

-b, --bytes <SIZE> – Maximum bytes per output file. Defaults to 1M.

Supported suffixes: K (kilobytes), M (megabytes), G (gigabytes). When set to 0, each article is written to its own file.

Compression

wicket dump.xml.bz2 -c

-c, --compress – Compress output files using bzip2. Output files will have a .bz2 extension.

JSON Output

wicket dump.xml.bz2 --json

--json – Write output in JSON Lines format (one JSON object per line) instead of the default doc format.

Parallel Workers

wicket dump.xml.bz2 --processes 8

--processes <N> – Number of parallel workers for text cleaning. Defaults to the number of CPU cores.

Quiet Mode

wicket dump.xml.bz2 -q

-q, --quiet – Suppress progress output on stderr. Useful when piping output to another command.

Namespace Filtering

wicket dump.xml.bz2 --namespaces 0
wicket dump.xml.bz2 --namespaces 0,1,2

--namespaces <IDS> – Comma-separated list of namespace IDs to extract. Defaults to 0 (main articles only).

Common namespace IDs:

IDNamespace
0Main (articles)
1Talk
2User
3User talk
4Wikipedia
6File
10Template
14Category

CLI Examples

Basic Extraction

Extract text from a Wikipedia dump into the default text/ directory:

wicket simplewiki-latest-pages-articles.xml.bz2

Custom Output Directory

wicket dump.xml.bz2 -o output/

Write to stdout

Pipe output directly to another command:

wicket dump.xml.bz2 -o - -q | wc -l

JSON Output with Compression

wicket dump.xml.bz2 -o output/ --json -c

Extract Talk Pages

Extract namespace 1 (talk pages) with 8 workers:

wicket dump.xml.bz2 -o output/ --namespaces 1 --processes 8

Multiple Namespaces

Extract main articles and user pages:

wicket dump.xml.bz2 -o output/ --namespaces 0,2

Small Output Files

Split output into 500 KB files:

wicket dump.xml.bz2 -o output/ -b 500K

One Article per File

wicket dump.xml.bz2 -o output/ -b 0

Output Directory Structure

After extraction, the output directory looks like:

output/
  AA/
    wiki_00
    wiki_01
    ...
    wiki_99
  AB/
    wiki_00
    ...

With --compress:

output/
  AA/
    wiki_00.bz2
    wiki_01.bz2
    ...

License

wicket is distributed under a dual license.

MIT License

MIT License

Copyright (c) 2025 Minoru OSUKA

Apache License 2.0

Apache License, Version 2.0

Copyright (c) 2025 Minoru OSUKA

Full License Text

The complete license text is available in the LICENSE file in the repository.