Introduction

wicket is a high-performance tool that extracts plain text from Wikipedia XML dump files, offering fast processing through parallel execution and efficient streaming.

Key Features

Streaming XML parsing – handles multi-gigabyte dumps without loading them into memory
Parallel text extraction – uses multiple CPU cores via rayon
Automatic bzip2 decompression – transparently handles .xml.bz2 dump files
Dual output formats – both doc format and JSON format
File splitting – configurable maximum size per output file
Namespace filtering – extract only specific page types (main articles, talk pages, etc.)

Output Formats

Doc Format (default)

<doc id="1" url="https://en.wikipedia.org/wiki/April" title="April">
April is the fourth month of the year...
</doc>

JSON Format

{"id":"1","url":"https://en.wikipedia.org/wiki/April","title":"April","text":"April is the fourth month of the year..."}

Current Version

wicket v0.1.0 – Rust Edition 2024, minimum Rust version 1.85.

Getting Started

Welcome to wicket! This section will help you get up and running quickly.

wicket extracts plain text from Wikipedia XML dump files. It reads MediaWiki XML dumps (optionally bzip2-compressed), removes wiki markup, and writes clean text in doc or JSON format.

Next Steps

Installation – install wicket from source or crates.io
Quick Start – extract text from a Wikipedia dump in minutes

Installation

Prerequisites

Rust 1.85 or later (stable channel) from rust-lang.org
Cargo (Rust’s package manager, included with Rust)

Installing the CLI Tool

From crates.io

cargo install wicket-cli

From Source

git clone https://github.com/mosuka/wicket.git
cd wicket
cargo build --release

The binary will be available at ./target/release/wicket.

Verify the installation:

./target/release/wicket --help

Using as a Library

Add wicket to your project’s Cargo.toml:

[dependencies]
wicket = "0.1.0"

Supported Platforms

wicket is tested on the following platforms:

OS	Architecture
Linux	x86_64, aarch64
macOS	x86_64 (Intel), aarch64 (Apple Silicon)
Windows	x86_64, aarch64

Quick Start

Obtaining a Wikipedia Dump

Download a Wikipedia dump from https://dumps.wikimedia.org/. For testing, the Simple English Wikipedia dump is recommended due to its small size:

wget https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2

CLI Quick Start

Basic Extraction

Extract plain text from a Wikipedia dump:

wicket simplewiki-latest-pages-articles.xml.bz2 -o output/

This reads the dump, extracts plain text from all main namespace articles, and writes the output to the output/ directory in doc format, splitting files at 1 MB.

JSON Output

wicket simplewiki-latest-pages-articles.xml.bz2 -o output/ --json

Write to stdout

wicket simplewiki-latest-pages-articles.xml.bz2 -o - -q | head -50

Library Quick Start

Here is a minimal Rust program that opens a dump and processes articles:

use wicket::{open_dump, clean_wikitext, format_page, make_url, OutputFormat};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let reader = open_dump("simplewiki-latest-pages-articles.xml.bz2".as_ref(), &[0])?;
    let url_base = reader.url_base().to_string();

    for result in reader.take(5) {
        let article = result?;
        let text = clean_wikitext(&article.text);
        let url = make_url(&url_base, &article.title);
        let output = format_page(
            article.id, &article.title, &url, &text, OutputFormat::Doc,
        );
        println!("{}", output);
    }

    Ok(())
}

What’s Next

CLI Reference – learn all CLI options
Architecture – understand how wicket works internally

Architecture Overview

wicket is designed as a high-performance, streaming Wikipedia dump text extractor. It processes multi-gigabyte XML dumps by combining streaming I/O with batch-based parallelism.

High-Level Data Flow

Input (.xml / .xml.bz2)
  |
  v
DumpReader (streaming XML parse + namespace filter)
  |  yields Article { id, title, namespace, text }
  v
Batch (1000 articles)
  |
  v
rayon par_iter (parallel processing)
  |  clean_wikitext(text) -> plain text
  |  format_page(id, title, url_base, text, format) -> formatted string
  v
OutputSplitter (sequential write, file rotation)
  |
  v
Output files (AA/wiki_00, AA/wiki_01, ...)

Design Principles

Streaming processing – XML is parsed as a stream; only one article is in memory at a time
Batch parallelism – CPU-bound wikitext cleaning is parallelized via rayon while I/O remains sequential
Structured output – doc format and JSON format with organized directory structure
Fail-soft – malformed pages are logged and skipped rather than causing the entire process to abort
Library-first – core functionality lives in the wicket library crate; the CLI is a thin wrapper

Workspace Structure

wicket is organized as a Cargo workspace with two crates and supporting directories.

Directory Layout

wicket/
├── Cargo.toml              # Workspace manifest
├── Cargo.lock              # Dependency lock file
├── LICENSE                 # MIT OR Apache-2.0
├── README.md               # Project overview
├── wicket/                # Core library crate
│   ├── Cargo.toml
│   └── src/
│       ├── lib.rs          # Module declarations and re-exports
│       ├── dump.rs         # XML dump streaming parser
│       ├── cleaner.rs      # Wikitext to plain text conversion
│       ├── extractor.rs    # Output formatting (doc/JSON)
│       ├── output.rs       # File splitting and rotation
│       └── error.rs        # Error types
├── wicket-cli/            # CLI binary crate
│   ├── Cargo.toml
│   └── src/
│       └── main.rs         # CLI entry point
├── docs/                   # mdBook documentation (this book)
│   ├── book.toml
│   ├── src/
│   └── ja/                 # Japanese documentation
│       ├── book.toml
│       └── src/
└── .github/
    └── workflows/          # CI/CD pipelines
        ├── regression.yml  # Test on push/PR
        ├── release.yml     # Release builds and publishing
        ├── periodic.yml    # Weekly stability tests
        └── deploy-docs.yml # Documentation deployment

Crate Details

`wicket` (Core Library)

The core library provides streaming XML parsing, wikitext cleaning, output formatting, and file splitting.

Dependency	Version	Purpose
`quick-xml`	0.39	Streaming XML parsing
`parse-wiki-text-2`	0.2	Wikitext AST parsing
`regex`	1.12	Fallback wikitext cleaning
`bzip2`	0.6	Bzip2 compression/decompression
`serde`	1.0	Serialization framework
`serde_json`	1.0	JSON output formatting
`rayon`	1.11	Data parallelism (used by CLI)
`thiserror`	2.0	Error type derivation
`log`	0.4	Logging facade

`wicket-cli` (CLI Binary)

The CLI provides a command-line interface to wicket’s functionality.

Dependency	Version	Purpose
`clap`	4.5	Command-line argument parsing
`rayon`	1.11	Parallel batch processing
`bzip2`	0.6	Compressed output support
`env_logger`	0.11	Logging output
`anyhow`	1.0	Error handling in binary
`wicket`	0.1	Core library (workspace member)

Workspace Configuration

The workspace uses Cargo resolver version 3 (Rust Edition 2024):

[workspace]
resolver = "3"
members = ["wicket", "wicket-cli"]

[workspace.package]
version = "0.1.0"
edition = "2024"
license = "MIT OR Apache-2.0"

Shared dependencies are defined at the workspace level in [workspace.dependencies] and referenced by each crate with { workspace = true }.

Module Design

The wicket library crate is organized into five modules, each with a clear responsibility.

Module Overview

Module	Primary Types	Purpose
`dump`	`Article`, `DumpReader`	Streaming XML dump parsing
`cleaner`	`clean_wikitext()`	Wikitext to plain text conversion
`extractor`	`OutputFormat`, `format_page()`	Output formatting (doc/JSON)
`output`	`OutputConfig`, `OutputSplitter`	File splitting and rotation
`error`	`Error`	Error type definitions

Module Details

`dump` – XML Dump Reader

Streaming XML parser built on quick-xml. Reads MediaWiki XML dump files and yields Article structs.

Article – A single Wikipedia page with id (u64), title (String), namespace (i32), and text (String)
DumpReader<R: BufRead> – Iterator that streams articles from an XML source with namespace filtering
open_dump(path, namespaces) – Opens a dump file with automatic .bz2 detection using MultiBzDecoder

The reader parses <siteinfo><base> to extract the wiki’s URL base, which is exposed via url_base().

`cleaner` – Wikitext Cleaner

Converts MediaWiki markup into plain text using a two-stage approach:

AST-based cleaning – Uses parse_wiki_text_2 to build an AST and walks text nodes
Regex fallback – When AST parsing fails, falls back to regex-based cleanup

Key function: clean_wikitext(wikitext: &str) -> String

`extractor` – Output Formatter

Formats extracted articles into the output representation.

OutputFormat – Enum with Doc and Json variants
format_page(id, title, url, text, format) – Formats a single article
make_url(url_base, title) – Constructs a Wikipedia article URL
parse_file_size(spec) – Parses size specifications like 1M, 500K, 1G

`output` – File Splitter

Manages writing extracted articles to split output files following a two-letter directory naming convention.

OutputConfig – Configuration for output path, max file size, and compression
OutputSplitter – Manages file rotation with AA/wiki_00 naming (100 files per directory, directories AA through ZZ)

Supports stdout output (path = "-"), bzip2 compression, and configurable file size limits.

`error` – Error Types

Defines the Error enum using thiserror:

Io – I/O errors
XmlReader – XML parsing errors from quick-xml
JsonSerialization – JSON serialization errors

Public Exports

The library’s lib.rs re-exports key types for convenience:

#![allow(unused)]
fn main() {
pub use cleaner::clean_wikitext;
pub use dump::{open_dump, Article, DumpReader};
pub use error::Error;
pub use extractor::{format_page, make_url, parse_file_size, OutputFormat};
pub use output::{OutputConfig, OutputSplitter};
}

Data Flow

This page describes how data flows through wicket from input to output.

Processing Pipeline

Input (.xml / .xml.bz2)
  |
  v
DumpReader (streaming XML parse + namespace filter)
  |  yields Article { id, title, namespace, text }
  v
Batch (1000 articles)
  |
  v
rayon par_iter (parallel processing)
  |  clean_wikitext(text) -> plain text
  |  format_page(id, title, url_base, text, format) -> formatted string
  v
OutputSplitter (sequential write, file rotation)
  |
  v
Output files (AA/wiki_00, AA/wiki_01, ...)

Stage Details

1. XML Dump Reading

DumpReader uses quick-xml to parse the MediaWiki XML dump as a stream. For .xml.bz2 files, the stream is automatically wrapped with MultiBzDecoder for transparent decompression.

The reader extracts:

Page ID (<id> inside <page>, not inside <revision>)
Title (<title>)
Namespace (<ns>)
Wikitext body (<text>)
URL base from <siteinfo><base> (extracted once at startup)

Pages with namespaces not in the filter list are skipped at the iterator level.

2. Batch Collection

Articles are collected into batches of 1000 from the DumpReader iterator. This batch size balances parallelization overhead against memory usage.

3. Parallel Processing

Each batch is processed with rayon::par_iter(), which distributes work across CPU cores:

clean_wikitext(text) – Converts wikitext markup to plain text. This is the most CPU-intensive step.
format_page(id, title, url, text, format) – Formats the clean text into doc or JSON format.

Results are collected in order (rayon preserves element ordering with par_iter).

4. Sequential Output

Formatted strings are written sequentially to the OutputSplitter, which:

Creates subdirectories (AA, AB, …, ZZ) as needed
Rotates to a new file after reaching the configured size limit
Applies bzip2 compression when enabled
Outputs to stdout when the path is "-"

Parallelization Strategy

wicket uses a batch-based parallelization approach rather than a pipeline with channels:

The main thread reads articles from the DumpReader in batches of 1000
Each batch is processed in parallel using rayon::par_iter()
Results are written sequentially to maintain deterministic output ordering
This repeats until all articles are processed

This approach is simple, maintains output ordering, and effectively parallelizes the CPU-bound cleaning step while keeping the I/O-bound reading and writing sequential.

Library API Overview

The wicket crate provides a Rust API for extracting plain text from Wikipedia XML dump files.

Installation

[dependencies]
wicket = "0.1.0"

Module Map

Module	Primary Types	Purpose
`wicket::dump`	`Article`, `DumpReader`, `open_dump()`	Streaming XML dump parsing
`wicket::cleaner`	`clean_wikitext()`	Wikitext to plain text conversion
`wicket::extractor`	`OutputFormat`, `format_page()`, `make_url()`	Output formatting
`wicket::output`	`OutputConfig`, `OutputSplitter`	File splitting and rotation
`wicket::error`	`Error`	Error type definitions

Quick Example

use wicket::{open_dump, clean_wikitext, format_page, make_url, OutputFormat};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let reader = open_dump("dump.xml.bz2".as_ref(), &[0])?;
    let url_base = reader.url_base().to_string();

    for result in reader {
        let article = result?;
        let text = clean_wikitext(&article.text);
        let url = make_url(&url_base, &article.title);
        let output = format_page(
            article.id, &article.title, &url, &text, OutputFormat::Doc,
        );
        print!("{}", output);
    }

    Ok(())
}

API Documentation

Full API documentation is available on docs.rs/wicket.

dump

The dump module provides streaming XML parsing of MediaWiki dump files.

Types

`Article`

A single Wikipedia page extracted from the dump.

Field	Type	Description
`id`	`u64`	Page ID
`title`	`String`	Page title
`namespace`	`i32`	Namespace ID (0 = main articles)
`text`	`String`	Raw wikitext content

`DumpReader<R: BufRead>`

An iterator that streams Article values from an XML dump source.

Implements Iterator<Item = Result<Article, Error>>
Filters articles by namespace at the iterator level
Exposes url_base() to retrieve the wiki’s base URL from <siteinfo>

Functions

`open_dump(path: &Path, namespaces: &[i32]) -> Result<DumpReader<...>>`

Opens a Wikipedia XML dump file for reading.

Automatically detects .bz2 extension and applies MultiBzDecoder
Parses <siteinfo> to extract the URL base
Configures namespace filtering

Usage

#![allow(unused)]
fn main() {
use wicket::open_dump;

let reader = open_dump("dump.xml.bz2".as_ref(), &[0])?;
println!("URL base: {}", reader.url_base());

for result in reader {
    let article = result?;
    println!("[{}] {}", article.id, article.title);
}
}

cleaner

The cleaner module converts MediaWiki wikitext markup into plain text.

Functions

`clean_wikitext(wikitext: &str) -> String`

Converts raw wikitext into clean plain text by removing all MediaWiki markup.

The cleaning process uses a three-stage approach:

AST-based cleaning – Uses parse_wiki_text_2 to parse the wikitext into an AST and extracts text content from relevant nodes
Regex fallback – When AST parsing fails or for markup not handled by the AST, applies regex-based pattern removal
Post-processing – Removes markup remnants that survive the first two stages, such as orphaned template braces (}}), template parameter lines, and HTML comment fragments

The parser is configured with both English and Japanese Wikipedia namespaces, so it correctly handles dumps from either language edition without requiring any configuration changes.

Handled Markup

The cleaner handles the following MediaWiki markup elements:

Bold/Italic – '''bold''' and ''italic''
Internal links – [[Article]] and [[Article|display text]]
External links – [https://example.com text]
Templates – {{template|...}}
HTML tags – <ref>, <nowiki>, <gallery>, etc.
Categories – [[Category:...]] and [[カテゴリ:...]]
Files – [[File:...]], [[Image:...]], and [[ファイル:...]]
Tables – Wikitext table markup
Comments – 
Magic words – __TOC__, __NOTOC__, etc.
Redirects – #REDIRECT and #転送

Usage

#![allow(unused)]
fn main() {
use wicket::clean_wikitext;

let wikitext = "'''April''' is the [[month|fourth month]] of the year.";
let text = clean_wikitext(wikitext);
assert_eq!(text, "April is the fourth month of the year.");
}

extractor

The extractor module formats extracted articles into the final output representation.

Types

`OutputFormat`

An enum specifying the output format.

Variant	Description
`Doc`	Doc format with XML-like tags
`Json`	JSON Lines format (one JSON object per article)

Functions

`format_page(id: u64, title: &str, url: &str, text: &str, format: OutputFormat) -> String`

Formats a single article into the specified output format.

Doc format output:

<doc id="1" url="https://en.wikipedia.org/wiki/April" title="April">
April is the fourth month of the year...
</doc>

JSON format output:

{"id":"1","url":"https://en.wikipedia.org/wiki/April","title":"April","text":"April is the fourth month of the year..."}

`make_url(url_base: &str, title: &str) -> String`

Constructs a full Wikipedia article URL from the URL base and title. Spaces in the title are replaced with underscores.

`parse_file_size(spec: &str) -> Result<u64, Error>`

Parses a human-readable file size specification into bytes.

Input	Result
`"1M"`	1,048,576
`"500K"`	512,000
`"1G"`	1,073,741,824
`"0"`	0 (one article per file)

output

The output module manages writing extracted articles to split output files following a two-letter directory naming convention.

Types

`OutputConfig`

Configuration for the output splitter.

Field	Type	Description
`path`	`PathBuf`	Output directory path, or `"-"` for stdout
`max_file_size`	`u64`	Maximum bytes per output file
`compress`	`bool`	Whether to compress output with bzip2

`OutputSplitter`

Manages file rotation and writing. Creates subdirectories and files as needed.

Directory Naming Convention

Output files are organized using the following naming convention:

output/
  AA/
    wiki_00
    wiki_01
    ...
    wiki_99
  AB/
    wiki_00
    ...

Each directory holds up to 100 files
Directory names follow the pattern AA, AB, …, AZ, BA, …, ZZ
When compress is enabled, files are named wiki_00.bz2, etc.

Special Modes

stdout mode – When path is "-", all output is written to stdout without splitting
Zero size – When max_file_size is 0, each article is written to its own file

error

The error module defines the error types used throughout the wicket library.

Error Type

The Error enum is derived using thiserror and covers all error conditions:

Variant	Source	Description
`Io`	`std::io::Error`	File I/O errors
`XmlReader`	`quick_xml::Error`	XML parsing errors
`JsonSerialization`	`serde_json::Error`	JSON serialization errors

Result Type

The library provides a Result type alias:

#![allow(unused)]
fn main() {
pub type Result<T> = std::result::Result<T, Error>;
}

All public functions in the library return this Result type.

CLI Reference Overview

The wicket CLI extracts plain text from Wikipedia XML dump files.

Usage

wicket [OPTIONS] <INPUT>

Quick Reference

Option	Description	Default
`<INPUT>`	Input Wikipedia XML dump file (`.xml` or `.xml.bz2`)	(required)
`-o, --output`	Output directory, or `-` for stdout	`text`
`-b, --bytes`	Maximum bytes per output file (e.g., `1M`, `500K`, `1G`)	`1M`
`-c, --compress`	Compress output files using bzip2	`false`
`--json`	Write output in JSON format	`false`
`--processes`	Number of parallel workers	CPU count
`-q, --quiet`	Suppress progress output on stderr	`false`
`--namespaces`	Comma-separated namespace IDs to extract	`0`

Detailed Documentation

Options – detailed description of all CLI options
Examples – common usage patterns and examples

CLI Options

Input

wicket <INPUT>

The input file is a positional argument. It must be a Wikipedia XML dump file, either uncompressed (.xml) or bzip2-compressed (.xml.bz2). Compression is detected automatically by file extension.

Output Directory

wicket dump.xml.bz2 -o output/
wicket dump.xml.bz2 -o -

-o, --output <PATH> – Specifies the output directory. Defaults to text.

When set to a directory path, output files are created using the two-letter directory naming convention (AA/wiki_00, etc.)
When set to -, all output is written to stdout without file splitting

File Size

wicket dump.xml.bz2 -b 500K
wicket dump.xml.bz2 -b 1M
wicket dump.xml.bz2 -b 1G
wicket dump.xml.bz2 -b 0

-b, --bytes <SIZE> – Maximum bytes per output file. Defaults to 1M.

Supported suffixes: K (kilobytes), M (megabytes), G (gigabytes). When set to 0, each article is written to its own file.

Compression

wicket dump.xml.bz2 -c

-c, --compress – Compress output files using bzip2. Output files will have a .bz2 extension.

JSON Output

wicket dump.xml.bz2 --json

--json – Write output in JSON Lines format (one JSON object per line) instead of the default doc format.

Parallel Workers

wicket dump.xml.bz2 --processes 8

--processes <N> – Number of parallel workers for text cleaning. Defaults to the number of CPU cores.

Quiet Mode

wicket dump.xml.bz2 -q

-q, --quiet – Suppress progress output on stderr. Useful when piping output to another command.

Namespace Filtering

wicket dump.xml.bz2 --namespaces 0
wicket dump.xml.bz2 --namespaces 0,1,2

--namespaces <IDS> – Comma-separated list of namespace IDs to extract. Defaults to 0 (main articles only).

Common namespace IDs:

ID	Namespace
0	Main (articles)
1	Talk
2	User
3	User talk
4	Wikipedia
6	File
10	Template
14	Category

CLI Examples

Basic Extraction

Extract text from a Wikipedia dump into the default text/ directory:

wicket simplewiki-latest-pages-articles.xml.bz2

Custom Output Directory

wicket dump.xml.bz2 -o output/

Write to stdout

Pipe output directly to another command:

wicket dump.xml.bz2 -o - -q | wc -l

JSON Output with Compression

wicket dump.xml.bz2 -o output/ --json -c

Extract Talk Pages

Extract namespace 1 (talk pages) with 8 workers:

wicket dump.xml.bz2 -o output/ --namespaces 1 --processes 8

Multiple Namespaces

Extract main articles and user pages:

wicket dump.xml.bz2 -o output/ --namespaces 0,2

Small Output Files

Split output into 500 KB files:

wicket dump.xml.bz2 -o output/ -b 500K

One Article per File

wicket dump.xml.bz2 -o output/ -b 0

Output Directory Structure

After extraction, the output directory looks like:

output/
  AA/
    wiki_00
    wiki_01
    ...
    wiki_99
  AB/
    wiki_00
    ...

With --compress:

output/
  AA/
    wiki_00.bz2
    wiki_01.bz2
    ...

License

wicket is distributed under a dual license.

MIT License

MIT License

Copyright (c) 2025 Minoru OSUKA

Apache License 2.0

Apache License, Version 2.0

Copyright (c) 2025 Minoru OSUKA

Full License Text

The complete license text is available in the LICENSE file in the repository.

Keyboard shortcuts

wicket Documentation