Introduction
wicket is a high-performance tool that extracts plain text from Wikipedia XML dump files, offering fast processing through parallel execution and efficient streaming.
Key Features
- Streaming XML parsing – handles multi-gigabyte dumps without loading them into memory
- Parallel text extraction – uses multiple CPU cores via rayon
- Automatic bzip2 decompression – transparently handles
.xml.bz2dump files - Dual output formats – both doc format and JSON format
- File splitting – configurable maximum size per output file
- Namespace filtering – extract only specific page types (main articles, talk pages, etc.)
Output Formats
Doc Format (default)
<doc id="1" url="https://en.wikipedia.org/wiki/April" title="April">
April is the fourth month of the year...
</doc>
JSON Format
{"id":"1","url":"https://en.wikipedia.org/wiki/April","title":"April","text":"April is the fourth month of the year..."}
Current Version
wicket v0.1.0 – Rust Edition 2024, minimum Rust version 1.85.
Links
Getting Started
Welcome to wicket! This section will help you get up and running quickly.
wicket extracts plain text from Wikipedia XML dump files. It reads MediaWiki XML dumps (optionally bzip2-compressed), removes wiki markup, and writes clean text in doc or JSON format.
Next Steps
- Installation – install wicket from source or crates.io
- Quick Start – extract text from a Wikipedia dump in minutes
Installation
Prerequisites
- Rust 1.85 or later (stable channel) from rust-lang.org
- Cargo (Rust’s package manager, included with Rust)
Installing the CLI Tool
From crates.io
cargo install wicket-cli
From Source
git clone https://github.com/mosuka/wicket.git
cd wicket
cargo build --release
The binary will be available at ./target/release/wicket.
Verify the installation:
./target/release/wicket --help
Using as a Library
Add wicket to your project’s Cargo.toml:
[dependencies]
wicket = "0.1.0"
Supported Platforms
wicket is tested on the following platforms:
| OS | Architecture |
|---|---|
| Linux | x86_64, aarch64 |
| macOS | x86_64 (Intel), aarch64 (Apple Silicon) |
| Windows | x86_64, aarch64 |
Quick Start
Obtaining a Wikipedia Dump
Download a Wikipedia dump from https://dumps.wikimedia.org/. For testing, the Simple English Wikipedia dump is recommended due to its small size:
wget https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2
CLI Quick Start
Basic Extraction
Extract plain text from a Wikipedia dump:
wicket simplewiki-latest-pages-articles.xml.bz2 -o output/
This reads the dump, extracts plain text from all main namespace articles, and writes the output to the output/ directory in doc format, splitting files at 1 MB.
JSON Output
wicket simplewiki-latest-pages-articles.xml.bz2 -o output/ --json
Write to stdout
wicket simplewiki-latest-pages-articles.xml.bz2 -o - -q | head -50
Library Quick Start
Here is a minimal Rust program that opens a dump and processes articles:
use wicket::{open_dump, clean_wikitext, format_page, make_url, OutputFormat};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let reader = open_dump("simplewiki-latest-pages-articles.xml.bz2".as_ref(), &[0])?;
let url_base = reader.url_base().to_string();
for result in reader.take(5) {
let article = result?;
let text = clean_wikitext(&article.text);
let url = make_url(&url_base, &article.title);
let output = format_page(
article.id, &article.title, &url, &text, OutputFormat::Doc,
);
println!("{}", output);
}
Ok(())
}
What’s Next
- CLI Reference – learn all CLI options
- Architecture – understand how wicket works internally
Architecture Overview
wicket is designed as a high-performance, streaming Wikipedia dump text extractor. It processes multi-gigabyte XML dumps by combining streaming I/O with batch-based parallelism.
High-Level Data Flow
Input (.xml / .xml.bz2)
|
v
DumpReader (streaming XML parse + namespace filter)
| yields Article { id, title, namespace, text }
v
Batch (1000 articles)
|
v
rayon par_iter (parallel processing)
| clean_wikitext(text) -> plain text
| format_page(id, title, url_base, text, format) -> formatted string
v
OutputSplitter (sequential write, file rotation)
|
v
Output files (AA/wiki_00, AA/wiki_01, ...)
Design Principles
- Streaming processing – XML is parsed as a stream; only one article is in memory at a time
- Batch parallelism – CPU-bound wikitext cleaning is parallelized via rayon while I/O remains sequential
- Structured output – doc format and JSON format with organized directory structure
- Fail-soft – malformed pages are logged and skipped rather than causing the entire process to abort
- Library-first – core functionality lives in the
wicketlibrary crate; the CLI is a thin wrapper
Workspace Structure
wicket is organized as a Cargo workspace with two crates and supporting directories.
Directory Layout
wicket/
├── Cargo.toml # Workspace manifest
├── Cargo.lock # Dependency lock file
├── LICENSE # MIT OR Apache-2.0
├── README.md # Project overview
├── wicket/ # Core library crate
│ ├── Cargo.toml
│ └── src/
│ ├── lib.rs # Module declarations and re-exports
│ ├── dump.rs # XML dump streaming parser
│ ├── cleaner.rs # Wikitext to plain text conversion
│ ├── extractor.rs # Output formatting (doc/JSON)
│ ├── output.rs # File splitting and rotation
│ └── error.rs # Error types
├── wicket-cli/ # CLI binary crate
│ ├── Cargo.toml
│ └── src/
│ └── main.rs # CLI entry point
├── docs/ # mdBook documentation (this book)
│ ├── book.toml
│ ├── src/
│ └── ja/ # Japanese documentation
│ ├── book.toml
│ └── src/
└── .github/
└── workflows/ # CI/CD pipelines
├── regression.yml # Test on push/PR
├── release.yml # Release builds and publishing
├── periodic.yml # Weekly stability tests
└── deploy-docs.yml # Documentation deployment
Crate Details
wicket (Core Library)
The core library provides streaming XML parsing, wikitext cleaning, output formatting, and file splitting.
| Dependency | Version | Purpose |
|---|---|---|
quick-xml | 0.39 | Streaming XML parsing |
parse-wiki-text-2 | 0.2 | Wikitext AST parsing |
regex | 1.12 | Fallback wikitext cleaning |
bzip2 | 0.6 | Bzip2 compression/decompression |
serde | 1.0 | Serialization framework |
serde_json | 1.0 | JSON output formatting |
rayon | 1.11 | Data parallelism (used by CLI) |
thiserror | 2.0 | Error type derivation |
log | 0.4 | Logging facade |
wicket-cli (CLI Binary)
The CLI provides a command-line interface to wicket’s functionality.
| Dependency | Version | Purpose |
|---|---|---|
clap | 4.5 | Command-line argument parsing |
rayon | 1.11 | Parallel batch processing |
bzip2 | 0.6 | Compressed output support |
env_logger | 0.11 | Logging output |
anyhow | 1.0 | Error handling in binary |
wicket | 0.1 | Core library (workspace member) |
Workspace Configuration
The workspace uses Cargo resolver version 3 (Rust Edition 2024):
[workspace]
resolver = "3"
members = ["wicket", "wicket-cli"]
[workspace.package]
version = "0.1.0"
edition = "2024"
license = "MIT OR Apache-2.0"
Shared dependencies are defined at the workspace level in [workspace.dependencies] and referenced by each crate with { workspace = true }.
Module Design
The wicket library crate is organized into five modules, each with a clear responsibility.
Module Overview
| Module | Primary Types | Purpose |
|---|---|---|
dump | Article, DumpReader | Streaming XML dump parsing |
cleaner | clean_wikitext() | Wikitext to plain text conversion |
extractor | OutputFormat, format_page() | Output formatting (doc/JSON) |
output | OutputConfig, OutputSplitter | File splitting and rotation |
error | Error | Error type definitions |
Module Details
dump – XML Dump Reader
Streaming XML parser built on quick-xml. Reads MediaWiki XML dump files and yields Article structs.
Article– A single Wikipedia page withid(u64),title(String),namespace(i32), andtext(String)DumpReader<R: BufRead>– Iterator that streams articles from an XML source with namespace filteringopen_dump(path, namespaces)– Opens a dump file with automatic.bz2detection usingMultiBzDecoder
The reader parses <siteinfo><base> to extract the wiki’s URL base, which is exposed via url_base().
cleaner – Wikitext Cleaner
Converts MediaWiki markup into plain text using a two-stage approach:
- AST-based cleaning – Uses
parse_wiki_text_2to build an AST and walks text nodes - Regex fallback – When AST parsing fails, falls back to regex-based cleanup
Key function: clean_wikitext(wikitext: &str) -> String
extractor – Output Formatter
Formats extracted articles into the output representation.
OutputFormat– Enum withDocandJsonvariantsformat_page(id, title, url, text, format)– Formats a single articlemake_url(url_base, title)– Constructs a Wikipedia article URLparse_file_size(spec)– Parses size specifications like1M,500K,1G
output – File Splitter
Manages writing extracted articles to split output files following a two-letter directory naming convention.
OutputConfig– Configuration for output path, max file size, and compressionOutputSplitter– Manages file rotation with AA/wiki_00 naming (100 files per directory, directories AA through ZZ)
Supports stdout output (path = "-"), bzip2 compression, and configurable file size limits.
error – Error Types
Defines the Error enum using thiserror:
Io– I/O errorsXmlReader– XML parsing errors fromquick-xmlJsonSerialization– JSON serialization errors
Public Exports
The library’s lib.rs re-exports key types for convenience:
#![allow(unused)]
fn main() {
pub use cleaner::clean_wikitext;
pub use dump::{open_dump, Article, DumpReader};
pub use error::Error;
pub use extractor::{format_page, make_url, parse_file_size, OutputFormat};
pub use output::{OutputConfig, OutputSplitter};
}
Data Flow
This page describes how data flows through wicket from input to output.
Processing Pipeline
Input (.xml / .xml.bz2)
|
v
DumpReader (streaming XML parse + namespace filter)
| yields Article { id, title, namespace, text }
v
Batch (1000 articles)
|
v
rayon par_iter (parallel processing)
| clean_wikitext(text) -> plain text
| format_page(id, title, url_base, text, format) -> formatted string
v
OutputSplitter (sequential write, file rotation)
|
v
Output files (AA/wiki_00, AA/wiki_01, ...)
Stage Details
1. XML Dump Reading
DumpReader uses quick-xml to parse the MediaWiki XML dump as a stream. For .xml.bz2 files, the stream is automatically wrapped with MultiBzDecoder for transparent decompression.
The reader extracts:
- Page ID (
<id>inside<page>, not inside<revision>) - Title (
<title>) - Namespace (
<ns>) - Wikitext body (
<text>) - URL base from
<siteinfo><base>(extracted once at startup)
Pages with namespaces not in the filter list are skipped at the iterator level.
2. Batch Collection
Articles are collected into batches of 1000 from the DumpReader iterator. This batch size balances parallelization overhead against memory usage.
3. Parallel Processing
Each batch is processed with rayon::par_iter(), which distributes work across CPU cores:
clean_wikitext(text)– Converts wikitext markup to plain text. This is the most CPU-intensive step.format_page(id, title, url, text, format)– Formats the clean text into doc or JSON format.
Results are collected in order (rayon preserves element ordering with par_iter).
4. Sequential Output
Formatted strings are written sequentially to the OutputSplitter, which:
- Creates subdirectories (AA, AB, …, ZZ) as needed
- Rotates to a new file after reaching the configured size limit
- Applies bzip2 compression when enabled
- Outputs to stdout when the path is
"-"
Parallelization Strategy
wicket uses a batch-based parallelization approach rather than a pipeline with channels:
- The main thread reads articles from the
DumpReaderin batches of 1000 - Each batch is processed in parallel using
rayon::par_iter() - Results are written sequentially to maintain deterministic output ordering
- This repeats until all articles are processed
This approach is simple, maintains output ordering, and effectively parallelizes the CPU-bound cleaning step while keeping the I/O-bound reading and writing sequential.
Library API Overview
The wicket crate provides a Rust API for extracting plain text from Wikipedia XML dump files.
Installation
[dependencies]
wicket = "0.1.0"
Module Map
| Module | Primary Types | Purpose |
|---|---|---|
wicket::dump | Article, DumpReader, open_dump() | Streaming XML dump parsing |
wicket::cleaner | clean_wikitext() | Wikitext to plain text conversion |
wicket::extractor | OutputFormat, format_page(), make_url() | Output formatting |
wicket::output | OutputConfig, OutputSplitter | File splitting and rotation |
wicket::error | Error | Error type definitions |
Quick Example
use wicket::{open_dump, clean_wikitext, format_page, make_url, OutputFormat};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let reader = open_dump("dump.xml.bz2".as_ref(), &[0])?;
let url_base = reader.url_base().to_string();
for result in reader {
let article = result?;
let text = clean_wikitext(&article.text);
let url = make_url(&url_base, &article.title);
let output = format_page(
article.id, &article.title, &url, &text, OutputFormat::Doc,
);
print!("{}", output);
}
Ok(())
}
API Documentation
Full API documentation is available on docs.rs/wicket.
dump
The dump module provides streaming XML parsing of MediaWiki dump files.
Types
Article
A single Wikipedia page extracted from the dump.
| Field | Type | Description |
|---|---|---|
id | u64 | Page ID |
title | String | Page title |
namespace | i32 | Namespace ID (0 = main articles) |
text | String | Raw wikitext content |
DumpReader<R: BufRead>
An iterator that streams Article values from an XML dump source.
- Implements
Iterator<Item = Result<Article, Error>> - Filters articles by namespace at the iterator level
- Exposes
url_base()to retrieve the wiki’s base URL from<siteinfo>
Functions
open_dump(path: &Path, namespaces: &[i32]) -> Result<DumpReader<...>>
Opens a Wikipedia XML dump file for reading.
- Automatically detects
.bz2extension and appliesMultiBzDecoder - Parses
<siteinfo>to extract the URL base - Configures namespace filtering
Usage
#![allow(unused)]
fn main() {
use wicket::open_dump;
let reader = open_dump("dump.xml.bz2".as_ref(), &[0])?;
println!("URL base: {}", reader.url_base());
for result in reader {
let article = result?;
println!("[{}] {}", article.id, article.title);
}
}
cleaner
The cleaner module converts MediaWiki wikitext markup into plain text.
Functions
clean_wikitext(wikitext: &str) -> String
Converts raw wikitext into clean plain text by removing all MediaWiki markup.
The cleaning process uses a three-stage approach:
- AST-based cleaning – Uses
parse_wiki_text_2to parse the wikitext into an AST and extracts text content from relevant nodes - Regex fallback – When AST parsing fails or for markup not handled by the AST, applies regex-based pattern removal
- Post-processing – Removes markup remnants that survive the first two stages, such as orphaned template braces (
}}), template parameter lines, and HTML comment fragments
The parser is configured with both English and Japanese Wikipedia namespaces, so it correctly handles dumps from either language edition without requiring any configuration changes.
Handled Markup
The cleaner handles the following MediaWiki markup elements:
- Bold/Italic –
'''bold'''and''italic'' - Internal links –
[[Article]]and[[Article|display text]] - External links –
[https://example.com text] - Templates –
{{template|...}} - HTML tags –
<ref>,<nowiki>,<gallery>, etc. - Categories –
[[Category:...]]and[[カテゴリ:...]] - Files –
[[File:...]],[[Image:...]], and[[ファイル:...]] - Tables – Wikitext table markup
- Comments –
<!-- comments --> - Magic words –
__TOC__,__NOTOC__, etc. - Redirects –
#REDIRECTand#転送
Usage
#![allow(unused)]
fn main() {
use wicket::clean_wikitext;
let wikitext = "'''April''' is the [[month|fourth month]] of the year.";
let text = clean_wikitext(wikitext);
assert_eq!(text, "April is the fourth month of the year.");
}
extractor
The extractor module formats extracted articles into the final output representation.
Types
OutputFormat
An enum specifying the output format.
| Variant | Description |
|---|---|
Doc | Doc format with XML-like tags |
Json | JSON Lines format (one JSON object per article) |
Functions
format_page(id: u64, title: &str, url: &str, text: &str, format: OutputFormat) -> String
Formats a single article into the specified output format.
Doc format output:
<doc id="1" url="https://en.wikipedia.org/wiki/April" title="April">
April is the fourth month of the year...
</doc>
JSON format output:
{"id":"1","url":"https://en.wikipedia.org/wiki/April","title":"April","text":"April is the fourth month of the year..."}
make_url(url_base: &str, title: &str) -> String
Constructs a full Wikipedia article URL from the URL base and title. Spaces in the title are replaced with underscores.
parse_file_size(spec: &str) -> Result<u64, Error>
Parses a human-readable file size specification into bytes.
| Input | Result |
|---|---|
"1M" | 1,048,576 |
"500K" | 512,000 |
"1G" | 1,073,741,824 |
"0" | 0 (one article per file) |
output
The output module manages writing extracted articles to split output files following a two-letter directory naming convention.
Types
OutputConfig
Configuration for the output splitter.
| Field | Type | Description |
|---|---|---|
path | PathBuf | Output directory path, or "-" for stdout |
max_file_size | u64 | Maximum bytes per output file |
compress | bool | Whether to compress output with bzip2 |
OutputSplitter
Manages file rotation and writing. Creates subdirectories and files as needed.
Directory Naming Convention
Output files are organized using the following naming convention:
output/
AA/
wiki_00
wiki_01
...
wiki_99
AB/
wiki_00
...
- Each directory holds up to 100 files
- Directory names follow the pattern AA, AB, …, AZ, BA, …, ZZ
- When
compressis enabled, files are namedwiki_00.bz2, etc.
Special Modes
- stdout mode – When
pathis"-", all output is written to stdout without splitting - Zero size – When
max_file_sizeis 0, each article is written to its own file
error
The error module defines the error types used throughout the wicket library.
Error Type
The Error enum is derived using thiserror and covers all error conditions:
| Variant | Source | Description |
|---|---|---|
Io | std::io::Error | File I/O errors |
XmlReader | quick_xml::Error | XML parsing errors |
JsonSerialization | serde_json::Error | JSON serialization errors |
Result Type
The library provides a Result type alias:
#![allow(unused)]
fn main() {
pub type Result<T> = std::result::Result<T, Error>;
}
All public functions in the library return this Result type.
CLI Reference Overview
The wicket CLI extracts plain text from Wikipedia XML dump files.
Usage
wicket [OPTIONS] <INPUT>
Quick Reference
| Option | Description | Default |
|---|---|---|
<INPUT> | Input Wikipedia XML dump file (.xml or .xml.bz2) | (required) |
-o, --output | Output directory, or - for stdout | text |
-b, --bytes | Maximum bytes per output file (e.g., 1M, 500K, 1G) | 1M |
-c, --compress | Compress output files using bzip2 | false |
--json | Write output in JSON format | false |
--processes | Number of parallel workers | CPU count |
-q, --quiet | Suppress progress output on stderr | false |
--namespaces | Comma-separated namespace IDs to extract | 0 |
Detailed Documentation
CLI Options
Input
wicket <INPUT>
The input file is a positional argument. It must be a Wikipedia XML dump file, either uncompressed (.xml) or bzip2-compressed (.xml.bz2). Compression is detected automatically by file extension.
Output Directory
wicket dump.xml.bz2 -o output/
wicket dump.xml.bz2 -o -
-o, --output <PATH> – Specifies the output directory. Defaults to text.
- When set to a directory path, output files are created using the two-letter directory naming convention (AA/wiki_00, etc.)
- When set to
-, all output is written to stdout without file splitting
File Size
wicket dump.xml.bz2 -b 500K
wicket dump.xml.bz2 -b 1M
wicket dump.xml.bz2 -b 1G
wicket dump.xml.bz2 -b 0
-b, --bytes <SIZE> – Maximum bytes per output file. Defaults to 1M.
Supported suffixes: K (kilobytes), M (megabytes), G (gigabytes). When set to 0, each article is written to its own file.
Compression
wicket dump.xml.bz2 -c
-c, --compress – Compress output files using bzip2. Output files will have a .bz2 extension.
JSON Output
wicket dump.xml.bz2 --json
--json – Write output in JSON Lines format (one JSON object per line) instead of the default doc format.
Parallel Workers
wicket dump.xml.bz2 --processes 8
--processes <N> – Number of parallel workers for text cleaning. Defaults to the number of CPU cores.
Quiet Mode
wicket dump.xml.bz2 -q
-q, --quiet – Suppress progress output on stderr. Useful when piping output to another command.
Namespace Filtering
wicket dump.xml.bz2 --namespaces 0
wicket dump.xml.bz2 --namespaces 0,1,2
--namespaces <IDS> – Comma-separated list of namespace IDs to extract. Defaults to 0 (main articles only).
Common namespace IDs:
| ID | Namespace |
|---|---|
| 0 | Main (articles) |
| 1 | Talk |
| 2 | User |
| 3 | User talk |
| 4 | Wikipedia |
| 6 | File |
| 10 | Template |
| 14 | Category |
CLI Examples
Basic Extraction
Extract text from a Wikipedia dump into the default text/ directory:
wicket simplewiki-latest-pages-articles.xml.bz2
Custom Output Directory
wicket dump.xml.bz2 -o output/
Write to stdout
Pipe output directly to another command:
wicket dump.xml.bz2 -o - -q | wc -l
JSON Output with Compression
wicket dump.xml.bz2 -o output/ --json -c
Extract Talk Pages
Extract namespace 1 (talk pages) with 8 workers:
wicket dump.xml.bz2 -o output/ --namespaces 1 --processes 8
Multiple Namespaces
Extract main articles and user pages:
wicket dump.xml.bz2 -o output/ --namespaces 0,2
Small Output Files
Split output into 500 KB files:
wicket dump.xml.bz2 -o output/ -b 500K
One Article per File
wicket dump.xml.bz2 -o output/ -b 0
Output Directory Structure
After extraction, the output directory looks like:
output/
AA/
wiki_00
wiki_01
...
wiki_99
AB/
wiki_00
...
With --compress:
output/
AA/
wiki_00.bz2
wiki_01.bz2
...
License
wicket is distributed under a dual license.
MIT License
MIT License
Copyright (c) 2025 Minoru OSUKA
Apache License 2.0
Apache License, Version 2.0
Copyright (c) 2025 Minoru OSUKA
Full License Text
The complete license text is available in the LICENSE file in the repository.