Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Quick Start

Obtaining a Wikipedia Dump

Download a Wikipedia dump from https://dumps.wikimedia.org/. For testing, the Simple English Wikipedia dump is recommended due to its small size:

wget https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2

CLI Quick Start

Basic Extraction

Extract plain text from a Wikipedia dump:

wicket simplewiki-latest-pages-articles.xml.bz2 -o output/

This reads the dump, extracts plain text from all main namespace articles, and writes the output to the output/ directory in doc format, splitting files at 1 MB.

JSON Output

wicket simplewiki-latest-pages-articles.xml.bz2 -o output/ --json

Write to stdout

wicket simplewiki-latest-pages-articles.xml.bz2 -o - -q | head -50

Library Quick Start

Here is a minimal Rust program that opens a dump and processes articles:

use wicket::{open_dump, clean_wikitext, format_page, make_url, OutputFormat};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let reader = open_dump("simplewiki-latest-pages-articles.xml.bz2".as_ref(), &[0])?;
    let url_base = reader.url_base().to_string();

    for result in reader.take(5) {
        let article = result?;
        let text = clean_wikitext(&article.text);
        let url = make_url(&url_base, &article.title);
        let output = format_page(
            article.id, &article.title, &url, &text, OutputFormat::Doc,
        );
        println!("{}", output);
    }

    Ok(())
}

What’s Next