Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Library API Overview

The wicket crate provides a Rust API for extracting plain text from Wikipedia XML dump files.

Installation

[dependencies]
wicket = "0.1.0"

Module Map

ModulePrimary TypesPurpose
wicket::dumpArticle, DumpReader, open_dump()Streaming XML dump parsing
wicket::cleanerclean_wikitext()Wikitext to plain text conversion
wicket::extractorOutputFormat, format_page(), make_url()Output formatting
wicket::outputOutputConfig, OutputSplitterFile splitting and rotation
wicket::errorErrorError type definitions

Quick Example

use wicket::{open_dump, clean_wikitext, format_page, make_url, OutputFormat};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let reader = open_dump("dump.xml.bz2".as_ref(), &[0])?;
    let url_base = reader.url_base().to_string();

    for result in reader {
        let article = result?;
        let text = clean_wikitext(&article.text);
        let url = make_url(&url_base, &article.title);
        let output = format_page(
            article.id, &article.title, &url, &text, OutputFormat::Doc,
        );
        print!("{}", output);
    }

    Ok(())
}

API Documentation

Full API documentation is available on docs.rs/wicket.