Introduction

wicket is a high-performance tool that extracts plain text from Wikipedia XML dump files, offering fast processing through parallel execution and efficient streaming.

Key Features

Streaming XML parsing – handles multi-gigabyte dumps without loading them into memory
Parallel text extraction – uses multiple CPU cores via rayon
Automatic bzip2 decompression – transparently handles .xml.bz2 dump files
Dual output formats – both doc format and JSON format
File splitting – configurable maximum size per output file
Namespace filtering – extract only specific page types (main articles, talk pages, etc.)

Output Formats

Doc Format (default)

<doc id="1" url="https://en.wikipedia.org/wiki/April" title="April">
April is the fourth month of the year...
</doc>

JSON Format

{"id":"1","url":"https://en.wikipedia.org/wiki/April","title":"April","text":"April is the fourth month of the year..."}

Current Version

wicket v0.1.0 – Rust Edition 2024, minimum Rust version 1.85.

wicket Documentation

Introduction

Key Features

Output Formats

Doc Format (default)

JSON Format

Current Version

Links

Keyboard shortcuts

wicket Documentation

Introduction

Key Features

Output Formats

Doc Format (default)

JSON Format

Current Version

Links