Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

wicket is a high-performance tool that extracts plain text from Wikipedia XML dump files, offering fast processing through parallel execution and efficient streaming.

Key Features

  • Streaming XML parsing – handles multi-gigabyte dumps without loading them into memory
  • Parallel text extraction – uses multiple CPU cores via rayon
  • Automatic bzip2 decompression – transparently handles .xml.bz2 dump files
  • Dual output formats – both doc format and JSON format
  • File splitting – configurable maximum size per output file
  • Namespace filtering – extract only specific page types (main articles, talk pages, etc.)

Output Formats

Doc Format (default)

<doc id="1" url="https://en.wikipedia.org/wiki/April" title="April">
April is the fourth month of the year...
</doc>

JSON Format

{"id":"1","url":"https://en.wikipedia.org/wiki/April","title":"April","text":"April is the fourth month of the year..."}

Current Version

wicket v0.1.0 – Rust Edition 2024, minimum Rust version 1.85.