トレーニングガイド

このガイドでは、Litsea で独自の単語分割モデルと品詞推定モデルを学習する手順を説明します。

両方のワークフローとも、データソースとして Universal Dependencies (UD) Treebanks を使用します。

単語分割（AdaBoost）

UD Treebank をダウンロードしてコーパスを準備: conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp) && bash scripts/corpus_udtreebank.sh "$conllu_file" corpus.txt
コーパスから特徴量を抽出する
AdaBoost でモデルを訓練する

品詞推定（Averaged Perceptron）

UD Treebank をダウンロードして品詞付きコーパスを準備: conllu_file=$(bash scripts/download_udtreebank.sh -l ja -o /tmp) && bash scripts/corpus_udtreebank.sh -p "$conllu_file" pos_corpus.txt
品詞付き特徴量を抽出: litsea extract --pos -l japanese corpus.txt features.txt
POS モデルを訓練: litsea train --pos --num-epochs 10 features.txt model.txt

その他のトピック

モデルの評価 – モデル品質の評価
モデルの再訓練 – 既存モデルの微調整