Byte Pair Coding Usage

Byte Pair Coding Usage

Introduction

Traditonally, Byte Pair Coding(BPE) is a text compress algorithm. Now, it has been developed to be a popular word-segment method to address the rare word and out-of-vocabulary(OOV) problem[1], and significantly decrease the computer complexity, which has been widely in NLP task, such as machine translation, BERT.

In this post, I want to show you the basic usage of the subword-nmt package.

Recently, I found a user-friendly package called fastBPE having the same functional as subword-nmt, which can be download at: https://github.com/glample/fastBPE

Install

  • Install from pip
1
pip install subword-nmt
  • Install from source

First, download the source file from https://github.com/rsennrich/subword-nmt, then unzip it and come in the main directiory, run:

1
python setup.py install

learn-ape

  • Funcitonal

    learn the subword pieces encoding using ape pair alorgrithm

  • Usage

    For example, we use the subword-cut/tests/data/cropus.en to show the usage of subword-nmt.

    • Command

      subword-nmt learn-ape -s {num_operations} < {train_file} > {codes_file}

1
subword-nmt learn-ape < data/corpus.en > output/learn.ape

get-vocab

  • Functional

    Build a vocab according to the train_file

  • Command

    subword-nmt get-vocab –input {train_file} –vocab-file {vocab_file}

1
subword-nmt get-vocab --input data/corpus.en --vocab-file output/en.vocab

segment_char_ngrams

This has been deprecated in the current version.

learn_joint_bpe_and_vocab

For convenient, you can accompanish the learn-ape and get-vocab in a single command:

1
subword-nmt learn-joint-bpe-and-vocab --input data/cropus.en --output output/joint_learn.ape --write-vocabulary output/joint.vocab

Note that the joint.vocab is the union vocab of word and sub_word, which is different from that is got from the command “subword-nmt get-vocab”

apply_ape

Final, we need re-segment the train_file with the newest vocabulary, namely joint.vocab, which can be accompanished by apply_ape.

Command is:

1
subword-nmt apply-bpe -c output/joint_learn.ape < data/corpus.en > output_apply.bpe

Reference

[1] Neural Machine Translation of Rare Words with Subword Units