Learn practical skills, build real-world projects, and advance your career

SentencePiece and Byte Pair Encoding

Introduction to Tokenization

In order to process text in neural network models, it is first required to encode text as numbers with ids (such as the embedding vectors we've been using in the previous assignments), since the tensor operations act on numbers. Finally, if the output of the network are words, it is required to decode the predicted tokens ids back to text.

To encode text, the first decision that has to be made is to what level of granularity are we going to consider the text? Because ultimately, from these tokens, features are going to be created about them. Many different experiments have been carried out using words, morphological units, phonemic units, characters. For example,

  • Tokens are tricky. (raw text)
  • Tokens are tricky . (words)
  • Token s _ are _ trick _ y . (morphemes)
  • t oʊ k ə n z _ ɑː _ ˈt r ɪ k i. (phonemes, for STT)
  • T o k e n s _ a r e _ t r i c k y . (character)

But how to identify these units, such as words, are largely determined by the language they come from. For example, in many European languages a space is used to separate words, while in some Asian languages there are no spaces between words. Compare English and Mandarin.

  • Tokens are tricky. (original sentence)
  • 令牌很棘手 (Mandarin)
  • Lìng pái hěn jí shǒu (pinyin)
  • 令牌 很 棘手 (Mandarin with spaces)

So, the ability to tokenize, i.e. split text into meaningful fundamental units is not always straight-forward.

Also, there are practical issues of how large our vocabulary of words, vocab_size, should be, considering memory limitations vs. coverage. A compromise between the finest-grained models employing characters which can be memory and more computationally efficient subword units such as n-grams or larger units need to be made.

In SentencePiece unicode characters are grouped together using either a unigram language model (used in this week's assignment) or BPE, byte-pair encoding. We will discuss BPE, since BERT and many of its variant uses a modified version of BPE and its pseudocode is easy to implement and understand... hopefully!

SentencePiece Preprocessing

NFKC Normalization