1. Tokenizing

Reading time: 3 minutes

Tokenizing

Tokenizing is the process of breaking down data, such as text, into smaller, manageable pieces called tokens. Each token is then assigned a unique numerical identifier (ID). This is a fundamental step in preparing text for processing by machine learning models, especially in natural language processing (NLP).

tip

The goal of this initial phase is very simple: Divide the input in tokens (ids) in some way that makes sense.

How Tokenizing Works

  1. Splitting the Text:
    • Basic Tokenizer: A simple tokenizer might split text into individual words and punctuation marks, removing spaces.
      • Example:
        Text: "Hello, world!"
        Tokens: ["Hello", ",", "world", "!"]
  2. Creating a Vocabulary:
    • To convert tokens into numerical IDs, a vocabulary is created. This vocabulary lists all unique tokens (words and symbols) and assigns each a specific ID.
    • Special Tokens: These are special symbols added to the vocabulary to handle various scenarios:
      • [BOS] (Beginning of Sequence): Indicates the start of a text.
      • [EOS] (End of Sequence): Indicates the end of a text.
      • [PAD] (Padding): Used to make all sequences in a batch the same length.
      • [UNK] (Unknown): Represents tokens that are not in the vocabulary.
    • Example:
      If "Hello" is assigned ID 64, "," is 455, "world" is 78, and "!" is 467, then:
      "Hello, world!"[64, 455, 78, 467]
    • Handling Unknown Words:
      If a word like "Bye" isn't in the vocabulary, it is replaced with [UNK].
      "Bye, world!"["[UNK]", ",", "world", "!"][987, 455, 78, 467]
      (Assuming [UNK] has ID 987)

Advanced Tokenizing Methods

While the basic tokenizer works well for simple texts, it has limitations, especially with large vocabularies and handling new or rare words. Advanced tokenizing methods address these issues by breaking text into smaller subunits or optimizing the tokenization process.

  1. Byte Pair Encoding (BPE):
    • Purpose: Reduces the size of the vocabulary and handles rare or unknown words by breaking them down into frequently occurring byte pairs.
    • How It Works:
      • Starts with individual characters as tokens.
      • Iteratively merges the most frequent pairs of tokens into a single token.
      • Continues until no more frequent pairs can be merged.
    • Benefits:
      • Eliminates the need for an [UNK] token since all words can be represented by combining existing subword tokens.
      • More efficient and flexible vocabulary.
    • Example:
      "playing" might be tokenized as ["play", "ing"] if "play" and "ing" are frequent subwords.
  2. WordPiece:
    • Used By: Models like BERT.
    • Purpose: Similar to BPE, it breaks words into subword units to handle unknown words and reduce vocabulary size.
    • How It Works:
      • Begins with a base vocabulary of individual characters.
      • Iteratively adds the most frequent subword that maximizes the likelihood of the training data.
      • Uses a probabilistic model to decide which subwords to merge.
    • Benefits:
      • Balances between having a manageable vocabulary size and effectively representing words.
      • Efficiently handles rare and compound words.
    • Example:
      "unhappiness" might be tokenized as ["un", "happiness"] or ["un", "happy", "ness"] depending on the vocabulary.
  3. Unigram Language Model:
    • Used By: Models like SentencePiece.
    • Purpose: Uses a probabilistic model to determine the most likely set of subword tokens.
    • How It Works:
      • Starts with a large set of potential tokens.
      • Iteratively removes tokens that least improve the model's probability of the training data.
      • Finalizes a vocabulary where each word is represented by the most probable subword units.
    • Benefits:
      • Flexible and can model language more naturally.
      • Often results in more efficient and compact tokenizations.
    • Example:
      "internationalization" might be tokenized into smaller, meaningful subwords like ["international", "ization"].

Code Example

Let's understand this better from a code example from https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb:

python
# Download a text to pre-train the model
import urllib.request
url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

# Tokenize the code using GPT2 tokenizer version
import tiktoken
token_ids = tiktoken.get_encoding("gpt2").encode(txt, allowed_special={"[EOS]"}) # Allow the user of the tag "[EOS]"

# Print first 50 tokens
print(token_ids[:50])
#[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438, 568, 340, 373, 645, 1049, 5975, 284, 502, 284, 3285, 326, 11, 287, 262, 6001, 286, 465, 13476, 11, 339, 550, 5710, 465, 12036, 11, 6405, 257, 5527, 27075, 11]

References