Language Models

Language models are mathematical structures that map a sequence of tokens to a probability. The probability assigned to a sequence of tokens depends on how often the tokens are observed together in order. This mapping between the sequence of tokens and their associated probability is derived from vast quantities of textual data using a suitable learning rule.

We define a sequence of tokens, $x = [x_{1}, x_{2}, ..., x_{t}]$ , and the joint probability distribution that the language model learns from empirical data,

p (x) = p (x_{1}, x_{2}, ..., x_{t})

The language model is also a generative model, as it models the joint probability distribution of the given data. We can simplify the above distribution,

p (x_{1}, x_{2}, ..., x_{t}) = p (x_{t} ∣ x_{1}, x_{2}, ..., x_{t - 1}) p (x_{1}, x_{2}, ..., x_{t - 1}) = p (x_{t} ∣ x_{1}, x_{2}, ..., x_{t - 1}) p (x_{t - 1} ∣ x_{1}, x_{2}, ..., x_{t - 2}) p (x_{1}, x_{2}, ..., x_{t - 2}) = p (x_{1}) t = 2 \prod T p (x_{t} ∣ x_{1}, ..., x_{t - 1})

Consider the following example, where $x = [A ppl e, se ll s, e x ce ll e n t, co m p u t ers]$ , the probability assigned by the language model to this sequence will be computed as,

p (A ppl e, se ll s, e x ce ll e n t, co m p u t ers) = p (co m p u t ers ∣ A ppl e, se ll s, e x ce ll e n t) \times p (e x ce ll e n t ∣ A ppl e, se ll s) \times p (se ll s ∣ A ppl e) \times p (A ppl e)

As we observed, the probability associated with each token is dependent or conditioned on tokens that occur prior in the sequence. This gives us a hint, that language models are contextually rich models considering different parts of the sequence when assigning probability to a token.

N-grams Model

Instead of considering the entire previous sequence $[x_{1}, x_{2}, ..., x_{t - 1}]$ to compute the probability $p (x_{t} ∣ x_{1}, x_{2}, ..., x_{t - 1})$ used in the language model above, we only consider the past $N$ tokens from the sequence.

p (x_{t} ∣ x_{1}, x_{2}, ..., x_{t - 1}) \approx p (x_{t} ∣ x_{t - N + 1}, x_{t - N + 2}, ..., x_{t - 1})

With $N = 2$ , we obtain the bigram model, where the probability of a token is dependent only on its previous token,

p (x_{t} ∣ x_{1}, x_{2}, ..., x_{t - 1}) \approx p (x_{t} ∣ x_{t - 1})

The assumption that the probability of a token depends only on the previous token in the sequence is called a Markov assumption.

Improving the N-grams Model: Using Log Probabilities

As observed in the expression of the language model, the desired probability $p (x_{t} ∣ x_{1}, x_{2}, ..., x_{t - 1})$ is a product of many probabilities, which might be very small if the text corpus under consideration is very large. Multiplying many small numbers together makes the product smaller making it ‘underflow’ the computer’s memory. The product is so small that it is represented as a zero even though it is not.

Using logarithm solves the problem, converting a huge product expression to a summation over individual log probabilities,

p (x_{1}, x_{2}, ..., x_{t}) lo g (p (x_{1}, x_{2}, ..., x_{t})) = p (x_{1}) t = 2 \prod T p (x_{t} ∣ x_{1}, ..., x_{t - 1}) = lo g (p (x_{1})) + t = 2 \sum T lo g (p (x_{t} ∣ x_{1}, ..., x_{t - 1}))

Improving the N-grams models: Using Smoothing Techniques

Just as we avoided smaller products by introducing log probabilities, we also need to avoid zero-ing out the entire product if only one of the individual probabilities was zero. If the probability of a token is calculated by computing its relative frequency in the corpus,

p (x_{i}) = \frac{C ( x _{i} )}{M}

where $C (x_{i})$ is the count of token $x_{i}$ in the corpus and $M$ is the total number of tokens in the corpus. We can add $1$ to the numerator and denominator, the process known as Laplace Smoothing,

p^{*} (x_{i}) = \frac{C ( x _{i} ) + 1}{M + 1}

Evaluating Language Models: Perplexity

To evaluate how accurately a language model performs on a test-dataset (unobserved data), we can compute the perplexity of the model. It is defined as,

perplexity (x_{1}, x_{2}, ..., x_{n}) = p (x_{1}, x_{2}, ..., x_{n})^{- \frac{1}{n}} = n \frac{1}{p ( x _{1} , x _{2} , ... , x _{n} )}

Lower the perplexity of the test sequence, the better the model. Perplexity also represents the uncertainty in the value of a sample being derived from a given distribution. Thus, a lower perplexity suggests that the chances of a sequence $[x_{1}, x_{2}, ..., x_{n}]$ being derived or sampled by the probability distribution constructed by the language model are high.

Explorer

Language Models

Language Models

N-grams Model

Improving the N-grams Model: Using Log Probabilities

Improving the N-grams models: Using Smoothing Techniques

Evaluating Language Models: Perplexity

References

Table of Contents