# 8.4. Recurrent Neural Networks¶

In Section 8.3 we introduced \(n\)-gram models, where the conditional probability of word \(x_t\) at time step \(t\) only depends on the \(n-1\) previous words. If we want to incorporate the possible effect of words earlier than time step \(t-(n-1)\) on \(x_t\), we need to increase \(n\). However, the number of model parameters would also increase exponentially with it, as we need to store \(|\mathcal{V}|^n\) numbers for a vocabulary set \(\mathcal{V}\). Hence, rather than modeling \(P(x_t \mid x_{t-1}, \ldots, x_{t-n+1})\) it is preferable to use a latent variable model:

where \(h_{t-1}\) is a *hidden state* (also known as a hidden
variable) that stores the sequence information up to time step
\(t-1\). In general, the hidden state at any time step \(t\)
could be computed based on both the current input \(x_{t}\) and the
previous hidden state \(h_{t-1}\):

For a sufficiently powerful function \(f\) in (8.4.2), the latent variable model is not an approximation. After all, \(h_t\) may simply store all the data it has observed so far. However, it could potentially make both computation and storage expensive.

Recall that we have discussed hidden layers with hidden units in
Section 4. It is noteworthy that hidden layers and
hidden states refer to two very different concepts. Hidden layers are,
as explained, layers that are hidden from view on the path from input to
output. Hidden states are technically speaking *inputs* to whatever we
do at a given step, and they can only be computed by looking at data at
previous time steps.

*Recurrent neural networks* (RNNs) are neural networks with hidden
states. Before introducing the RNN model, we first revisit the MLP model
introduced in Section 4.1.

## 8.4.3. RNN-based Character-Level Language Models¶

Recall that for language modeling in Section 8.3, we
aim to predict the next token based on the current and past tokens, thus
we shift the original sequence by one token as the labels. Bengio et
al. first proposed to use a neural network for language modeling
[Bengio et al., 2003]. In the following we
illustrate how RNNs can be used to build a language model. Let the
minibatch size be one, and the sequence of the text be “machine”. To
simplify training in subsequent sections, we tokenize text into
characters rather than words and consider a *character-level language
model*. Fig. 8.4.2 demonstrates how to predict the next
character based on the current and previous characters via an RNN for
character-level language modeling.

During the training process, we run a softmax operation on the output from the output layer for each time step, and then use the cross-entropy loss to compute the error between the model output and the label. Due to the recurrent computation of the hidden state in the hidden layer, the output of time step 3 in Fig. 8.4.2, \(\mathbf{O}_3\), is determined by the text sequence “m”, “a”, and “c”. Since the next character of the sequence in the training data is “h”, the loss of time step 3 will depend on the probability distribution of the next character generated based on the feature sequence “m”, “a”, “c” and the label “h” of this time step.

In practice, each token is represented by a \(d\)-dimensional vector, and we use a batch size \(n>1\). Therefore, the input \(\mathbf X_t\) at time step \(t\) will be a \(n\times d\) matrix, which is identical to what we discussed in Section 8.4.2.

## 8.4.4. Perplexity¶

Last, let us discuss about how to measure the language model quality, which will be used to evaluate our RNN-based models in the subsequent sections. One way is to check how surprising the text is. A good language model is able to predict with high-accuracy tokens that what we will see next. Consider the following continuations of the phrase “It is raining”, as proposed by different language models:

“It is raining outside”

“It is raining banana tree”

“It is raining piouw;kcj pwepoiut”

In terms of quality, example 1 is clearly the best. The words are sensible and logically coherent. While it might not quite accurately reflect which word follows semantically (“in San Francisco” and “in winter” would have been perfectly reasonable extensions), the model is able to capture which kind of word follows. Example 2 is considerably worse by producing a nonsensical extension. Nonetheless, at least the model has learned how to spell words and some degree of correlation between words. Last, example 3 indicates a poorly trained model that does not fit data properly.

We might measure the quality of the model by computing the likelihood of
the sequence. Unfortunately this is a number that is hard to understand
and difficult to compare. After all, shorter sequences are much more
likely to occur than the longer ones, hence evaluating the model on
Tolstoy’s magnum opus *War and Peace* will inevitably produce a much
smaller likelihood than, say, on Saint-Exupery’s novella *The Little
Prince*. What is missing is the equivalent of an average.

Information theory comes handy here. We have defined entropy, surprisal, and cross-entropy when we introduced the softmax regression (Section 3.4.7) and more of information theory is discussed in the online appendix on information theory. If we want to compress text, we can ask about predicting the next token given the current set of tokens. A better language model should allow us to predict the next token more accurately. Thus, it should allow us to spend fewer bits in compressing the sequence. So we can measure it by the cross-entropy loss averaged over all the \(n\) tokens of a sequence:

where \(P\) is given by a language model and \(x_t\) is the
actual token observed at time step \(t\) from the sequence. This
makes the performance on documents of different lengths comparable. For
historical reasons, scientists in natural language processing prefer to
use a quantity called *perplexity*. In a nutshell, it is the exponential
of (8.4.7):

Perplexity can be best understood as the harmonic mean of the number of real choices that we have when deciding which token to pick next. Let us look at a number of cases:

In the best case scenario, the model always perfectly estimates the probability of the label token as 1. In this case the perplexity of the model is 1.

In the worst case scenario, the model always predicts the probability of the label token as 0. In this situation, the perplexity is positive infinity.

At the baseline, the model predicts a uniform distribution over all the available tokens of the vocabulary. In this case, the perplexity equals the number of unique tokens of the vocabulary. In fact, if we were to store the sequence without any compression, this would be the best we could do to encode it. Hence, this provides a nontrivial upper bound that any useful model must beat.

In the following sections, we will implement RNNs for character-level language models and use perplexity to evaluate such models.

## 8.4.5. Summary¶

A neural network that uses recurrent computation for hidden states is called a recurrent neural network (RNN).

The hidden state of an RNN can capture historical information of the sequence up to the current time step.

The number of RNN model parameters does not grow as the number of time steps increases.

We can create character-level language models using an RNN.

We can use perplexity to evaluate the quality of language models.

## 8.4.6. Exercises¶

If we use an RNN to predict the next character in a text sequence, what is the required dimension for any output?

Why can RNNs express the conditional probability of a token at some time step based on all the previous tokens in the text sequence?

What happens to the gradient if you backpropagate through a long sequence?

What are some of the problems associated with the language model described in this section?