Overview

In the first blog post of this series we have seen how we can represent words or tokens using Word2Vec. However, we also want representations that capture the context and meaning of words within different sentences.

A word in the context of one sentence might mean something different than the same word in another sentence. How can we capture that? Concepts like RNNs and LSTMs address this by preserving information about previous words in hidden state vectors. However, they have limitations such as vanishing gradient issues and computational inefficiency due to the need to compute all hidden states sequentially for a given sequence.

Generally, to overcome the limitation of models struggling to remember what they have already seen, such as earlier parts of a sentence, attention mechanisms are used. Attention introduces a direct connection between the prediction and relevant past information.

Self-attention goes a step further by allowing the model to relate each token to all other tokens in the sequence at once, moving away from strictly sequential processing of text.

Given a sentence such as “A cute teddy bear is reading”, to compute the representation of the word “teddy bear”, we need to consider all other tokens in the sequence simultaneously with direct connections. The representation of “teddy bear” becomes dependent on the full surrounding context.

The key concept behind self-attention is the Query (Q), Key (K), and Value (V) matrices.

more to come!