Self-Attention Mechanism - Part 1

 Understanding the Self-Attention Mechanism


How can we create the query, key, and value matrices? To create these, we introduce three new weight matrices called WQ, WK, WV. We create the query Q, key K, and value V, matrices by multiplying the input matrix X by WQ, WK, WV respectively.

The weights WQ, WK, WV are randomly initialised, and their optimal values will be learned during training. As we learn the optimal weights, we will obtain more accurate query, key, and value matrices. 


From the preceding figure, we can understand the following:





  • The first row in the query, key, and value matrices—q1, k1 and v1—implies the query, key, and value vectors of the word ‘I'.
  • The second row in the query, key, and value matrices—q2, k2 and v2—implies the query, key, and value vectors of the word ‘am'.
  • The third row in the query, key, and value matrices—q3, k3 and v3—implies the query, key, and value vectors of the word ‘good'.

Dimensions of the query, key, and value matrices

Note that the dimensionality of the query, key, and value vectors is 64. Thus, the dimension of our query, key, and value matrices is:

[sentence length×64]


Since we have three words in the sentence, the dimensions of the query, key, and value matrices are:

[3×64]


But still, the ultimate question is, why are we computing this? What is the use of query, key, and value matrices? How is this going to help us? This is exactly what we will discuss in detail in the next section.

How the self-attention mechanism works

We learned how to compute query Q, key K, and valueV, matrices and we also learned that they are obtained from the input matrix, X. Now, let's see how the query, key, and value matrices are used in the self-attention mechanism.

Overview of all the steps so far

We learned that in order to compute a representation of a word, the self-attention mechanism relates the word to all the words in the given sentence. Consider the sentence 'I am good'. To compute the representation of the word 'I', we relate the word 'I' to all the words in the sentence, as shown in the following figure: 



But why do we need to do this? Understanding how a word is related to all the words in the sentence helps us to learn better representation. Now, let's learn how the self-attention mechanism relates a word to all the words in the sentence using the query, key, and value matrices. The self-attention mechanism includes four steps. Let's take a look at them one by one.

Comments