Understanding the Self-Attention Mechanism
How can we create the query, key, and value matrices? To create these, we introduce three new weight matrices called WQ, WK, WV. We create the query Q, key K, and value V, matrices by multiplying the input matrix X by WQ, WK, WV respectively.
The weights WQ, WK, WV are randomly initialised, and their optimal values will be learned during training. As we learn the optimal weights, we will obtain more accurate query, key, and value matrices.
From the preceding figure, we can understand the following:
- The first row in the query, key, and value matrices—q1, k1 and v1—implies the query, key, and value vectors of the word ‘I'.
- The second row in the query, key, and value matrices—q2, k2 and v2—implies the query, key, and value vectors of the word ‘am'.
- The third row in the query, key, and value matrices—q3, k3 and v3—implies the query, key, and value vectors of the word ‘good'.
Dimensions of the query, key, and value matrices
Note that the dimensionality of the query, key, and value vectors is 64. Thus, the dimension of our query, key, and value matrices is:
[sentence length×64]
Since we have three words in the sentence, the dimensions of the query, key, and value matrices are:
[3×64]
But still, the ultimate question is, why are we computing this? What is the use of query, key, and value matrices? How is this going to help us? This is exactly what we will discuss in detail in the next section.
How the self-attention mechanism works
We learned how to compute query Q, key K, and valueV, matrices and we also learned that they are obtained from the input matrix, X. Now, let's see how the query, key, and value matrices are used in the self-attention mechanism.
Overview of all the steps so far
We learned that in order to compute a representation of a word, the self-attention mechanism relates the word to all the words in the given sentence. Consider the sentence 'I am good'. To compute the representation of the word 'I', we relate the word 'I' to all the words in the sentence, as shown in the following figure:
Comments
Post a Comment