Self attention Mechanism - Step 1

 Step 1: Calculating the similarity score

The first step in the self-attention mechanism is to compute the dot product between the query matrix, Q, and the key matrix K^T.

The following shows the result of the dot product between the query matrix Q and  K^T. 




The similarity score for 'I'

Let's look into the first row of the Q.K^T matrix as shown in the following figure. We can observe that we are computing the dot product between the query vector q1(I)  and all the key vectors— k1(I), k2(am), k3(good). Computing the dot product between two vectors tells us how similar they are.

Therefore, computing the dot product between the query vector (q1) and key vectors (k1, k2, k3) tells us how similar the query vector q1 is to key vectors  k1(I), k2(am)and k3(good). By looking at the first row of the Q.K^T matrix, we can understand that the word 'I' is more related to itself than the words 'am' and 'good' since the dot product value is higher for q1.k1 than q1. k2 and q1. k3: 



Note:
 The values used in the input matrices are arbitrary and used here just to give us a better understanding.


The similarity score for 'am'

Now, let's look into the second row of the Q.K^T matrix.  We can observe that we are computing the dot product between the query vector q2(am)  and all the key vectors— k1(I), k2(am), k3(good). Computing the dot product between two vectors tells us how similar q2(am)  is to the key vectors— k1(I), k2(am), k3(good).

By looking at the second row of Q.K^T matrix, we can undesrtand that the word ‘am’ is more related to itself tan the words ‘I’ and ‘good’ since the dot product value is higher for q2.k2 compared to q1.k1 and q2.k3: 


The similarity score for 'good'

Similarly, let's look into the third row of the Q.K^T matrix. s shown in the following figure, we can observe that we are computing the dot product between the query vector q3(good) and all the key vectors- k1(I), k2(am), k3(good). This tells us how similar query vector q3(good) is to all the key vectors- k1(I), k2(am), k3(good).

By looking at the third row of the Q.K^T  matrix, we can understand that the word 'good' is more related to itself than the words 'I' and 'am' in the sentence since the dot product value is higher for q3.k3 compared to q3.k1 and q3.k2: 


Thus, we can say that computing the dot product between the query matrix Q, and the key matrix K^T, essentially gives us the similarity score, which helps us to understand how similar each word in the sentence is to all other words.

Comments