Key Knows
KEY and Pretraining in TRANSFORMERS… 🗝
QUERY: I am the vector representing the word of interest. KEY: I am the vector holding the information on how much attention should be paid to each word with respect to QUERY. VALUE: I am the vector of the word of interest that gets weighted after knowing the contextual information of QUERY.
Here, how does KEY know how much attention should be paid to each word?
If we have a sentence like ‘The tiger comes from the south, and it is the direction of divinity’, what does the word ‘it’ in the sentence refer to?
Apparently, it refers to the direction ‘south’ because the tiger is not a direction! It’s an animal belonging to the Panther Family (panthera tigris). 🐯
When QUERY is ‘Tiger’ and KEY is the word ‘it’, the multiplication of QUERY and KEY will have a low value. However, when QUERY is ‘South’ and KEY is the word ‘it’, the multiplication of QUERY and KEY will have a high value. 🤠
But how? How does KEY instruct the learning correctly? How did KEY acquire this information, something akin to human-level understanding? The answer is during Pretraining! 😲
Pretraining is when the transformer learns how to determine the meaning of a word based on the CONTEXT! On a large chunk of data (corpus), pretraining is conducted to understand how words are aligned, how they are associated with each other, the underlying patterns, and so on. 😍
Yes, KEY information is learned during pretraining. 🤩 Stay connected and keep learning… ☘️