Talk:Attention (machine learning)
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Confusing line "X is the input matrix of word embeddings, size 4 x 300. x is the word vector for "that". "
editAfter "4x300" it immediately says "x is the word for 'that'." That's super-confusing, because one might think that the second x refers to the x between 4 and 300. There are three different uses of x in the sentence. Someone familiar with the field will be able to understand it, but wikipedia is meant to be clear as possible. ThinkerFeeler (talk) 00:20, 30 July 2023 (UTC)
typo in "asterix"?
editin the following extract: " The asterix within parenthesis "(*)" denotes the softmax" shouldn't the word be asterisk, not asterix ? :-) Jrob kiwi (talk) 16:35, 23 August 2023 (UTC)
Does RNN mean "recursive neural network" or "recurrent neural network"?
editIn this article, is RNN supposed to mean "recursive neural network" or "recurrent neural network", or maybe sometimes one and sometimes the other? Once we figure this out, let's replace all occurrences with the correct three words, so that it is immediately clear even to novices. —Quantling (talk | contribs) 16:14, 24 October 2023 (UTC)
- I'm pretty sure it is "recurrent". I am going to go ahead and edit. If I have it wrong, please accept my apologies ... and fix my edit. —Quantling (talk | contribs) 16:23, 24 October 2023 (UTC)
hard vs soft weights
editThe intro mentions hard and soft weights, which I havent heard before in this context. can someone provide a citation showing it is actually used terminology? DMH43 (talk) 15:15, 26 December 2023 (UTC)
'word' should be replaced with something more generic
editThe article frequently uses the word "word" when talking about attention. For example the opening paragraph states: "It calculates "soft" weights for each word, more precisely for its embedding, in the context window.". However, attention is a concept that is independent of input type - it can and has been applied to words, pixel values, quantities, etc. I believe it would be clearer to replace the use of "word" in reference to the inputs that attention is applied to, with something more generic such as "input element" or "token". 180.150.65.6 (talk) 14:31, 5 March 2024 (UTC)
Where the matrices coming from?
editThe article does not explain where the Q K V matrices are coming from or how the corresponding networks are trained. 108.53.169.6 (talk) 02:38, 4 August 2024 (UTC)
Article dispute resolution
edit@Ffid tham you have been repeatedly reverting all article edits to a very specific version of the article. However, at that point, the article is disorganized, and hard to read. Consider for example:
> The attention network was designed to identify high correlations patterns amongst words in a given sentence, assuming that it has learned word correlation patterns from the training data. This correlation is captured as neuronal weights learned during training with backpropagation.
This uses awkward phrasing like "neuronal weights learned". It also says "attention network", but attention mechanism is not a network. It is a module that can go into different kinds of neural networks.
> The diagram shows the Attention forward pass calculating correlations
This diagram is hard to understand, especially up there as the first image showing the mathematical operations all together. To have good style, the article should start simple and build the attention mechanism piece-by-piece. Specifically, the section on seq2seq was written to build the attention mechanism piece-by-piece.
After that section, then that picture can be displayed as a big summary (although I believe better pictures are available).
Furthermore, the "Encoder-decoder with attention" diagram is deeply confusing. I don't know what it shows, and I suspect neither would the readers. I have worked on the Transformer page a great deal, so I would know what encoder-decoder mechanism is, but this diagram has defeated me. There are better diagrams out there that I can put in, from seq2seq:
Please justify your choice of that very specific version of the article, despite all these problems I have pointed out. See WP:DISPUTE for guidelines for dispute resolution