Talk:Transformer (deep learning architecture)

Transformer (deep learning architecture) was nominated as a Engineering and technology good article, but it did not meet the good article criteria at the time (August 12, 2024, reviewed version). There are suggestions on the review page for improving the article. If you can improve it, please do; it may then be renominated.

Linguistics Mid‑importance

	Linguistics portal This article is within the scope of WikiProject Linguistics, a collaborative effort to improve the coverage of linguistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.LinguisticsWikipedia:WikiProject LinguisticsTemplate:WikiProject LinguisticsLinguistics articles
Mid	This article has been rated as Mid-importance on the project's importance scale.

Computing Mid‑importance

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing articles
Mid	This article has been rated as Mid-importance on the project's importance scale.

Artificial Intelligence

This article is within the scope of WikiProject Artificial Intelligence, a collaborative effort to improve the coverage of Artificial intelligence on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Artificial IntelligenceWikipedia:WikiProject Artificial IntelligenceTemplate:WikiProject Artificial IntelligenceArtificial Intelligence articles

Text and/or other creative content from this version of Large language model was copied or moved into Transformer (machine learning model) with this edit on 2 August 2023. The former page's history now serves to provide attribution for that content in the latter page, and it must not be deleted as long as the latter page exists.

Wiki Education Foundation-supported course assignment

Latest comment: 2 years ago1 comment1 person in discussion

This article was the subject of a Wiki Education Foundation-supported course assignment, between 5 September 2019 and 10 December 2019. Further details are available on the course page. Student editor(s): Iliao2345.

Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 04:23, 18 January 2022 (UTC)Reply

Suggestions for the "Background" section

Latest comment: 7 months ago2 comments2 people in discussion

The first sentence mentions "attention mechanism" without explaining what they are. Unfortunately, no article by that name exists, and a reader looking at the RNN, LSTM, and GRU pages will find no mention of them. I think this paragraph needs to be explicit about *which* specific models introduced attention mechanisms with adequate citation. --Ninepoints (talk) 19:25, 21 July 2020 (UTC)Reply

For what it's worth, there's this now:

Attention (machine learning)

– AndyFielding (talk) 11:14, 18 April 2024 (UTC)Reply

Feedback from Logan Paterson on Isaac Liao's article

Latest comment: 5 years ago2 comments1 person in discussion

Logkailp (talk) 14:41, 22 October 2019 (UTC) Praise: - Article does a very good job of laying a groundwork of what Transformers are and giving details on the inner workings of it. - doesn't repeat things too often - links to other articles for applications of transformers instead of unnecessarily writing them out all over again.Reply

Changes suggested: - I would put a little more background information in the background portion, as I came into the essay knowing nothing about transformers or the way that RNN's or CNN's work, and therefore couldn't grasp the information as well as I could have had I known some background information in the beginning. - Might want to separate the training section from the Architecture section, as they seem to be slightly different topics that could be more distinguished from one another. - Add a little more information in the section on CNN's

Most Important improvement: - More background information like I put above. This may just be a problem with my background knowledge but since the article is meant to be written for "everyone", you may want to add more to give the reader a groundwork of the topic.

Applicable to mine: - I really like your layout of the article and how the article builds from background information to explaining the workings of the topic and how each individual part of a transformer functions to the overall uses and applications of transformers - Smoothly transitioned from topic to topic within each subsection. Logkailp (talk) 14:41, 22 October 2019 (UTC)Logan PatersonReply

"Autoregressive" link points to wrong page

Latest comment: 4 years ago2 comments2 people in discussion

Someone linked the "Autoregressive" part of "Autoregressive Convolutional Neural Network" to "Autoencoder". Yes, they both start with "Auto", but this is clearly wrong. I'd fix it, but Wiki has rules these days where you can't fix a mistake unless you log in and then specify why you made a change, sign it, and have some understanding of how the "rules for editing" work? — Preceding unsigned comment added by 65.158.32.123 (talk) 14:05, 13 January 2020 (UTC)Reply

I've made that change now, thanks. --aricooperdavis (talk) 22:14, 20 January 2020 (UTC)Reply

Diagrams and simple explanations

Latest comment: 4 years ago2 comments2 people in discussion

Perhaps this is a stupid question, but what do people think of adding diagrams to the article? Also what do people think of adding dummies are us explanations? Daniel.Cardenas (talk) 18:32, 18 October 2020 (UTC)Reply

Yes, diagrams are a good idea. However, one must ensure that they aren't misleading because then they do more harm than good. I don't know what "dummies are us explanations" mean. ImTheIP (talk) 19:00, 18 October 2020 (UTC)Reply

AlphaFold, transformers, and attention mechanisms

Latest comment: 3 years ago4 comments3 people in discussion

Given the recent "milestone scientific breakthrough" being hailed for AlphaFold for its results in the protein structure prediction problem at CASP 14, and also their use in computer vision ([1], [2]; also Image GPT), I think it would be useful if we could try to present what they are trying to do in a more general framing perspective, wider and more general than their use in NLP.

(AlphaFold 2 is believed to use two transformer networks as the key core of its design).

In AlphaFold#Algorithm I've written that the transformers

"effect a mathematical transformation of [the elements of two feature-vs-feature matrices].
These transformations have the effect of bringing relevant data together and filtering out irrelevant data for these two relationships, in a context-dependent way (the "attention mechanism"), that can itself be learnt from training data."

I'd be grateful for input as to whether I've got this more or less right?

Transformers therefore seem to be maybe doing a similar job to bottleneck networks, autoencoders, latent variable extractors, and other forms of nonlinear input transformation and dimensional reduction techniques -- but there's obvously more to it than that. It might be useful to identify if there are similarities and differences.

(added): cf Transformers as Variational Autoencoders, found on github

Finally, it's clear that we could use an article on attention (machine learning), aka attention networks, aka attention mechanisms. Some of the following, found by Google, look like they may be relevant, but it would be good to get at least a stub created by someone who knows a bit about it.

Attention and Memory in Deep Learning
Lilian Weng, Attention? Attention!
Attention mechanism, FloydHub
Buomsoo Kim, Attention mechanism
Prodip Hore, Sayan Chatterjee A Comprehensive Guide to Attention Mechanism in Deep Learning for Everyone
also Giuliano Giacaglia, How Transformers Work, which puts attention etc in context.

Pinging @Iliao2345, Toiziz, The Anome, and ImTheIP: as recent editors here, in case you can help. Jheald (talk) 15:06, 2 December 2020 (UTC)Reply

I agree with everything you say. Please incorporate this into the article. And yes, we should have an article on attention (machine learning), aka attention networks, aka attention mechanisms. I'll create a stub for it now. -- The Anome (talk) 09:11, 3 December 2020 (UTC)Reply

Any idea on how to find reliable sources in this area? Most of my knowledge in the area comes from github, random blog posts, and YouTube and those sources don't count. Would ArXiv do? ImTheIP (talk) 09:25, 3 December 2020 (UTC)Reply

@ImTheIP: Well, we're not under WP:MEDRS, or Israel/West Bank restrictions, so sourcing can a little more permissive. Obviously, the usual hierarchy applies, with major textbooks, and reviews and survey articles and tour-de-horizon commentary pieces from the leading journals in the field near the top of tree, and other sources falling somewhere below that. A key criterion is always: does the source have a reputation for knowing what they're talking about. (Also: how mainstream, or introductory, is what they're saying? They maybe get more latitude reviewing the foundations of the field, vs playing up their latest project) My understanding is the ML is a field that very much talks to itself through preprints and conference papers, so arXiv papers should certainly have their place. I also think there is a place for more informal pieces like blogs or videos, which can give more accessible treatments that can be useful to readers. Videos from authoritative sources can certainly be worth adding as External links. With luck, most of this area shouldn't be controversial, so IMO it's a question of finding the balance of references that are most useful to readers. And of course, we're a wiki: so there's always a lot to be said for going with what we've got, establishing a framework or a structure for the topic, then ever-incrementally finding what we can add to the topic. People can always retire old references and ELs, if they have sources that are better.

Incidentally, the paper from Google Research on transformers in computer vision that I linked above (An image is worth 16X16 words: transformers for image recognition at scale) looks very helpful, (and also the [3] tutorial based on it). One nice thing about vision examples is that they can be so visual -- I love the pictures showing the examples of attention.

I've also seen a reference to this paper as being of interest, in applying the transformer model to molecular-biological domains with 3d symmetries.

Nice quote too, from the start of that Google paper, on Transformers vs CNNs: "Transformers lack some of the inductive biases inherent to CNNs, such as translation. equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data. However, the picture changes if the models are trained on larger datasets (14M-300M images). We find that large scale training trumps inductive bias."

-- if I'm reading that right, it's saying that with enough data, transformers can learn the symmetries and adjacencies of 1D, 2D, and 3D spaces, even when they have not been hard-coded in.

I don't want to be editing before I feel I've got a proper grasp and perspective of the subject, so I'd really appreciate if the shape of it could be laid down by those who do. But it does look very interesting! Jheald (talk) 16:28, 3 December 2020 (UTC)Reply

The name "Transformer"

Latest comment: 3 years ago1 comment1 person in discussion

It would be great to have an explanation for the name "Transformer" included into the article, if there exists one, or a clarification that the name is arbitrary, otherwise. — Preceding unsigned comment added by AVM2019 (talk • contribs) 20:57, 5 December 2020 (UTC)Reply

Vanilla Transformer Code: Incomplete

Latest comment: 3 years ago3 comments2 people in discussion

The "Pseudocode" section may be doing more to confuse than help because many of the terms are undefined(copy it to Python to see what I mean). So here is what I suggest:

Temporarily remove it.
Update the code to include relevant imports in Pytorch or Tensorflow or make custom definitions so that all terms are well defined in the code.
Post it again. — Preceding unsigned comment added by 103.118.46.204 (talk • contribs)

This was PSEUDO Code not CODE. Why not just leave it. If one is able to program, one will find the right layers in pytorch or tensorflow.... — Preceding unsigned comment added by Nico Hambauer (talk • contribs)

@Nico Hambauer and 103.118.46.204: I would argue that it is not pseudo-code, but rather an incomplete Python implementation. Pseudocode, by its very definition, should not be as language-specific as this code snippet. Python operations such as "embedding()", "multi_head_attention", etc., should not appear in pseudocode; rather, the pseudocode should be readable by programmers in any language, whether or not they are familiar with the operation of these specific Python operations. WikiDan61^ChatMe!_ReadMe!! 13:53, 27 July 2021 (UTC)Reply

I agree with WikiDan61. Stuff like "multi_head_attention(x, x, x, None)" is completely unreadable for those not already familiar with Python and the framework this is written in. intforce (talk) 14:06, 27 July 2021 (UTC)Reply

Ok maybe then I am wrong and the one that just stupidly used all the frameworks and now is used to it without noting differences anymore, but I was kind of sad to see it go as it was kind of helpful even if one had to look up all the implementations if not used to the libs. Thanks for the note! Will revert my change then :) — Preceding unsigned comment added by Nico Hambauer (talk • contribs)

@Nico Hambauer: The pseudocode can be made useful if the functioning of the framework functions can be explained, rather than just assuming that the reader knows what they do. The best way to do this would be to include in the pseudocode a declaration of the function with a pseudocode description of its operation. Then the function can be invoked within the pseudocode, since the reader will now have the knowledge required to understand it. WikiDan61^ChatMe!_ReadMe!! 14:55, 27 July 2021 (UTC)Reply

There is a readable XLNet publication ...

Latest comment: 2 years ago1 comment1 person in discussion

... at arxiv: https://arxiv.org/abs/1906.08237

Is there compelling reason to cite it at OCLC, rather than in the place where people will be able to read it? 222.154.128.36 (talk) 09:14, 2 April 2022 (UTC)Reply

Suggestion to increase the "Importance" to "Mid" or "High"

Latest comment: 1 year ago2 comments2 people in discussion

With recent progress within AI, transformers are entering more conversations with non-experts. Also, this topic is relevant to a growing number of fields outside of linguistics. Cscangarella (talk) 04:34, 10 April 2023 (UTC)CscangarellaReply

This is already oversimplified. It should never devolve even further into an article for people who can't even understand the current form. That would make it useless to the only people who knowledge of Transformers could possibly serve. It needs to become more technical, not less. Someone lacking WP:COMPETENCE might be similarly offended by the articles on specific topics in Pure Mathematics. "Linguistics"... 76.188.120.7 (talk) 18:27, 12 April 2023 (UTC)Reply

NPOV history

Latest comment: 1 year ago1 comment1 person in discussion

Please don't be like Schmidhuber.

Especially nefarious is retroactively naming "linear Transformer" to the 1993 model without explaining it is a retroactive naming, or just quoting old passages where "attention" is used metaphorically as if it is a direct originator of attention mechanism.

I think the fast weight controller is not a hushed-up origin of modern Transformers, but rather an attempt to apply high-order neural networks, or pi-sigma networks (1991), to the problem of processing sequential data. It failed to gain traction and plain LSTM dominated until 2014 when seq2seq introduced attention mechanism to LSTM, and 2017 purified attention mechanism into the Transformer. pony in a strange land (talk) 01:06, 24 April 2023 (UTC)Reply

relies too much primary ref?

Latest comment: 1 year ago2 comments1 person in discussion

I think the notice on relying too much on primary references is not correct. The article has nearly 90 references. The primary reference here would be the 2017 paper (all you need is attention) ans possibly some work leading up to that paper. However, most papers are after that, by different authors. Those are academic references, but not primary to the transformer architecture. Bquast (talk) 15:12, 13 May 2023 (UTC)Reply

I suggest to remove the notice. Maybe an inline notice of having more non-academic sources good be added lower down. Bquast (talk) 15:13, 13 May 2023 (UTC)Reply

Did Jürgen Schmidhuber invent Transformers?

Latest comment: 1 year ago1 comment1 person in discussion

Conflicting edits have added/removed statements such as In 1992, the first kind of Transformer was published by Jürgen Schmidhuber under the name "fast weight controller."

Schmidhuber has been involved in multiple controversies over what he terms credit assignment^[1]. He holds a minority but not fringe view, regarding the proper attribution of ideas in the field of AI.^[2]^[3]^[4]

The paper "Attention is All You Need"^[5] by Vaswani et al describes the Transformer as follows: "In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention."

The paper Learning to Control Fast-Weight Memories^[6] by Jürgen Schmidhuber describes the Fast Weight Controller as: "This paper describes an alternative class of gradient-based systems consisting of two feedforward nets that learn to deal with temporal sequences using fast weights: the first net learns to produce context dependent weight changes for the second net whose weights may vary very quickly."

There is not an immediate resemblance between the two methods: Transformers are a sequence-to-sequence model using self-attention, and Fast-Weight Controllers sound more like a predecessor to Hypernetworks^[7] ("an approach of using one network...to generate the weights for another network") or Memory Networks ^[8].

But, in the years after the Transformer gained popularity, several modified and altered systems based on the Transformer were proposed. One such system was the Linear Transformer^[9] by Katharopoulos et al. which "[expresses] the self-attention as a linear dot-product of kernel feature maps... We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks."

The Linear Transformer is not the same as the Transformer, but in the paper Linear Transformers Are Secretly Fast Weight Programmers^[10] Schmidhuber proves that it is mathematically equivalent to the Fast-Weight Controller, apart from its normalization scheme.

To cover Jürgen Schmidhuber's contributions without violating either WP:NPOV or WP:UNDUE, I propose that the article should make clear the following:

Schmidhuber invented the Fast Weight Controller
The FWC was mathematically almost identical to Katharopoulos' Linear Transformer, but not to Vaswani's Transformer
The FWC did not have the language-processing capabilities of a modern Transformer
The FWC is a notable historical contribution to the line of research that produced the Transformer (along with other forms of recurrent neural networks in the 80s, 90s, and 2000s.)

Lwneal (talk) 18:06, 12 August 2023 (UTC)Reply

"Decoder only" is ill defined

Latest comment: 1 year ago1 comment1 person in discussion

Description of decoder block lists the original three sub-layers (a self-attention mechanism, an attention mechanism over the encodings, and a feed-forward neural network) but later in the Terminology section "decoder only" is defined as autoregressive encoder, autoregressive decoder.

The words Decoder Only implies the lack of an encoder yet nothing in the article addresses how this 'autoregressive encoding' is happening sans encoder or the shape of the decoder block since 'an attention mechanism over the encodings' is confusing when the source of encodings is not given in this case 101.178.0.181 (talk) 01:15, 4 September 2023 (UTC)Reply

references 33 and 35 seem unhelpful

Latest comment: 1 year ago1 comment1 person in discussion

Why is there a need to cite a paper from the arXiv, published a year after the paper which made a scientific leap?
What is the point of the Ithaca example?Ladypine (talk) 09:02, 31 October 2023 (UTC)Reply

Loss function

Latest comment: 8 months ago2 comments2 people in discussion

Seemingly not covered in the article: when creating a transformer, what is the loss function to be minimized? I see in the article that once trained, a transformer can be used with a post-processing layer (or layers) to be trained, which enable a specific task such as classification. I understand a loss function for the transformer-plus-classification task, but what is the loss function used on the raw transformer before a specific task is chosen to be appended?

Or putting it another way, I can't be the only person who is looking for mention of a loss function. I would very much appreciate a sentence along the lines of one of these:

The loss function is, in effect, ....
In lieu of a loss function, ....

Thanks — $Q$ uantling (talk | contribs) 20:44, 1 November 2023 (UTC)Reply

You have to attach a task head later and the task head uses some loss function suitable to solve your task Biggerj1 (talk) 21:58, 6 March 2024 (UTC)Reply

Wiki Education assignment: Research Process and Methodology - FA23 - Sect 202 - Thu

Latest comment: 1 year ago1 comment1 person in discussion

This article was the subject of a Wiki Education Foundation-supported course assignment, between 6 September 2023 and 14 December 2023. Further details are available on the course page. Student editor(s): HELLOEXTRACREDIT (article contribs).

— Assignment last updated by HELLOEXTRACREDIT (talk) 20:51, 11 November 2023 (UTC)Reply

Wiki Education assignment: Linguistics in the Digital Age

Latest comment: 11 months ago1 comment1 person in discussion

This article was the subject of a Wiki Education Foundation-supported course assignment, between 21 August 2023 and 11 December 2023. Further details are available on the course page. Student editor(s): Gh0828 (article contribs).

— Assignment last updated by Fedfed2 (talk) 00:54, 9 December 2023 (UTC)Reply

Transformers transform what?

Latest comment: 4 months ago2 comments2 people in discussion

I came to this article to learn what a "Transformer" is or does. After reading it twice, I still haven't determined much of anything of about why it would be called a "transformer" or what place in an A.I. system it fits. According to Wikipedia tradition, and probably the MOS, the answer should have been in the first few sentences. Instead, I have dug through a word salad of gobblydagoop and have only faint impressions of the underlying technology involved but no clear, top-level understanding of what it does. —EncMstr (talk) 22:08, 29 March 2024 (UTC)Reply

The name isn't of much importance to be honest. Researchers like naming things any which way. 80.2.247.44 (talk) 20:19, 6 July 2024 (UTC)Reply

Timeline too long

Latest comment: 3 months ago1 comment1 person in discussion

The timeline is currently 90% just some highlights of language modeling, 1990-2018. Also, it gives undue focus to Schmidhuber. Despite what he says, attention mechanism had been studied in vision models since late 1980s, see Attention (machine learning) § History for references.

A good rule of thumb is to assume Schmidhuber is wrong about history. pony in a strange land (talk) 09:03, 6 August 2024 (UTC)Reply

GA Review

Latest comment: 3 months ago4 comments2 people in discussion

Unsuccessful. Phlsph7 (talk) 08:14, 12 August 2024 (UTC)Reply

The following discussion is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

This review is transcluded from Talk:Transformer (deep learning architecture)/GA1. The edit link for this section can be used to add comments to the review.

Nominator: Cosmia Nebula (talk · contribs) 19:04, 9 August 2024 (UTC)Reply

Reviewer: Phlsph7 (talk · contribs) 08:14, 12 August 2024 (UTC)Reply

Hello Cosmia Nebula and thanks for all your improvements to this article. However, despite the improvements, the article fails criterion 2b since there are too many unreferenced paragraphs. Examples are the paragraphs starting with "For many years, sequence modelling ", "As the Transformer architecture natively processes", and "A positional encoding is a fixed-size vector". According to criterion 2b, these passages require inline citations "no later than the end of the paragraph".

The article cites many papers from arXiv. They are usually considered self-published sources, making them unreliable, see WP:ARXIV. Maybe some of them are also published in reliable journals, in which case you could cite these versions instead. You would probably have to replace the rest with other sources.

I suggest that you add all the missing references and replace the arXiv papers before a renomination.

A few other observations

WP:EARWIG detects no copyvios
Linear transformers were first developed as an improvement over previous architectures for machine translation, but has found many applications since then. there is a problem with the clause starting with "but", should it be "..., but many additional applications have been found for them since then"?
An well-cited early example was replace "An well-cited" with "A well-cited" or maybe with "An often-cited"
One key innovation was use of an attention mechanism add "the" before "use"
by removing its recurrence to processes all tokens in parallel should this be "to process" instead of "to processes"?

Phlsph7 (talk) 08:14, 12 August 2024 (UTC)Reply

The discussion above is closed. Please do not modify it. Subsequent comments should be made on the appropriate discussion page. No further edits should be made to this discussion.

Add topic

[1] ttps://people.idsia.ch/~juergen/deep-learning-history.html

[2] ttps://www.nytimes.com/2016/11/27/technology/artificial-intelligence-pioneer-jurgen-schmidhuber-overlooked.html

[3] ttps://www.youtube.com/watch?v=HGYYEUSm-0Q&t=3770s

[4] ttps://people.idsia.ch/~juergen/critique-turing-award-bengio-hinton-lecun.html

[5] ttps://arxiv.org/pdf/1706.03762.pdf

[6] ttps://mediatum.ub.tum.de/doc/814768/document.pdf

[7] ttps://arxiv.org/pdf/1609.09106.pdf

[8] ttps://arxiv.org/pdf/1410.3916)

[9] ttps://linear-transformers.com/

[10] ttps://arxiv.org/pdf/2102.11174.pdf

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]