Talk:Needleman–Wunsch algorithm
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||||||||||||
|
Error in formula
editI think there's an error in the formula for the iteration of the algorithm: it currently reads:
Fij = max(Fi − 1,j − 1 + S(Ai − 1,Bj − 1),Fi,j − 1 + d,Fi − 1,j + d)
I think it should read
Fij = max(Fi − 1,j − 1 + S(Ai,Bj),Fi,j − 1 + d,Fi − 1,j + d)
(the suffices in S(A,B) need changing) as i'm new to dynamic programming, i'm not sure about this.
Any ideas anyone?
The section following this, starting with 'Basis':
As the algorithm progresses, the Fij will be assigned to be the optimal score for the alignment of the first i characters in A and the first j characters in B. The principle of optimality is then applied as follows.
Does this sequence begin at i=0? or i=1?
==> There were numerous inconsistencies in the article regarding indexing (zero based or one based)? I made everything zero based. The pseudo code is now self-consistent and consistent with the linked Java code. The above quote is now correct: "Fij represents optimal score for the alignment of the first i ...", where i and j can be zero. -- Kipton Barros, 4/18/2010.
"To compute which alignment actually gives this score, you can start from the bottom left cell" - should this not be right?
Huge error with the initialization concept?==
editIf you google search Needleman-Wunsch algorithm you will find one of the top matches is this: http://www.ludwig.edu.au/course/lectures2005/Likic.pdf
This shows that the initialization is based on something other than zeroes. In my solution I implemented this using the (Java) code:
// Initialize the 0,0 element F[0][0] = 0; // The top and left sides need to be assigned as if they were dashes for (int i = 1; i <= A.length(); i++) { F[i][0] = d * i; } for (int j = 1; j <= B.length(); j++) { F[0][j] = d * j; }
(Sorry about probably not following Wiki etiquette, I am a complete newbie here) —The preceding unsigned comment was added by Ravi Luthra (talk • contribs) .
Errors with the algorithm as it is presented here:
edit1) In the matrix calculation, the loop iterates from 1 to length(A) and length(B) where the sequences start from 0. Therefore, S(Ai,Bj) should be S(Ai-1, Bj-1) since S(Ai,Bj) would return an error.
==> I have made the suggested change here (and elsewhere) which corresponds to zero based indexing for the sequences. The pseudo-code should now be self consistent. -- Kipton Barros, 4/18/2010
====> This is still not fixed, it is true that S(Ai,Bj) should be changed to S(Ai-1, Bj-1), as well as changing Ai and Bi to Ai-1 and Bi-1 respectively in the whole pseudo-code for it to work properly (I tried it in C with this fix and it now works fine). The whole problem is that the F matrix goes from [0][0] to [lengthA][lengthB], while the sequences go from [0] to [length-1]. So the pseudo code should be changed to this:
for i=0 to length(A) F(i,0) ← d*i for j=0 to length(B) F(0,j) ← d*j for i=1 to length(A) for j=1 to length(B) { Match ← F(i-1,j-1) + S(Ai-1, Bj-1) Delete ← F(i-1, j) + d Insert ← F(i, j-1) + d F(i,j) ← max(Match, Insert, Delete) }
AlignmentA ← "" AlignmentB ← "" i ← length(A) j ← length(B) while (i > 0 or j > 0) { if (i > 0 and j > 0 and F(i,j) == F(i-1,j-1) + S(Ai-1, Bj-1)) { AlignmentA ← Ai-1 + AlignmentA AlignmentB ← Bj-1 + AlignmentB i ← i - 1 j ← j - 1 } else if (i > 0 and F(i,j) == F(i-1,j) + d) { AlignmentA ← Ai-1 + AlignmentA AlignmentB ← "-" + AlignmentB i ← i - 1 } else (j > 0 and F(i,j) == F(i,j-1) + d) { AlignmentA ← "-" + AlignmentA AlignmentB ← Bj-1 + AlignmentB j ← j - 1 } }
192.55.54.40 (talk) 23:19, 6 August 2013 (UTC) Vittorio Capra.
-- IMHO, this algorithm uses the matrix with length(B) + 1 rows and length(A) + 1 columns; a cells in first row and column contain an initial gap penalty; the example of matrix is not from this algorithm —Preceding unsigned comment added by 212.96.160.35 (talk) 20:01, 3 December 2009 (UTC)
2) As in the previous post ('huge error...'), the initialization needs to change according to the "Genomic Perl" book.
3) The code in which the alignments are calculated also gives wrong results. I am using the code in "Genomic Perl" book and it works fine.
Defining d and edit distance
editNote that everyone so far has failed to define the variable d in all their formulas. What is this? Also, I just backtracked up an edit distance matrix and the alignments generated definitely seemed fine to me (inserting "-" when you don't move diagonally). Is this formula a generalization of that? It seems like it. Should this be noted and explained in this page or the Levenshtein page? —Preceding unsigned comment added by 65.88.178.10 (talk) 18:17, 17 September 2007 (UTC)
d appears to be the gap penalty...
Ok, following the link posted above gives an excellent explanation. If you're confused by the wiki page, don't feel bad, it's not very clear. As for the Levenshtein relationship, the matrix you generate in that algorithim is equivalent to F where the (cost of insertion) = (cost of deletion) = (gap penalty), and S(i,j) = the substitution cost.
Error ?
editI made an implementation of this algorithm in C. It seems there is some misstake : (well, I am not completly sure of me, but I did the following change in my source code to make it works correctly :
{
i <- length(B) - 1 j <- length(A) - 1
}
instead of
{
i <- length(A) - 1 j <- length(B) - 1
}
According we consider F(x,y) (x number of line, y number of column). Which has been the choice in my implementation, and seems to be the choice in this algorithm.
Thanks to confirm it
Valeuf (talk) 17:51, 17 May 2009 (UTC)
I found the same error in my code. The algorithm seems to begin assuming array indices start at 0 and then later on is assumes indices start at 1. For an algorithm in C you also need to alter the while loops so that they exit when i, j <= 0, not when i, j < 0. This still has yet to be fixed. —Preceding unsigned comment added by 18.42.1.137 (talk) 00:10, 20 April 2011 (UTC)
Illustration ?
editThis article really needs a visual illustration of what a typical F matrix would contain. Can anyone provide ?
Adding a link
editThe following link gives a clear explanation of Needleman-Wunsch algorithm, and can be included in the external links.
http://technology66.blogspot.com/
--122.169.92.175 (talk) 21:11, 12 September 2008 (UTC)
full URL: http://technology66.blogspot.de/2008/08/sequence-alignment-techniques.html --Hubalu (talk) 08:58, 16 July 2013 (UTC)
Wat is S?
editIn the pseudo code, what does S mean? It's never mentioned in the text above
Match ← F(i-1,j-1) + S(Ai, Bj)
I suggest a short explaination.
Code optimization?
editWhy is the F(0,0) initialized two times?
for i=0 to length(A) F(i,0) ← d*i for j=0 to length(B) F(0,j) ← d*j
Either initialize the F(0,0) in a separate statement or adjust the start index of the second loop:
F(0,0) ← 0 for i=1 to length(A) F(i,0) ← d*i for j=1 to length(B) F(0,j) ← d*j
for i=0 to length(A) F(i,0) ← d*i for j=1 to length(B) F(0,j) ← d*j
Indexing problem
editI think there's an issue with the indexes in the verbal description of the algorithm. There are i rows, one for each element of sequence B. We then index i over A rather than B. Likewise there are j columns based on the number of elements in A and then we index j over B. In a square matrix, this should result in the transpose of the desired match matrix and in a non-square matrix this will result in an index going out of bounds. — Preceding unsigned comment added by 107.191.32.203 (talk) 14:48, 14 August 2014 (UTC)
The ″U″ in DNA sequence
editHi! I've just noticed a mistake in one of the example sequences. The last base in the first sequence is ″U″ (GCATGCU). As it is DNA it should only contain A, T, C and G, shouldn't it?--Zlir'a (talk) 09:52, 30 December 2014 (UTC)
Yes. It should be a T not U. RNA is coded with AUCG DNA is coded with ATCG. Though in some contexts U and T can be thought of as equivalent I'd still consider that wrong.
I think it would be better to replace it with a 'G'. Using a T would result in an error in the matrix, as the second T in GATT is penalized in the matrix. Using a C or a G would be consistent. — Preceding unsigned comment added by 157.193.240.3 (talk) 16:11, 22 June 2018 (UTC)
Keyboard layout proximity ?
edithttps://wiki.melissadata.com/?title=Matchcode_Optimization%3ANeedleman-Wunsch mentions: "Needleman-Wunsch - A variation of the Levenshtein algorithm. Levenshtein and Needleman-Wunsch are identical except that character mistakes are given different weights depending on how far two characters are on a standard keyboard layout. For example: A to S is given a mistake weight of 0.4, while A to D is a 0.6 and A to P is a 1.0" but the Needleman-Wunsch Wikipedia article does not mention keyboard layout proximity. Should the article mention keyboard layout proximity ? --Jean-Marc Liotier (talk) 09:53, 7 September 2016 (UTC)
- The general algorithm allows for S as the (usually look-up table) comparing input data items. Keyboard distance is one metric that could be used to generate such table. In the case of protein sequences, there are biological statistical reasons related to how likely mutations are to occur. Gah4 (talk) 01:45, 20 June 2021 (UTC)
- In biology, this is described by the Point accepted mutation, and more usually through the BLOSUM matrix. Gah4 (talk) 01:47, 20 June 2021 (UTC)
External links modified (February 2018)
editHello fellow Wikipedians,
I have just modified one external link on Needleman–Wunsch algorithm. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
- Added archive https://web.archive.org/web/20140729133238/http://www.science.marshall.edu/murraye/Clearer%20Matrix%20slide%20show.pdf to http://www.science.marshall.edu/murraye/Clearer%20Matrix%20slide%20show.pdf
- Added
{{dead link}}
tag to http://svitsrv25.epfl.ch/R-doc/library/Biostrings/html/00Index.html
When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.
This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}}
(last update: 5 June 2024).
- If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
- If you found an error with any archives or the URLs themselves, you can fix them with this tool.
Cheers.—InternetArchiveBot (Report bug) 09:29, 15 February 2018 (UTC)
diff
editNote that this algorithm was later used for the Unix diff command. It used to be mentioned on the man page for diff, but it seems not anymore. It might be useful to say that, as that is where more people will actually use it. Gah4 (talk) 09:02, 18 June 2021 (UTC)
Mistake/Typo/Redundancy in Basic Scoring Section
editIn the "Basic Scoring" section, two alternate example scoring schemes are discussed... but they are in fact identical. I think the edit "15:19, 5 April 2023 106.205.216.93 →Basic scoring schemes" introduced a mistake. Previous to this edit, the schemes were different. 136.38.95.121 (talk) 18:08, 10 May 2024 (UTC)