Talk:Needleman–Wunsch algorithm

Latest comment: 6 months ago by 136.38.95.121 in topic Mistake/Typo/Redundancy in Basic Scoring Section

Error in formula

edit

I think there's an error in the formula for the iteration of the algorithm: it currently reads:

Fij = max(Fi − 1,j − 1 + S(Ai − 1,Bj − 1),Fi,j − 1 + d,Fi − 1,j + d)

I think it should read

Fij = max(Fi − 1,j − 1 + S(Ai,Bj),Fi,j − 1 + d,Fi − 1,j + d)

(the suffices in S(A,B) need changing) as i'm new to dynamic programming, i'm not sure about this.

Any ideas anyone?

The section following this, starting with 'Basis':

As the algorithm progresses, the Fij will be assigned to be the optimal score for the alignment of the first i characters in A and the first j characters in B. The principle of optimality is then applied as follows.

Does this sequence begin at i=0? or i=1?

==> There were numerous inconsistencies in the article regarding indexing (zero based or one based)? I made everything zero based. The pseudo code is now self-consistent and consistent with the linked Java code. The above quote is now correct: "Fij represents optimal score for the alignment of the first i ...", where i and j can be zero. -- Kipton Barros, 4/18/2010.

"To compute which alignment actually gives this score, you can start from the bottom left cell" - should this not be right?

Huge error with the initialization concept?==

edit

If you google search Needleman-Wunsch algorithm you will find one of the top matches is this: http://www.ludwig.edu.au/course/lectures2005/Likic.pdf

This shows that the initialization is based on something other than zeroes. In my solution I implemented this using the (Java) code:

// Initialize the 0,0 element
F[0][0] = 0;

// The top and left sides need to be assigned as if they were dashes
for (int i = 1; i <= A.length(); i++)
{
  F[i][0] = d * i;
}
for (int j = 1; j <= B.length(); j++)
{
  F[0][j] = d * j;
}

(Sorry about probably not following Wiki etiquette, I am a complete newbie here) —The preceding unsigned comment was added by Ravi Luthra (talkcontribs) .

Errors with the algorithm as it is presented here:

edit

1) In the matrix calculation, the loop iterates from 1 to length(A) and length(B) where the sequences start from 0. Therefore, S(Ai,Bj) should be S(Ai-1, Bj-1) since S(Ai,Bj) would return an error.

==> I have made the suggested change here (and elsewhere) which corresponds to zero based indexing for the sequences. The pseudo-code should now be self consistent. -- Kipton Barros, 4/18/2010

====> This is still not fixed, it is true that S(Ai,Bj) should be changed to S(Ai-1, Bj-1), as well as changing Ai and Bi to Ai-1 and Bi-1 respectively in the whole pseudo-code for it to work properly (I tried it in C with this fix and it now works fine). The whole problem is that the F matrix goes from [0][0] to [lengthA][lengthB], while the sequences go from [0] to [length-1]. So the pseudo code should be changed to this:

for i=0 to length(A)
  F(i,0) ← d*i
for j=0 to length(B)
  F(0,j) ← d*j
for i=1 to length(A)
  for j=1 to length(B)
  {
    Match ← F(i-1,j-1) + S(Ai-1, Bj-1)
    Delete ← F(i-1, j) + d
    Insert ← F(i, j-1) + d
    F(i,j) ← max(Match, Insert, Delete)
  }
AlignmentA ← ""
AlignmentB ← ""
i ← length(A)
j ← length(B)
while (i > 0 or j > 0)
{
  if (i > 0 and j > 0 and F(i,j) == F(i-1,j-1) + S(Ai-1, Bj-1))
  {
    AlignmentA ← Ai-1 + AlignmentA
    AlignmentB ← Bj-1 + AlignmentB
    i ← i - 1
    j ← j - 1
  }
  else if (i > 0 and F(i,j) == F(i-1,j) + d)
  {
    AlignmentA ← Ai-1 + AlignmentA
    AlignmentB ← "-" + AlignmentB
    i ← i - 1
  }
  else (j > 0 and F(i,j) == F(i,j-1) + d)
  {
    AlignmentA ← "-" + AlignmentA
    AlignmentB ← Bj-1 + AlignmentB
    j ← j - 1
  }
}

192.55.54.40 (talk) 23:19, 6 August 2013 (UTC) Vittorio Capra.Reply

-- IMHO, this algorithm uses the matrix with length(B) + 1 rows and length(A) + 1 columns; a cells in first row and column contain an initial gap penalty; the example of matrix is not from this algorithm —Preceding unsigned comment added by 212.96.160.35 (talk) 20:01, 3 December 2009 (UTC)Reply

2) As in the previous post ('huge error...'), the initialization needs to change according to the "Genomic Perl" book.

3) The code in which the alignments are calculated also gives wrong results. I am using the code in "Genomic Perl" book and it works fine.

Defining d and edit distance

edit

Note that everyone so far has failed to define the variable d in all their formulas. What is this? Also, I just backtracked up an edit distance matrix and the alignments generated definitely seemed fine to me (inserting "-" when you don't move diagonally). Is this formula a generalization of that? It seems like it. Should this be noted and explained in this page or the Levenshtein page? —Preceding unsigned comment added by 65.88.178.10 (talk) 18:17, 17 September 2007 (UTC)Reply

d appears to be the gap penalty...

Ok, following the link posted above gives an excellent explanation. If you're confused by the wiki page, don't feel bad, it's not very clear. As for the Levenshtein relationship, the matrix you generate in that algorithim is equivalent to F where the (cost of insertion) = (cost of deletion) = (gap penalty), and S(i,j) = the substitution cost.

Error ?

edit

I made an implementation of this algorithm in C. It seems there is some misstake : (well, I am not completly sure of me, but I did the following change in my source code to make it works correctly :

{

 i <- length(B) - 1
 j <- length(A) - 1

}

instead of

{

 i <- length(A) - 1
 j <- length(B) - 1

}

According we consider F(x,y) (x number of line, y number of column). Which has been the choice in my implementation, and seems to be the choice in this algorithm.

Thanks to confirm it

Valeuf (talk) 17:51, 17 May 2009 (UTC)Reply

I found the same error in my code. The algorithm seems to begin assuming array indices start at 0 and then later on is assumes indices start at 1. For an algorithm in C you also need to alter the while loops so that they exit when i, j <= 0, not when i, j < 0. This still has yet to be fixed. —Preceding unsigned comment added by 18.42.1.137 (talk) 00:10, 20 April 2011 (UTC)Reply

Illustration ?

edit

This article really needs a visual illustration of what a typical F matrix would contain. Can anyone provide ?

edit

The following link gives a clear explanation of Needleman-Wunsch algorithm, and can be included in the external links.

http://technology66.blogspot.com/

--122.169.92.175 (talk) 21:11, 12 September 2008 (UTC)Reply

full URL: http://technology66.blogspot.de/2008/08/sequence-alignment-techniques.html --Hubalu (talk) 08:58, 16 July 2013 (UTC)Reply

Wat is S?

edit

In the pseudo code, what does S mean? It's never mentioned in the text above

Match ← F(i-1,j-1) + S(Ai, Bj)

I suggest a short explaination.

Mcemperor (talk) 16:27, 30 November 2011 (UTC)Reply

Code optimization?

edit

Why is the F(0,0) initialized two times?

for i=0 to length(A)
  F(i,0) ← d*i
for j=0 to length(B)
  F(0,j) ← d*j

Either initialize the F(0,0) in a separate statement or adjust the start index of the second loop:

F(0,0) ← 0
for i=1 to length(A)
  F(i,0) ← d*i
for j=1 to length(B)
  F(0,j) ← d*j
for i=0 to length(A)
  F(i,0) ← d*i
for j=1 to length(B)
  F(0,j) ← d*j

--Hubalu (talk) 09:00, 16 July 2013 (UTC)Reply

Indexing problem

edit

I think there's an issue with the indexes in the verbal description of the algorithm. There are i rows, one for each element of sequence B. We then index i over A rather than B. Likewise there are j columns based on the number of elements in A and then we index j over B. In a square matrix, this should result in the transpose of the desired match matrix and in a non-square matrix this will result in an index going out of bounds. — Preceding unsigned comment added by 107.191.32.203 (talk) 14:48, 14 August 2014 (UTC)Reply

The ″U″ in DNA sequence

edit

Hi! I've just noticed a mistake in one of the example sequences. The last base in the first sequence is ″U″ (GCATGCU). As it is DNA it should only contain A, T, C and G, shouldn't it?--Zlir'a (talk) 09:52, 30 December 2014 (UTC)Reply


Yes. It should be a T not U. RNA is coded with AUCG DNA is coded with ATCG. Though in some contexts U and T can be thought of as equivalent I'd still consider that wrong.

I think it would be better to replace it with a 'G'. Using a T would result in an error in the matrix, as the second T in GATT is penalized in the matrix. Using a C or a G would be consistent. — Preceding unsigned comment added by 157.193.240.3 (talk) 16:11, 22 June 2018 (UTC)Reply

Keyboard layout proximity ?

edit

https://wiki.melissadata.com/?title=Matchcode_Optimization%3ANeedleman-Wunsch mentions: "Needleman-Wunsch - A variation of the Levenshtein algorithm. Levenshtein and Needleman-Wunsch are identical except that character mistakes are given different weights depending on how far two characters are on a standard keyboard layout. For example: A to S is given a mistake weight of 0.4, while A to D is a 0.6 and A to P is a 1.0" but the Needleman-Wunsch Wikipedia article does not mention keyboard layout proximity. Should the article mention keyboard layout proximity ? --Jean-Marc Liotier (talk) 09:53, 7 September 2016 (UTC)Reply

The general algorithm allows for S as the (usually look-up table) comparing input data items. Keyboard distance is one metric that could be used to generate such table. In the case of protein sequences, there are biological statistical reasons related to how likely mutations are to occur. Gah4 (talk) 01:45, 20 June 2021 (UTC)Reply
In biology, this is described by the Point accepted mutation, and more usually through the BLOSUM matrix. Gah4 (talk) 01:47, 20 June 2021 (UTC)Reply
edit

Hello fellow Wikipedians,

I have just modified one external link on Needleman–Wunsch algorithm. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 09:29, 15 February 2018 (UTC)Reply

diff

edit

Note that this algorithm was later used for the Unix diff command. It used to be mentioned on the man page for diff, but it seems not anymore. It might be useful to say that, as that is where more people will actually use it. Gah4 (talk) 09:02, 18 June 2021 (UTC)Reply

Mistake/Typo/Redundancy in Basic Scoring Section

edit

In the "Basic Scoring" section, two alternate example scoring schemes are discussed... but they are in fact identical. I think the edit "15:19, 5 April 2023‎ 106.205.216.93 →‎Basic scoring schemes" introduced a mistake. Previous to this edit, the schemes were different. 136.38.95.121 (talk) 18:08, 10 May 2024 (UTC)Reply