Talk:Longest common subsequence
This article is rated Start-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||
|
|
|
Reduce strings to hashes
edit.. should mention that this method is not recommended for professional use cases. I don't want to be a passenger of a plane which runs software merged some times in the past from different branches and contains old code due to a hash collision... cheers 82.83.133.92 (talk) 20:18, 3 July 2015 (UTC)
- In order to prevent errors like this, a string searching algorithm would usually compare the substrings after a matching hash was found. The Rabin–Karp string search algorithm works in a similar way, using a rolling hash function to find substrings that are possibly identical. Jarble (talk) 01:29, 6 March 2024 (UTC)
Cleanup 2005-08-07
editSince I am activally using and contributing to this page (I am the one who posted the diff algorithm) to create a generic history tool I think Zer000 is right it lay out sucks.... for that reason I will undertake writting a new one once I start I will post a link to the draft page here for comment. (unsigned by Sarek1024, 2006-05-14)
- (I forgot to save this edit to the talk page earlier). I took the liberty of rewriting the page. Feel free to revert or change. I moved the old talk to Talk:Longest_common_subsequence_problem/Archive. -- Nils Grimsmo 05:54, 8 August 2006 (UTC)
- The quality control effort really screwed the page up.... first of not everyone knows python so it is better to offer the algorithm in traditional psedo-code... second back tracking is *NOT* needed so why include it (the rivest algorithm may not be the most efficent but it is the simplest to understand)... getting rid of the LCS matrix makes stuff more confusing and harder to debug for someone implementing it (I started editing the page when I first started playing with LCS about 6 months ago and found it to be an ideal reference [short on theory but long on practicle stuff])... also the diff algorithm should also be included since it illustrates a good real life example of using the LCS for something useful. (unsigned by Sarek1024, 2006-08-08)
- I will not be offended if you revert it. Some views:
- * I might agree to the argument about Python, as there is some syntax noise there (initialising the C matrix, for instance.)
- * The algorithms from CLRS01 does use backtracking. The "UP"/"LEFT" table? The algorithm PRINT-LCS from CLRS01 reads this, and backtracks. See also the section "Improving your code" on page 355.
- * PRINT-LCS from CLRS01 prints a random LCS if there are more than one. This could be a better approach from an educational view, though. But then it should be stated clearly in the article that you shold modify your approach if you want all the LCS.
- * I feel that the four step explanation in CLRS01 is too long for an encyclopedia. Theorem 15.1 Step 1 is enough for people to understand the problem. It might be better to write it as in CLRS01 (as it was here previously) though.
- I am sorry for the brutal cleanup. If more people feel like a revert, just go ahead.
- Nils Grimsmo 05:54, 8 August 2006 (UTC)
- I have now changed it a bit. Maybe you like it better.
- * Finding a single LCS is the default problem again.
- * The code is split into one function which finds a single LCS, and one which finds all. The former should be very easy to understand.
- * I have gotten rid of some odd Python specific syntax (initializing the
C
array). What is left is the syntax of the for-loops (range
) and some stuff inbt_all
. - * Added a diff print function.
- Nils Grimsmo 14:51, 8 August 2006 (UTC)
- I wrote a crude Python to pseudo-code translator. Do you people like this pseudo-code better than the Python variant then? Nils Grimsmo 07:38, 9 August 2006 (UTC)
- While I'm not too keen on the Python code, it's not horrible. I think that while Nils' LCS2 version may be more language-agnostic, it may be harder to read due to its additional verbosity. As it stands, the Python code is at least trim and concise. RickOsborne 14:54, 2 October 2006 (UTC)
- I tried to make an example and print the C matrix combined with the choices you make (which are the choices you have to redo when you backtrack). I am no HTML wiz, so I had a little trouble getting the output compact. I have uploaded a dump of it. Does anybody have an idea of how to make the source code more compact? -- Nils Grimsmo 18:05, 13 August 2006 (UTC)
Number of LCSs
editDoes anybody know a tight bound on the maximal number of longest common subsequences? (As in less than , ). If you take for example the strings XYXYXY...XY and YXYXYX...YX, there are LCSs, right? For XYZXYZ...XYZ and ZYXZYX...ZYX there are . We have that is maximal for . Is a bound? -- Nils Grimsmo 17:56, 13 August 2006 (UTC)
- This was just plain wrong. The strings XYZXYZ and ZYXZYX, have 10 LCSs, of length 3. And . But pretty please, doesn't anybody have a bound somewhere in a paper? -- Nils Grimsmo 07:42, 16 August 2006 (UTC)
- I believe the following will help: http://arxiv.org/abs/cs.DM/0301030 . This too: http://www.research.att.com/~njas/sequences/A094837 . Zerotalk 10:48, 16 August 2006 (UTC)
- Thanks a bunch! -- Nils Grimsmo 09:29, 17 August 2006 (UTC)
LCP?
editHi folks, I dunno if it's just an artifact of an earlier version or an artifact of my wayward brain, but is "LCP" a typo for LCS in the Python code and there abouts? Also, this Python doesn't run as is, right? The matrix C isn't initialized. Isn't the LCS algorithm understood to take two sequences as its input? --babbage 09:00, 27 October 2006 (UTC)
- You are right about the typo. Thanks for noticing! I added a comment in the code about the initialisation of the C matrix. The reason I did not do create the matrix inside in the function is that this is really ugly in python:
C = [[0] * (m+1) for i in xrange(n+1)]
- But maybe it should have been put there anyway. Klem fra Nils Grimsmo 07:18, 28 October 2006 (UTC)
Notation
editCurrently the notation is not consistent between the definition of the recurrence and the pseudo-code (my fault). Which should be used of X1..m and X[1..m]
? Or maybe something else? Or does it not matter that two different notations are used? Klem fra Nils Grimsmo 14:29, 10 December 2006 (UTC)
- And BTW, this should be probably be uniform across all similar string problems/algorithms which use DP solutions, such as edit distance, longest common substring, shortest common supersequence, etc.. Klem fra Nils Grimsmo 21:26, 11 December 2006 (UTC)
Relation to other problems
editFor two strings and , the Levenshtein distance (edit distance) is
I don't think this is true. For example, if we take the words RHYTHM and ALGORITHM, then so the Levenshtein distance should become , but it should be 6.
- You are right. Good thing you noticed! Klem fra Nils Grimsmo 15:03, 21 December 2006 (UTC)
Solution for two sequences
editI think the following equation is wrong.
The correct one should be
Isn't it? —The preceding unsigned comment was added by 140.115.213.121 (talk) 04:59, 2 February 2007 (UTC).
Sorry, the original one is correct. I misunderstood the equation. —140.115.213.121.
Needs an example
editI think the article could really do with an example demonstrating the LCS between two strings. I'm not sure I understand it well enough to add one, though... Tjwood 10:18, 5 March 2007 (UTC)
- I added an example now. Does it make things clearer? Klem fra Nils Grimsmo 08:25, 6 March 2007 (UTC)
- I strongly agree that there must be examples. I've tried explaining the LCS problem and solution to several people. Showing examples really makes it click in their minds. A custom page to automatically generate examples for any pair of strings was made. It generates the LCS Length chart and resulting LCS answer but I'm having trouble with people removing the link from the external links of the article. Does anyone know why people are removing this? Here is the address: LCS Page (Joshi1983 (talk) 04:29, 22 February 2010 (UTC))
A point about memory space
editThe worked example is very good, nice and visual. I notice it describes the use of a traceback system to save storing a big table of subsequences. In the languages I program in, the subsequences and even sets of subsequences would be objects, so in the untweaked version of the scheme (without the traceback) the individual cells of the table would be single pointers only, many of them to the same objects. (For every cell that inherits its set of longest sequences only from the left or only from above, no new object would need to be created at all. That is, any of the mismatch cells where the cell above or left also have different-lengthed longest subsequences to pass on. I think that would be the majority of cells when you get into comparing large and contrasting sequences. In the case of a mismatch cell which inherits from left and above, no new subsequence objects are generated, only two lists of them combined.) So does this discount the value of the traceback approach, and should it be mentioned?
Is the Identity Integers reasoning wrong?
editThe only advantage I can see identity integers having over hash functions is that they don't have collisions, at the cost of taking longer to calculate. The article currently makes it sound like they have more advantages.. eg:
- In source code, there will often be lines that are exactly the same. This could include an empty line between two functions, or a trailing brace, or closing comment, and so on. The algorithm does not need to distinguish between specific instances of these similar lines. That is, one blank line is the same as any other blank line. Therefore, the previous optimization can be taken one step further by reducing each line to an identity integer.
Yet any hash function will yield the same hash value for two lines that are the same, else it's not a hash function. So the above point seems fairly moot..
- More importantly, each item is reduced from a string to a unique integer, meaning that each comparison is now an integer comparison.
As above.
Identity integers also lose this property:
- Additionally, the randomized nature of hashes and checksums would guarantee that comparisons would short-circuit faster, as lines of source code will rarely be changed at the beginning.
- An "identity integer" is nothing more than a hash sequence that is the same size as a machine integer (and can therefore be stored as an integer). The advantages of both are the same, and they both have the same unstated disadvantage: The block of octets that are hashed together become the editing unit, instead of bytes. This means that if you replace lines with hashes, your diff tool ends up being line-oriented, and thus unsuitable for use on binary files. I'll be removing the mention of "identity integers" tonight, because it's redundant. 71.72.235.91 (talk) 03:13, 19 May 2009 (UTC)
- "randomized nature" leads me to believe they're considering a hash stored as a sequence of bytes to be compared one-by-one, rather than a sequence of integers which can be compared more efficiently. What it's saying is that without a hash you typically need to look through many characters to test equality. With a "randomized nature" hash the comparison is faster, typically just one or two bytes/characters of the hash need to be examined, or perhaps just one integer if the hash is encoded as a sequence of integers. It's not really guaranteed to short-circuit though since there may be collisions and for certainty you should fall back to string comparison if the hashes are equal. With identity integers, "randomized nature" is not required since equality is guaranteed to be determined with a single integer comparison, making it most efficient, although in practice the advantage over a hash using integer comparison may be negligible.
Atomota (talk) 08:12, 22 November 2007 (UTC)
- "randomized nature" leads me to believe they're considering a hash stored as a sequence of bytes to be compared one-by-one, rather than a sequence of integers which can be compared more efficiently. What it's saying is that without a hash you typically need to look through many characters to test equality. With a "randomized nature" hash the comparison is faster, typically just one or two bytes/characters of the hash need to be examined, or perhaps just one integer if the hash is encoded as a sequence of integers. It's not really guaranteed to short-circuit though since there may be collisions and for certainty you should fall back to string comparison if the hashes are equal. With identity integers, "randomized nature" is not required since equality is guaranteed to be determined with a single integer comparison, making it most efficient, although in practice the advantage over a hash using integer comparison may be negligible.
So unless I'm misunderstanding something, should the article be changed? (I don't fully understand how the function works yet, so in the case that I'm wrong I'd rather not change/make a mess of things =P, and also have never worked in perl/c++)
Themania 05:20, 20 April 2007 (UTC)
Also I forgot to mention the largest advantage of hash values - you can calculate them just once for a file, and then compare that file against other files. Very useful!
Themania 05:24, 20 April 2007 (UTC)
In traceback since R is {} , we can omit it in the first if .
Define Variable
editUnder "Complexity" is the following equation:
L is not previously defined. But, more to the point, I've had calculus II, and I can't make sense of the upper limit of that summation. Perhaps it could be put in words, as well, for us mere mortals.--Christopher King (talk) 21:36, 10 January 2009 (UTC)
- I don't understand that formula either: not so much what it means, but why the "naive algorithm" has been so heavily pessimized as to make this the correct time analysis for the algorithm. I replaced it by something simpler and (I hope) easier to understand. —David Eppstein (talk) 23:40, 10 January 2009 (UTC)
Move Code Examples?
editSomeone added the note on the right to the start of the computer programming examples:
Should the code examples and their discussion be moved to that article? That article has no introduction, just code, so the material in this article could serve as an introduction to that article.--Christopher King (talk) 23:23, 2 February 2009 (UTC)
Better Examples
editWhy don't the examples of "longest common sequence" actually have a common sequence?
How about:
MYDOG YOURDOG! —Preceding unsigned comment added by 58.106.110.238 (talk) 08:49, 10 June 2009 (UTC)
Reading out an LCS pseudo code is wrong
editThe pseudo code in the "Reading out an LCS" section doesn't work.
function backTrace(C[0..m,0..n], X[1..m], Y[1..n], i, j) if i = 0 or j = 0 return "" else if X[i-1] = Y[j-1] return backTrace(C, X, Y, i-1, j-1) + X[i-1] else if C[i,j-1] > C[i-1,j] return backTrace(C, X, Y, i, j-1) else return backTrace(C, X, Y, i-1, j)
Notice what happens when you take i and j to be 1,1. You get to the else if clause and compare X[0] to Y[0]. But the function definition shows X and Y to have ranges of 1..m/1..n, so the 0 is out of range.
Mbcook (talk) 18:38, 12 June 2009 (UTC)
- Not really. The C array is defined from 0..m/0..n, with the assumption that row/column 0 are equal to null. The same assumption is made for X and Y, and if you assume X[0] = Y[0] = NULL, then it should work just fine. 8/26/2008
Second property description is incorrect.
editThe statement The LCS of X and Y are the longest sequences contained in LCS(Xm – 1, Y) and LCS(X, Yn – 1) is not correct. The correct statement of the second property is The LCS of X and Y are the longest sequences contained in LCS(Xm – 1, Y) or LCS(X, Yn – 1). The statement that LCS(Xm – 1, Y) and LCS(X, Yn – 1) will both produce an LCS, as the example suggests, is false.
Take for example X = AGT and Y = ATC. LCS(Xm – 1, Y) = A and LCS(X, Yn – 1) = AT. Clearly AT is the longest common subsequence, not A. Thus, LCS(X, Y) = the longest sequences of LCS(Xm – 1, Y) or LCS(X, Yn – 1).
The current example of the second property is at best misleading.
Shannon Pattison (talk) 20:31, 19 August 2009 (UTC)notpattison
LCS function defined
editIs the Equation really right? If in the third Case both Sequences have the same size? What did the max-function returns then? — Preceding unsigned comment added by 85.180.137.191 (talk) 07:51, 28 August 2011 (UTC)
Multiple String (> 2) LCS
editI think mentioning the LCS (aka MLCS or k-LCS) on more than two strings (NP-hard) is missing in this article. Jens (talk) 10:34, 4 April 2012 (UTC)
- I agree with you.. 163.5.23.68 (talk) 17:53, 11 April 2024 (UTC)
Substring vs. subsequence doesn't exist
editThe link in the intro links to http://en.wikipedia.org/wiki/Subsequence#Substring_vs._subsequence but no such header there, and not really any good examples of the difference. 129.241.124.103 (talk) 21:09, 10 July 2013 (UTC)
- The section was removed from the target article. I've replaced the link with a description of the difference. --Joshua Issac (talk) 08:44, 13 August 2013 (UTC)
Why the callenge to add references?
editWhy has someone added a requirement for reference in the code sequence? The whole sequence is a paraphrase (very useful!) of what is explained above, and can be desk checked by any competent programmer. "unsourced material can be challenged and removed". This is not a controversial statement like the "sons of Nixons are involved in the My Lai massacre" which requires the backing of "reliable sources". It is more like "on a clear day, from the bell tower of the Notre Dame you can see the Effel tower". — Preceding unsigned comment added by 80.100.243.19 (talk) 14:20, 13 April 2014 (UTC)
Method of Four Russians
edit"For problems with a bounded alphabet size, the Method of Four Russians can be used to reduce the running time of the dynamic programming algorithm by a logarithmic factor"
I don't see how the Method of Four Russians is applicable here to this problem. Method of Four Russians requires the contents of the matrix to come from a bounded alphabet, where as the contents of the matrix here are unbounded integers (even if the two input strings come from a bounded alphabet). — Preceding unsigned comment added by 71.198.24.3 (talk) 06:33, 21 January 2019 (UTC)
- The inputs at the top and left sides of each log-by-log submatrix are two sequences: the input (bounded alphabet) and the deltas of subproblem LCS length from one cell to the next (bounded alphabet {0,+1}). The outputs are the new deltas (again, bounded alphabet {0,+1}). —David Eppstein (talk) 07:43, 21 January 2019 (UTC)
Missing references
editThere is a paper by Ukkonen[1] and a paper by Myers[2] that present efficient algorithms for bounded D = N - L. Maybe the reason they are missing from this page is that they are primarily phrased as being about edit distance where deletion and insertion cost 1 and substitution is not allowed (or costs 2 if you prefer, so slightly different than Levenshtein distance). However, it's clear that those algorithm also apply to computing the LCS, and it's even explicitly stated in at least one of the papers.
The paper by Myers exposes some interesting results:
- Algorithm runs in O(ND) time and O(N+D^2) space.
- Algorithm runs in O(N + D^2) time on "random" inputs (where random means character-character matches outside the LCS occur with constant probability, that is, independent of N or L).
- Algorithm runs in O(N) space if using Hirschberg's scheme.
- Algorithm runs in O(N log N + D^2) time if using suffix trees.
The paper by Ukkonen looks like a subset of what is in Myers paper (point 1) but the result applies to a more general setting (or at least is written in such a way that it looks like the result applies to a more general setting).
The paper by Myers is often cited in the context of file diffing algorithms.
Aureooms (talk) 15:45, 26 May 2021 (UTC)
References
- ^ Esko Ukkonen (1985). "Algorithms for Approximate String Matching"
- ^ Eugene Myers (1986). "An O(ND) Difference Algorithm and Its Variations"
Prefixes X_1,2,...,m ?
editIt is written: "The prefixes of X are ...", which uses an undefined notation. I assume the author meant "... ..." instead, do you agree? — MFH:Talk 01:22, 27 October 2021 (UTC)
- Done I agree, but I guess the sequences should start at 0. I changed the text accorgingly. - Jochen Burghardt (talk) 09:58, 27 October 2021 (UTC)
LCS variants
editI didn't see any of the LCS variants elsewhere on Wikipedia and wondered if this article would be a good place to add a section about them. Three variants I'm aware of are:
- Constrained Longest Common Subsequence (CLCS)
- Doubly-Constrained Longest Common Subsequence (DC-LCS)
- Repetition-Free Longest Common Subsequence (RF-LCS)
This paper has an overview of the three above:
CLCS in particular is described here with an O(n^2 · m^2 · r) solution:
And a more recent CLCS solution in O(n · m · r) is described here:
Thanks!
Delta encoding and data differencing
editThis section doesn't mention data differencing or delta encoding. Isn't the longest common subsequence also related to this method of compression? Jarble (talk) 22:53, 5 March 2024 (UTC)