Talk:Levenshtein distance
This article is rated C-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||
|
Time Complexity
editWhat is the time complexity for calculating Levenshtein distance? For example, is it O(n*m) where n and m are the lengths of the two strings being compared? I think this information would be very helpful to include for any implementations described in the article. --Showeropera (talk) 13:32, 14 January 2016 (UTC)
- The Wagner-Fisher algorithm takes O(n*m) in the worst case, which is O(n²) assuming that m = Θ(n). This follows trivially from the structure of the algorithm; count the number of times the loops run, realize that each iteration performs O(1) work, and multiply. QVVERTYVS (hm?) 10:12, 18 January 2016 (UTC)
Incorrect Calculation of 'cost'
edit"if s[i-1] = t[j-1]" should have been "if s[i] = t[j]", since s and t are declared as "char s[1..m], char t[1..n]", and the loops go from 1 to m and 1 to n. i-1 and j-1 are needed in implementations in other languages where string character indexing starts at 0 (e.g C). Fixed. --Anon
- Someone at 68.99.153.115 put the "-1"s back in. We should keep an eye one this to make sure they don't do it again. Fixed, Again. Adam Butt (talk) 17:01, 26 February 2009 (UTC)
I don't get where the following has appeared from?
int cost = (s[i-1] == t[j-1) ? 0 : 1; d[i,j] = min( d[i-1,j]+1, d[i,j-1]+1, d[i-1,j-1]+cost )
...in the pseudo-code, it is suggested that the code is actually...
if( s[i-1] == t[j-1] ) d[i,j] = d[i-1,j-1]; else d[i,j] = min( d[i-1,j]+1, d[i,j-1]+1, d[i-1,j-1]+1 )
I tried the latter and it works, so why do all of the other implementations I've seen use the former code??? -- Lee. — Preceding unsigned comment added by 31.53.116.129 (talk) 17:34, 25 November 2011 (UTC)
- Appeared where? There is no such code currently in the article. -- X7q (talk) 19:34, 25 November 2011 (UTC)
- There isn't any in the actual article, but I've seen this as the implementation everywhere I've looked, as well as within this discussion page. -- Lee. — Preceding unsigned comment added by 86.150.194.119 (talk) 21:42, 25 November 2011 (UTC)
- Both codes work. Essentially the difference between them is only that the latter code incorporates the fact that edit distance between two strings doesn't change if you drop their common suffixes/prefixes. So you don't need to check all possibilities in this case to compute minimum, it is always optimal to just drop the common suffix. -- X7q (talk) 22:45, 25 November 2011 (UTC)
Incorrect Date
editLevenshtein's paper was published in 1966, not 1965. cf. V. Levenshtein, Binary codes capable of correcting deletions, insertions and reversals. Soviet Physice – Doklady 10, pp. 707-710, 1966. ralian 03:15, 11 April 2007 (UTC)
- The Russian version at http://www.mathnet.ru/links/e8cd05ac34a785493c702e3ed733e49a/dan31411.pdf lists 1965. — Preceding unsigned comment added by 62.225.68.26 (talk) 11:45, 5 June 2018 (UTC)
Mention Competing Utilities
editThis page should link to the competing utilities such as wdiff since the Levenshtein distance and diff, wdiff, sdiff, etc all address the same set of problems (i.e., what is the difference between two strings/files)
Inconsistent Pseudocode
editWhy in the pseudocode do int arrays start at index 0 and char arrays start at index 1?
- Because it simplifies the algorithm. Anything else would require confusing offsets to be inserted for one array or the other. If you read the correctness section you can see why this numbering fits with the invariants. Deco 02:15, 19 Feb 2005 (UTC)
Not really. It's confusing and inconsistent. Anyone just converting the pseudocode into real code (like I was doing earlier) will end up with a program that generates array out of bound errors. It was easy enough to fix when I looked at what it was actually doing, but is still terribly annoying. The offsets would be less confusing than having arrays start at different indexes.
I agree that the array indicing is confusing. Adding offsets doesnt change the algorithm, its just another way to show it. How it is at the moment is not especially useful given how most language lack features as nice as: "declare int d[0..lenStr1, 0..lenStr2]", "declare int d[size1,size2]" is much more common. More importantly though you can express code using the first type with the second, but not the other way around, which makes it more practical for psuedocode. Jheriko 05:41, 8 March 2006 (UTC)
- The purpose of pseudocode is not to make the algorithm easily implementable in your-limited-language-of-choice (if it was, we would write algorithms in it, not pseudocode.) It is to give the clearest explanation of the algorithm, and conceptually, this algorithm indexes strings from one and the distance matrix from 0. As stated in the invariants section, this is because D[0][0] means the distance from the zero-length string to the zero-length string (and the algorithm needs this) and string[..0] must be the zero-length string, not a length-1 string. I've just implemented this in my-limited-language-of-choice, and I thought about it this way (without looking at this page -- I came here to check myself), although of course I didn't implement it that way. On a lighter note, this reminded me of XKCD #163 --Taejo|대조 08:15, 7 September 2007 (UTC)
I agree it may make sense for the pseudocode to be indexed this way, but could we at least put a comment to indicate it is the ith or jth character? The array syntax implies it is a index, not the character number James Van Boxtel (talk) 20:12, 28 May 2010 (UTC)
This pseudocode is garbage and needs to be cleaned up — Preceding unsigned comment added by 24.8.146.185 (talk) 06:53, 26 January 2013 (UTC)
Haskell Example
editComplicated Haskell
editWhy is the Haskell code so complicated? Isn't this enough?
editDistance :: Eq a => [a] -> [a] -> Int editDistance s [] = length s editDistance [] t = lenght t editDistance (s:ss) (t:ts) = minimum [ (if s == t then 0 else 1) + editDistance ss ts, 1 + editDistance ss (t:ts), 1 + editDistance (s:ss) ts ]
Okay, I replaced the original code below with the shorter code above. Jirka
min3 :: Int->Int->Int->Int min3 x y z = min x (min y z) cost :: Char->Char->Int cost a b | a == b = 0 | otherwise = 1 sumCosts :: String->Char->Int sumCosts [] a = 0 sumCosts (s:ss) a = cost s a + sumCosts ss a editDistance :: String->String->Int editDistance [] [] = 0 editDistance s [] = sumCosts s '-' editDistance [] t = sumCosts t '-' editDistance (s:ss) (t:ts) = min3 (cost s t + editDistance ss ts) (cost s '-' + editDistance ss (t:ts)) (cost '-' t + editDistance (s:ss) ts)
On memoization
editAlso... I really don't think Haskell memoizes like this page claims it does. That would either take tons of space or one crafty LRU cache.... Could we get a reference for that, outside of Wikipedia?
--Daren Brantley 04:15, 16 July 2005 (UTC)
- Is that really true? As far as I know Haskell is a lazy language, lazy mean that it calls functions by need, which has the same meaning than call by name, the difference is that call by need has some memoization mechanism which prevents multiple evaluation of function applications passed as arguments.
- I can only speak to GHC Haskell, other implementations may work slightly differently. When you call a function twice with the same arguments e.g.
f x + f x
Haskell will evaluate the function at that value twice. Compare that to the following:let v = f x in v + v
, wherev
is created as a reference to a function call. The function is called on the first evaluation ofv
, and once it is evaluated, subsequent subsequent references to it will not need to call the function again. So the first example evaluatesf x
twice, while the second only evaluates it once. AquitaneHungerForce (talk) 12:27, 29 September 2023 (UTC)
- I can only speak to GHC Haskell, other implementations may work slightly differently. When you call a function twice with the same arguments e.g.
If you want to verify Haskell's memoization features, visit the Dynamic programming article where the claim is also made. I'm sure somebody could give you a reference to the Haskell standard. --132.198.104.164 21:38, 18 July 2005 (UTC)
- On the other hand, I wrote both these statements, so I may just be wrong. Deco 05:14, 20 August 2005 (UTC)
- Dynamic programming does not mention Haskell at all!
I think the current Haskell implementation is wrong. The levenshtein algorithm should use no more than O(n²) time, and this is exponential. It's possible to write a memorizing implementation, but Haskell doesn't do it automatically. (The ruby one is wrong, too, and probably some others.)
An O(n²) version:
distance str1 str2 = let (l1, l2) = (length str1, length str2) istr1 = (0, undefined) : zip [1..] str1 istr2 = (0, undefined) : zip [1..] str2 table = array ((0, 0), (l1, l2)) [ ((i, j), value i c1 j c2) | (i, c1) <- istr1, (j, c2) <- istr2 ] value 0 _ j _ = j value i _ 0 _ = i value i c1 j c2 = minimum [ table ! (i-1, j) + 1, table ! (i, j-1) + 1, table ! (i-1, j-1) + cost ] where cost = if c1 == c2 then 0 else 1 in table ! (l1, l2)
The following version isn't O(n²), because // copies arrays and !! is linear in the element number -- using lazy evaluation as above is the key for solving that:
distance str1 str2 = last $ elems $ foldl update table [ (i,j) | i <- [1..length str1] , j <- [1..length str2] ] where table = initial (length str1 , length str2 ) update table (i,j) = table // [((i,j),value)] where value = minimum [ table ! (i-1 , j) + 1 -- deletion , table ! (i , j-1) + 1 -- insertion , table ! (i-1 , j-1) + cost -- substitution ] where cost = if str1!!(i-1) == str2!!(j-1) then 0 else 1 initial (b1,b2) = array ((0,0),(b1,b2)) [ ((i,j), value (i,j)) | i <- [0 .. b1] , j <- [0..b2]] where value (0,j) = j value (i,0) = i value (i,j) = 0
An O(n) in space, faster, stricter, tail recursive version
editHere is an O(n) version, using iteration over an initial row. This is much faster (with GHC; have not tried others) since
- it is O(n) in space,
- laziness left away by GHC strictness analysis is removed at one point using
seq
, - it is tail recursive in the outer iterations
distance :: String -> String -> Int distance s1 s2 = iter s1 s2 [0..length s2] where iter (c:cs) s2 row@(e:es) = iter cs s2 (e' : rest e' c s2 row) where e' = e + 1 iter [] _ row = last row iter _ _ _ = error "iter (distance): unexpected arguments" rest e c (c2:c2s) (e1:es@(e2:es')) = seq k (k : rest k c c2s es) where k = (min (e1 + if c == c2 then 0 else 1) $ min (e+1) (e2+1)) rest _ _ [] _ = [] rest _ _ _ _ = error "rest (distance): unexpected arguments"
Levenshtein's nationality/ethnicity
editI don't see any particular reason to point out his ethnicity/religion in this article. If you want to put it in his article, be my guest -- but please provide documentation of some sort.--SarekOfVulcan 21:22, 3 November 2005 (UTC)
- I believe the point of the edit was to indicate that Levenshtein was not an ethnic Russian, just a Jewish guy who lived in Russia. As you suggest, I think such fine points of Levenshtein's nationality are best reserved for his own article, if anyone can drag up enough info to write it. Deco 00:23, 4 November 2005 (UTC)
- I dragged up enough on the web to stub it.--SarekOfVulcan 00:41, 4 November 2005 (UTC)
- It might help when learning to pronounce the name of the algorithm. -- Mikeblas 08:13, 3 January 2006 (UTC)
- I dragged up enough on the web to stub it.--SarekOfVulcan 00:41, 4 November 2005 (UTC)
Minimality
editI can transform "kitten" into "sitting" in two steps:
- kitten
- (delete "kitten")
- sitting (insert "sitting" at end)
Can someone explain why this is not acceptable? Are we talking single-character edits here? --Doradus 17:48, 16 January 2006 (UTC)
- I went ahead and added "of a single character" to the intro sentence. Please feel free to revert if I have this wrong. --Doradus 17:51, 16 January 2006 (UTC)
- I guess it doesn't hurt to be pedantic. Deco 18:20, 16 January 2006 (UTC)
- Heh... Well, if you use "diff" you wind up with a list of edits which are most definitely not single-character edits. Plus, I think I'm a fairly reasonable person, and I didn't realize it was single-character edits until I read the algorithm. --Doradus 23:46, 16 January 2006 (UTC)
- Sorry for the confusion, I hope it's clearer now. :-) Deco 00:15, 17 January 2006 (UTC)
- Try to do this two strings of 40 billion characters. Or, do it with your examples, but 99 billion times in a minute. 92.81.167.231 (talk) 23:48, 31 January 2010 (UTC)
Implementations
editI don't believe the implementations are relevant to the article. Some of them are even wrong - and they certainly aren't more illustrative than the pseudo-code. Currently there are 18 implementations: 2 for C++, C#, 2 for Lisp, Haskell, Java, Python, Ruby, 2 for Scheme, Prolog, VB.NET, Visual FoxPro, Actionscript, Perl, Ocaml and Lua. I vote for removing all of them from the article; if anyone find these useful, just create a page (like www.99-bottles-of-beer.net) with a sample implementation for each language under the sun, and add a link to it. DanielKO 00:21, 12 July 2006 (UTC)
- This has a way of happening on wikis. This is why I started a separate wiki for source code. Moving them there is not an option though due to license conflicts - GFDL-licensed code just isn't very useful. Wikisource no longer accepts source code. I suggest we eradicate them all. Deco 00:37, 12 July 2006 (UTC)
Certainly. Get rid of all those implementations and just leave the pseudocode. I truly don't see the point of all this. Zyxoas (talk to me - I'll listen) 16:49, 12 July 2006 (UTC)
I just removed them all. Maybe the Implementations section should be rewritten. DanielKO 23:25, 12 July 2006 (UTC)
Potential license compatability downstream is no reason to delete material, even if the material is source code. That's like saying there should be no programming books at WikiBooks released under the GFDL.
If there's a wider proposal to remove all source code from WikiPedia (and hopefully come up with a sister project to move it to) then I'll accept deleting the implementations. If the quality of implementations are lacking, this can hardly be relevant to their deletion, because even the pseudocode was incorrect at one point.
The implementations provide useful, working examples of the pseudocode for readers. --71.192.61.193 01:54, 13 July 2006 (UTC)
- I didn't say license compatibility was a problem. You totally misinterpreted me. I think source code at Wikipedia is totally okay. I would not have removed all of the implementations, maybe just all but 2 or 3. Deco 02:11, 13 July 2006 (UTC)
- I don't think that it's reasonable to have that many implementations. The page is "unprintable", because there's too much irrelevant stuffs on it (who needs a Prolog implementation anyways?). But I agree that we should be consistent, and choose just n (for small values of n) languages to be used on all algorithm articles; everything else should not be in an ecyclopedia. So indeed, better lets use this article as an extreme example on how bad the situation may become, and propose a sister project for algorithms. --DanielKO 22:49, 14 July 2006 (UTC)
Well even though it's said here that the implementations were removed, someone must have put them back. I went ahead and removed the Ruby implementation because it didn't work.
I agree that the page is long and unprintable. What if we moved these to separate pages? Something like Levenshtein distance (C++)? Is that poor style, since most parentheticals in an encyclopedia are standardized to thinks like "album", "film", "computer" and not flavors of programming language? --71.254.12.146 00:55, 17 July 2006 (UTC)
- That's certainly not reasonable; but maybe a Levenshtein distance (implementations). I think we should try asking this in other algorithm articles saturated with implementations and see if together we can create some basic guidelines. I would suggest that if an article has more than 3 implementations, they should be moved from the main article. What do you think? --DanielKO 08:57, 17 July 2006 (UTC)
That seems reasonable. We *should* try to standardize and therefore get the input from other articles and their editors. --69.54.29.23 15:01, 17 July 2006 (UTC)
Didn't there used to be more history on this page? What's with all the implementations? This article is just silly. ONE is enough. The technicalities of the Haskell code is so irrelevant. 65.100.248.229 01:43, 27 July 2006 (UTC)
The number of implementations are a known problem. What do you think of the proposal?
I think the technicalities of the haskell code are extremely relevant, but the Haskell implementation is disputed.
I don't see any sign of the page's history material removed. --72.92.129.85 03:20, 27 July 2006 (UTC)
I'd like to remove the majority of implementations from this page. Visual FoxPro? Is it really necessary? From a good encyclopedia I would expect pseudo code and supporting implementation in a well known language such as C. Can anyone cite me a good standard algorithm text that has implementations in so many languages? If no one has any objection I will remove most of the implementations tomorrow. I am also against an article called Levenshtein distance (implementations), if such an article were to exist in an encyclopedia then it's purpose would be to describe existing implementations not to provide new ones. It is my understanding that in an encyclopedia we should be documenting existing entities, not providing new implementations or producing original research. New299 13:09, 18 September 2006 (UTC)
I've cleaned out many of the major offenders that were either obscure or just hideous (long lines, comment headers). It would be nice to have a C version for sentimentality, but C++ usually subsumes programming examples in most textbooks these days. Should we keep the VisualBasic implementation? --71.169.128.40 23:28, 18 September 2006 (UTC)
There was this deletion: "(23:59, 19 September 2006) Simetrical (→Implementations - Wikipedia is not a code repository. It's especially not a code repository for that not-explicitly-GFDLed Python script.)". I think it is OK not to store code here. But as code may be extremely useful for those who want to implement the algorithm the links should be kept. I remember there was a link to code for python and perl. These links should be restored! --148.6.178.137 07:54, 20 September 2006 (UTC)
I don't think the article now stands as a "mere collection of ... source code" (quoted from WP:NOT). It probably used to. The python script should be just a link anyway, regardless of copyright issues. I've done that and removed the hideous VisualBasic source code since no one made a motion to oppose it. We'll see if they do now. Currently, the page prints in under 10 pages on my system. --69.54.29.23 12:41, 20 September 2006 (UTC)
Seems to me that Visual FoxPro is as valid as the next language: that's why I keep restoring it. If we want to drop the article down to one or two reference implementations (C++, probably), I don't have an issue with it. Since the Haskell implementation looks nothing like the others, I'd vote for keeping that one as well, regardless of how widely-used it is. --SarekOfVulcan 21:09, 9 October 2006 (UTC)
- I would allow PL/SQL, Pascal or even COBOL before I had VFP. But none of those are there. Not just because of its market adoption (read "popularity") or lack thereof, but simply because it's not pedagocial or contributing anything except bytes to the article. --71.169.130.172 02:43, 10 October 2006 (UTC)
- COBOL I'd buy for historical interest. :-) I don't see how Pascal is more pedagogical than VFP, though.--SarekOfVulcan 22:21, 10 October 2006 (UTC)
The size of the rendered page is about 44kB. If it get's too large, the VFP implementation should proably be first to go. --71.161.222.65 22:21, 23 October 2006 (UTC)
I think this article should include as many implementations as possible. I came here via Google search: 'edit distance' (couldn't recall the spelling of Levenshtein) looking for a Ruby implementation. Wikipedia ideally is a collection of all knowledge, regardless if the concept manifests itself as prose or code. Clearly the different implementations are conceptually different because of the facilities the language provides, and should all be included.
- That's actually a common misconception. Wikipedia is not intended as a collection of all knowledge. It is intended as an encyclopedia, which is one form of collection of knowledge. For instance, Wikipedia does not include primary source material (for which see Wikisource), even though primary source material is clearly knowledge.
- It is inappropriate to include a large number of implementations here, because it is redundant, distracting, and each additional is unlikely to provide much marginal benefit. It is better, if we need any at all, to prefer one or two in languages which are both (1) reasonably common, so readers are more likely to have seen them; and (2) easy to read for this algorithm, that is, not requiring a lot of hacks or bookkeeping. --FOo 07:35, 15 November 2006 (UTC)
- According to Wikipedia:Algorithms_on_Wikipedia#Code_samples:
- ...avoid writing sample code unless it contributes significantly to a fundamental understanding of the encyclopedic content'
- See also the rest of the arguments there. I feel that this is an algorithm where the implementations to very little extent contribute to the understanding. The implementations should be move somewhere under wikibooks:Algorithm_implementation. See wikibooks:Algorithm_implementation/Sorting/Quicksort for an excellent example. Klem fra Nils Grimsmo 08:43, 8 December 2006 (UTC)
Rather than adding more, can we link to an exisitng page with more solutions? I agree with @Dcoetzee that it was confusing to catch the change of index from 0 to 1. Here's the page I eventually found that helped me solve my problem.wikibooks:Algorithm_Implementation/Strings/Levenshtein_distance -User:SomeAnonymousContributor
Haskell
editI removed the Haskell example, for two reasons:
- it was the only instance of Haskell code on Wikipedia outside the Haskell article itself
- it relied on compiler support for memoization, which is "not guaranteed" according to the accompanying text.
If anyone knows of a language, not terribly obscure, in which memoization is guaranteed by the standard, please add an implementation in such a language. I think the clarity of such an implementation would be a great addition to the article.
For more information on my recent edits to example-code articles, and proposal to get rid of a lot of relatively esoteric example code along the lines that other editors have suggested in this section, see Wikipedia talk:WikiProject Programming languages#Category:Articles_with_example_code_proposal_and_call_for_volunteers. --Quuxplusone 01:34, 4 December 2006 (UTC)
Common Lisp version broken
editUsing the Common Lisp implementation described in the article, I get notably wrong results:
[11]> (levenshtein-distance "cow" "cow") 3
Whoever added this should probably fix it. The Levenshtein distance from any string to itself should be zero. --FOo 01:11, 28 August 2006 (UTC)
Thanks for reporting this. I've deleted until it can be fixed. It wasn't a very elegant implementation anyway, combined with it having been translated from the Python implementation. I'm sure somebody will be able to contribute something more useful in the future. --71.161.221.31 03:55, 28 August 2006 (UTC)
Dynamic Time Warping
editCan someone please explain the difference between DTW and Levenshtein distance?
The algorithms look almost identical. Prehaps a translation of both into C would claify this?
RE: Implementations.
editI removed the implementations and replaced with a link even though we already had one. It seems they were ignoring it anyway. I didn't even look at the source code to see how good the implementations were... I'm not sure I care. I'm not against people sharing their implementations, even multiple implementations per language, especially if one happens to use o(m) space instead of o(mn), or if there is some advantage to a particular implementation over another. If two implementations in the article were compared and contrasted, with benefits and disadvantages for both, that might be encyclopedic and useful for a more in depth investigation of the calculation of this metric. But all I saw was just code. And I'm not even sure it was good code at that. Root4(one) 03:17, 28 September 2007 (UTC)
Confusing wording?
editIn the upper and lower bounds section, point 5 seems to be merely re-stating point 1. I have taken the liberty of removing point 5, as it is the more complicated of the two. If this is wrong, feel free to correct me. --Gnack (talk) 05:21, 7 July 2008 (UTC)
Levenshtein automaton as Finite State Machine approach
editCould someone please extend Levenshtein automaton article, as other possible implementation of this string metric? Current stub version is merely nothing. For the particular case when we do not need a transition graph (memory consuming n*m sized matrix) but scalar result only, FSM approach looks more appealing than dynamic one. —Preceding unsigned comment added by 91.78.191.197 (talk) 13:17, 28 July 2008 (UTC)
Spell checking or correction?
editThe second para of the article currently says this "is often used in applications that need to determine how similar, or different, two strings are, such as spell checkers." Shouldn't that be "spell *correctors*"? A spell checker need only check for exact matches, whereas a spell corrector needs to find the closest match(es). Mcswell (talk) 13:41, 13 October 2008 (UTC)
- I think that your remark makes sense but spell checker IS the commonly used word for such applications whether or not the intended end is correction or just exact match lookup. [[User:D'Artagnol|]] (talk) 16:51, 31 July 2009 (UTC)
- Don't some of them hold only a limited set of exact words and guess variations (tenses etc) of the words based on rules and stuff? --TiagoTiago (talk) 19:13, 20 October 2011 (UTC)
Bad layout
editThe layout of this article is a total failure. Rather then slapping the pseudocode in the reader’s face, go and figure it out, it should provide an introduction and explain the ideas behind the algorithm first. --Yecril (talk) 13:01, 25 November 2009 (UTC)
- I agree completely. It was written by programmers, who program first and specify later. I have expanded the introduction, which should help a little, but the algorithm section should still be reordered:
- first, present the matrix in which every d[i,j] is the Levensthein distance up to i and j;
- then, explain how d[i,j] can always be computed from d[i-1,j-1], d[i,j-1], and d[i,j-1];
- conclude that the distance can be computed by flood-filling the matrix from start to end;
- only then, give the algorithm, of which the correctness is now obvious;
- continue with the improvements, starting with the version that only keeps two rows of the matrix in memory at once.
- I'll attempt to do this if I find the time but please beat me to it. Rp (talk) 19:16, 21 January 2010 (UTC)
Citation
editThere was an article about Levenshtein distance in Dr. Dobbs Magazine: s. http://drdobbs.com/architecture-and-design/184408746?pgno=2 Maybe someone who knows how to include citations in the article could add this. — Preceding unsigned comment added by 84.128.62.142 (talk) 16:49, 2 July 2011 (UTC)
Metric with only additions and deletions
editThe section Relationship with other edit distance metrics states that the length of the longest common subsequence gives an edit distance metric given addition and deletion as the allowable edit operations. This seems incorrect. If two strings have length and with a longest common subsequence of symbols, then the edit distance should be . Am I making some silly mistake? Otherwise this should be fixed. Augurar (talk) 02:01, 5 January 2012 (UTC)
- You're right, what should be said is that the work required to do one or the other is the same, or that the metrics can be expressed in terms of the lengths of the strings and their longest common subsequence. Rp (talk) 16:09, 6 January 2012 (UTC)
Proposed merge with Damerau–Levenshtein distance
editWhen the OR is removed from this page, what remains is a rather trivial modification of Levenshtein distance. QVVERTYVS (hm?) 21:21, 25 November 2013 (UTC)
- In favor of merger. What is "OR"? I found the modified D-L (as described in Lowrance and Wagner in the references) to be quite distinct from simple L. distance. I would not want to see those references removed (or the pseudocode which is the only real explanation provided for how the algorithm works at the moment.) --Yoderj (talk) 20:44, 14 July 2014 (UTC)
- Original research. Sorry for the jargon. What I'd like to keep is exactly the well-sourced bits. QVVERTYVS (hm?) 21:49, 14 July 2014 (UTC)
- Oppose. The adjacent transpositions variation makes the two algorithms sufficiently different. Glrx (talk) 18:06, 15 July 2014 (UTC)
- Oppose. It's a different distance, with quite different algorithms because of its difference, and different applications (this one makes more sense in bioinformatics where transpositions are apparently uncommon; the other one makes more sense in spelling checking where transpositions are common). I don't think much of either article is actually original research, just badly sourced. And both articles are long enough that a merged combination of both of them would be both confusing (how would readers be able to tell which distance a particular section refers to) and hard to read (because of the length). Accordingly I have removed the merge tag. (It's over a year old with inadequate support for performing the merge.) —David Eppstein (talk) 01:24, 11 December 2014 (UTC)
Most text is duplicate from Edit distance
editDoes this variation of edit distance really deserve an article of its own? --Sigmundur (talk) 13:34, 31 December 2014 (UTC)
- The thing is that I started rewriting and generalizing this article's text as the article Edit distance. Levenshtein distance is the most common edit distance, so I think it deserves its own article, but we need to move some content around to put everything in its proper place. QVVERTYVS (hm?) 09:55, 16 March 2015 (UTC)
Inconsistent Start Index
editThe start index of the string is 1 but the start index of the 2D int array is 0. This is an inconsistency that can cause confusion, especially when translated into a programming language. Most programming languages stick with either 1 or 0 but not both. — Preceding unsigned comment added by Blackraven36 (talk • contribs) 18:19, 15 March 2015 (UTC)
Incorrect pseudocode
editThere is a problem in 4.3 pseudocode: line v1[i] = i+1; causes writing outside the array boundary when s.Length > t.Length+1. — Preceding unsigned comment added by 95.80.245.102 (talk) 12:47, 3 July 2015 (UTC)
Recursive C implementation
editI'm not sure why we're presenting a (presumably) fully working C implementation of the naïve recursive algorithm. No serious source presents this algorithm because it's so extremely inefficient. Anyone who wants to try it can derive it from the mathematical definition if they want. QVVERTYVS (hm?) 12:29, 14 October 2015 (UTC)
- I'm happy keeping the slow and stupid algorithm. If nothing else, it serves as a set up for the more efficient versions. Glrx (talk) 16:19, 20 October 2015 (UTC)
The calculation of 'cost' is not branching. The indicator function is only boolean evaluation expression. If ... else
statement is redundant and confusing. Please see http://c2.com/cgi/wiki?ReturnBooleanEvaluations in WikiWikiWeb. So the recursive C implementation source code is the following --- Cedar101 (talk) 02:13, 20 October 2015 (UTC)
// len_s and len_t are the number of characters in string s and t respectively
int LevenshteinDistance(string s, int len_s, string t, int len_t) { // 'string' is not in C. It's std::string in C++.
// base case: empty strings
if (len_s == 0) return len_t;
if (len_t == 0) return len_s;
// check mismatch of last characters of the strings
bool mismatch = (s[len_s-1] != t[len_t-1]); // The bool type requires C++ or modern C(C99 or above) (#include <stdbool.h>)
int cost = (int)mismatch;
// return minimum of delete char from s, delete char from t, and delete char from both
return minimum(LevenshteinDistance(s, len_s - 1, t, len_t ) + 1,
LevenshteinDistance(s, len_s , t, len_t - 1) + 1,
LevenshteinDistance(s, len_s - 1, t, len_t - 1) + cost);
}
- Your reference discusses returning a boolean from a function. That's different from converting a boolean into a number, which is what is going on here. (I still think we should just delete this code, it serves no encyclopedic purpose.) QVVERTYVS (hm?) 12:56, 20 October 2015 (UTC)
- Qwertyus is right about your reference; a cost is not a bool. The typecast to an int should also raise a flag as dirty code; the code relies on a specific representation of boolean values. Furthermore, WP is interested in describing algorithms rather than presenting code that compiles. WP avoids compiler issues such as include files because it does not expect its readers to be familiar with all languages and their nuances. Using constructs such as
+=
or bizzare typecasts just makes the code more difficult for WP's readers. Glrx (talk) 16:17, 20 October 2015 (UTC)
- Discussions like this are just one of the reasons why the WikiProject Computer science Manual of style recommends explaining algorithms like this in pseudocode. The point of an article like this is not to provide cut-and-paste source code, but rather to explain the algorithm in an understandable fashion. RossPatterson (talk) 22:50, 20 October 2015 (UTC)
- RossPatterson rewrote it. Thanks for pointing out that MOS. QVVERTYVS (hm?) 19:52, 21 October 2015 (UTC)
- I reverted the partial programming language style change. Needs discussion; further fracture. Glrx (talk) 00:35, 23 October 2015 (UTC)
- What part do you object to? My version is mostly a decluttered version of the quasi-C code (indentation instead of brackets, fewer declarations), with one-indexing to directly match the equations. QVVERTYVS (hm?) 08:56, 23 October 2015 (UTC)
- Okay. This point is a kind of MOS. By the way, In the above code, I think to have more emphasis the indicator function using the reference (Return Boolean Evaluations). -- Cedar101 (talk) 01:10, 22 October 2015 (UTC)
// len_s and len_t are the number of characters in string s and t respectively
int LevenshteinDistance(string s, int len_s, string t, int len_t) { // 'string' is not in C. It's std::string in C++.
// base case: empty strings
if (len_s == 0) return len_t;
if (len_t == 0) return len_s;
// return minimum(std::min) of delete char from s, delete char from t, and delete char from both
return min({LevenshteinDistance(s, len_s-1, t, len_t ) + 1,
LevenshteinDistance(s, len_s , t, len_t-1) + 1,
LevenshteinDistance(s, len_s-1, t, len_t-1) + mismatch_cost(s, len_s, t, len_t)});
}
// The indicator function: check (and count) mismatch of last characters of the strings
int mismatch_cost(string s, int len_s, string t, int len_t) {
return s[len_s-1] != t[len_t-1];
}
- The implied type cast is just ugly. Glrx (talk) 00:36, 23 October 2015 (UTC)
- The thing that is just ugly is your insistence on using C instead of pseudocode. Why do you think this is a good idea? But if we are programming in C, I would use
int mismatch_cost = (s[len_s-1] != t[len_t-1]) ? 1 : 0
- It's a little redundant but avoids any weird casting. —David Eppstein (talk) 00:45, 23 October 2015 (UTC)
- My opposition is quick style changes.
- You are correct about the conditional expression in C, but C idioms should not be used because they are inaccessible. My comment about the ugly typecast, however, is directed at the failed understanding of the typecast and not how a conditional expression should be coded. Glrx (talk) 01:05, 23 October 2015 (UTC)
- Glrx Would it have been better if my pseudocode had matched more closely the Wagner-Fisher algorithm below it? Do we have consensus that pseudocode is better than C (David Eppstein, Cedar101)? QVVERTYVS (hm?) 10:38, 27 October 2015 (UTC)
- I mean to get to your question, but I just looked at the code and got distracted into reverting the changes by Cedar101. I'm running out of time. Cedar101's group of changes insert style changes and some poor indirection.
- Generally, there's a WP notion to avoid gratuitous style changes to an article. If there is not a set style, then the first editor sets the style and other editors should follow it. That's what happens with reference formating. Lots of people have their own preference, and it crazy for editors to driveby and change the article to their preferred style. Style changes should be discussed and consensus obtained before the change.
- There is a notion that if a style is inconsistent, then a WP:BRD could be done to make the entire article consistent. I don't think BRD is appropriate here; I've reverted changes so I'm asking for discussion.
- There is no policy requiring pseudo code. I'm not opposed to it, but there are some gotchas going on in the string matching code. Strings are being passed, and there are subtle issues about the length of such strings, the 0/1 indexing, and need for an extra fencepost in the rows.
- I'm not a fan of C/C++/etc., but if it's there I won't kill it. I will remove obscure C idioms (
p++; p+=2; (x)?y:z;
) as confusing to the reader, and I will bludgeon dirty typecasts. - Your syntactic changes to the code were minor, but you also removed informative comments such as "base case:empty strings".
- Another reason for avoiding change is that errors creep in. It's minor, but you dropped a trailing paren.
- Sorry, but I gotta go. Glrx (talk) 01:11, 29 October 2015 (UTC)
Please, prior to entering code into this article, take a look at its history. Many code examples have been introduced, gone through changes, and removed, according to the whims of the editors at the time. I have little doubt that the present examples will go through the same process. On top of that, there is a plethora of other articles on this subject such as edit distance, Jaro-Winkler distance, Damerau-Levenshtein distance, Needleman–Wunsch algorithm, string metric, Longest common subsequence problem, Approximate string matching, etcetera, which explain basically the same thing over and over again, often with their own code examples. It would be nice if we could repeat ourselves less often on this topic - but how to achieve that? Rp (talk) 11:12, 23 October 2015 (UTC)
- The simple solution would be to limit the pages on edit distance to the actual definition of the distances, and explain the algorithm to compute it for a pair strings at Wagner–Fischer algorithm. On the plus side, that introduces a clear separation of concerns, pointing out that Levenshtein distance is not to be equated with with expensive pairwise comparisons (they could, and often should, use Smith–Waterman, Levenshtein automata, etc.). However, I have an inkling that the popularity of this page is largely due to its exposition of Wagner–Fischer rather than the theory... QVVERTYVS (hm?) 12:06, 23 October 2015 (UTC)
Dispute over memory consumed by a naive recursive implementation
editThe article claims that
"A more efficient method would never repeat the same distance calculation. For example, the Levenshtein distance of all possible prefixes might be stored in an array d[][] where d[i][j] is the distance between the first i characters of string s and the first j characters of string t. The table is easy to construct one row at a time starting with row 0. When the entire table has been built, the desired distance is d[len_s][len_t]. While this technique is significantly faster, it will consume len_s * len_t more memory than the straightforward recursive implementation.”
Does anybody have any reference for the last statement, because I think it’s not right. Consider the huge stack space occupied by a recursive implementation…
Xrisk (talk) 19:53, 29 October 2015 (UTC)
- Why huge? The recursive implementation makes a lot of recursive calls, but the recursion is shallow (linear in total input length) and requires only constant storage per level of recursion. That said, I don't see why we're talking about the recursive implementation at all. It's a bad algorithm. In sorting, do we spend much time talking about the algorithm that generates all permutations and tests whether each one is sorted until it finds the right one? —David Eppstein (talk) 19:57, 29 October 2015 (UTC)
- I spent way too much time on that problem once...
- But the OP does have a point: brute-force recursion is linear in memory usage, naive Wagner-Fischer quadratic, optimized W-F linear again. The text suggests that brute-force is quadratically better, which is incorrect. QVVERTYVS (hm?) 20:49, 29 October 2015 (UTC)
Problem with iterative with two matrix rows implementation
editI have implemented this method with unexpected results: the distance obtained is asymmetrical and gives fairly different results than the straightforward implementation. I will read the article and check if the implementation on this article is correct. --Farru ES (talk) 18:38, 4 October 2016 (UTC)
- Mea culpa, the line
- var cost = (s[i] == t[j]) ? 0 : 1;
- should read
- int cost = (s[i-1] == t[j-1]) ? 0 : 1;
- for C++, as the index ranges from 0. --Farru ES (talk) 13:04, 6 October 2016 (UTC)
Error in Two Row algorithm
editThe current page shows: function LevenshteinDistance(char s[1..m], char t[1..n]):
// create two work vectors of integer distances declare int v0[n + 1] declare int v1[n + 1]
I believe this should be: function LevenshteinDistance(char s[1..m], char t[1..n]):
// create two work vectors of integer distances declare int v0[n + 1] declare int v1[m + 1]
--165.225.81.1 (talk) 17:06, 5 March 2019 (UTC)
- No. The dimensions are correct. Both v0 and v1 represent rows that are n+1 columns wide. Glrx (talk) 19:00, 3 April 2019 (UTC)
Efficiency in the Two Row algorithm
editThere is an implied loop in the statement
// copy v1 (current row) to v0 (previous row) for next iteration swap v0 with v1
This loop just slows the algorithm down, and for no purpose. If the arrays were declared with an inner index then a 'swap' would consist of switching the inner index from 0 to 1 or 1 to 0. I'm not going to clutter the existing code on the page, but for coders searching for an algorithm this should be noted. — Preceding unsigned comment added by 165.225.81.1 (talk • contribs) 17:24, 5 March 2019 (UTC)
- There is no "implied loop". The array names are swapped rather than the contents of the arrays. Glrx (talk) 18:58, 3 April 2019 (UTC)
Overly Naïve Implementation
editThe existing "naïve" implementation was in my opinion a good deal more naïve than it had to be. I've gone ahead and replaced the code sample with a slight variation on the algorithm that makes it a good deal faster. Of course the point of the naïve implementation was not to be fast, but to be easy to understand. It is my opinion that the new version is easier to understand than the previous version.
The change to the algorithm is to not branch when looking at identical characters. The previous algorithm would at every step branch into three calls, one for deletion, one for insertion, and one for both replacement and no action with a cost calculated based on the characters at the read heads. The new version first checks the values read heads, then if they are the same, move both heads forward (equivalent to no action), only branching if they are different into the three possible moves (insertion, deletion, replacement).
This works properly since , thus when calculating , no branching is needed simply calculate .
This is faster because it branches far less frequently. Every place this algorithm branches the old one would too, but there are some places the old one branched that this one does not. If you would like to verify this simply compare a very long string with itself on both versions. The new version is nearly instant, since it is linear on equal strings while the old version stuggles to handle strings reaching even 100 characters, since it is exponential on equal strings.
This is, in my opinion, easier to understand because it separates the logic for performing steps off from simply moving through the string. The cost
calculation in the old version was a little too abstract, and having both no action and replacement occupy the same call was a little confusing.
Asside from the change of algorithm there are a few other changes. The big change is that the code is now in Haskell instead of C, the biggest reason for this change is that I am more comfortable at writing high quality example Haskell than I am at doing so in C. If there is a real good case that C should be used instead then someone confident in C should replace this code with C code that implements the new algorithm sensibly. The other reason is that Haskell is in its nature a recursive language, making recursive algorithms more clearly expressed in Haskell than C. I've avoided using the more opaque Haskell syntax like pattern guards, instead opting for if then else
, to make this as approachable as possible to persons without Haskell knowledge.
The other change is that the algorithm now moves left to right instead of right to left. This is a product of the differences between Haskell lists and C arrays. I don't think this change impacts clarity.
A few suggestions for improvement of this article
editThe definition presented in the introduction (the smallest number of insertions, deletions, and subsitutions needed to transform one string to another) is perfect as a definition and should be the definition used in this article. (With one improvement: The concept of an insertion should be clarified so that it explicitly includes additions to the beginning or end of a string.)
Then it should be demonstrated that this in fact defines a metric on the space of finite strings from a fixed alphabet: reflexivity, symmetry, and transitivity.
The reason is that this definition is very intuitive, and that's what a definition should be.
Then the "definition" that the article presents in the section titled Definition should instead be presented as a theorem, and that theorem should be proved in the article.47.44.96.195 (talk) 02:38, 27 October 2020 (UTC)
- I absolutely agree, provided that the exposition is done in language that a non-mathematician can follow. I have a master's degree in computing science and I never had the concept of a metric explained to me in a course, and neither have 99% of readers of this article. Rp (talk) 10:57, 29 October 2020 (UTC)
- I agree that the definition is terrible and confusing. However I disagree with your assessment of why. I don't think that intuition should form the basis of definitions, it is good to have that present but the 'definition' should be just that a clear and precise representation of the concept. I don't see how this could even be phrased as a theorem. Because I found it confusing I've gone ahead and rewritten the definition. This definition is very similar to the C code I replaced earlier (see section above) for, amoung other issues, being confusing in the same ways as this. In the new version the symbolic part is a little larger but requires less machinery (no unexplained indices, no indicator functions, cases are split up to match intuition instead of clumped together to make things shorter). I think this new definition clearly corresponds with intuitive notions while being precise enough to be useful. AquitaneHungerForce (talk) 22:48, 15 November 2020 (UTC)
- Thanks,but I still agree with 47.44.96.195: this is far too implementation-oriented. The definition should be a direct formalization of the 'informal' definition, namely, the Levenshtein distance between two strings is the minimal number of edits required to change one string into the other. Rp (talk) 22:00, 23 November 2020 (UTC)
- Here is a proposal:
- The Levenshtein distance is the function that assigns a nonnegative integer to each pair of (possibly empty) strings with the following properties:
- for all strings ,
- for all strings ,
- for all strings and characters , (character insertion or deletion is 1 step)
- for all strings and characters , if , (character substitution is 1 step)
- for all strings , (the distance is the length of the shortest path)
- Exactly one function with these properties exists, and it follows straight from its definitions that it is a metric.
- By induction on the length of the strings involved, it can be proved that for all strings : (what follows here is the existing definition)
- where the of some string is a string of all but the first character of , and is the th character of the string , starting with character 0.
- The naïve recursive implementation to calculate Levenshtein distance directly follows from this.
- I am in favour of substituting the existing definition with this. Rp (talk) 20:14, 20 December 2020 (UTC)
- I think this is a nice addition, but surely the 3rd point should be
- for all strings and characters , (character insertion or deletion is 1 step)
- AquitaneHungerForce (talk) 14:19, 21 December 2020 (UTC)
- I think this is a nice addition, but surely the 3rd point should be
- The last step of the visualization for transforming kitten into sitting seemed wrong to me, but I admit, I cerrtainly may be confused, and my apologies in advance. We are transforming kitten into sitting right? The last operation was shown as deleting 'g', but earlier in the article it looks like we are transforming that direction. the rows are the 'i's right? and the columns the 'j's? In any event, if it's meant to be a delete op at the end, maybe there could be a little more identifying clues, perhaps saying something like 'this time let's try transforming sitting into kitten'. Also, the language "substitute 'x' for 'y'" is a bit ambiguous to me - I can read it as saying that y changes into x or that x changes into y. maybe if we did an arrow? "substitute 'x' -> 'y'"? Benjoldersma (talk) 17:34, 31 January 2021 (UTC)
Add reducer-based implementation?
editI wanted to add a reducer-based Clojure implementation because reducers are a new functional programming concept ("new" as in "not around back in 2007 when the article was started"), and is an additional way of doing it (i.e. not iterative or recursive). It is less efficient computationally, but also terser, so I feel it merits adding. However, I saw the note not to add implementations and to come to the talk page, so I'll pop it here and see what others think:
(defn levenshtein [a b]
(peek
(reduce
(fn [last this]
(reduce
(fn [row [diagonal above other]] (let [update (if ( = other this) diagonal (inc (min diagonal above (peek row))))] (conj row update)))
[(inc (first last))]
(map vector last (next last) b)))
(map #(identity %2) (cons nil b) (range))
a)))
We should explain that time complexity can be expressed in terms of both n and s (where s is edit distance), and that O(n*s) algorithms exist
editSee https://stackoverflow.com/a/4057585/1709587 and http://www.berghel.net/publications/asm/asm.php, for instance.
"Where the earlier algorithm of Wagner and Fisher [6] was implemented in time O(n²), Ukkonen's algorithm runs in O(s * n) time, for edit distance s and string length n."
Obviously in the worst case s=n (so if you want to characterise the worst-case time complexity purely in terms of n, it's still O(n²)) but the point is that with the right algorithm, the time complexity approaches linear as the strings being compared approach being identical. This way of characterising time complexity of edit distance calculations is fairly normal - see also how the Myers diff algorithm (for calculating diffs between strings, with allowable edits being only insertions and deletions) is characterised (both typically and in the title of Myers' paper) as an O(nd) algorithm, where d is edit distance. ExplodingCabbage (talk) 09:14, 23 November 2023 (UTC)
Upper bound
editIt is currently stated that the upper bound is at most the length of the longer string. It is however also stated that if the strings have the same size, the Hamming distance is an upper bound on the Levenshtein distance. This implies that the upper bound for the Levenshtein distance between two strings of different lengths can be calculated by adding the length difference to the Hamming distance between the shorter string and the substring of the longer string that starts at index 0 and has the same length as the shorter string. Should this be explicitly stated instead? Ratsatchel (talk) 17:11, 21 June 2024 (UTC)