The longest common subsequence problem[1] (LCS) is a computer science problem to find the longest subsequence common to a set of sequences. This paper will address the special problem of comparing the words or whole lines of 2 text files where the goal is to discover how the second file differs from the first. It is useful in comparing versions of text files to determine what changes were made in subsequent versions. We will call this special case of the problem to have a bias for the first sequence.

Using Bias

edit

The problem can be solved in polynomial time with memory for the two sequences and memory for a binding vector for each sequence. The binding vectors are vectors that remember which element of one sequence is common to the other. A matrix of common sequence lengths is not required as in the classical LCS solution[2].

Although in the classical problem, the LCS is not unique, it is unique in this special case of bias.

Suffixes

edit

Let us define a suffix of a sequence as the sequence with the top cut off. Let X be the sequence {GCAT}, Then there are 4 suffixes, S1, S2, S3, S4 which are subsequences of X where the subscript denotes the number of elements in the subsequence. The possible suffixes of X are

S1 = {T}
S2 = {AT}
S3 = {CAT}
S4 = {GCAT} = X

CS function

edit

The CS function is a function that yields a common subsequence, but not necessarily the longest. It is a guess of how the second sequence is a modification of the first. Given two sequences,   and  , their binding vectors, Xbind[1..m] and Ybind[1..n], and their corresponding suffixes,   and   the CS function is given by

 

LCS function

edit

Now, the longest common subsequence can be found using the CS function and is

 

Example

edit

Given the two sequences,   and   the longest common sequence is  . A Difference Report of the two sequences side by side commenting on the additions, changes and deletions made to the left (bias) sequence, might be

  Version 1    Version 2
  ---------    ---------
               A  Added
  B            B
  C  Deleted
  D            D
  F            E  Changed
  K            K
  L            N  Changed 
  P  Deleted

Now, consider the Versions reversed.   and  . Although the same longest common sequence is found, the additions, changes and deletions are different since the bias has changed.

  Version 1    Version 2
  ---------    ---------
  A  Deleted
  B            B
               C  Added
  D            D
  F            E  Changed
  K            K
  L            N  Changed 
               P  Added


Non-Recursive Code for LCS with Bias

edit
function forwardTrace( X[1..m], Y[1..n]) )
  array Xbind[1..m], Ybind[1..n]
  Xbind, Ybind := Zero Vector
  mtop, ntop := 1
  j := 1
  while j ≤ n
     j, j0 := ntop
     Find some CS(X,Y) scanning the first sequence X
     for i := mtop..m
        j := j0
        while j ≤ n and Y[j] ≠ X[i]
           j := j+1
        end
        if j ≤ n
           Xbind[i] := j
           Ybind[j] := i
           j0 := j+1
        else
           Xbind[i] := 0
        end 
     next i
     i, i0 := mtop
     Scan the second sequence to see if CS(Y,X) = CS(X,Y), i.e., symmetrical
     isSymmetrical := true 
     j = ntop
     while j ≤ n and isSymmetrical
        i := i0
        while i ≤ m and X[i] ≠ Y[j]
           i := i+1
        end
        if i ≤ m
           if Xbind[i] = 0 or ( Ybind[j] > 0 and Xbind(Ybind[j]) ≠ j ) 
              CS(Y,X) ≠ CS(X,Y) at this j so…
              erase all bindings from here down for Ybind
              for j2 := j+1..n
                 if Ybind[j2] > 0
                    Xbind[Ybind[j2]] := 0
                    Ybind[j2] := 0
                 end
              next j2
              Set new bindings
              Ybind[j] := i
              Xbind[Ybind[j]] := i
              mtop := i+1
              ntop := j+1 
              isSymmetrical := false
           end
        end
        j := j+1
     end
  end
return {Xbind, Ybind}
For languages that cannot return sets, include the arrays as parameters by reference. 
edit

This function will print the two sequences side by side commenting on the additions, changes and deletions made to the left (first) sequence, the bias sequence. Appropriate column formatting may be added.

function PrintDiff( X[1..m], Y[1..n], Xbind[1..m], Ybind[1..n] )
  i, j := 1
  while i ≤ m and j ≤ m
     if Xbind[i] = 0 and Ybind[j] > 0
        Print X[i] + " Deleted"
        i := i+1
     else if Xbind[i] > 0 and Ybind[y] = 0
        Print Space(10) + Y[j] + " Added"
        j := j+1
     else if Xbind[i] = 0 and Ybind[j] = 0
        Print X[i] + " " + Y[j] + " Changed"
        i := i+1
        j := j+1
     else
        Print X[i] + " " + Y[j]
        i := i+1
        j := j+1
    end
  end
  if i ≤ m
     for i := i..m
        Print X[i] + " Deleted"
     end
  end
  if j ≤ n
     for j := j..n
        Print Space(10) + Y[j] + " Added"
     end
  end
return 0

References

edit