Dynamic programming - minimum edit distance
Minimum Edit Distance
scince
What is this word?
Well, here are a few options:
sense
seance
science
If we’re a handy-dandy spell checker, how do we decide among them?
We clearly need some sort of way to tell that words like seance
and
science
are closer to scince
than something like sense
.
One pretty natural idea is to see how many edits would be required to go
from scince
to sense
. Or seance
to science
.
Let’s look at sense
first. One possible alignment is this.
s c i n c e
s e _ n s e
The edit distance between these words is the number of mismatched characters plus the number of inserted gaps. Here we failed to line up 3 characters, so the edit distance is 3.
Of course we could make worse matches, for example:
s c i n c e
s e n s e _
Here we managed to create an edit distance of 5 instead of 3. Clearly if we’re going to use edit distance as our measure of similarity, we only want to consider the minimum edit distance between two words.
I hope this gives some good motivation for why we might want an algorithm to compute minimum edit distance, now on to finding it!
Algorithmic Thinking
Let’s consider the general problem of finding the minimum edit distance between two strings,
X
and Y
, of lengths n
and m
.
This problem seems pretty hard. In general there are many possible edit distances we could compute between two strings and a simple algorithm here doesn’t come to mind.
Well a super general strategy in problem solving is to take a complex problem, and break it down into smaller problems. I can think of two main ways to make this problem simpler.
- Reduce the number of characters we consider in string
X
- Reduce the number of characters we consider in string
Y
To start, let’s say we’ll only consider the first i
characeters in string X
and the
first j
characters in string Y
.
We can express this as X[1..i]
and Y[1..j]
, note we’re being inclusive on our bounds here,
X[1..1] == X[1]
, and X[1..0] == ""
.
Base Cases
Is our problem easier yet? Not really, but we can express something new using i
and j
.
We know that if i
is 0, then we’re comparing an empty string to another string with length
j
, so to go from empty string to a string of length j
will take a minimum of j
edits,
where each edit is just inserting another character in Y[1..j]
.
The same argument applies the otherway around, if j
is 0, then i
is our minimum edit distance.
What we’ve defined are base cases in this problem. They’re trivially solvable because the input is insane, but they’re going to be a critical piece of our solution.
To keep track of these base cases we’ll say that some table T
, (indexed by i
and j
) stores
them. T[i,0] = i
and T[0,j] = j
.
Recurrence(Recurrence(Recurrence(……]
Now if we can define a way to go from the complicated problem, towards our base cases each step, then after enough iteration, we’ll reach our base cases and know the solution of the hard problem.
Following? If not hopefully walking through the logic here will help.
Remember that table we defined to keep track of our base cases? Well we can use it to keep track
of all the intermediate solutions to smaller problems. Define T[i,j]
to be the minimum
edit distance between strings X[1..i]
and Y[1..j]
. Clearly, the solution to the overall
problem of a length n
X
string and length m
Y
string is then T[n,m]
.
Okay, so we’ve got to express T[i,j]
as something simpler (since we need to move in the direction
of our base cases), so that means expressing T[i,j]
as a function of elements in the table
with smaller i
and j
values.
Consider these substrings.
s c ?
s e ?
There are three options to fill in the ?
.
- We use the next character from the X string
X[i]
and leave a_
for the other string. - We use the next character from the Y string
Y[j]
and leave a_
for the other string. - We use both
X[1]
andY[1]
.
T[i,j]
will have the value of the best minimum edit distance when trying these three options out.
Number 1 is basically just taking a loss by saying we’ll have to add a character into the Y
string, so the best minimum edit distance it’ll produce is 1 plus the minimum edit distance between
the slightly shorter string X[i-1]
and Y[j]
. We can express this really elegantly as just
T[i,j] = 1 + T[i-1,j]
.
Likewise, if we chose to use the Y[j]
character, T[i,j] = 1 + T[i,j-1]
.
Finally, there’s the option of using both X[i]
and Y[j]
. Now this is the interesting case.
If X[i]
is the same character as Y[j]
, then we haven’t increased the edit distance at all from the minimum
edit distance between X[1..i-1]
and Y[1..j-1]
(the prefixes of the current strings).
If X[i]
is not the same character as Y[j]
, then just like the other cases, we increased the minimum
edit distance by 1.
Now that we’ve expressed all the cases, we can create a general equation for T[i,j]
by just
taking the best result, so
T[i,j] = min{
1 + T[i-1,j],
1 + T[i,j-1],
1 + T[i-1,j-1] if X[i] != Y[j] else T[i-1,j-1],
}
Is That It?
Yes! We finally know how to solve the simplest cases and how to take a complicated problem and reduce it to simpler cases.
At this point we can write an algorithm for computing the minimum edit distance called MinD
.
MinD(X: 1->n, Y: 1->m):
for i = 1->n: T[i,0] = i
for j = 1->m: T[0,j] = j
for i = 1->n
for j = 1->m
T[i,j] = min{
1 + T[i-1,j],
1 + T[i,j-1],
1 + T[i-1,j-1] if X[i] != Y[j] else T[i-1,j-1],
}
return T[n,m]
The critical piece to understand here is that every reference to our table in the loop body is
defined by the time we get around to computing it. Basically, our table gets filled row-by-row,
left-to-right, and we only look at previous row/column pairs. We can also see that this algorithm
has a running time of O(n*m)
as we have nested n
and m
loops.
Let’s walk through computing the minimum edit distance between sense
and scince
, I’ve put
string X
(scince
) along the rows and string Y
(sense
) over the columns, along with indices
i
, and j
respectively.
0 | 1, s | 2, e | 3, n | 4, s | 5, e | |
0 | ||||||
1, s | ||||||
2, c | ||||||
3, i | ||||||
4, n | ||||||
5, c | ||||||
6, e |
Base cases
We can start by filling out our base cases, like so:
0 | 1, s | 2, e | 3, n | 4, s | 5, e | |
0 | 0 | 1 | 2 | 3 | 4 | 5 |
1, s | 1 | |||||
2, c | 2 | |||||
3, i | 3 | |||||
4, n | 4 | |||||
5, c | 5 | |||||
6, e | 6 |
Starting to fill
Now we start at i=1
and j=1
, here we consider the inner loop
of the algorithm and see that since X[i] == Y[j]
, we can use
the minimum edit distance of T[i-1,j-1]
, so we get a value of 0.
Next up, we have the case of i=1
and j=2
. We can see
that X[i] != Y[j]
in this case, so we won’t be able to
reuse the edit distance of T[i-1,j-1]
. Evaluating each
case in the minimum clause, we get the following values:
1 + T[i-1,j] == 2 + 1 == 3
1 + T[i,j-1] == 0 + 1 == 1
1 + T[i-1,j-1] == 1 + 1 == 2
So the minimum edit distance of se
and s
is 1.
0 | 1, s | 2, e | 3, n | 4, s | 5, e | |
0 | 0 | 1 | 2 | 3 | 4 | 5 |
1, s | 1 | 0 | 1 | |||
2, c | 2 | |||||
3, i | 3 | |||||
4, n | 4 | |||||
5, c | 5 | |||||
6, e | 6 |
Result
After following the algorithm, you can see that we’ve solved for every
possible minimum edit distance for substrings X[1..i]
and Y[1..j]
,
until we finally reach the solution we want, X[1..n]
and Y[1..m]
.
0 | 1, s | 2, e | 3, n | 4, s | 5, e | |
0 | 0 | 1 | 2 | 3 | 4 | 5 |
1, s | 1 | 0 | 1 | 2 | 3 | 4 |
2, c | 2 | 1 | 1 | 2 | 3 | 4 |
3, i | 3 | 2 | 2 | 2 | 3 | 4 |
4, n | 4 | 3 | 3 | 2 | 3 | 4 |
5, c | 5 | 4 | 4 | 3 | 3 | 4 |
6, e | 6 | 5 | 4 | 4 | 4 | 3 |
Reflecting
What we just built is a dynamic programming algorithm. Programming here doesn’t really refer to our modern sense of computer programming, but instead is more akin to the word planning.
When a plan is too complicated to work out immediately, then we try to think of a subproblem, that if solved would allow us to figure out the more complicated problem. In our case, that subproblem was figuring out the minimum edit distance of a prefix of the input strings.
This step of finding a subproblem to solve is the piece that requires some experimentation,
trying a specific definition of T
and seeing if it is possible to state a recurrence relation
between T[i,j]
and previously computed T[i,j]
values. If it’s not possible, then you
probably need to revisit your definition of T
.
If you find this sort of stuff cool, you should do some searching on proofs by induction. They’re heavily related to this concept but focus on proving certain properties by proving a base case, then defining a recurrence relationship that moves you closer to the base case.
So there you have it, dynamic programming is just a basic problem solving technique wrapped in some fancy language.