87994.com

学习资料共享网 文档搜索专家

学习资料共享网 文档搜索专家

1515-5 编辑距离

a)Given a)Given two sequences x[1 ‥ m] and y[1 ‥ n] and set of transformationtransformation-operation costs, the edit distance from x to y is the cost of the least expensive operation sequence dynamicthat transforms x to y. Describe a dynamic-programming algorithm that finds the edit distance from x[1 ‥ m] to y[1 ‥ n] and prints an optimal operation sequence. Analyze the running time and space requirements of your algorithm.

First,,set X[i]=x[1..i] 、 Y[j]=y[1..j].The edit distance‘s child problem is actually from X [i] to Y [j] optimal conversion, We will call it "X [i] -> Y [i] problem " 。 Obviously，The original problem is actually X [m] -> Y [n].Set c[i, j] Said that resolution of the X [i] -> Y [j] problem of the optimal cost，Then under settlement X[i]->Y[j] On the last step of the operation，The following recurrence relations can be： If the last step of the operation is COPY，Then there must be x [i] = y [j].The optimal solution to this problem depends on the X [i-1] -> Y [j-1] the optimal solution, so there is c [i, j] = c [i-1, j-1] + cost (COPY); If the last step is to REPLACE, then there must be x [i]! = y [i]. (This assumes that the implementation can not use the same letters replace operation.) optimal solution to this problem also depends on x [i-1] -> Y [j-1] the optimal solution, so there is c [i, j] = c [i-1, j-1] + cost (REPLACE); If the last step is TWIDDLE, then there must be x [i] = y [j-1] and x [i-1] = y [j], and asked i, j> = 2. At this time, the optimal solution of this problem depends on the X [i-2] -> Y [i-2] the optimal solution, so there is c [i, j] = c [i-2, j-2] + cost (TWIDDLE);

If the last step of the operation is DELETE, the operation on the x or y is not restricted. Optimal solution to this problem

depends on the X [i-1] -> Y [j] the optimal solution, so there is c [i, j] = c [i-1, j] + cost (DELETE); If the last step of the operation is INSERT, as there is no limit on the x and y. Optimal solution to this problem depends on the X [i] -> Y [j-1] the optimal solution, so there is c [i, j] = c [i, j-1] + cost (INSERT); If the last step is to KILL, it means that we have completed the X [m] to Y [n] of the conversion, so the current problem must be X [m] -> Y [n] issue. That is, at this time there must be i = m and j = n. Optimal solution to this problem depends on the X [k] -> Y [n] the optimal solution (0 <= k <m), so there is c [m, n] = min (0 <= k <m) {c [k, n] + cost (KILL)}. X [0] and Y [0] represents the empty string, from X [0] switch to Y [j] INSERT operations require j times, so c [0, j] = j * cost (INSERT). Similarly, from X [i] to switch to Y [0] need to DELETE operation i times, then c [i, 0] = i * cost (DELETE). When i = j = 0 时, c [i, j] = c [0, 0] = 0, empty string into the cost of an empty string to 0. For i, j> 0, c [i, j] has the following recurrence formula：

c[i, j]=min{ c[i-1, j-1]+cost(COPY) (x[i]=y[j]), c[i-1, j-1]+cost(REPLACE) (x[i]!=y[j]), c[i-2, j-2]+cost(TWIDDLE) (i, j>=2 and x[i]=y[j-1] and x[i-1]=y[j]), c[i-1, j]+cost(DELETE), c[i, j-1]+cost(INSERT), min(0<=k<m){c[k, n]+cost(KILL)} (i=m and j=n). }

In addition, the set op [i, j] Save to solve X [i] -> Y [j] the issue of the last step. EDIT-DISTANCE(x, y, m, n) for i<-0 to m do c[i, 0]<-i*cost(DELETE) op[i, 0]<-DELETE for j<-0 to n do c[0, j]<-j*cost(INSERT) op[0, j]<-INSERT for i<-1 to m do for j<-1 to n

do c[i, j]<-∞ if x[i]=y[j] then c[i, j]<-c[i-1, j-1]+cost(COPY) op[i, j]<-COPY if x[i]!=y[j] and c[i-1, j-1]+cost(REPLACE)<c[i, j] then c[i, j]<-c[i-1, j-1]+cost(REPLACE) op[i, j]<-REPLACE(y[j]) if i>=2 and j>=2 and x[i]=y[j-1] and x[i-1]=y[j] and c[i-2, j-2]+cost(TWIDDLE)<c[i, j] then c[i, j]<-c[i-2, j-2]+cost(TWIDDLE) op[i, j]<-TWIDDLE if c[i-1, j]+cost(DELETE)<c[i, j] then c[i, j]<-c[i-1, j]+cost(DELETE) op[i, j]<-DELETE if c[i, j-1]+cost(INSERT)<c[i, j] then c[i, j]<-c[i, j-1]+cost(INSERT) op[i, j]<-INSERT(y[j]) for i<-0 to m-1 do if c[i, n]+cost(KILL)<c[m, n] then c[m, n]<-c[i, n]+cost(KILL) op[m, n]<-KILL(i) return c and op

Obviously, this algorithm time and space complexity are Θ (mn). We use the EDIT-DISTANCE return the table to reconstruct the sequence of operations op. OP-SEQUENCE (op, i, j) Reconstruction of X [i] -> Y [j] the optimal solution, the boundary conditions are i = j = 0. The outermost call OP-SEQUENCE (op, m, n)

OP-SEQUENCE(op, i, j) if i=0 and j=0 then return if op[i, j]=COPY or op[i, j]=REPLACE then OP-SEQUENCE(op, i-1, j-1) else if op[i, j]=TWIDDLE then OP-SEQUENCE(op, i-2, j-2) else if op[i, j]=DELETE then OP-SEQUENCE(op, i-1, j) else if op[i, j]=INSERT then OP-SEQUENCE(op, i, j-1) else if op[i, j]=KILL(k) then OP-SEQUENCE(op, k, j) print op[i, j]

The edit-distance problem is a generalization of the problem of aligning two DNA sequences (see, for example, Setubal and Meidanis [272, Section 3.2]). There are several methods for measuring the similarity of two DNA sequences by aligning them. One such method to align two sequences x and y consists of inserting spaces at arbitrary locations in the two sequences (including at either end) so that the resulting sequences x′ and y′ have the same length but do not have a space in the same position (i.e., for no position j are both x′[j] and y′[j] a space.) Then we assign a "score" to each position. Position j receives a score as follows: +1 if x′[j] = y′[j] and neither is a space, -1 if x′[j] ≠ y′[j] and neither is a space, -2 if either x′[j] or y′[j] is a space. The score for the alignment is the sum of the scores of the individual positions. For example, given the sequences x = GATCGGCAT and y = CAATGTGAATC, one alignment is G ATCG GCAT CAAT GTGAATC -*++*+*+-++* A + under a position indicates a score of +1 for that position, a - indicates a score of -1, and a * indicates a score of -2, so that this alignment has a total score of 6 · 1 - 2 · 1 4 · 2 _ -4.

b) Explain how to cast the problem of finding an optimal alignment as an edit distance problem using a subset of the transformation operations copy, replace, delete, insert, twiddle, and kill.

DNA sequence edit distance problem is essentially the problem: cost(COPY) = -1, cost(REPLACE) = +1, cost(DELETE) cost(INSERT) = +2, = +2,

TWIDDLE operation and KILL operation is not allowed.

DNA sequence alignment of the two biggest points to the operation of the A string is converted to B by the string edit distance of the opposite number.

相关文章:

更多相关标签: