% 18mar2006 +chris+
\documentclass[12pt,fleqn]{article}
%let array = True
%include lhs2TeX.fmt
%include lhs2TeX.sty
% a visible space
\def\vissp{\texttt{'{\char'040}'}}
%format ' ' = "\vissp"
\usepackage[a4paper]{geometry}
\usepackage{hyperref}
\setlength{\parindent}{0pt}
\setlength{\parskip}{0.5ex}
\title{Dynamic Programming in Haskell}
\author{Christian Neukirchen\footnote{The author can be reached at
\url{http://chneukirchen.org}.}}
\date{March 2006}
\begin{document}
\maketitle
The purpose of this Literate Haskell program is to implement a
function that does global sequence alignment using Needleman/Wunsch
techniques.\footnote{More about these techniques, graphics helpful
for understanding, and a codeless step-by-step explanation can be
found at
\url{http://www.sbc.su.se/~pjk/molbioinfo2001/dynprog/dynamic.html}.}
The algorithm is based on two steps: first, filling a matrix with the
maximal alignment scores for each element and then tracing a path
connecting the top-left and the bottom-right cell. Note that the
matrix is $O(n\cdot{}m)$ memory-wise and therefore pretty inefficient, you
don't want to use this on bigger sequences.
In Haskell, a good way to implement Dynamic Programming like this is
an array that will memoize a lazy stream of scores per cell. This
allows $O(1)$-lookup of formerly calculated values without losing
referential transparency and (to an extent) lazy evaluation.
% hack, hack
\setlength{\mathindent}{2.5em}
\begin{code}
import Array
\end{code}
|align| is the function that wraps all the functions below and calls
them in correct order. It takes two strings and returns them aligned:
\begin{code}
align :: String -> String -> [String]
align da db = format $ reverse $ traceback lena lenb
where
lena = length da
lenb = length db
\end{code}
% hack, hack
\setlength{\mathindent}{4.5em}
The algorithm is easier to express when the sequences to align are
one-indexed, since the borders of the matrix are used as special
values. An easy way to achieve this is prepending a space:
\begin{code}
a = ' ' : da
b = ' ' : db
\end{code}
|memscore| is the array that contains the actual matrix. It is filled
using a lazy stream of scores for each element.
\begin{code}
memscore = listArray ((0,0), (lena,lenb))
[score x y | x <- [0..lena], y <- [0..lenb]]
\end{code}
The scoring function looks very confusing since Haskell's array access
operator is not very elegant. I'll introduce an infix operator $i$ |@@|
$j$ that corresponds to $M_{i,j}$:
\begin{code}
infix 5 @@
(@@) i j = memscore ! (i, j)
\end{code}
The score $M_{i,j}$ of each element is determined in below code as follows,
the borders of the matrix with $i=0$ and $j=0$ are initialized to zero.
(More complex scoring algorithms could be added easily.)
\[
M_{i,j} = \textrm{maximum of} \left\{%
\begin{array}{l}
M_{i-1,j-1} + S_{i,j} \\
M_{i,j-1} + w \\
M_{i-1,j} + w
\end{array}\right.
\]
The gap penalty $w$ is zero here for reasons of simplicity.
\begin{code}
score 0 _ = 0
score _ 0 = 0
score x y = maximum [( x-1 @@ y-1) + difference x y,
x-1 @@ y ,
x @@ y-1]
\end{code}
$S_{i,j}$ is a mismatch penalty defined here like this:
\[
S_{i,j} = \left\{%
\begin{array}{ll}
0 & \textrm{if the symbols at position $i$ and position $j$ match} \\
1 & \textrm{otherwise}
\end{array}\right.
\]
\begin{code}
where difference x y | a!!x == b!!y = 1
| otherwise = 0
\end{code}
|traceback| now finds the path connecting both corners of the matrix
and collects the appropriate symbols (or spaces for gaps).
\begin{code}
traceback :: Int -> Int -> [(Char, Char)]
traceback 0 0 = []
traceback x y | x == 0 = (' ' , b!!y ) : traceback 0 (y-1)
| y == 0 = (a!!x , ' ' ) : traceback (x-1) 0
| x@@y == x@@y-1 = (' ' , b!!y ) : traceback x (y-1)
| x@@y == x-1@@y = (a!!x , ' ' ) : traceback (x-1) y
| otherwise = (a!!x , b!!y ) : traceback (x-1) (y-1)
\end{code}
The resulting list of tuples like |[('a', 'd'), ('b', 'e'), ('c', 'f')]|
gets converted by |format| into |["abc", "def"]|.
\begin{code}
format l = [map fst l, map snd l]
\end{code}
Finally, a small main program to test the algorithm:
% hack, hack
\setlength{\mathindent}{2.5em}
\begin{code}
dna1 = "GAATTCAGTTA"
dna2 = "GGATCGA"
main = mapM_ putStrLn $ align dna1 dna2
\end{code}
%$
Expected output:
\begin{verbatim*}
G AATTCAGTTA
GGA T C G A
\end{verbatim*}
As you can see, corresponding symbols are aligned, with appropriate
gaps in between. Implementing more complex rules for alignment is
left as an exercise for the reader.
A run where bigger gaps are needed:
\begin{spec}
dna1 = "ATGGCTTCTACC"
dna2 = "TATCAAAAGCCG"
\end{spec}
\begin{verbatim*}
ATGGCTTCTA CC
TAT C AAAAGCCG
\end{verbatim*}
\end{document}