% 18mar2006  +chris+
\documentclass[12pt,fleqn]{article}

%let array = True

%include lhs2TeX.fmt
%include lhs2TeX.sty

% a visible space
\def\vissp{\texttt{'{\char'040}'}}
%format ' ' = "\vissp"

\usepackage[a4paper]{geometry}
\usepackage{hyperref}

\setlength{\parindent}{0pt}
\setlength{\parskip}{0.5ex}

\title{Dynamic Programming in Haskell}
\author{Christian Neukirchen\footnote{The author can be reached at
    \url{http://chneukirchen.org}.}}
\date{March 2006}

\begin{document}

\maketitle

The purpose of this Literate Haskell program is to implement a
function that does global sequence alignment using Needleman/Wunsch
techniques.\footnote{More about these techniques, graphics helpful
  for understanding, and a codeless step-by-step explanation can be
  found at
  \url{http://www.sbc.su.se/~pjk/molbioinfo2001/dynprog/dynamic.html}.}

The algorithm is based on two steps: first, filling a matrix with the
maximal alignment scores for each element and then tracing a path
connecting the top-left and the bottom-right cell.  Note that the
matrix is $O(n\cdot{}m)$ memory-wise and therefore pretty inefficient, you
don't want to use this on bigger sequences.

In Haskell, a good way to implement Dynamic Programming like this is
an array that will memoize a lazy stream of scores per cell.  This
allows $O(1)$-lookup of formerly calculated values without losing
referential transparency and (to an extent) lazy evaluation.

% hack, hack
\setlength{\mathindent}{2.5em}

\begin{code}
import Array
\end{code}

|align| is the function that wraps all the functions below and calls
them in correct order.  It takes two strings and returns them aligned:

\begin{code}
align        :: String -> String -> [String]
align da db  = format $ reverse $ traceback lena lenb
    where
      lena = length da
      lenb = length db
\end{code}

% hack, hack
\setlength{\mathindent}{4.5em}

The algorithm is easier to express when the sequences to align are
one-indexed, since the borders of the matrix are used as special
values.  An easy way to achieve this is prepending a space:

\begin{code}
      a = ' ' : da
      b = ' ' : db
\end{code}

|memscore| is the array that contains the actual matrix.  It is filled
using a lazy stream of scores for each element.

\begin{code}
      memscore = listArray  ((0,0), (lena,lenb))
                            [score x y | x <- [0..lena], y <- [0..lenb]]
\end{code}

The scoring function looks very confusing since Haskell's array access
operator is not very elegant.  I'll introduce an infix operator $i$ |@@|
$j$ that corresponds to $M_{i,j}$:

\begin{code}
      infix 5 @@
      (@@) i j = memscore ! (i, j)
\end{code}

The score $M_{i,j}$ of each element is determined in below code as follows,
the borders of the matrix with $i=0$ and $j=0$ are initialized to zero.
(More complex scoring algorithms could be added easily.)

\[
M_{i,j} = \textrm{maximum of} \left\{%
\begin{array}{l}
M_{i-1,j-1} + S_{i,j} \\
M_{i,j-1} + w \\
M_{i-1,j} + w
\end{array}\right.
\]

The gap penalty $w$ is zero here for reasons of simplicity.

\begin{code}
      score 0 _ = 0
      score _ 0 = 0
      score x y = maximum [(  x-1  @@ y-1) + difference x y,
                              x-1  @@ y  ,
                              x    @@ y-1]
\end{code}

$S_{i,j}$ is a mismatch penalty defined here like this:

\[
S_{i,j} = \left\{%
\begin{array}{ll}
0 & \textrm{if the symbols at position $i$ and position $j$ match} \\
1 & \textrm{otherwise}
\end{array}\right.
\]

\begin{code}
        where difference x y  | a!!x == b!!y  = 1
                              | otherwise     = 0
\end{code}

|traceback| now finds the path connecting both corners of the matrix
and collects the appropriate symbols (or spaces for gaps).

\begin{code}
      traceback :: Int -> Int -> [(Char, Char)]
      traceback 0 0 = []
      traceback x y  | x     == 0       = (' '   , b!!y  ) : traceback  0      (y-1)
                     | y     == 0       = (a!!x  , ' '   ) : traceback  (x-1)  0
                     | x@@y  == x@@y-1  = (' '   , b!!y  ) : traceback  x      (y-1)
                     | x@@y  == x-1@@y  = (a!!x  , ' '   ) : traceback  (x-1)  y
                     | otherwise        = (a!!x  , b!!y  ) : traceback  (x-1)  (y-1)
\end{code}

The resulting list of tuples like |[('a', 'd'), ('b', 'e'), ('c', 'f')]|
gets converted by |format| into |["abc", "def"]|.
 
\begin{code}
      format l = [map fst l, map snd l]
\end{code}

Finally, a small main program to test the algorithm:
 
% hack, hack
\setlength{\mathindent}{2.5em}

\begin{code}
dna1 = "GAATTCAGTTA"
dna2 = "GGATCGA"

main = mapM_ putStrLn $ align dna1 dna2
\end{code}
%$
Expected output:

\begin{verbatim*}
G AATTCAGTTA
GGA T C G  A
\end{verbatim*}

As you can see, corresponding symbols are aligned, with appropriate
gaps in between.  Implementing more complex rules for alignment is
left as an exercise for the reader.

A run where bigger gaps are needed:

\begin{spec}
dna1 = "ATGGCTTCTACC"
dna2 = "TATCAAAAGCCG"
\end{spec}
\begin{verbatim*}
 ATGGCTTCTA    CC 
TAT  C    AAAAGCCG
\end{verbatim*}

\end{document}