Mapping reads to a genome with LAST
===================================

LAST has many adjustable parameters, providing many ways of mapping
reads to a genome.  We cannot tell you which way is best, but here are
some ideas that might be helpful.

1. A reasonable mapping procedure
---------------------------------

This procedure has worked well for us with different kinds of read
(Illumina and FLX Titanium):

  lastdb -m1111110 -s20G genomedb genome/chr*.fa
  lastal -Q1 genomedb reads.fastq

If your reads are in plain FASTA format (without quality data), then
omit "-Q1".  Beware, however, that this changes the default score
parameters: without -Q they are tuned for long, weak genome/genome
alignments, and with -Q they are tuned for short, strong read/genome
alignments.  To use the latter parameters, do this:

  lastal -r6 -q18 -a21 -b9 -e180 genomedb reads.fa

Quality data improves mapping accuracy, as long as the quality data
itself is not too inaccurate.

2. Dealing with multi-mapping reads
-----------------------------------

Often, one read will align to more than one genome location.  You can
use last-map-probs.py in the scripts directory to help judge which
location the read really maps to.  This script calculates a mapping
probability for each alignment.  For example, if one read aligns to
two locations with identical scores, then the probabilities will be
50:50.

3. Very short reads
-------------------

The default score parameters do not align reads shorter than 30 bp,
because the match score is 6 and the score threshold is 180.  To align
shorter reads, reduce the score threshold (lastal -e).

If the score threshold is too low, you will get meaningless, random
alignments.  (You might also need to adjust lastal -d, which defaults
to e*3/5: if it is too low, lastal might be slow.)

4. How does this alignment procedure work: what are its limitations?
--------------------------------------------------------------------

If you want to understand how this alignment procedure works in more
detail, read on.  LAST uses a two-step approach: first find initial
matches, then extend alignments from these matches.  The "initial
matches" are: all matches of any part of a read to the genome, of any
length, where the match occurs at most ten times in the genome.

One consequence is that repetitive reads will not be mapped: if a read
perfectly matches more than ten locations in the genome, it gets
dropped at the first step.

The main point is that this procedure does not guarantee to find all
alignments with score >= 180.  It is more likely to miss alignments
that have uniformly-spaced mismatches/gaps, and less likely to miss
alignments with mismatches/gaps concentrated at the ends.  We think it
does a good job in practice.

5. Diagnosing repetitive reads
------------------------------

We can detect repetitive reads as follows:

  lastal -j0 -l30 genomedb reads.fastq

Here, -j0 tells lastal to just report counts of initial matches.  In
this case, there is no limit on how often the matches occur: matches
that occur more than ten times in the genome are counted too.  The
-l30 option requests matches of length >= 30 only: this makes it
faster and makes the output smaller.  (Without -l30, it counts all
matches of length >= 1: this is still quite fast.)

6. Guaranteeing to map reads with up to N mismatches
----------------------------------------------------

Suppose we wish to find all matches of length-36 reads to the genome,
allowing up to two mismatches.  A naive approach is to start by
finding all exact matches of length 12, and extend alignments from
these.  This works because any length-36 read with two mismatches is
guaranteed to have an exact match of length 12.  It will be very slow,
however, because there will be many unproductive length-12 matches.

We can do better by finding matches using a spaced seed, and then
extending alignments.  For example, our reads are guaranteed to have a
match using this spaced seed pattern: 11111011000111110110001111.
Since this seed has 18 matched positions (18 "1"s), we will get far
fewer unproductive matches.  With LAST, we can do this as follows:

  lastdb -m11111011000 mydb genome.fa
  lastal -l26 -m4000000000 -j1 -q0 -d34 -n100 mydb reads.fa

In the lastdb command, the seed pattern gets cyclically repeated, so
we only need to specify the repeating unit of the pattern.  In the
lastal command, we used -l26 to get length-26 initial matches, and
-m4000000000 to accept hugely repeated initial matches.  We also used
-j1 to request gapless alignments, -q0 to set the mismatch cost to 0,
and -d34 to request alignments with score >= 34.  Finally, -n100
limits the output: if one read has many matches, an arbitrary sample
(at least 100) of them will be returned.

The following table shows optimal spaced seed patterns for various
read sizes and numbers of mismatches.  Each entry shows the match
length (e.g. 26) and the pattern (e.g. 11111011000).

====  ===========  ================  ==================  ======================
Read  1 mismatch   2 mismatches      3 mismatches        4 mismatches
size
====  ===========  ================  ==================  ======================
16    12 11110     10 1110100         9 11010000          3 1110
17    13 11110     10 1110100        10 11010000          5 1110
18    14 11110     12 1110100        10 11010000          5 1110
19    14 11110     12 1110100        12 11010000          5 1110
20    16 11110     12 1110100        12 11010000         11 1100010000
21    17 11110     15 1110100        12 11010000         12 1100010000
22    18 11110     16 1110100        13 1110100000       12 1100010000
23    19 11110     17 1110100        13 11101001000      12 1100010000
24    19 111110    17 1110100        14 11101001000      12 1100010000
25    20 111110    19 1110100        14 11101001000      16 1100010000
26    21 111110    19 1110100        16 11101001000      16 1100010000
27    22 111110    19 1110100        16 11101001000      15 1110100000000
28    23 111110    22 1110100        16 11101001000      16 1110100000000
29    23 111110    23 1110100        19 11101001000      16 1110100000000
30    25 111110    24 1110100        19 11101001000      18 1110100000000
31    26 111110    24 1110100        19 1110110100000    18 1110100000000
32    27 111110    26 1110100        18 111101011001000  18 111010010000000
33    28 111110    26 1110100        19 111101011001000  18 111010010000000
34    29 111110    22 1111101110010  19 111101011001000  20 111010010000000
35    29 1111110   25 11111011000    21 111101011001000  20 111010010000000
36    30 1111110   26 11111011000    21 111101011001000  20 11110010000001000
37    31 1111110   27 11111011000    23 111101011001000  21 11110010000001000
38    32 1111110   27 11111011000    24 111101011001000  21 11110010000001000
39    33 1111110   29 11111011000    24 111101011001000  21 11110010000001000
40    34 1111110   30 11111011000    24 111101011001000  24 11110010000001000
41    34 1111110   30 11111011000    27 111101011001000  24 11110010000001000
42    36 1111110   30 1111110101100  27 111101011001000  24 1101110100000010000
43    37 1111110   31 1111110101100  27 111101011001000  25 1101110100000010000
44    38 1111110   32 1111110101100  32 1110110100000    25 1101110100000010000
45    39 1111110   32 1111110101100  31 111101011001000  27 1101110100000010000
46    40 1111110   34 1111110101100  32 111101011001000  28 1101001110100000000
47    41 1111110   35 1111101110010  33 111101011001000  28 1101001110100000000
48    41 11111110  35 1111101110010  34 111101011001000  30 1101001110100000000
49    42 11111110  37 1111110101100  34 111101011001000  30 1101001110100000000
50    43 11111110     ?              36 111101011001000  30 1101001110100000000
====  ===========  ================  ==================  ======================

This table was made using software kindly provided by the authors of
these publications:

* G Kucherov, L Noé, M Roytberg (2005) IEEE/ACM Trans Comput Biol
  Bioinform 2:51-61.
* S Burkhardt, J Kärkkäinen (2003) Fundamenta Informaticae 56:51-70.

For longer reads, it becomes harder to determine the optimal patterns.

7. Merging identical read sequences
-----------------------------------

Suppose we have read sequences in a FASTA file called "reads.fa":

  >readA
  GGACAAAAACCAAAAAAAACAAAAAAAAAAAAAAAA
  >readB
  GGACAAAAACCAAAAAAAACAAAAAAAAAAAAAAAA
  >readC
  GGACAAAAACCAAAAAAAACAAAAAAAAAAAAAAAA
  >readD
  GGCACTCTTTCCCTACACGACGCTCTTCCGATCTGG

If there are many identical sequences, we can speed up the alignment
by merging them.  The following Unix pipeline merges identical
sequences (assuming each sequence is all on one line):

  grep -v '>' reads.fa | sort | uniq -c | awk '{print ">" NR ":" $1 "\n" $2}'

The output of this command looks as follows:

  >1:3
  GGACAAAAACCAAAAAAAACAAAAAAAAAAAAAAAA
  >2:1
  GGCACTCTTTCCCTACACGACGCTCTTCCGATCTGG

The number after the colon is the count of the read, and the number
before the colon is just a serial number.

8. How does lastal use sequence quality data?
---------------------------------------------

The quality scores have no effect on finding initial matches, but they
do affect the alignment scores.  If quality scores are used, the
default alignment scoring scheme is +6 for a high-quality match and
-18 for a high-quality mismatch.  Low-quality matches and mismatches
get scores between these values, as shown in the following table.

======  ========  ========      ======  ========  ========
Solexa  Match     Mismatch      Phred   Match     Mismatch
score   score     score         score   score     score
======  ========  ========      ======  ========  ========
]  29       6       -18	        >  29       6       -18
\  28       6       -17	        =  28       6       -17
[  27       6       -17	        <  27       6       -17
Z  26       6       -17	        ;  26       6       -17
Y  25       6       -17	        :  25       6       -17
X  24       6       -17	        9  24       6       -17
W  23       6       -17	        8  23       6       -17
V  22       6       -16	        7  22       6       -16
U  21       6       -16	        6  21       6       -16
T  20       6       -15	        5  20       6       -15
S  19       6       -15	        4  19       6       -15
R  18       6       -14	        3  18       6       -14
Q  17       6       -14	        2  17       6       -14
P  16       6       -13	        1  16       6       -13
O  15       6       -13	        0  15       6       -12
N  14       6       -12	        /  14       6       -12
M  13       6       -11	        .  13       6       -11
L  12       6       -10         -  12       6       -10
K  11       6       -10         ,  11       6        -9 
J  10       6        -9         +  10       6        -8 
I   9       5        -8         *   9       5        -7 
H   8       5        -7         )   8       5        -7 
G   7       5        -6         (   7       5        -6 
F   6       5        -6         '   6       5        -5 
E   5       5        -5         &   5       4        -4 
D   4       5        -4         %   4       4        -3 
C   3       4        -3         $   3       3        -2 
B   2       4        -3         #   2       2        -1 
A   1       3        -2         "   1      -1         0  
@   0       3        -2         !   0     -18         1  
?  -1       2        -1
>  -2       2        -1
=  -3       1        -1
<  -4       1         0
;  -5       0         0
   -6      -1         0
   -7      -2         0
   -8      -3         1
   -9      -3         1
  -10      -4         1
  -11      -5         1
  -12      -6         1
  -13      -7         1
  -14      -8         1
  -15      -9         1
  -16     -10         1
  -17     -10         1
  -18     -11         1
  -19     -12         1
  -20     -13         1
  -21     -13         1
  -22     -14         1
  -23     -15         1
  -24     -15         1
  -25     -16         1
  -26     -16         1
  -27     -16         1
  -28     -17         1
  -29     -17         1
  -30     -17         1
  -31     -17         1
  -32     -17         1
  -33     -17         1
  -34     -18         1
======  ========  ========      ======  ========  ========
