Lab 5: Sequence alignment (with pointers)


This weeks lab is a practical test of algorithms based on “dynamic programming”, applied to one of the fundamental problems in bioinformatics.

Determining Similarity

One of the central problems pursued in bioinformatics is determining similarity between sequences of nucleotides or amino acids (the basic information that makes up our genetic structure). Similarity information is required to perform some typical tasks including:

  • Using a given sequence of amino acids to infer a protein’s shape and function, and
  • Finding all the genes and proteins in a given genome.

Usually, when given a particular sequence, you would search through one of the publicly availiable databases (visit, for example, the NCBI database for more information) and find multiple sequences that have varying degrees of similarity to the sequence in question. It is then up to the researcher to select the ones of most interest. In this lab we will focus on the underlying problem of determining similarity between two nucleotide sequences.

Exercise 5.1 (70%)

For this assignment you will write a program in Snap! which finds the optimal global alignment between two sequences of nucleotides.

Snap link: A project file with the framework for the assignment is available here.

The Global Alignment (Needleman-Wunsch) Algorithm

As described in the lectures, the global alignment algorithm is a dynamic programming algorithm. Given two sequences, at any point in the alignment we have three choices:

  • Align the two nucleotides
  • Introduce a gap in the first sequence
  • Introduce a gap in the second sequence

The three alignment options

So as we align our sequence, at each point the optimal alignment is given by the optimal alignment up to that point, plus whichever of the above choices gives us the greatest alignment score.

To achieve this we will construct a score matrix $F$ with the number of rows equal to the length of the first sequence plus one and the number of columns equal to the length of the second sequence plus one. Each element in this matrix represents the score of the optimal alignment up to that point in each sequence. To fill in the matrix we start at the top left element (inititalised to 0) and move left-right, top-bottom. At each element $F(i,j)$, the elements above ($F(i-1,j)$), to the top-left ($F(i-1,j-1)$) and to the left ($F(i,j-1)$) represent the decision to introduce a gap in the second sequence, align the two nucleotides, or introduce a gap in the first sequence respectively. So to fill in in the element $F(i,j)$ we use the formula

Score function

where $d$ is the penalty for introducing a gap and $s(x,y)$ is the score for aligning the two nucleotides.

This is illustrated in the following figure

How to compute a value in the score matrix

Each element $F(i,j)$ must not only store the score of the optimal alignment, but it must also store a pointer to the cell that it obtained the optimal value from, allowing you to trace the path that corresponds to the optimal alignment when the matrix is completed. Thus, if the finished score matrix is the following

A completed score matrix

the corresponding optimal alignment is

Optimal alignment

The optimal alignment is found by starting at the bottom right element in the matrix, and tracing back the steps taken to get there. This is done by following the pointers (arrows) stored at each element. Depending on the direction of the arrow do the following:

  • Arrow points up: Add the nucleotide for sequence 1 and a gap for sequence 2.
  • Arrow points diagonally: Add the nucleotide from both sequence 1 and sequence 2.
  • Arrow points left: Add a gap for sequence 1 and the nucleotide for sequence 2.

Try to reconstruct the aligned sequences above by following the arrows in the completed score matrix.

Implement the global alignment algorithm

The lab project provides a block which creates the score matrix and handles the visualisation. To initialise a new matrix, select (or type in) two sequences you wish to align and run the following blocks

initialise blocks

You should see the outline of the score matrix appear on the stage. Your task is to implement the global alignment algorithm, described above. Place your solution in the solution block. A basic stub of the block is provided to illustrate how to use the helper functions.

The three inputs to the solution block are the values to use for $s(x,y)$ and $d$ in the algorithm:

  • The first input (good alignment score) is the score for aligning two equal nucleotides, i.e., $s(x,y)$ when $x = y$.
  • The second input (misalignment penalty) is the penalty for aligning two unequal nucleotides, i.e., $-s(x,y)$ when $x \neq y$.
  • The third input (gap penalty) is the penalty for introducing a gap in either sequence, i.e., $-d$.

When you have completed and run your code, you can use the following blocks to display the optimal alignment corresponding to the computed score matrix:

show global alignment

Hints

  • You can use the set element block to set an element of the score matrix to a particular value. Note that when you do this, the pointer for that element is automatically set to null (a value indicating no arrow). It will also update the matrix shown on the stage.
  • To set the pointer for an element in the score matrix you can use the set pointer block. You should set the pointer to a list which contains the indices of the cell you are pointing to in the score matrix. The pointer (if set correctly) will also show up as an arrow between the entries in the matrix shown on the stage.
  • Remember to check the edges of the matrix: If $i = 1$, there is no row at $i - 1$, and likewise when $j = 1$, there is no column at $j - 1$.
  • You can use local variables, say “Max Score” and “Max Score Pointer”, to keep track of the best choice as you evaluate each one in turn. Calculate the score for each of the choices (skipping the ones that do not apply) and update both variables when you find a new best score. Don’t forget that you must initialise “Max Score” to a value that is lower than the score for any of the choices can be.

Testing

The lab project comes with some sample sequences to test your block on, four of length 7 and two shorter of length 5. The example score matrix above is for aligning two of them, GAATTCA and GGATCGA, with default scores. The optimal alignment has one mismatch (in the second position) and one gap in each sequence.

Try comparing all sequences of the same length pair-wise.

You should also try varying the scores. If you keep the gap penalty at 1 and increase the misalignment penalty to any value greater than 2, you should get alignments that have no mismatches (because a mismatch can always be avoided by introducing two gaps). Verify that your implementation does this! On the other hand, if you increase the gap penalty while keeping the misalignment penalty at 1 you should see fewer and fewer gaps.

Exercise 5.2 (30%)

The block you have implemented so far computes the best alignment between two complete sequences. Sometimes, however, the objective is to find similar subsequences in two given sequences. Fortunately, only a few small modifications are needed to compute an optimal local alignment. (This is known as the Smith-Waterman algoritm.)

  • For local alignment, the maximum score function is given by

    The local alignment scoring function

    In other words, an alignment between two subsequences with a negative score is never optimal, because in that case a better alignment can be made between shorter subsequences.

  • When the score matrix is completed and we wish to find an optimal local alignment, we need to start tracing back from the largest score in the matrix (instead of the bottom right corner) and trace back until we reach a score of 0 which is where we stop.

Your task is to implement the local alignment algorithm, in the block called "Solution for Local Alignment" . (Remember that you can use the “duplicate” function to copy your earlier implementation so that you don’t have to rebuild it from scratch.) To find and display the optimal local alignment after you have computed the matrix, use the following blocks:

"find optimal local alignment in (score matrix)" .

Testing

You can use the same test cases as for the global alignment. However, local alignment is more sensitive to the score given to a good aligment (match): The higher it is, relative the the misalignment and gap penalties, the more willing we are to accept a worse alignment of longer subsequences. As an example, aligning the sequences GCATGCT and GGATCGA with good alignment scores of 2 and 1, keeping both penalties at 1, gives subsequences of different lengths.

Assignment submission

To submit your assignment, you must export blocks (as described on the lab 1 page. You should export the blocks that you defined in this lab (“Solution”, “Solution for Local Alignment” and “Show optimal local alignment”), along with any other custom you created that are used by those.

Submit the file with the saved blocks through wattle.

Test before you submit! Always, test your submission file – to make sure everything that should be included is indeed included!

To do so:

  • Start a fresh chrome/chronium session/
  • Load the Lab 5 Snap project.
  • Use the “import … ” function from Snap to load your submission file, containing your exported blocks.

Now run these blocks – does you solution still work?

12 March, 2016
1544 words


Written by
Tags
labs

Quick links

2016 S1 lectures schedule:
  • Monday 1-2pm (RS Chem T)
  • Tuesday 9-10am (ENGN T)
  • Thursday 12-1pm (JD101)

Note: TBD.

Quick-access class pointers: TBC