Computational Biology

Joshua Gregory

1 - Introduction

Data

Biology is big data
10² to 10¹² numbers
Lots of information stored in big databases

Interactions

Protein-protein common
Protein-DNA for regulation
Protein-environment how cell senses the world
Ecosystems

Self-Organise

Protein complexes form machines
Life can organise into cells
Cells can organise into higher organisms
Organisms form populations that act together
Back to ecosystems

Dynamic

Cells chemical machines
Complex auto-regulation
Cells divide
Can fuse sexually
Life evolves
Almost all aspects of life governed by mechanisms

Comp Bio

Comp bio is big data problem
Tools in compsci essential to

Store the data
Make inference about the data

Tools of abstraction apply to biology too

Evolution

Charles Darwin
Natural Selection
Still being understood

Tree of life

Life being related through tree like structure
Molecularly, species share a huge amount in common
Transfer knowledge between species

Inheritance

Not generally understood in Darwin’s time
Modern theory by Gregor Mendel
Finally widely understood it occurred through exchange of discrete units (genes)

Mendel’s Peas Experiment

Inheritance

Experiments explained if:

Traits carried by discrete units (genes)
Each individual carries two copies of a gene (diploidy)
We inherit one copy from each parent
One gene is dominant

These not always true
Rigged experiment? Mechanism remained confused until DNA

DNA

Deoxyribose Nucleic Acid
Crick and Watson
Discovery of double helix

Base Pairs

Double helix held together by 4 types of bases: A, T, G, C
Each base form base pairs A-T or C-G

DNA Replication

What does DNA do?

Part codes for proteins

Gets copied to messenger RNA through transcription
mRNA then gets translated to proteins

A lot does not code for protein

Can code for RNA molecules (ribozymes)
Used for controlling the cell
Can be junk - caused by replication
Or we’re not quite sure…

RNA and DNA

The base U is used in RNA instead of T

Transcription by RNA Polymerase

DNA unwound, mRNA encoded, DNA rewound

Exons and Introns

Exons are regions in DNA that will encode the RNA
Introns are the noncoding regions

mRNA to Protein Translation

Ribosome moves across the mRNA strand
The groups of 3 nucleotides (codons) act as the code for amino acid molecules

Can have multiple codes for the same acid

The amino acids attached together in the right order
Then the strand of acids form the protein

DNA nucleotides -> mRNA nucleotides -> amino acids -> protein

RNA Triplet to Amino Acids

Transcription: DNA -> RNA
Translation: RNA -> Protein

To help remember(?):

TranscRiption - creating RNA, transcribing the DNA to RNA
Translation - translating the language of RNA to the language of proteins

What do proteins do?

Main building block of cells
Form skeleton (lipid layer of membranes)
Act as chemical machines (enzymes)
Act as mechanical machines
Large part of control mechanism
Not everything in cell is protein

There is DNA and RNA
Rna can fold up like proteins and act as catalysts (ribosymes)

Computational challenges

A lot of challenges when trying to understand genome
Need to find coding regions (Exons)
Need to reconstruct DNA sequence from sequencing data
Need to compare different sequences

2 - Cell Regulation

Cells

Complicated machines
Still on-going understanding

Prokaryotes and Eukaryotes Cell Types

Share cell membrane, DNA code, ribosome (machine for producing proteins)

Eukaryotes

Has a true nucleus
Multiple linear chromosomes in the nucleus
Larger and more complex
Formed by fusing different cells
Can be multicellular
e.g. plant and animal cells

Prokaryotes

Do not have a nucleus
Single DNA molecule in nucleoid
Smaller
Usually unicellular
e.g. bacteria cell

Understanding proteins

Hard to understand and sequence directly
Deduce their sequence from DNA or mRNA
Measuring protein concentration difficult:

Field of proteomics
Uses mass spectrometers to count

Protein Folding

Understand their 3D shape
Difficult
Crystallise them and use X-ray diffraction
Predicting structure is non-local problem

(Primary Structure)

The sequence/order of amino acids

Secondary Structure

Main components make up proteins:

Alpha helices
Beta sheets
Various folds and connecting parts

Folding

Protein folding is hard because typically amino acids apart will interact through hydrogen bonding
RNA also folds (more flexible than DNA)
RNA folding easier to solve
Look for matching sub-sequences

RNA Folding

The First Law of Protein Sequence

Similar structures perform similar functions

Learn about function of protein by comparing it with others
Evolution means many proteins are common to huge number of organisms
Converse not necessarily true: dissimilar structures can have similar function

Sequence Alignment

Gene sequences aligned and similarities compared
e.g. mouse gene to shark gene

Finding Aligned Sequences

Compare new sequences with ones already found
Compare with large database
They are usually annotated indicating what someone thinks a protein does in some organism

Cell Regulation

Cells must respond to environment, maintain their environment, divide
Need control mechanism, proteins and DNA help with this

Promoter Region

Basically this is a region in the DNA which helps regulate the levels of proteins in the cell

Regulation by Signal

Regulation in Promoter Region

Many different proteins can bind to the promoter region
They can either excite the expression of the protein (by increasing the probability of RNA-polymerase binding)
Or inhibit the expression of the protein

Binding Sites

TATA Box

Close to the coding regions is a region where are more A-T pairs than usual
Because A-T pairs have only 2 hydrogen bonds, they are easier to break
This helps RNA-polymerase split open the double helix

Gene Regulation Networks

Proteins can regulate their own expression and many other proteins
Protein A can regulate protein B which in turn regulates protein A
In general regulation forms complex circuit

Motifs

To understand what’s going on we can look for patterns or motifs
Spot common motifs by askings what patterns happen more than by chance
Synthetic biology builds biological circuits from these motifs

Dynamics

Gene regulation networks are dynamic
To understand: simulate the network
Requires knowledge of rate at which molecules are being produced

Multicellular

Many Eukaryotes are multicellular
Many cases the cells differentiate
How do they know what role they play?
How do organisms develop so cells do the right thing?
You’ll find out later

3 - Sequencing

What is it?

DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA.

Sequencing Revolution

Advances in DNA sequencing
Reading DNA took a while to take off since structure discovery in 1950s
Francis Crick had speculated on how DNA might code for proteins
By removing some number of bases from DNA he was able to show that DNA has a three letter code

Frederick Sanger

British biochemist
1995 discovered protein sequence for Insulin
Showed proteins actually chains

Sequencing DNA

Chemically difficult as bases not distinctive
1965 when first strands of RNA read
Copying DNA and breaking into smaller sequences was effort
Various techniques developed in 1970s
1977 “Sanger Sequencing”

Sanger Sequencing

Use biological mechanisms for copying DNA
Would add Dideoxynucleoties (ddNTP) to a DNA chain

Modified ddNTP

Developed with a missing 3’ hydroxyl group which prevented chain from developing any further
These marked (with radioactive phosphorus or later with fluorescent marker)
DNA replication carried out with normal ddNTPs and a weak concentration of these modified ddNTPs
Initially done in 4 containers, one for each base (in container A only base A ddNTP were modified)
Later different fluorescence were used so everything could be carried out in the same vessel

* means modified ddNTP
When DNA replicating, the chain would stop at these * points
You could identify these points

Fractionation

Each solution added to a layer of gel
Using capillary electrophoresis different length DNA fragments are sorted
From the distance travelled and the colour of the last nucleotide base we can read off the sequence

Shotgun Sequencing

Saves careful preparation

Reconstruction

Looks easy but genome very repetitive
Sanger only allows you to sequence “reads” of around 1000bp (base pairs)
Not clear what long range ordering is
Larger genomes, “scaffold” built where position of some genes knew
Sequence both dies of large fragment so you know approximate relationship

Human Genome Project

Sanger Sequencing helped human genome project
Costs were predicted high
Still don't know enough to provide all benefits

Improving Sanger

Costs a lot
Improve with massive parallelisation
Many techniques used

Sequencing Techniques

Sanger: sequencing-by-synthesis

Using DNA-polymerase a strand of DNA is replicated
It turns out this technique ultimately triumphant

Another approach is sequence-by-ligation (or hybridisation)

Create lots of short sections of DNA (oligonucleotides) that will stick to an unbound stretch of DNA (or RNA)
Use in microarrays

Synthesis vs Ligation

454 Sequencer

First high throughput sequencer was Roche 454 machine
Amplified fragments of DNA attached to a microbead
These are contained in mirco-wells
Sequencing-by-synthesis is carried out
Added one type of base at a time with fluorescent label
Can have difficulty distinguishing regions with repeated base

Water in Oil Emulsion PCR

Illumina

The dominant next generation machines are produced by Illumina (formally Solexa)
Uses technique called bridge amplification to create clusters of identical strands on a 2-D surface (substrate)
Can sequence 100s of million clusters at once
It then uses sequencing-by-synthesis to read the sequence
Also read the other end of the sequence by flipping

Bridge Amplification

Identifying Sequence

? low res :(

Cost Per Human Genome

2007, 2015 jump
Surpassed Moore’s Law

100,000 Genome Project

Currently NHS and Sanger centre sequencing around 70,000 human genome to identify rare genetic diseases
Can now sequence a human genome within a day
As already sequenced human genome, no longer have difficulty reconstructing sequence

Single Molecule DNA Sequencing

Replicating DNA can bring errors and biases
Most high-throughput techniques only read short sections (around 100bp), reconstruction difficult
Current research to create machine that can read long single DNA strand

Zero Mode Waveguide

Uses same fluorescence dNTPs as used in high-throughput processing
Use micro-well smaller than the wavelength of light
Able to detect very small signal
Not high-throughput

Nanopore Technology

Force DNA through tiny pore and read out the sequence
Hand-held machines
Currently accuracy not quite good enough

4 - Sequence Re-assembly

DNA Sequencing

In shotgun genome sequencing, full genome broken into small pieces
Pieces then read by a sequencing machine
Reads short sections (100-1000) bases
The reads are then assembled to construct the full genome

Overlapping the pieces to reconstruct. Need algorithms to do this

Reads and Contigs

We are given set of reads (DNA segments)
We need to reconstruct them into contigs (set of overlapping DNA segments that form a consensus region of DNA)
Problem as (0.1-2%) errors in reads
Shotgun important in speeding up human genome project
What data structure to use?

Digital Trees or Tries

A Trie (pronounced: “try”) or digital tree is a multiway tree data structure often used for storing large sets of words
They are trees with a possible branch for every letter of an alphabet
Their names comes from retrieval
Tries usually compactify the edges in the tree
All words end up with a special letter “$”

Trie example

General Use of Tries

Implementation of sets
Quick insertion, deletion and find operations
Typically considerably quicker than binary trees and hash tables
Good for spell checkers, completion algorithms, longest-prefix matching, hyphenation
Each search finds longest match between the words in the set and the query

Tries for Sequence Analysis

Tries can be used to store DNA or proteins
For reconstruction of DNA tries are very useful for finding the next fitting section
Useful when there are errors, as by searching small sub-tree you can find good match
For DNA you have a four-way (5 way?) tree, which means your tree is deep but it doesn’t waste so much memory

Example table

Disadvantages of Tries

Table-based typically waste large amounts of memory
Often table-based used for first few layers, while lower levels less memory intensive structure
These days memory less of a problem, so table used
Many implementations of tries, each suited to particular task
For DNA, tries not too wasteful (maybe though for proteins)

Suffix Tree

Suffix tree is a trie of all suffixes of a string
E.g. banana

Importance of Suffix Tree

Fast algorithms to compute suffix trees (and the related suffix array) are relatively new data structures
Important for string-based algorithms
Classic application is in finding a match for a query string, Q, in a text, T

String Matching

To find a match of a query string, Q = an, in a text, T = banana, we can first construct the suffix tree of the string T
We then simply look up the query, Q, using the trie

(Image above)

Complexity of Suffix Tree

Using a regular trie for a suffix tree would typically use far too much memory to be useful
However, by using pointers to the original text it is possible to build a suffix tree using O(n) memory where n is the length of the text
Furthermore there is a linear time (O(n)) algorithm to construct the trie
The algorithm is not however trivial to understand

Uses of Suffix Trees

Suffix trees are efficient whenever it is likely that you will do multiple searches
Exact word matching is in itself is a very important application
Suffix trees in combination with dynamic programming can be used to do inexact matching (finding the match with the smallest edit distance)
Suffix trees get used in bioinformatics, advanced machine learning algorithms,...
(Prefix trees also exist as well lol)

New Generation Sequencers

Estimated cost in 2005 of genome sequencing was $10,000,000
To reduce cost, produce new sequencing machines
Read shorter sections of DNA (20-100bp)
Can these be re-assembled?

Repeats

Difficulty of assembly is caused by repeates

How many repeats are there in human genome? (3.2 billions base pairs)
Can read DNA in either directions, meaning effectively 6.4 billion bases
Important question for developing new technologies

Repeats in Human Genome

Repeats Structure

Computing Repeats

A naive program would take n² operations where n = 6.4*10⁹
This is too long
Couldn’t answer this question a few years ago, not because of computer power, but as algorithms not been developed
However using suffix arrays counting repeats was made easy

5 - Sequence Matching

Alignment and Evolution

We want to transfer our knowledge about the function of a protein through knowledge of similar proteins in different organisms
Most proteins found in organism are close to those of near relatives and sometimes not so near relatives

Practical Alignment

Alignment with known genomes is now the first task after sequencing
Done with protein and DNA
Mutation rate sufficiently high that expect no alignment between most species except if there is selective pressure to conserve the structure
For DNA coding proteins there are a number of “synonymous mutations” that leave the protein unchanged - DNA is less conserved than proteins
In proteins many amino acids behave similarly so there is little selection pressure against mutations

Synonymous Mutations

Some amino acids are rarer than others
But some codon sequences are slightly more popular than their synonymouse partners
They may be more robust to mutation
They may be slight biases due to the translation, etc.

Amino Acid Similarity

Most important distinguishing property is whether they are hydrophobic (oily) or hydrophilic (charged)
Other properties such as hydrogen bonds, etc.
Many observations can compute a table giving log-odds of one amino acids being substituted by another
Two commonly used substitution tables (matrices) are PAM and BLOSUM
Likely substitutions are positive, unlikely ones are negative

Blosum50

Each letter represents an amino acid

Gaps

It is not infrequent that we find good matches where there has been an insertion or deletion of some part of the string
We start by considering a linear cost where the cost of a gap is proportional to its length
In biology, a linear cost model is a poor description of reality
Where there is an insertion, then it is often quite long (this is a consequence of the copy mechanism in cells)

Inexact Sequence Matching

Given 2 sequences we want to find the best match between them
We can measure the distance in terms of an edit distance or Levenshtein distance which measure the minimum number of moves to go from one sequence to another
There are two cases that are commonly considered

Global matching: where we match the two complete strings
Local matching: we find the sub-string with the best match

Algorithms for Matching

Need efficient algorithms for matching
Number of possible inexact matches is exponential in the size of the string so we can’t exhaustively try all matches
Fortunately dynamical programming provides a relative efficient algorithm
Dynamic programming is very important to a huge number of sequence-based algorithms

Dynamic Programming

Powerful strategies for solving optimisation problems is dynamic programming

Build set of optimal partial solutions
Increase size of the partial solutions until have full solution
Each step uses the set of optimal partial solutions found in the previous step

(name bad as doesn't actually have much to do with programming)

A Toy Problem

Find minimum cost path from (0,0) to (8,0) though this:
Costs in red
Cost of path is sum of weights on each link

Brute Force

Try every path
Problem with n steps we require n/2 to be diagonally up and n/2 to be diagonally down
The total number of paths is:
With n = 8 there are 70 paths
With n = 100 there are 1.01*10²⁹ paths

Building a solution

Solve problem efficiently using dynamic programming by considering optimal paths of shorter length
Let c(i,j) denote cost of optimal path to node (i,j)
We denote the weights between two points on the lattice by:
We define initial cost c(0,0) = 0

Forward Algorithm

Suppose we know the optimal costs for all the edges in column i
Our task is to find the optimal cost at column i +1
If we consider the sites in the lattice then the optimal cost will be:
(minimum of either the total cost going up, or total cost going down??)
This is the defining equation in dynamic programming
We have to treat boundary sites specially, but this is just book-keeping

Example

Backward Algorithm

Having found optimal costs c(i,j) we can find the optimal path starting from (n,0)
At each step we have a choice of going up or down
We choose the direction which satisfies the constraint
If both direction satisfy the constraint we have more than one optimal path

Time Complexity

In our solution we had to compute the cost c(i,j) at each lattice point
There were (n/2 + 1)² lattice point
It took constant time to compute each cost so the total time to perform the forward algorithm was Θ(n²)
The time complexity of the backward algorithm was Θ(n)
This compares with exp(Θ(n)) for the brute force algorithm

Global Matching

Needleman-Wunsch problem and solution for global matching
Denote costs of a substitution, a→b, by s(a,b)
For proteins we can take this to be our Blosum50 matrix
There are different substitution matrices e.g. Blosum60 used for closely related species
We will assume the cost of an insertion or deletion is -8

Problem Definition

We are given two strings:

a = a1a2 … am
b = b1b2 … bn

We want to find the minimum edit distance measuring how to change a to b where the price of a substitution is s(a,b) and of an insertion or deletion d = -8
To compute this we compute a tableau showing the best possible moves using the dynamic programming formula:
(value at i,j is the max of either substitution (diagonal), left or right substitution or deletion)

Example Table

Then backwards algorithm (basically follow the arrows from the bottom right):

Space Complexity

For two strings of length m and n the time complexity of dynamic programming is the size of the tableau O(m n)
This is massively faster than exhaustive search
Unfortunately in biology we often want to match two very long strings (particularly when looking at DNA)
The space complexity is then more of an issue than time complexity (we don’t have enough memory to store the tableau)
Algorithms exist that use linear space complexity
Can still be too slow for DNA

6 - Local Sequence Matching

Local Alignment

We see global alignment
But nature of evolution is that often section from one part of the genome are accidentally copied to another
Thus, we are often more interested in finding best matching sub-sequences
Problem known as the Smith-Waterman problem, they first posed and gave an algorithm to solve

Dynamic Programming Solution

Surprisingly it's just a small tweak to global assignment problem
Add the option of ignoring the match with cost 0
Gives the formula:
(essentially means there are no negative numbers in the table)

Example Table

Then backward algorithm starting from the highest number in the grid till zero

Difference to Global Assignment

The forward algorithm is same except we insist F(i,j) ≥ 0
We also store the location of the highest F(i,j)
In the backward algorithm we start at the best value and go back until we reach 0
Easy to change global to local assignment

Repeated and Overlapping Matches

So far considered single best match
In same cases (where have repeated domain or motif) we may have one region being repeated many times within a second sequence
We can modify our dynamic programming to allow this
Again we may believe that one sequence should be completely contained in a second, then we can modify our boundary conditions to find this

Gaps

Insertions are usually caused by adding a subsequence from elsewhere
Deletions would result from jumping a number of bases or treating an exon as an intron
Although these are rare events, once it happens then the gap could be quite large
Up to now we have made a cost of a gap proportional to its length, but this seems unrealistic

Dumb Gap Algorithm

If we assign a cost γ(g) to a gap size g then we could use the dynamic programme
Alas for two sequences of length n this is a O(n³) algorithm which is too slow for most practical applications
Better?

Affine Gap

If we choose a gap cost of

γ(g) = d + (g - 1)e

Then there exists a O(n²) algorithm to do matching
An affine transform of a variable x is of the form a x + b (i.e. a straight line) - note that γ(g) is an affine function
Using d = -12 and e = -2 gives a good model
It penalises any gap, with larger gaps being slightly more punished

Affine Gap Algorithm

To efficiently cope with gaps we need to introduce three matrices (rather than just F(i,j))

M(i,j) is the best score up to (i,j) given ai and bj align
Ia(i,j) is the best score given that ai aligns with a gap
Ib(i,j) is the best score given that bj aligns with a gap

Update Equation

Matching with Gaps

Different FSA

The finite automaton (FSA) description provides an elegant visualisation of dynamic programming rules
It makes it much easier to come up with different models that perhaps can better capture the evolutionary process
But comes with cost that for each new state you need to generate an extra matrix - slowing down matching
For affine gaps we can come up with an approximate two state model that nearly always finds the same solution as the three state model

Inexact Matching in Practice

Proteins area few 100 amino acids long as O(mn) is doable
There are 100s of millions of proteins in the main protein database so it can take a while to look for matches against all known proteins
DNA sequences however, are of the order 10⁵ up to 10¹¹
Using dynamic programming on a large genome is not feasible

Approximate Matching

Various techniques have been developed to find approximate matches
Not guaranteed to find the best inexact match
Use heuristic of searching for sections with some exact matches
Once these sections are identified they are extended and combined to find good matches that are polished using Smith-Waterman
These algorithms are a rag-bag of heuristic techniques that are very effective if not very elegant

FASTA

FASTA is an approximate matching algorithm for both proteins and nucleotide sequences
It firstly checks for short (k-mer) exact matches between a query sequence and a target (k=2 for proteins and k=4-6 for DNA)
Local regions identified with high density of k-mers
Local regions extended by looking for inexact matches
Ungapped regions are combined into gapped region
Finally plausible regions are rematched using dynamical programming

BLAST

BLAST (Basic Local Alignment Search Tool) is the most commonly used tool for alignment (mainly superseding FASTA)
After removing low-complexity regions that might give high matches, exact matches (slightly larger than FASTA) are found
BLAST builds up high scoring “words” from the exact matches
These seeds are extended into regions which can then be combined (in BLAST2) to find gapped regions
Again these are polished with Smith-Waterman

Are Matches Real?

By chance we are likely to find some good matches even between random sequences
What we would like to know is the probability of finding a particular match by chance
This is used in algorithms like BLAST to decide which matched regions it should keep
The problem is non-trivial because we are choosing the best match that we found

Probability of a Match

Computing probability of an exact match between two random strings is possible if we know the frequency of occurrence of all the elements
For inexact matching this requires a bit more work
The BLOSUM50 scores are designed to give the (log-)probability of a particular match between amino acid residues
A lot of effort went into creating matching schemes that give probability of a match
But what is the probability of the best of n matches being greater than Z?

Extreme Values

Extreme Value Distribution

The distribution of extreme values (best of k) is known as the extreme value distribution
Similar to the central limit theorem (which states that the sum of random variables will be normally distributed), it turns out that asymptotically (k→∞) the extreme value of random deviates converges to a small number of extreme value distributions
The most common is the Gumbel EVD given by:

Using EVD

Extreme value distributions become more accurate when we do a large number of trials
In matching the number of matches are exponential in the string length so we are in a situation where EVDs are quite accurate
The probability of observing a match with a score at least Z is: