Research
Information Technology | Biotechnology | Transportation & Logistics |
Materials Science & Engineering | Construction Engineering & Management |
System Engineering & Management | Energy & Environment

Self-organising Methodology for the Automated Annotation of Eukaryotic Genomes

The annotation of genomic data represents a significant challenge for researchers in genetics. The amount of information emerging from whole genome sequencing projects is vast and is increasing at an exponential rate. The result is that the interpretation of this information is lagging significantly behind the generation of data. One of the key stages in the interpretation of genomic data is determining the positions of genes within the genome. In Eukaryotic sequences, the actual proportion of the sequences which contain protein-coding information is very small (it is estimated that ~98.5% of the genome for a human is non-coding, as opposed to 11% of the genome for the bacterium E. coli) and locating them can be challenging. In addition, even within genes, there are significant tracks of DNA, known as introns, which do not code for proteins.

This project aims to develop methods for the automated classification of DNA sequences. While computational methods which perform this function are already widely used, they generally depend on models which have to be trained and calibrated on existing annotated sequences, i.e., "supervised" methodologies. This dependence on existing annotation can be a potential weakness as these may contain erroneous or "dirty" data. In addition, the statistical properties of gene coding regions may vary between genomes from different organisms. Finally, supervised methods are not able to detect new classes of data, which is a particularly important capability in Eukaryotic genome sequences as these contain a high percentage of so-called "junk" DNA, for which no known function has been attributed.

Our aim is to apply methods drawn from the field of unsupervised or self-organising algorithms to perform gene classification without the use of prior information. The basic concept is to model different gene regions with a mixture of Hidden Markov Models (HMMs) trained using a standard clustering algorithm. Briefly, this works as follows:

1) The sequence to be analysed is first divided into segments which are treated as samples from the space of possible subsequences.

2) The candidate HMMs are compared to these sequences and are subsequently matched to the subsequences which return the highest likelihood scores.

3) Each HMM is trained using the subsequences which are assigned to it in this fashion. This is repeated until convergence is achieved.

The procedure described above has been used previously to classify bacterial genome
sequences. However, we wish to extend its use to eukaryotic DNA. The starting point will be a small scale study of its applicability on the Arabidopsis genome, where the initial objective will be to automatically detect non-coding, intron and exon regions. Once the feasibility of this is established, we hope to introduce the use of "resource allocating" architectures which are able to detect novel classes.

Principal Investigators at MUST



  Back