Aligning streamed genome sequences

People

Supervisor

Description

A DNA sequence is the information system that encodes all life1. Multiple scientific disciplines (e.g. medicine, public health, genetic epidemiology) rely on genomic data2. The value of DNA sequences arises from knowledge of what they encode. This knowledge is presented as "annotations" that pinpoint, for instance, the linear segments of DNA that correspond to individual genes3. Other annotation types map relationships between genes from different organisms, exploiting the principle of descent from a common ancestor4.


The microorganisms that parasitise humans present several computational challenges. Pathogen genome sizes can be very small, making the acquisition of the genome sequence of a single pathogen isolate a trivial and low cost undertaking5. As a consequence, genome sequencing of pathogen isolates is now widely used as a monitoring tool. Illustrated by the ongoing SARS-COV-2 pandemic, this application can produce millions of DNA pathogen genome sequences that need to be streamed to centralised data repositories. Our job is to facilitate the processing, storage and manipulation of this data to accelerate public health decisions.

 

1: The exception is that many viruses use RNA instead, thus ending your first lesson in biology – all rules are broken!

2: The genome of an organism is its complete set of genetic material and can be computationally represented as a string of the four letters A, C, G, T.

3: A gene is a DNA segment that encodes a molecular machine, e.g., a protein.

4: Thanks Charles Darwin.

5: Fun fact, some DNA sequencer machines are quite literally  the size of a 50 cent piece and powered by the USB port on a laptop.

 

Decades of research have shown that comparing biological sequences provides powerful insights into biological systems. Take the case of SARS-COV-2 as an example. Identifying changes to the SPIKE protein is crucial since these may affect the efficacy of the current vaccines.

 

In a computational sense, there are 2 core data attributes and a core algorithm class that are fundamental to deliver this capability.

  • representing individual DNA sequences
  • representing the relationship between DNA sequences as a multiple sequence alignment

Sequence alignment algorithms are crucial to identifying the genetic changes that have occurred since sequences last shared a common ancestor. These algorithms accomodate changes of one letter to another in a genome and the deletion of parts of the genome. The consequence of the latter is that not all isolate genomes are the same size.

 

As the pandemic has shown, the acquisition of genome sequences is an ongoing enterprise. Making sense of new sequences requires they be aligned to ever growing collection of previously characterised sequences.

 

An alignment of biological sequences can be efficiently represented as a graph. Fast algorithms for such determination of the graphs relating sequences have been developed for the genome assembly problem but less commonly for the pairwise and multiple sequence alignment problem.

 

Goals

In this project you will implement a fast graph aligner. A crucial data attribute affecting the utility of alignments of viral genomes is whether the alignment respects the nature of information encoded within the genome. That is, mapping from the alphabet of DNA sequence to the alphabet of proteins imposes constraints on the viable solution sets found by the aligner. The developed algorithm will accomodate these sequence attributes to improve alignment quality and facilitate the objective of detecting functional differences.

The final result will be made available as a plugin for the genome data science library `cogent3`.

 

Requirements

  • Sophisticated understanding of Python
  • Software design patterns
  • Experience in C / C++ or Rust are desirable

Background Literature

  • Christopher Lee, Catherine Grasso, Mark F. Sharlow, Multiple sequence alignment using partial order graphs , Bioinformatics, 18.3, (2002): 452–464.
  • Mikko Rautiainen, Tobias Marschall. GraphAligner: rapid and versatile sequence-to-graph alignment, Genome biology 21.1 (2020): 1-28.

Gain

You will join a multi-disciplinary team consisting of computer scientists, computational biologists, geneticists and mathematical statisticians. Professor Gavin Huttley from ANU Research School of Biology will co-supervise this project. The project leads have extensive experience in successfully teaching and mentoring students to develop their practical skill set in this multi-disciplinary domain.

 

You will contribute to the cogent3 open source project for computational biology6. The project is being developed with adherence to industry best-practice software engineering processes. You will be mentored in employing these practices.

 

By contributing to an open source project, your work benefits the large global community of bioinformatics scientists. All contributions will be acknowledged on the project documentation website and significant contributions will further be acknowledged by co-authorship on academic publication of the project.

 

You will get access to working space in Robertson Building.

 

6: cogent3 is available on PyPi where it has ~1k downloads per month. It is the successor to the high impact `PyCogent` library which provided the critical foundations for multiple extremely widely used, spinoff projects including `QIIME`, `QIIME2` and `scikit-bio`.

 

Keywords

Software design; Plugin architecture; Genomics; Bioinformatics; Computational Biology; Biological Viruses; Open source

Updated:  10 August 2021/Responsible Officer:  Dean, CECS/Page Contact:  CECS Marketing