Algorithms for efficient representation of biological pathogens

People

Supervisor

Description

A DNA sequence is the information system that encodes all life1. Multiple scientific disciplines (e.g. medicine, public health, genetic epidemiology) rely on genomic data2. The value of DNA sequences arises from knowledge of what they encode. This knowledge is presented as "annotations" that pinpoint, for instance, the linear segments of DNA that correspond to individual genes3. Other annotation types map relationships between genes from different organisms, exploiting the principle of descent from a common ancestor4.

 

The microorganisms that parasitise humans present several computational challenges. Pathogen genome sizes can be very small, making the acquisition of the genome sequence of a single pathogen isolate a trivial and low cost undertaking5. As a consequence, genome sequencing of pathogen isolates is now widely used as a monitoring tool. Illustrated by the ongoing SARS-COV-2 pandemic, this application can produce millions of DNA pathogen genome sequences that need to be streamed to centralised data repositories. Our job is to facilitate the processing, storage and manipulation of this data to accelerate public health decisions.

 

1: The exception is that many viruses use RNA instead, thus ending your first lesson in biology – all rules are broken!

2: The genome of an organism is its complete set of genetic material and can be computationally represented as a string of the four letters A, C, G, T.

3: A gene is a DNA segment that encodes a molecular machine, e.g., a protein.

4: Thanks Charles Darwin.

5: Fun fact, some DNA sequencer machines are quite literally  the size of a 50 cent piece and powered by the USB port on a laptop.

 

Decades of research have shown that comparing biological sequences provides powerful insights into biological systems. Take the case of SARS-COV-2 as an example. Identifying changes to the SPIKE protein is crucial since these may affect the efficacy of the current vaccines.

 

In a computational sense, there are 2 core data attributes and a core algorithm class that are fundamental to deliver this capability.

  • representing individual DNA sequences
  • representing the relationship between DNA sequences as a multiple sequence alignment

Sequence alignment algorithms are crucial to identifying the genetic changes that have occurred since sequences last shared a common ancestor. These algorithms accomodate changes of one letter to another in a genome and the deletion of parts of the genome. The consequence of the latter is that not all isolate genomes are the same size.

 

A Python genome data science library that we develop `cogent3` has been used to analyse SARS-COV-2 sequences and is included in the Biomedical Linux distribution. That work has identified crucial shortcomings of the current implementation.

 

`cogent3` handles alignments of large sequences that have sequence annotations using native Python types. Operations on small numbers of large sequences are efficient as they return views (avoiding copying) but inefficient for massive numbers of short sequences due to memory overhead from representing sequence deletions. For the latter case, `cogent3` has a different class that uses NumPy arrays which have a smaller memory footprint but do not support annotations and operations return new instances.

 

Goals

The overarching goal of this project is to modernise the Python genome data science library `cogent3`. The specific goal of this project is to merge the best features of the two different categories of `cogent3` sequence/alignment types.

We will implement a new representation of biological sequence data that is memory efficient and computationally fast. The existing API will be retained when possible. The open source `scipy` library will be leveraged for the core data structures. The new structure will exploit the biological property of descent from a common ancestor for even greater reductions in memory.

 

This will simultaneously allow greatly improved compute performance and removal of current code, delivering a significantly reduced maintenance burden for `cogent3`.

 

Requirements

  • Sophisticated understanding of Python
  • Software design patterns

Gain

You will join a multi-disciplinary team consisting of computer scientists, computational biologists, geneticists and mathematical statisticians. Professor Gavin Huttley from ANU Research School of Biology will co-supervise this project. The project leads have extensive experience in successfully teaching and mentoring students to develop their practical skill set in this multi-disciplinary domain.

 

You will contribute to the cogent3 open source project for computational biology6. The project is being developed with adherence to industry best-practice software engineering processes. You will be mentored in employing these practices.

 

By contributing to an open source project, your work benefits the large global community of bioinformatics scientists. All contributions will be acknowledged on the project documentation website and significant contributions will further be acknowledged by co-authorship on academic publication of the project.

 

You will get access to working space in Robertson Building.

 

6: cogent3 is available on PyPi where it has ~1k downloads per month. It is the successor to the high impact `PyCogent` library which provided the critical foundations for multiple extremely widely used, spinoff projects including `QIIME`, `QIIME2` and `scikit-bio`.

 

 

Keywords

Software design; Plugin architecture; Genomics; Bioinformatics; Computational Biology; Biological Viruses; Open source

Updated:  10 August 2021/Responsible Officer:  Dean, CECS/Page Contact:  CECS Marketing