A DNA sequence is the information system that encodes all life1. Multiple scientific disciplines (e.g. medicine, public health, genetic epidemiology) rely on genomic data2. The value of DNA sequences arises from knowledge of what they encode. This knowledge is presented as "annotations" that pinpoint, for instance, the linear segments of DNA that correspond to individual genes3. Other annotation types map relationships between genes from different organisms, exploiting the principle of descent from a common ancestor4.
1: The exception is that many viruses use RNA instead, thus ending your first lesson in biology – all rules are broken!
2: The genome of an organism is its complete set of genetic material and can be computationally represented as a string of the four letters A, C, G, T.
3: A gene is a DNA segment that encodes a molecular machine, e.g., a protein.
4: Thanks Charles Darwin.
While the essential characteristics of DNA data are universal, each application domain presents distinctive computational challenges. The genome size and gene content can differ by many orders of magnitude between species. For instance, the genome of
- the virus causing COVID-19 is ~3x10^4 letters long and contains 10's of genes
- a fungal pathogen can be ~3x10^7 letters long and contain ~2x10^4 genes
- a human is ~3x10^9 letters long and contain ~2x10^4 genes
The challenges faced by a genomic data science library are that sequence data and the associated meta-data are:
- often stored in different locations
- in different formats
- generated by different technology
No single algorithmic implementation can efficiently support this diversity of challenges. What's required is an architecture that presents a unified API that can be readily modified to employ customised algorithms for individual problems.
The overarching goal is to modernise the Python genome data science library `cogent3`, implementing a plugin architecture to simplify development of customised solutions for each problem domain.
We will target sequence annotation handling as the initial prototype. In `cogent3`, annotations are in-memory functional objects. Annotation handling will be refactored to an object relational mapping and their storage in a database. An in-memory SQLite plugin will be developed and distributed with `cogent3`. Completing this project requires we identify the Python plugin framework 5 best suited for `cogent3`.
5: These include `stevedore`, `pluggy` and `qiime2`.
- Sophisticated understanding of Python
- Software design patterns
- Experience with SQL and NoSQL desirable
You will join a multi-disciplinary team consisting of computer scientists, computational biologists, geneticists and mathematical statisticians. Professor Gavin Huttley from ANU Research School of Biology will co-supervise this project. The project leads have extensive experience in successfully teaching and mentoring students to develop their practical skill set in this multi-disciplinary domain.
You will contribute to the cogent3 open source project for computational biology6. The project is being developed with adherence to industry best-practice software engineering processes. You will be mentored in employing these practices.
By contributing to an open source project, your work benefits the large global community of bioinformatics scientists. All contributions will be acknowledged on the project documentation website and significant contributions will further be acknowledged by co-authorship on academic publication of the project.
You will get access to working space in Robertson Building.
6: cogent3 is available on PyPi where it has ~1k downloads per month. It is the successor to the high impact `PyCogent` library which provided the critical foundations for multiple extremely widely used, spinoff projects including `QIIME`, `QIIME2` and `scikit-bio`.
Software design; Plugin architecture; Genomics; Bioinformatics; Computational Biology; Biological Viruses; Open Source