Understanding the Language of Life
Cancer. Every year millions of people have their lives prematurely cut short. Bird Flu. Over half a billion chickens have died due to bird flu since 2023. This disease is slowly becoming a threat to a variety of ecosystems, whose services human lives and livelihoods depend on. AMR. Drug resistant strains of bacteria and fungus may very well bring us back to a time where a simple paper cut leads to an amputation. All of these diseases & problems are connected by changes in the DNA and we at DeOxy Tech, aim to build a world where the complexities of biology are understood and controlled.
Recent advances in generative models for nucleotide sequences have shown promise, but their practical utility remains limited. In this study, we explore DNA as a complex functional representation of evolutionary processes and assess the ability of transformer-based models to capture this complexity. Through experiments with both synthetic and real DNA sequences, we demonstrate that current transformer architectures, particularly auto-regressive models relying on next-token prediction, struggle to effectively learn the underlying biological functions. Our findings suggest that these models face inherent limitations, that cannot be overcome with scale, highlighting the need for alternative approaches that incorporate evolutionary constraints and structural information. We propose potential future directions, including the integration of topological methods or the switch of modelling paradigms, to enhance the generation of genomic sequences.
Preprint:
bioRxiv
We introduce DVQ (DNA Visualisations and Quick comparisons), an open-source Python library for exploring nucleotide sequences using a variety of methods.
Understanding DNA sequences intuitively isimportant for a variety of tasks in biology.
DVQ aims to be a one-stop comprehensive library that makes explainable DNA easy for geneticists, researchers, and practitioners who need explanations.
For practitioners, the library provides an easy-to-use interface to generate visualisations for their sequencesby only writing a few lines of code.
In this report, we demonstrate several example use cases across different types of sequences as well as visualisations.
Package repo:
DVQ