Johnny Millar, jonathan.millar@ed.ac.uk
Ian Simpson, ian.simpson@ed.ac.uk
Project Description
Large scale genome sequence datasets are increasingly common. In addition to nucleotide sequences, these datasets contain an array of metadata and phenotypic measurements. These data may be used alongside genomic foundation models to advance our understanding of biology and to accelerate drug discovery. However, there are currently few ways to improve the contextual understanding of these models and to allow for rapid advancements in our understanding of genomics. In other domains of generative AI, vector databases have been employed to address similar issues through techniques such as retrieval-augmented generation.