Keep up with what's happening in Bioinformatics and Machine Learning (^ω^)
Last update: 6/8/2021
By: Huitian (Yolanda) Diao
A reference genome (also known as a reference assembly) is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assembled from the sequencing of DNA from a number of individual donors, reference genomes do not accurately represent the set of genes of any single individual organism. Instead a reference provides a haploid mosaic of different DNA sequences from each donor. There are reference genomes for multiple species of viruses, bacteria, fungus, plants, and animals [1].
Genome = all DNA
Transciptome = all transcribed RNA
GENCODE uses the UCSC convention of prefixing chromosome names with “chr”, e.g. “chr1” and “chrM”, but Ensembl calls these “1” or “MT”. At the time of writing (Ensembl 89), a few transcripts differ due to conversion issues. In addition, around 160 PAR genes are duplicated in GENCODE but only once in Ensembl. The differences affect fewer than 1% of the transcripts. Apart from gene annotation itself, the links to external databases differ [2].