Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chromosomes-scaffolds-contigs.md #57

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
slug: chromosomes-scaffolds-contigs
title: Chromosomes scaffolds and contigs
description: The relationship between contigs, scaffolds and chromosomes in Ensembl
---

Genome assemblies are hierarchical. The shortest assembly components are contigs, which are sequences taken from individuals. Contigs are assembled into longer scaffolds, and scaffolds are assembled into chromosomes if there is sufficient mapping information. Many genome assemblies have only been assembled to the scaffold level.

Scaffolds are classified in three ways:

## Placed scaffolds
The scaffolds have been placed within a chromosome.

## Unlocalised scaffolds
Although, the chromosome within which the scaffold occurs is known, the scaffold's position or orientation is not known.

## Unplaced scaffolds
It is not known which chromosome the scaffold belongs to.

The relationship between contigs, scaffolds and chromosomes is defined in AGP files. These files describe how assembled sequences (eg chromosomes) are compiled from their components (eg scaffolds). In Ensembl, we import contig-level DNA sequence into our core databases. We also import the AGP files for contig-to-scaffold, contig-to-chromosome, and scaffold-to-chromosome mappings. This allows us to generate scaffold and chromosome sequence on the fly by stitching the contigs sequences together as specified by the AGP files.

## Toplevel
For each genome assembly, we define the set of toplevel sequences. These are sequence regions in the genome assembly that are not a component of another sequence region. For example, when a genome is assembled into chromosomes, toplevel sequences will be chromosomes and any unlocalised or unplaced scaffolds. If a genome has only been assembled into scaffolds, then toplevel sequences are the full set of unlocalised and unplaced scaffolds.