MoPSeq-DB allows users to i) navigate through all the genomic data related to mollusc pathogens, ii) access fully interactive views of genome structures, variants, and phylogenetic trees, and iii) download data in various formats : FASTA, GFF (General Feature Format), VCF (Variant Call Format), Newick phylogenetic tree, PNG.
The user-friendly platform provides a comprehensive overview of genomes and pathogens referenced in the database. Marine bivalve molluscs are infected by various pathogens, MoPSeq-DB visualisation tools are designed to fit each of their genome particularities. Thereby, MoPSeq-DB is suited to be used with viruses, bacteria and eukaryotes genomes.
In this page, MoPSeq-DB design is explain to facilitate exploration to the plateform users.
The sharing aspect of the project is based on the referencement of genomic data related to mollusc pathogens. It functions as a downstream repository for this data collected from public repositories for sharing genomes.
MoPSeq-DB shares four different files related to each referenced pathogen :
Type | Description | Presence |
---|---|---|
FASTA | Text-based format for representing nucleotide sequences using single-letter codes. | Mandatory |
CSV | Delimited text file that uses a comma to separate values of each metadata attribute. | Mandatory |
GFF | General feature format, a file format used to describe genes and other features of DNA sequences. | Optionnal |
VCF | Variant Call Format, a text file used to store sequence nucleotidic variation information. | Optionnal |
A FASTA, GFF, or VCF file can correspond to three different states of the genome assembly they are referring to :
- Complete - The complete genome of the sample.
- Non-redundant - A less complex complete genome, with only one copy of repeated regions (Delmotte-Pelletier et al. Virus Evolution, 2022.). The regions can also be re-organised depending on the pathogen.
- Partial - When the genome is only a fraction of the referenced genome/contig.
For users, data sharing takes place within the Sequence Databases tab, in the navigation bar. When hovering over it, you can select the desired pathogen and see a table with the referenced data.
The main element of the page will be the table referencing the data. In this table, you have several direct functionalities, such as sorting the table based on the attribute of your choice or controlling the number of displayed lines.
The table's direct functionalities include:
The plateform offers dynamic data searching and filtering functionnaly, to be found above the table, in the following toolbar :
- Display - Enables you to select the list of attributes (metadata) that you want to display on the table.
- Filter - Provides different filtering possibilities, based on metadata, to apply to the table. You can combine multiple filters together.
- Download - Allows you to download the different files stored in the database. If a sample has multiple files of the same type, they will all be downloaded and differentiated with the tags "NR" for Non-redundant and "P" for Partial.
The downloaded archive will also include a text file called not_found.txt if any of the requested files are not available. - Search bar - You can use the search bar to match the information you entered with any referenced information. For example, entering "China" will display data referenced from China, while "Crassostrea gigas" will show all data with this host referenced.
Regarding missing values, they can be categorised as follows:
- Unknown - Information is missing.
- TBD - To Be Determined, the information will be filled at a later date.
- NA - Not Applicable, the information field is irrelevant to this particular sample.
On the platform, the 30 following metadata can be display and be used to filter genomes:
Sample name |
Name of the sequence. Clicking on it redirects
to the information page (genome visualisation) of the sample. |
Sequencing technology | Illumina, Oxford Nanopore, PacBio... |
Strain | Genetic variant. | Sequencer | Name of the sequencer used. |
Isolate |
A population of organisms with minimal
genetic mixing. |
Information on sequencing | Paired-end sequencing or not, read size. |
Host species | Species of the pathogen's host. | Number of reads | Number of reads in the sequencing run. |
Collection year | Year the sample was collected. | Sequence coverage | Mean sequence coverage from mapping. |
Collection date | Exact date of the sample collection. | Sequence size (BP) | Number of nucleotides in the sequence. |
Country of origin | Country where the sample is from. | Pathogen charge | Amount of pathogen detected in copies/µl. |
Localisation | Locality of the sample collection. | Assembly | Technique used to assemble the sequence. |
Latitude/Longitude | GPS coordinates of the locality. | Genome type |
Corresponds to the state of the sequence
(more info here). |
Nature of coordinates |
The coordinates can either be "Verified"
(meaning accurate), or "Approximated" on the locality or country general coordinates. |
Structure |
Names of sequence structural regions, separated
by "-". Different types of sequences are indicated with underscores, with the complete sequence listed first, followed by non-redundant and partial sequences. |
Host stage | Stage of development of the host. | Name of submitter |
Name of the person who submitted/owns
the data. |
Isolation source |
Host body parts from where the pathogen
was sequenced. |
Organization |
Organization (laboratory, university...) of
the person who submitted/owns the data. |
Pool/Individual |
Pathogen sample originates from the
sequencing of a "Pool" of hosts or an "Individual". |
Publication DOI | Clicking on it redirects to the article. |
Sample conservation method |
How the sample was conserved before
sequencing. |
Notes | Additionnal information regarding the sample. |
Number of contigs | Number of contigs in the assembly. | Genbank accession number |
GenBank identifier. Clicking on it redirects
to the NCBI page. |
The update of MoPSeq-DB pathogen data is performed by regular crawling through the NCBI/EBI databases to retrieve new genomes of pathogen referenced in MoPSeq-DB. The plateform will also be regularly enriched with new pathogens as sequencing of mollusc pathogens progresses.
You can resquest for the addition of missing data by filling the form available in the Updates tab.
Currently under construction, there will also be the possibility for anyone to submit their own data if submission to GenBank/EBI is not planned. The submitted data will be first checked to ensure that the user metadata corresponds to the fields template of MoPSeq-DB before being reviewed by the maintenance team of the platform.
Data visualisation plays a significant role in interpreting genomic data. MoPSeq-DB adds value to other public sequence repositories by enabling data visualisation through interactive graphics. A set of scripts generate pre-proceded files allowing a dynamic visualisation of genomic information.
The visualisation aspect of the project can be divided into two sub-aspects : the visualisation of each sample's genomic information, and the creation of a phylogenetic tree from all genomes of a same pathogen.
-> Regarding the sample's genomic information, it can be visualised on the individual pages of each sample. To access these pages, users need to click on the value in the Sample Name attribute column in the data table (more information on the data table can be found here).
-> For the pathogen's phylogenetic placement, the visualisation will occur within the Phylogenies tab, in the navigation bar.
Note 2: The page provides the option to download files related to the selected sample, but it only displays the corresponding download options when available (e.g., the download option for VCF files will not be displayed if none exist).
The visualisation of the genomic structure is presented using an interactive graph that represents the different features recorded in the GFF file of the sequence. The various types of features are differentiated using a color code and positioned on different y-axis values to prevent overlap.
Each pathogen may have its own color code and representation characteristics based on what we consider most interesting to represent. For example, the structural regions of Ostreid herpesvirus-1 have different color representations, while the repeated regions have the same representation (see example below). New models of caracteristic representation can easily be defined as new pathogens are add to the platform.
The interactive graph displays labels such as the ORF number or region name when hovering over the represented elements. Additionally, the graph offers several options to the user, such as zooming on the x-axis (nucleotide positions), moving within the graph, selecting an element, and more.
Example of the graph and its functionalities:
- Allow pan movement on the x-axis (active).
- Allow wheel zoom on the x-axis (non-active).
- Allow user selection of element (active).
- Refresh figure.
- Activate labels when hovering (active).
- Exemple of hovering label (here the name of a structural region).
- Exemple of GFF elements (CDS, genes, variations...).
- Due to its importance, the stem-loop region has a fixed label when present in the GFF file.
- Exemple of specific representation, here the two repaeated regions TRL and IRL have the same green color.
Three reference genomes have been utilised for this purpose:
- U17 for Vibrio aesturianus subspecies cardii
- 10_092_7MT1 for Vibrio aesturianus subspecies francensis
- 03008T for Vibrio aesturianus
Additionnal information has been calculated using GC_analysis and SkewIT. These analyses provide insights into the GC content and skewness of the genome, which can be valuable for understanding the characteristics and dynamics of Vibrio aestuarianus. Data is visualised on two scales:
- An overview of the chromosome linked to
- an interactive detailed one allowing to move horizontally on the graph, zoom on and get information of a selected element. Complementary information is displayed:
- GC content
- GC skew along the genome.
Hovering over a label on the graphs will display either the position and frequency of the variation or the position and coverage depth, depending on the graph.
In cases where a sample has more than 10,000 variations, the variations will be filtered based on frequency to limit the number of displayed labels to a maximum of 10,000.
However, if the graphic becomes slow with a lower number of variations, it is recommended for the user to hide the variations, navigate to the desired sequence positions, and then re-enable the display of variations.
For most pathogens, it is possible to observe both inter-individual and intra-individual variations. Inter-individual variations represent the differences between the sample genome and the reference genome, while intra-individual variations represent the variations detected within the pool of reads used to assemble the sample's consensus sequence.
In the provided example graph, variation labels may overlap and be difficult to distinguish due to their close proximity in the sequence. However, users can easily isolate them by using available tools such as zooming and pan movement.
Example of the graph and its functionalities:
- Allow multi-directionnal pan movement (active).
- Allow box zooming (non-active).
- Allow wheel zoom on the x-axis (active).
- Refresh figure.
- Save a picture of the current graph.
- Activate labels when hovering (active).
- Exemples of variation labels : SNP are colored based on nucleotid, INDEL and complex are grey but differentiated by their labels form (see legend).
- Buttons to activate displaying of a type of label (all active).
- Exemple of variation coverage representation, coverage of inter-variations regions can also be displayed if the information is available in the VCF.
On the same page where the sample data is represented, there is a phylogenetic tree (with limited interaction) of the pathogen. It displays the current sample's placement in the phylogeny and highlights its path to the root node.
At the top of the tree, there is a box containing information about tree generation, model selection, and a link to the pathogen's phylogenetic tree page. This page provides more interactions and better visualisation of the tree.
On the pathogen's phylogenetic tree page, which can be accessed through the Phylogenies tab or the link present on the sequence data page or sample information page, you will find a tree built with all available genomes of the pathogen.
Several ineractivity options are available on the toolbar above the table:
- Export - Allows you to download an image of the current tree with the selected or filtered nodes or edges in SVG format, or the tree file itself in NEWICK format.
- Informations - Displays an information box regarding the tree generation and model selection. This option assure the auditability of the calculation of the tree, information displayed are automaticaly saved in a report during tree generation.
- Selection - Provides a wide range of selection options for the tree, such as selecting leaf nodes or internal nodes.
- Search bar - You can use the search bar to search for specific sequences recorded in the tree.
Right below the toolbar, additionnal options are available:
Expand the tree vertical spacing. | Compress the tree vertical spacing. | ||
Expand the tree horizontal spacing. | Compress the tree horizontal spacing. | ||
Sort deepest clades to the bottom. | Sort deepest clades to the top. | ||
Restore the tree to its original order. | Linear | Display the tree in a linear configuration (default). | |
Radial | Display the tree in a radial configuration. | Display the sequences names at the end of leaf edges. | |
Display all the sequences names at the same level. |
Software | Version | Web link |
---|---|---|
Python | 3.9.12 | https://www.python.org |
Django | 3.0.14 | https://www.djangoproject.com |
Django-Tailwind | 3.1.1 | https://django-tailwind.readthedocs.io |
Django-active-link | 0.1.8 | https://pypi.org/project/django-active-link |
Django-filter | 21.1 | https://django-filter.readthedocs.io |
DNA Features Viewer | 3.1.1 | https://github.com/Edinburgh-Genome-Foundry/DnaFeaturesViewer |
Bokeh | 2.4.2 | https://docs.bokeh.org |
Pandas | 1.4.2 | https://pandas.pydata.org |
Bcbio-gff | 0.6.9 | https://openbase.com/python/bcbio-gff |
Cyvcf2 | 0.30.15 | https://github.com/brentp/cyvcf2 |
Typer | 0.4.1 | https://typer.tiangolo.com |
Minimap2 | 2.17 | https://github.com/lh3/minimap2 |
Software | Version | Web link |
---|---|---|
MAFFT | 7.310 | https://mafft.cbrc.jp |
Java | 11.0.15 | https://www.java.com |
jModelTest2 | 2.1.10 | https://github.com/ddarriba/jmodeltest2 |
Phyml | 3.3.3 | https://github.com/stephaneguindon/phyml |
RAxML | 8.2.11 | https://cme.h-its.org/exelixis/web/software/raxml |
phylotree.js | 1.0.0 | https://github.com/veg/phylotree.js |
andi | 0.12 | https://github.com/EvolBioInf/andi |
ape library | 5.7-1 | https://cran.r-project.org/package=ape |
RagTag | 2.1.0 | https://github.com/malonge/RagTag/wiki/scaffold |
GC_analysis | 0.4.5 | https://github.com/tonyyzy/GC_analysis |
SkewIT | 1.0 | https://github.com/jenniferlu717/SkewIT |
Gos | 0.1.1 | https://github.com/gosling-lang/gos |
Go back to the top