MoPSeq

MoPSeq-DB allows users to i) navigate through all the genomic data related to mollusc pathogens, ii) access fully interactive views of genome structures, variants, and phylogenetic trees, and iii) download data in various formats : FASTA, GFF (General Feature Format), VCF (Variant Call Format), Newick phylogenetic tree, PNG.

The user-friendly platform provides a comprehensive overview of genomes and pathogens referenced in the database. Marine bivalve molluscs are infected by various pathogens, MoPSeq-DB visualisation tools are designed to fit each of their genome particularities. Thereby, MoPSeq-DB is suited to be used with viruses, bacteria and eukaryotes genomes.

In this page, MoPSeq-DB design is explain to facilitate exploration to the plateform users.

The sharing aspect of the project is based on the referencement of genomic data related to mollusc pathogens. It functions as a downstream repository for this data collected from public repositories for sharing genomes.

MoPSeq-DB shares four different files related to each referenced pathogen :

Type	Description	Presence
FASTA	Text-based format for representing nucleotide sequences using single-letter codes.	Mandatory
CSV	Delimited text file that uses a comma to separate values of each metadata attribute.	Mandatory
GFF	General feature format, a file format used to describe genes and other features of DNA sequences.	Optionnal
VCF	Variant Call Format, a text file used to store sequence nucleotidic variation information.	Optionnal

A FASTA, GFF, or VCF file can correspond to three different states of the genome assembly they are referring to :

Complete - The complete genome of the sample.
Non-redundant - A less complex complete genome, with only one copy of repeated regions (Delmotte-Pelletier et al. Virus Evolution, 2022.). The regions can also be re-organised depending on the pathogen.
Partial - When the genome is only a fraction of the referenced genome/contig.

Thus, some samples may have several FASTA, GFF or VCF files.

For users, data sharing takes place within the Sequence Databases tab, in the navigation bar. When hovering over it, you can select the desired pathogen and see a table with the referenced data.

1. Page overview

The main element of the page will be the table referencing the data. In this table, you have several direct functionalities, such as sorting the table based on the attribute of your choice or controlling the number of displayed lines.

Note: There is an informative box at the top of the page redirecting to the pre-built phylogenetic tree interactive for the pathogen. See II.3 Phylogenetic tree.

The table's direct functionalities include:

Note 2: In the sample visualisaton page, you can find all metadata associated with the sample, and differents visualisation graphs generated to allows data exploration and interpretation. These visualisations are explained in II. Interactive visualisation graphs.

The plateform offers dynamic data searching and filtering functionnaly, to be found above the table, in the following toolbar :

Display - Enables you to select the list of attributes (metadata) that you want to display on the table.
Filter - Provides different filtering possibilities, based on metadata, to apply to the table. You can combine multiple filters together.
Download - Allows you to download the different files stored in the database. If a sample has multiple files of the same type, they will all be downloaded and differentiated with the tags "NR" for Non-redundant and "P" for Partial.
The downloaded archive will also include a text file called not_found.txt if any of the requested files are not available.
Search bar - You can use the search bar to match the information you entered with any referenced information. For example, entering "China" will display data referenced from China, while "Crassostrea gigas" will show all data with this host referenced.

2. Metadata

The metadata fields used in MoPSeq-DB were designed to be as relevant and complete as possible in relation to pathogen analyses and attributes commonly found in databases like NCBI/EBI. MoPSeq-DB gathers up to 51 metadata fields corresponding to sample data, epidemiologic data, sequencing and assembling technical data, etc. You can download the MoPSeq-DB metadata template here.

Regarding missing values, they can be categorised as follows:

Unknown - Information is missing.
TBD - To Be Determined, the information will be filled at a later date.
NA - Not Applicable, the information field is irrelevant to this particular sample.

On the platform, the 30 following metadata can be display and be used to filter genomes:

Sample name	Name of the sequence. Clicking on it redirects to the information page (genome visualisation) of the sample.	Sequencing technology	Illumina, Oxford Nanopore, PacBio...
Strain	Genetic variant.	Sequencer	Name of the sequencer used.
Isolate	A population of organisms with minimal genetic mixing.	Information on sequencing	Paired-end sequencing or not, read size.
Host species	Species of the pathogen's host.	Number of reads	Number of reads in the sequencing run.
Collection year	Year the sample was collected.	Sequence coverage	Mean sequence coverage from mapping.
Collection date	Exact date of the sample collection.	Sequence size (BP)	Number of nucleotides in the sequence.
Country of origin	Country where the sample is from.	Pathogen charge	Amount of pathogen detected in copies/µl.
Localisation	Locality of the sample collection.	Assembly	Technique used to assemble the sequence.
Latitude/Longitude	GPS coordinates of the locality.	Genome type	Corresponds to the state of the sequence (more info here).
Nature of coordinates	The coordinates can either be "Verified" (meaning accurate), or "Approximated" on the locality or country general coordinates.	Structure	Names of sequence structural regions, separated by "-". Different types of sequences are indicated with underscores, with the complete sequence listed first, followed by non-redundant and partial sequences.
Host stage	Stage of development of the host.	Name of submitter	Name of the person who submitted/owns the data.
Isolation source	Host body parts from where the pathogen was sequenced.	Organization	Organization (laboratory, university...) of the person who submitted/owns the data.
Pool/Individual	Pathogen sample originates from the sequencing of a "Pool" of hosts or an "Individual".	Publication DOI	Clicking on it redirects to the article.
Sample conservation method	How the sample was conserved before sequencing.	Notes	Additionnal information regarding the sample.
Number of contigs	Number of contigs in the assembly.	Genbank accession number	GenBank identifier. Clicking on it redirects to the NCBI page.

3. Data update

The update of MoPSeq-DB pathogen data is performed by regular crawling through the NCBI/EBI databases to retrieve new genomes of pathogen referenced in MoPSeq-DB. The plateform will also be regularly enriched with new pathogens as sequencing of mollusc pathogens progresses.
You can resquest for the addition of missing data by filling the form available in the Updates tab.
Currently under construction, there will also be the possibility for anyone to submit their own data if submission to GenBank/EBI is not planned. The submitted data will be first checked to ensure that the user metadata corresponds to the fields template of MoPSeq-DB before being reviewed by the maintenance team of the platform.

II. Interactive visualisation graphs

Data visualisation plays a significant role in interpreting genomic data. MoPSeq-DB adds value to other public sequence repositories by enabling data visualisation through interactive graphics. A set of scripts generate pre-proceded files allowing a dynamic visualisation of genomic information.
The visualisation aspect of the project can be divided into two sub-aspects : the visualisation of each sample's genomic information, and the creation of a phylogenetic tree from all genomes of a same pathogen.
-> Regarding the sample's genomic information, it can be visualised on the individual pages of each sample. To access these pages, users need to click on the value in the Sample Name attribute column in the data table (more information on the data table can be found here).
-> For the pathogen's phylogenetic placement, the visualisation will occur within the Phylogenies tab, in the navigation bar.

Note 1: For the visualisation of genomic structure and variations, each sequence can have several visualisation elements (e.g., one graph for the complete sequence and one graph for the non-redundant sequence).
Note 2: The page provides the option to download files related to the selected sample, but it only displays the corresponding download options when available (e.g., the download option for VCF files will not be displayed if none exist).

1. Genomic structures and annotations

MoPSeq-DB visialisation tools are adapted to manage genomic data from various pathogens beyond mollusc diseases. Therefore, different type of graphs are generated to respond to the constraints corresponding to pathogens' different kingdoms.

a. Viruses

The visualisation of the genomic structure is presented using an interactive graph that represents the different features recorded in the GFF file of the sequence. The various types of features are differentiated using a color code and positioned on different y-axis values to prevent overlap.
Each pathogen may have its own color code and representation characteristics based on what we consider most interesting to represent. For example, the structural regions of Ostreid herpesvirus-1 have different color representations, while the repeated regions have the same representation (see example below). New models of caracteristic representation can easily be defined as new pathogens are add to the platform.

The interactive graph displays labels such as the ORF number or region name when hovering over the represented elements. Additionally, the graph offers several options to the user, such as zooming on the x-axis (nucleotide positions), moving within the graph, selecting an element, and more.

Example of the graph and its functionalities:

Allow pan movement on the x-axis (active).
Allow wheel zoom on the x-axis (non-active).
Allow user selection of element (active).
Refresh figure.
Activate labels when hovering (active).
Exemple of hovering label (here the name of a structural region).
Exemple of GFF elements (CDS, genes, variations...).
Due to its importance, the stem-loop region has a fixed label when present in the GFF file.
Exemple of specific representation, here the two repaeated regions TRL and IRL have the same green color.

b. Bacteria

In MoPSeq-DB, most Vibrio aestuarianus chromosomes have been artificially created using RagTag to facilitate viewing, as bacterian genomes are more fragmented. Contigs obtained from the assembly have been regrouped, ordered, and oriented to form chromosomes based on complete reference genomes.

Three reference genomes have been utilised for this purpose:

U17 for Vibrio aesturianus subspecies cardii
10_092_7MT1 for Vibrio aesturianus subspecies francensis
03008T for Vibrio aesturianus

Additionnal information has been calculated using GC_analysis and SkewIT. These analyses provide insights into the GC content and skewness of the genome, which can be valuable for understanding the characteristics and dynamics of Vibrio aestuarianus. Data is visualised on two scales:

An overview of the chromosome linked to
an interactive detailed one allowing to move horizontally on the graph, zoom on and get information of a selected element.

Complementary information is displayed:

GC content
GC skew along the genome.

c. Protozoa

Under development

2. Genomic variations

Based on the different VCF files, the frequency of detected variants and their coverage is represented using two interactive graphs. Similar to the genomic structure visualisation (see above), these graphs provide various options to the users. Additionally, there are buttons under the graph that allow users to hide the variation or reference labels.
Hovering over a label on the graphs will display either the position and frequency of the variation or the position and coverage depth, depending on the graph.

In cases where a sample has more than 10,000 variations, the variations will be filtered based on frequency to limit the number of displayed labels to a maximum of 10,000.
However, if the graphic becomes slow with a lower number of variations, it is recommended for the user to hide the variations, navigate to the desired sequence positions, and then re-enable the display of variations.

For most pathogens, it is possible to observe both inter-individual and intra-individual variations. Inter-individual variations represent the differences between the sample genome and the reference genome, while intra-individual variations represent the variations detected within the pool of reads used to assemble the sample's consensus sequence.

In the provided example graph, variation labels may overlap and be difficult to distinguish due to their close proximity in the sequence. However, users can easily isolate them by using available tools such as zooming and pan movement.

Example of the graph and its functionalities:

Allow multi-directionnal pan movement (active).
Allow box zooming (non-active).
Allow wheel zoom on the x-axis (active).
Refresh figure.
Save a picture of the current graph.
Activate labels when hovering (active).
Exemples of variation labels : SNP are colored based on nucleotid, INDEL and complex are grey but differentiated by their labels form (see legend).
Buttons to activate displaying of a type of label (all active).
Exemple of variation coverage representation, coverage of inter-variations regions can also be displayed if the information is available in the VCF.

3. Phylogenetic tree

A phylogenetic tree is generated for each pathogen implemented in the plateform. The Phylogenies tab in the navigation bar redirects to an index page where the user can choose to see an interactive graph of the phylogenetic tree of all pathogen genomes of the database.

Note
On the same page where the sample data is represented, there is a phylogenetic tree (with limited interaction) of the pathogen. It displays the current sample's placement in the phylogeny and highlights its path to the root node.
At the top of the tree, there is a box containing information about tree generation, model selection, and a link to the pathogen's phylogenetic tree page. This page provides more interactions and better visualisation of the tree.

On the pathogen's phylogenetic tree page, which can be accessed through the Phylogenies tab or the link present on the sequence data page or sample information page, you will find a tree built with all available genomes of the pathogen.

Note: You can interact with the tree layout using the options provided above the tree. Additionally, you can interact with the tree by clicking on a node or sequence name, allowing you to hide or collapse subtrees, view the descendants, terminal or internal branches of a selected node, reroot the tree, and more.

Several ineractivity options are available on the toolbar above the table:

Export - Allows you to download an image of the current tree with the selected or filtered nodes or edges in SVG format, or the tree file itself in NEWICK format.
Informations - Displays an information box regarding the tree generation and model selection. This option assure the auditability of the calculation of the tree, information displayed are automaticaly saved in a report during tree generation.
Selection - Provides a wide range of selection options for the tree, such as selecting leaf nodes or internal nodes.
Search bar - You can use the search bar to search for specific sequences recorded in the tree.

Right below the toolbar, additionnal options are available:

	Expand the tree vertical spacing.		Compress the tree vertical spacing.
	Expand the tree horizontal spacing.		Compress the tree horizontal spacing.
	Sort deepest clades to the bottom.		Sort deepest clades to the top.
	Restore the tree to its original order.	Linear	Display the tree in a linear configuration (default).
Radial	Display the tree in a radial configuration.		Display the sequences names at the end of leaf edges.
	Display all the sequences names at the same level.

III. Analyses

To go futher, MoPSeq-DB amis to offer the possibility to increase the accessibility and the reproductability of genomic data analyses. Therefore, two additionnal functionnalities are planned to be include in the plateform.

1. Evolutionary placement

Under construction, this aspect operates with a pipeline based on the Evolutionary Placement Algorithm (EPA) of the RAxML software. Users can submit sequences, and the algorithm will suggest high-likelihood branch placements for them on the existing pathogen tree. The EPA algorithm allows for fast computing by applying maximum likelihood without reconsidering the entire phylogenetic tree.

2. Phylogeography

Under construction, this aspect is intended to integrate open-source modules from Nextstrain (Hadfield et al. Bioinformatics, 2018.) to enable near real-time epidemiology monitoring and phylogenetic visualisation of compatible pathogens registered in the database. This will allow the joint analysis of epidemiologic and genomic data, and thus extend the scope of the platform.

IV. Technical Information

1. Softwares

Software	Version	Web link
Python	3.9.12	https://www.python.org
Django	3.0.14	https://www.djangoproject.com
Django-Tailwind	3.1.1	https://django-tailwind.readthedocs.io
Django-active-link	0.1.8	https://pypi.org/project/django-active-link
Django-filter	21.1	https://django-filter.readthedocs.io
DNA Features Viewer	3.1.1	https://github.com/Edinburgh-Genome-Foundry/DnaFeaturesViewer
Bokeh	2.4.2	https://docs.bokeh.org
Pandas	1.4.2	https://pandas.pydata.org
Bcbio-gff	0.6.9	https://openbase.com/python/bcbio-gff
Cyvcf2	0.30.15	https://github.com/brentp/cyvcf2
Typer	0.4.1	https://typer.tiangolo.com
Minimap2	2.17	https://github.com/lh3/minimap2

Software	Version	Web link
MAFFT	7.310	https://mafft.cbrc.jp
Java	11.0.15	https://www.java.com
jModelTest2	2.1.10	https://github.com/ddarriba/jmodeltest2
Phyml	3.3.3	https://github.com/stephaneguindon/phyml
RAxML	8.2.11	https://cme.h-its.org/exelixis/web/software/raxml
phylotree.js	1.0.0	https://github.com/veg/phylotree.js
andi	0.12	https://github.com/EvolBioInf/andi
ape library	5.7-1	https://cran.r-project.org/package=ape
RagTag	2.1.0	https://github.com/malonge/RagTag/wiki/scaffold
GC_analysis	0.4.5	https://github.com/tonyyzy/GC_analysis
SkewIT	1.0	https://github.com/jenniferlu717/SkewIT
Gos	0.1.1	https://github.com/gosling-lang/gos

Icons used in MoPSeq-DB are from icones8 website.

2. Code availability

Source code, documentation and docker container are available in the MoPSeq-DB GitLab repository.

3. Data availability

MoPSeq-DB dataset is available for direct access and loading through Sextant Catalog, a CoreTrustSeal certified data providing service from Ifremer : doi.org/10.12770/52134702-4bbd-4c63-af34-9a2cde28e0cc.

Go back to the top