Table of Contents
- What are Primary Databases?
- GenBank
- European Molecular Biology Laboratory (EMBL)
- DNA Data Bank of Japan (DDBJ)
- Protein Data Bank (PDB)
- Gene Expression Omnibus (GEO)
- Applications of Primary Databases
The explosion of raw sequence data in modern genomic research necessitates the existence of biological databases to efficiently store and structure this immense amount of data. These databases serve as repositories for biological data, enabling convenient retrieval of information through effective organization and storage methods.
What are Primary Databases?
- Primary databases are a type of biological database containing original and unprocessed biological data.
- These databases typically consist of raw sequences, such as nucleotide or protein sequences, or structural information, such as molecular structures.
- Several primary sequence databases are widely used in the field of bioinformatics:
- GenBank at the National Center for Biotechnology Information (NCBI)
- DNA Database of Japan (DDBJ)
- European Molecular Biology Laboratory (EMBL)
- Other examples of primary databases include:
- Protein Data Bank (PDB)
- Gene Expression Omnibus (GEO)
- ArrayExpress
GenBank
- GenBank is a primary biological database managed by the National Center for Biotechnology Information (NCBI).
- It is an annotated collection of publicly available sequences, including information about genes, proteins, and other genetic elements.
- GenBank is part of the International Nucleotide Sequence Database Collaboration (INSDC), a joint effort between three primary databases: GenBank, DDBJ, and EMBL.
- These organizations work collaboratively to share sequence data globally on a daily basis, ensuring that the data in each database is up-to-date and accurate.
- The GenBank flat file format is used to represent sequence data and annotations in the database.
- GenBank accepts mRNA or genomic sequence data with proper source organism information and annotation provided by the submitter.
- However, the database does not accept:
- Noncontiguous sequences
- Primer sequences
- Protein sequences without underlying nucleotide submission
- Mixed genomic and mRNA sequences
- Consensus sequences
- Sequences with lengths of less than 200 nucleotides
- To submit sequences to this database, several web-based tools are available:
- BankIt: A web-based submission tool allowing users to submit gene sequences to the GenBank database. It supports the submission of sets of sequences.
- Sequin: A stand-alone submission tool used for more complex submissions, such as those containing long sequences, multiple annotations, or gapped sequences. Sequin can be downloaded from the FTP site for use on Mac, PC, and UNIX platforms. Each Sequin file should have fewer than 10,000 sequences to ensure maximum performance.
- tbl2asn: A stand-alone tool used for even larger submissions. Like Sequin, tbl2asn can be downloaded from the FTP site. The submitter can work offline to prepare the submission and then submit it using tbl2asn.
European Molecular Biology Laboratory (EMBL)
- EMBL (European Molecular Biology Laboratory) is a collection of nucleotide sequence data maintained by the European Bioinformatics Institute (EBI).
- It is part of the International Nucleotide Sequence Database Collaboration (INSDC), along with the GenBank and DDBJ databases.
- EMBL’s main focus is on the storage and distribution of nucleotide and protein sequences, providing tools and resources for researchers to analyze and interpret this data.
- Like other primary databases, EMBL collects and archives data from various sources, including scientific publications and direct submissions from researchers.
- One of the main features of EMBL is its user-friendly interface, allowing researchers to easily search for and retrieve data.
- EMBL offers a range of tools and resources for sequence analysis, including:
- Alignment tools
- Phylogenetic trees
- Protein structure prediction software
- EMBL uses a sequence submission tool called Webin:
- This tool is web-based and can be accessed through EMBL’s website.
- With Webin, researchers can submit single sequences, multiple sequences, or a large number of sequences.
DNA Data Bank of Japan (DDBJ)
- DDBJ (DNA Data Bank of Japan) is a primary database that collects and stores genetic information, mainly from Japanese researchers, but also assigns accession numbers to researchers from other countries.
- DDBJ is a member of the International Nucleotide Sequence Database Collaboration (INSDC) and regularly exchanges collected data with EMBL and GenBank.
- DDBJ's main activities include:
- Collecting and exchanging nucleotide sequence data
- Managing bioinformatics tools for data submission and retrieval
- Developing tools for biological data analysis
- Organizing Bioinformatics Training Courses in Japanese to teach people how to analyze biological data
- DDBJ uses a newly developed web-based tool called the Nucleotide Sequence Submission System (NSSS) for sequence submissions.
- The NSSS replaced Sakura in November 2012. Sakura had been used for sequence submission since 1995.
- For submissions involving very long or numerous sequences, DDBJ recommends using its Mass Submission System (MSS).
Protein Data Bank (PDB)
- PDB (Protein Data Bank) is a global database that stores information about the structure of biological macromolecules.
- It is managed by the Research Collaboratory for Structural Bioinformatics (RCSB) and provides many services to help researchers access and analyze the structural data.
- PDB collects and archives 3D-atomic level structural models of macromolecules obtained through three commonly used experimental techniques:
- Crystallography
- Nuclear magnetic resonance spectroscopy (NMR)
- Electron microscopy (3DEM)
- The database entries are mostly structures of proteins, but also include:
- Nucleic acids
- Carbohydrates
- Theoretical models
- In addition to the structural models, PDB archives:
- Experimental data
- Associated metadata
- Other details about the molecules
Gene Expression Omnibus (GEO)
- GEO (Gene Expression Omnibus) is a public database that stores high-throughput gene expression and functional genomics data.
- It was created in 2000 as a resource for gene expression studies and has since expanded to include other types of data such as genome methylation and chromatin structure.
- The database requires researchers to provide:
- Raw data
- Processed data
- Descriptive metadata
- The original submitter-supplied GEO records are of three types:
- Platform: Describes the array or sequencer used
- Sample: Describes the source and analysis of the sample
- Series: Links related Samples and describes a whole study
- These records are organized into two categories:
- DataSet: A curated collection of comparable Samples that share a common set of array elements
- Profile: Consists of expression measurements for a gene across all Samples in a DataSet
Applications of Primary Databases
- GenBank and EMBL are primary databases that serve as references for genome analysis and comparison.
- The primary database PDB is used for identifying protein structures.
- Gene Expression Omnibus (GEO), a primary database, contains transcriptome data useful for analyzing differentially expressed genes and understanding gene expression.
- KEGG is a primary database that provides information on metabolic and signaling pathways in various organisms.