Bioinformatics: Sequence File Formats

Articles —> Bioinformatics: Sequence File Formats

In the field of bioinformatics there exists many different file formats that store DNA and protein sequence information. There is no one sequence format that is ideal: many are used in different contexts, and can often be converted from one to another for easier access or sharing. Below is a list of file formats and a link to their respective file format specs and descriptions for anyone wishing to get to know the file formats a little better. While there are many different formats out there used by commercial software, this list focuses mainly on open, non-propietary file formats.

  • Genbank - quite possibly the standard in sequence file formats, the Genbank format is widely used by public databases such as NCBI. The Genbank file format is quite flexible and allows annotations, comments, and references to be included within the file. The file is plain text and thus can be read with a text editor. Genbank files often have the file extension '.gb' or '.genbank'.

    Genbank Sample Record

  • EMBL - similar in form to the Genbank file, the EMBL format is used by public databases such as European Molecular Biology Laboratory. The Genbank file format is quite flexible and allows annotations, comments, and references to be included within the file. The file is plain text and thus can be read with a text editor. Genbank files often have the file extension '.gb' or '.genbank'.

    EMBL Spec

  • ABI - ABI is a binary file format containing sanger sequencing sequence and trace data. The format is used by sequencing facilities and require special readers capable of reading the file format to view the trace data and extract the sequence. The file format is difficult to parse given its binary nature and the complexity of the spec.

    ABI Spec (PDF)

  • PDB - the PDB file format is used to store both sequence information, but more importantly stores 3-dimensional structure information. This information can be used to visualize the crystal structure of a given molecule (typically a protein). PDB files are simply text files, thus can be viewed with a text editor, and often have the file extension '.pdb'.

    PDB File Spec

  • MDL - While not technically containing sequence data, the MDL file format is worth including in this list. The MDL mol file contains information regarding small molecules, the spec being quite similar to that of the PDB file format. The MDL mol file contains information regarding 2d (and possibly 3d) molecule structure, such as atom type and atom connectivity.

    MDL Mol File

  • BAM/SAM - The BAM/SAM format contains next-generation sequencing data. The BAM is a binary file format while the SAM file format contains the same information but is text based. These files can be analyzed and viewed by several free software tools, such as the command line open source tool SAMTools and the user interface tool IGV. Both the BAM/SAM format contain not only the sequence data for next-generation sequencing reads, but also have the capability of storing alignment data of those reads to a reference sequence.

    SAMtools spec

  • SFF - The SFF file format specifies a binary file which contains next-generation sequence information. The name stands for standard flowgram format, and contains the actual flow information used on several next-generation DNA sequencers, including Ion-Torrent and Roche's '454'.

    Standard Flowgram Format Spec



There are no comments on this article.

Back to Articles


© 2008-2022 Greg Cope