bio-parsers

A library of parsers for interconverting between genbank, fasta, and (eventually) sbol through Teselagen's intermediary json format

Usage no npm install needed!

<script type="module">
  import bioParsers from 'https://cdn.skypack.dev/bio-parsers';
</script>

README

Bio Parsers

About this Repo

This repo contains a set of parsers to convert between datatypes through a generalized JSON format.

CHANGELOG

Exported Functions

Use the following exports to convert to a generalized JSON format:

fastaToJson //handles fasta files (.fa, .fasta)
genbankToJson //handles genbank files (.gb, .gbk)
ab1ToJson //handles .ab1 sequencing read files 
sbolXmlToJson //handles .sbol files
snapgeneToJson //handles snapgene (.dna) files
anyToJson    //this handles any of the above file types based on file extension

Use the following exports to convert from a generalized JSON format back to a specific format:

jsonToGenbank
jsonToFasta
jsonToBed

Format Specification

The generalized JSON format looks like:

const generalizedJsonFormat = {
    "size": 25,
    "sequence": "asaasdgasdgasdgasdgasgdasgdasdgasdgasgdagasdgasdfasdfdfasdfa",
    "circular": true,
    "name": "pBbS8c-RFP",
    "description": "",
    "chromatogramData": { //only if parsing in an ab1 file
      "aTrace": [], //same as cTrace but for a
      "tTrace": [], //same as cTrace but for t
      "gTrace": [], //same as cTrace but for g
      "cTrace": [0,0,0,1,3,5,11,24,56,68,54,30,21,3,1,4,1,0,0, ...etc], //heights of the curve spaced 1 per x position (aka if the cTrace.length === 1000, then the max basePos can be is 1000)
      "basePos": [33, 46, 55, ...etc], //x position of the bases (can be unevenly spaced)
      "baseCalls": ["A", "T", ...etc],
      "qualNums": [], //or undefined if no qualNums are detected on the file
    },
    "features": [
        {
            "name": "anonymous feature",
            "type": "misc_feature",
            "id": "5590c1978979df000a4f02c7", //Must be a unique id. If no id is provided, we'll autogenerate one for you
            "start": 1,
            "end": 3,
            "strand": 1,
            "notes": {},
        },
        {
            "name": "coding region 1",
            "type": "CDS",
            "id": "5590c1d88979df000a4f02f5",
            "start": 12,
            "end": 9,
            "strand": -1,
            "notes": {},
        }
    ],
}

Usage

install

npm install -S bio-parsers

or

yarn add bio-parsers

or

use it from a script tag:

<script src="https://unpkg.com/bio-parsers/umd/bio-parsers.js"></script>
<script>
      async function main() {
        var jsonOutput = await window.bioParsers.genbankToJson(
          `LOCUS       kc2         108 bp    DNA     linear    01-NOV-2016
COMMENT             teselagen_unique_id: 581929a7bc6d3e00ac7394e8
FEATURES             Location/Qualifiers
     CDS             1..108
                     /label="GFPuv"
     misc_feature    61..108
                     /label="gly_ser_linker"
     bogus_dude      4..60
                     /label="ccmN_sig_pep"
     misc_feature    4..60
                     /label="ccmN_nterm_sig_pep"
                     /pragma="Teselagen_Part"
                     /preferred5PrimeOverhangs=""
                     /preferred3PrimeOverhangs=""
ORIGIN      
        1 atgaaggtct acggcaagga acagtttttg cggatgcgcc agagcatgtt ccccgatcgc
       61 ggtggcagtg gtagcgggag ctcgggtggc tcaggctctg ggg
//`
        );
        console.log('jsonOutput:', jsonOutput);
        var genbankString = window.bioParsers.jsonToGenbank(jsonOutput[0].parsedSequence);
        console.log(genbankString);
      }
      main();
</script>

see the ./umd_demo.html file for a full working example

jsonToGenbank (same interface as jsonToFasta)

//To go from json to genbank:
import { jsonToGenbank } from "bio-parsers"
//You can pass an optional options object as the second argument. Here are the defaults
const options = {
  isProtein: false, //by default the sequence will be parsed and validated as type DNA (unless U's instead of T's are found). If isProtein=true the sequence will be parsed and validated as a PROTEIN type (seqData.isProtein === true)
  guessIfProtein: false, //if true the parser will attempt to guess if the sequence is of type DNA or type PROTEIN (this will override the isProtein flag)
  guessIfProteinOptions: {
    threshold = 0.90, //percent of characters that must be DNA letters to be considered of type DNA
    dnaLetters = ['G', 'A', 'T', 'C'] //customizable set of letters to use as DNA 
  }, 
  inclusive1BasedStart: false //by default feature starts are parsed out as 0-based and inclusive 
  inclusive1BasedEnd: false //by default feature ends are parsed out as 0-based and inclusive 
  // Example:
  // 0123456
  // ATGAGAG
  // --fff--  (the feature covers GAG)
  // 0-based inclusive start:
  // feature.start = 2
  // 1-based inclusive start:
  // feature.start = 3
  // 0-based inclusive end:
  // feature.end = 4
  // 1-based inclusive end:
  // feature.end = 5
} 
const genbankString = jsonToGenbank(generalizedJsonFormat, options)

anyToJson (same interface as genbankToJson, fastaToJson, xxxxToJson) (async required)

import { anyToJson } from "bio-parsers"

//note, anyToJson should be called using an await to allow for file parsing to occur (if a file is being passed)
const results = await anyToJson(
  stringOrFile, //if ab1 files are being passed in you should pass files only, otherwise strings or files are fine as inputs
  options //options.fileName (eg "pBad.ab1" or "pCherry.fasta") is important to pass here in order for the parser to!
) 

//we always return an array of results because some files my contain multiple sequences 
results[0].success //either true or false 
results[0].messages //either an array of strings giving any warnings or errors generated during the parsing process
results[0].parsedSequence //this will be the generalized json format as specified above :)
//chromatogram data will be here (ab1 only): 
results[0].parsedSequence.chromatogramData 

Options (for anyToJson or xxxxToJson)

//You can pass an optional options object as the third argument. Here are the defaults
const options = {
  fileName: "example.gb", //the filename is used if none is found in the genbank           
  isProtein: false, //if you know that it is a protein string being parsed you can pass true here
  parseFastaAsCircular: false; //by default fasta files are parsed as linear sequences. You can change this by setting parseFastaAsCircular=true 
  //genbankToJson options only
  inclusive1BasedStart: false //by default feature starts are parsed out as 0-based and inclusive 
  inclusive1BasedEnd: false //by default feature ends are parsed out as 0-based and inclusive 
  acceptParts: true //by default features with a feature.notes.pragma[0] === "Teselagen_Part" are added to the sequenceData.parts array. Setting this to false will keep them as features instead
}

ab1ToJson

import { ab1ToJson } from "bio-parsers"
const results = await ab1ToJson(
  //this can be either a browser file  <input type="file" id="input" multiple onchange="ab1ToJson(this.files[0])">
  // or a node file ab1ToJson(fs.readFileSync(path.join(__dirname, './testData/ab1/example1.ab1')));
  file, 
  options //options.fileName (eg "pBad.ab1" or "pCherry.fasta") is important to pass here in order for the parser to!
)

//we always return an array of results because some files my contain multiple sequences 
results[0].success //either true or false 
results[0].messages //either an array of strings giving any warnings or errors generated during the parsing process
results[0].parsedSequence //this will be the generalized json format as specified above :)
//chromatogram data will be here (ab1 only): 
results[0].parsedSequence.chromatogramData 

snapgeneToJson (.dna files)

import { snapgeneToJson } from "bio-parsers"
//file can be either a browser file  <input type="file" id="input" multiple onchange="snapgeneToJson(this.files[0])">
// or a node file snapgeneToJson(fs.readFileSync(path.join(__dirname, './testData/ab1/example1.ab1')));
const results = await snapgeneToJson(file,options)

genbankToJson

import { genbankToJson } from "bio-parsers"

const result = genbankToJson(string, options)

console.info(result)
// [
//     {
//         "messages": [
//             "Import Error: Illegal character(s) detected and removed from sequence. Allowed characters are: atgcyrswkmbvdhn",
//             "Invalid feature end:  1384 detected for Homo sapiens and set to 1",
//         ],
//         "success": true,
//         "parsedSequence": {
//             "features": [
//                 {
//                     "notes": {
//                         "organism": [
//                             "Homo sapiens"
//                         ],
//                         "db_xref": [
//                             "taxon:9606"
//                         ],
//                         "chromosome": [
//                             "17"
//                         ],
//                         "map": [
//                             "17q21"
//                         ]
//                     },
//                     "type": "source",
//                     "strand": 1,
//                     "name": "Homo sapiens",
//                     "start": 0,
//                     "end": 1
//                 }
//             ],
//             "name": "NP_003623",
//             "sequence": "gagaggggggttatccccccttcgtcagtcgatcgtaacgtatcagcagcgcgcgagattttctggcgcagtcag",
//             "circular": true,
//             "extraLines": [
//                 "DEFINITION  contactin-associated protein 1 precursor [Homo sapiens].",
//                 "ACCESSION   NP_003623",
//                 "VERSION     NP_003623.1  GI:4505463",
//                 "DBSOURCE    REFSEQ: accession NM_003632.2",
//                 "KEYWORDS    RefSeq."
//             ],
//             "type": "DNA",
//             "size": 925
//         }
//     }
// ]

You can see more examples by looking at the tests.

Editing This Repo

All collaborators:

Edit/create a new file and update/add any relevant tests. Make sure they pass by running yarn test

Debug

yarn test-debug

Updating this repo

Teselagen collaborators

Commit and push all changes Sign into npm using the teselagen npm account (npm whoami)

npm version patch|minor|major
npm publish

Outside collaborators

fork and pull request please :)

Thanks/Collaborators