![]() |
3 days ago | |
---|---|---|
docs | 3 days ago | |
.gitignore | 3 weeks ago | |
LICENSE-CC | 3 weeks ago | |
LICENSE-GPL | 3 weeks ago | |
README.md | 3 weeks ago |
README.md
RCGP
DRAFT
Perhaps a toolchain for sequencing the DNA of wildlife.
Hardware
Open source hardware, where feasible.
PCR
PCR, perhaps open source hardware.
- NinjaPCR
- OpenPCR
Sequencer
No available open source hardware?
Software
Relevant Software in Debian. A selection of software that perhaps may be used.
dialign
DIALIGN2 is a command line tool to perform multiple alignment of protein or DNA sequences. It constructs alignments from gapfree pairs of similar segments of the sequences. This scoring scheme for alignments is the basic difference between DIALIGN and other global or local alignment methods. Note that DIALIGN does not employ any kind of gap penalty.
dialign-tx
DIALIGN-TX is a command line tool to perform multiple alignment of protein or DNA sequences. It is a complete reimplementation of the segment-base approach including several new improvements and heuristics that significantly enhance the quality of the output alignments compared to DIALIGN 2.2 and DIALIGN-T. For pairwise alignment, DIALIGN-TX uses a fragment-chaining algorithm that favours chains of low-scoring local alignments over isolated high-scoring fragments. For multiple alignment, DIALIGN-TX uses an improved greedy procedure that is less sensitive to spurious local sequence similarities.
diamond-aligner
DIAMOND is a sequence aligner for protein and translated DNA searches and functions as a drop-in replacement for the NCBI BLAST software tools. It is suitable for protein-protein search as well as DNA-protein search on short reads and longer sequences including contigs and assemblies, providing a speedup of BLAST ranging up to x20,000.
https://github.com/bbuchfink/diamond
dotter
Dotter is a graphical dot-matrix program for detailed comparison of two sequences.
- Every residue in one sequence is compared to every residue in the other, and a matrix of scores is calculated.
- One sequence is plotted on the x-axis and the other on the y-axis.
- Noise is filtered out so that alignments appear as diagonal lines.
- Pairwise scores are averaged over a sliding window to make the score matrix more intelligible.
- The averaged score matrix forms a three-dimensional landscape, with the two sequences in two dimensions and the height of the peaks in the third. This landscape is projected onto two dimensions using a grey-scale image - the darker grey of a peak, the higher the score is.
- The contrast and threshold of the grey-scale image can be adjusted interactively, without having to recalculate the score matrix.
- An Alignment Tool is provided to examine the sequence alignment that the grey-scale image represents.
- Known high-scoring pairs can be loaded from a GFF file and overlaid onto the plot.
- Gene models can be loaded from GFF and displayed alongside the relevant axis.
- Compare a sequence against itself to find internal repeats.
- Find overlaps between multiple sequences by making a dot-plot of all sequences versus themselves.
- Run Dotter in batch mode to create large, time-consuming dot-plots as a background process.
https://www.sanger.ac.uk/science/tools/seqtools
drop-seq-tools
This software provide for core computational analysis of Drop-seq data, which shows you how to transform raw sequence data into an expression measurement for each gene in each individual cell.
https://github.com/broadinstitute/Drop-seq/
fastdnaml
fastDNAml is a program derived from Joseph Felsenstein's version 3.3 DNAML (part of his PHYLIP package). Users should consult the documentation for DNAML before using this program.
fastDNAml is an attempt to solve the same problem as DNAML, but to do so faster and using less memory, so that larger trees and/or more bootstrap replicates become tractable. Much of fastDNAml is merely a recoding of the PHYLIP 3.3 DNAML program from PASCAL to C.
Note that the homepage of this program is not available any more and so this program will probably not see any further updates.
ftp://ftp.bio.indiana.edu/molbio/evolve/fastdnaml/fastDNAml.html
fastlink
Genetic linkage analysis is a statistical technique used to map genes and find the approximate location of disease genes. There was a standard software package for genetic linkage called LINKAGE. FASTLINK is a significantly modified and improved version of the main programs of LINKAGE that runs much faster sequentially, can run in parallel, allows the user to recover gracefully from a computer crash, and provides abundant new documentation. FASTLINK has been used in over 1000 published genetic linkage studies.
This package contains the following programs: ilink: GEMINI optimization procedure to find a locally optimal value of the theta vector of recombination fractions linkmap: calculates location scores of one locus against a fixed map of other loci lodscore: compares likelihoods at locally optimal theta mlink: calculates lod scores and risk with two of more loci unknown: identify possible genotypes for unknowns
https://www.ncbi.nlm.nih.gov/CBBResearch/Schaffer/fastlink.html
mafft
MAFFT is a multiple sequence alignment program which offers three accuracy-oriented methods:
- L-INS-i (probably most accurate; recommended for <200 sequences; iterative refinement method incorporating local pairwise alignment information),
- G-INS-i (suitable for sequences of similar lengths; recommended for <200 sequences; iterative refinement method incorporating global pairwise alignment information),
- E-INS-i (suitable for sequences containing large unalignable regions; recommended for <200 sequences), and five speed-oriented methods:
- FFT-NS-i (iterative refinement method; two cycles only),
- FFT-NS-i (iterative refinement method; max. 1000 iterations),
- FFT-NS-2 (fast; progressive method),
- FFT-NS-1 (very fast; recommended for >2000 sequences; progressive method with a rough guide tree),
- NW-NS-PartTree-1 (recommended for ∼50,000 sequences; progressive method with the PartTree algorithm).
https://mafft.cbrc.jp/alignment/software/
malt
MALT, an acronym for MEGAN alignment tool, is a sequence alignment and analysis tool designed for processing high-throughput sequencing data, especially in the context of metagenomics. It is an extension of MEGAN6, the MEGenome Analyzer and is designed to provide the input for MEGAN6, but can also be used independently of MEGAN6.
The core of the program is a sequence alignment engine that aligns DNA or protein sequences to a DNA or protein reference database in either BLASTN (DNA queries and DNA references), BLASTX (DNA queries and protein references) or BLASTP (protein queries and protein references) mode. The engine uses a banded-alignment algorithm with ane gap scores and BLOSUM substitution matrices (in the case of protein alignments). The program can compute both local alignments (Smith-Waterman) or semi-global alignments (in which reads are aligned end-to-end into reference sequences), the latter being more appropriate for aligning metagenomic reads to references.
https://github.com/danielhuson/malt
mauve-aligner
Mauve is a system for efficiently constructing multiple genome alignments in the presence of large-scale evolutionary events such as rearrangement and inversion. Multiple genome alignment provides a basis for research into comparative genomics and the study of evolutionary dynamics. Aligning whole genomes is a fundamentally different problem than aligning short sequences.
Mauve has been developed with the idea that a multiple genome aligner should require only modest computational resources. It employs algorithmic techniques that scale well in the amount of sequence being aligned. For example, a pair of Y. pestis genomes can be aligned in under a minute, while a group of 9 divergent Enterobacterial genomes can be aligned in a few hours.
Mauve computes and interactively visualizes genome sequence comparisons. Using FastA or GenBank sequence data, Mauve constructs multiple genome alignments that identify large-scale rearrangement, gene gain, gene loss, indels, and nucleotide substutition.
Mauve is developed at the University of Wisconsin.
mcaller
H | H-C-aller | H
This program is designed to call m6A from nanopore data using the differences between measured and expected currents.
https://github.com/al-mcintyre/mCaller
mipe
MIPE provides a standard format to exchange and/or storage of all information associated with PCR experiments using a flat text file. This will:
- allow for exchange of PCR data between researchers/laboratories
- enable traceability of the data
- prevent problems when submitting data to dbSTS or dbSNP
- enable the writing of standard scripts to extract data (e.g. a list of PCR primers, SNP positions or haplotypes for different animals)
Although this tool can be used for data storage, it's primary focus should be data exchange. For larger repositories, relational databases are more appropriate for storage of these data. The MIPE format could then be used as a standard format to import into and/or export from these databases.
mira-assembler
The mira genome fragment assembler is a specialised assembler for sequencing projects classified as 'hard' due to high number of similar repeats. For expressed sequence tags (ESTs) transcripts, miraEST is specialised on reconstructing pristine mRNA transcripts while detecting and classifying single nucleotide polymorphisms (SNP) occurring in different variations thereof.
The assembler is routinely used for such various tasks as mutation detection in different cell types, similarity analysis of transcripts between organisms, and pristine assembly of sequences from various sources for oligo design in clinical microarray experiments.
The package provides the following executables: Binaries provided:
- mira: for assembly of genome sequences
- miramem: estimating memory needed to assemble projects.
- mirabait: a "grep" like tool to select reads with kmers up to 256 bases.
- miraconvert: is a tool to convert, extract and sometimes recalculate all kinds of data related to sequence assembly files.
https://sourceforge.net/p/mira-assembler/wiki/Home/
mummer
MUMmer is a system for rapidly aligning entire genomes, whether in complete or draft form. For example, MUMmer 3.0 can find all 20-basepair or longer exact matches between a pair of 5-megabase genomes in 13.7 seconds, using 78 MB of memory, on a 2.4 GHz Linux desktop computer. MUMmer can also align incomplete genomes; it handles the 100s or 1000s of contigs from a shotgun sequencing project with ease, and will align them to another set of contigs or a genome using the NUCmer program included with the system. If the species are too divergent for DNA sequence alignment to detect similarity, then the PROmer program can generate alignments based upon the six-frame translations of both input sequences.
https://mummer.sourceforge.net/
murasaki
Murasaki is a scalable and fast, language theory-based homology detection tool across multiple large genomes. It enable whole-genome scale multiple genome global alignments. Supports unlimited length gapped-seed patterns and unique TF-IDF based filtering.
Murasaki is an anchor alignment software, which is
- exteremely fast (17 CPU hours for whole Human x Mouse genome (with 40 nodes: 52 wall minutes))
- scalable (Arbitrarily parallelizable across multiple nodes using MPI. Even a single node with 16GB of ram can handle over 1Gbp of sequence.)
- unlimited pattern length
- repeat tolerant
- intelligent noise reduction
http://murasaki.dna.bio.keio.ac.jp/wiki/
ncbi-acc-download
This package provides a script to download sequences from GenBank/RefSeq by accession through the NCBI ENTREZ API.
https://github.com/kblin/ncbi-acc-download
ncbi-blast+
The Basic Local Alignment Search Tool (BLAST) is the most widely used sequence similarity tool. There are versions of BLAST that compare protein queries to protein databases, nucleotide queries to nucleotide databases, as well as versions that translate nucleotide queries or databases in all six frames and compare to protein databases or queries. PSI-BLAST produces a position-specific-scoring-matrix (PSSM) starting with a protein query, and then uses that PSSM to perform further searches. It is also possible to compare a protein or nucleotide query to a database of PSSM’s. The NCBI supports a BLAST web page at blast.ncbi.nlm.nih.gov as well as a network service. Multi-Arch: foreign
https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/
patman
Patman searches for short patterns in large DNA databases, allowing for approximate matches. It is optimized for searching for many small pattern at the same time, for example microarray probes.
https://bioinf.eva.mpg.de/patman/
pbdagcon
pbdagcon is a tool that implements DAGCon (Directed Acyclic Graph Consensus) which is a sequence consensus algorithm based on using directed acyclic graphs to encode multiple sequence alignment.
It uses the alignment information from blasr to align sequence reads to a "backbone" sequence. Based on the underlying alignment directed acyclic graph (DAG), it will be able to use the new information from the reads to find the discrepancies between the reads and the "backbone" sequences. A dynamic programming process is then applied to the DAG to find the optimum sequence of bases as the consensus. The new consensus can be used as a new backbone sequence to iteratively improve the consensus quality.
While the code is developed for processing PacBio(TM) raw sequence data, the algorithm can be used for general consensus purpose. Currently, it only takes FASTA input. For shorter read sequences, one might need to adjust the blasr alignment parameters to get the alignment string properly.
The code and the underlying graphical data structure have been used for some algorithm development prototyping including phasing reads and pre-assembly.
https://github.com/PacificBiosciences/pbdagcon
probabel
The ProbABEL package is part of the GenABEL project for analysis of genome-wide data. ProbABEL is used to run GWAS. Using files in filevector/DatABEL format even allows for running GWAS on computers with only a few GB of RAM.
probalign
Probalign uses partition function posterior probability estimates to compute maximum expected accuracy multiple sequence alignments. It performs statistically significantly better than the leading alignment programs Probcons v1.1, MAFFT v5.851, and MUSCLE v3.6 on BAliBASE 3.0, HOMSTRAD, and OXBENCH benchmarks. Probalign improvements are largest on datasets containing N/C terminal extensions and on datasets with long and heterogeneous length sequences. On heteregeneous length datasets containing repeats Probalign alignment accuracy is 10% and 15% higher than the other three methods when standard deviation of length is at least 300 and 400.
https://web.njit.edu/~usman/probalign/
pyensembl
The Ensembl genome database is an established reference for genomic sequences and their automated annotation. To have this data local has advantages for bulk analyses, e.g. for the mapping of reads from RNA-seq against the latest golden path - or a previous one to compare analyses.
This package provides a reproducible way to insatll this data and thus simplify the automation of respective workflows.
https://github.com/openvax/pyensembl
pyfastx
The pyfastx is a lightweight Python C extension that enables users to randomly access to sequences from plain and gzipped FASTA/Q files. This module aims to provide simple APIs for users to extract sequence from FASTA and reads from FASTQ by identifier and index number. The pyfastx will build indexes stored in a sqlite3 database file for random access to avoid consuming excessive amount of memory. In addition, the pyfastx can parse standard (sequence is spread into multiple lines with same length) and nonstandard (sequence is spread into one or more lines with different length) FASTA format.
It features:
- a single file for the Python extension;
- lightweight, memory efficient FASTA/Q file parsing;
- fast random access to sequences from gzipped FASTA/Q file;
- sequences reading from FASTA file line by line;
- N50 and L50 calculation of sequences in FASTA file;
- GC content and nucleotides composition calculation;
- reverse, complement and antisense sequences extraction;
- excellent compatibility: support for parsing nonstandard FASTA file;
- support for FASTQ quality score conversion;
- a command line interface for splitting FASTA/Q file.
This package provides the command line interface.
https://github.com/lmdu/pyfastx.git
python3-sqt
sqt is a collection of command-line tools for working with high-throughput sequencing data. Conceptionally not fixed to use any particular language, many sqt subcommands are currently implemented in Python. For them, a Python package is available with functions for reading and writing FASTA/FASTQ files, computing alignments, quality trimming, etc.
The following tools are offered:
- sqt-coverage -- Compute per-reference statistics such as coverage and GC content
- sqt-fastqmod -- FASTQ modifications: shorten, subset, reverse complement, quality trimming.
- sqt-fastastats -- Compute N50, min/max length, GC content etc. of a FASTA file
- sqt-qualityguess -- Guess quality encoding of one or more FASTA files.
- sqt-globalalign -- Compute a global or semiglobal alignment of two strings.
- sqt-chars -- Count length of the first word given on the command line.
- sqt-sam-cscq -- Add the CS and CQ tags to a SAM file with colorspace reads.
- sqt-fastamutate -- Add substitutions and indels to sequences in a FASTA file.
- sqt-fastaextract -- Efficiently extract one or more regions from an indexed FASTA file.
- sqt-translate -- Replace characters in FASTA files (like the 'tr' command).
- sqt-sam-fixn -- Replace all non-ACGT characters within reads in a SAM file.
- sqt-sam-insertsize -- Mean and standard deviation of paired-end insert sizes.
- sqt-sam-set-op -- Set operations (union, intersection, ...) on SAM/BAM files.
- sqt-bam-eof -- Check for the End-Of-File marker in compressed BAM files.
- sqt-checkfastqpe -- Check whether two FASTQ files contain correctly paired paired-end data.
https://bitbucket.org/marcelm/sqt
python3-treetime
TreeTime provides routines for ancestral sequence reconstruction and the maximum likelihoo inference of molecular-clock phylogenies, i.e., a tree where all branches are scaled such that the locations of terminal nodes correspond to their sampling times and internal nodes are placed at the most likely time of divergence.
TreeTime aims at striking a compromise between sophisticated probabilistic models of evolution and fast heuristics. It implements GTR models of ancestral inference and branch length optimization, but takes the tree topology as given. To optimize the likelihood of time-scaled phylogenies, treetime uses an iterative approach that first infers ancestral sequences given the branch length of the tree, then optimizes the positions of unconstraine d nodes on the time axis, and then repeats this cycle. The only topology optimization are (optional) resolution of polytomies in a way that is most (approximately) consistent with the sampling time constraints on the tree. The package is designed to be used as a stand-alone tool or as a library used in larger phylogenetic analysis workflows.
Features
- ancestral sequence reconstruction (marginal and joint maximum likelihood)
- molecular clock tree inference (marginal and joint maximum likelihood)
- inference of GTR models
- rerooting to obtain best root-to-tip regression
- auto-correlated relaxed molecular clock (with normal prior)
This package provides the Python 3 module.
https://github.com/neherlab/treetime
qcumber
QCPipeline is a tool for quality control. The workflow is as follows:
- Quality control with FastQC
- Trim Reads with Trimmomatic
- Quality control of trimmed reads with FastQC
- Map reads against reference using bowtie2
- Classify reads with Kraken
https://gitlab.com/RKIBioinformaticsPipelines/QCumber
qiime
Microbes are surrounding us, animals, plants and all their parasites with strong effect on these and the environment these live in. Soil quality comes to mind but also the effect that bacteria have on each other. Humans are influencing the absolute and relative abundance of bacteria by antibiotics, food, fertilizers - you name it - and these changes affect us.
QIIME 2 is a powerful, extensible, and decentralized microbiome analysis package with a focus on data and analysis transparency. QIIME 2 enables researchers to start an analysis with raw DNA sequence data and finish with publication-quality figures and statistical results. Key features:
- Integrated and automatic tracking of data provenance
- Semantic type system
- Plugin system for extending microbiome analysis functionality
- Support for multiple types of user interfaces (e.g. API, command line, graphical)
QIIME 2 is a complete redesign and rewrite of the QIIME 1 microbiome analysis pipeline. QIIME 2 will address many of the limitations of QIIME 1, while retaining the features that makes QIIME 1 a powerful and widely-used analysis pipeline.
QIIME 2 currently supports an initial end-to-end microbiome analysis pipeline. New functionality will regularly become available through QIIME 2 plugins. You can view a list of plugins that are currently available on the QIIME 2 plugin availability page. The future plugins page lists plugins that are being developed.
sweed
Biological sequences are available in ever increasing abundance across ever larger populations for ever increasing fractions of the genome. This tool sorts the SNPs for their active or passive contribution to a genetic drift, i.e. to see particular sequences at a higher fraction over time.
https://sco.h-its.org/exelixis/web/software/sweed/
t-coffee
T-Coffee is a multiple sequence alignment package. Given a set of sequences (Proteins or DNA), T-Coffee generates a multiple sequence alignment. Version 2.00 and higher can mix sequences and structures.
T-Coffee allows the combination of a collection of multiple/pairwise, global or local alignments into a single model. It can also estimate the level of consistency of each position within the new alignment with the rest of the alignments. See the pre-print for more information
T-Coffee has a special called M-Coffee that makes it possible to combine the output of many multiple sequence alignment packages. In its published version, it uses MUSCLE, PROBCONS, POA, DiAlign-TS, MAFFT, Clustal W, PCMA and T-Coffee. A special version has been made for Debian, DM-Coffee, that uses only free software by replacing Clustal W by Kalign. Using the 8 Methods of M-Coffee can sometimes be a bit heavy. You can use a subset of your favorite methods if you prefer.
https://www.tcoffee.org/Projects/tcoffee/index.html
vmatch
Vmatch is a versatile software tool for efficiently solving large scale sequence matching tasks. It subsumes the software tool REPuter, but is much more general, with a very flexible user interface, and improved space and time requirements.
vsearch
Versatile 64-bit multithreaded tool for processing metagenomic sequences, including searching, clustering, chimera detection, dereplication, sorting, masking and shuffling
The aim of this project is to create an alternative to the USEARCH tool developed by Robert C. Edgar (2010). The new tool should:
- have a 64-bit design that handles very large databases and much more than 4GB of memory
- be as accurate or more accurate than usearch
- be as fast or faster than usearch
https://github.com/torognes/vsearch/
Copyright
Upstream sources under their respective copyrights.
License: CC By SA 4.0 International and/or GPLv3+ at your discretion.
Copyright © 2023, Jeff Moe.