Usage

Overview

The seqm module contains functions for calculating sequence-related distance and complexity metrics, commonly used in language processing and next-generation sequencing. It has a simple and consistent API that be used for investigating sequence characteristics:

>>> import seqm
>>> seqm.hamming('ATTATT', 'ATTAGT')
1
>>> seqm.edit('ATTATT', 'ATAGT')
2
>>> seqm.polydict('AAAACCGT')
{'A': 4, 'C': 2, 'G': 1, 'T': 1}
>>> seqm.polylength('AAAACCGT')
4
>>> seqm.entropy('AGGATAAG')
1.40
>>> seqm.gc_percent('AGGATAAG')
0.375
>>> seqm.gc_skew('AGGATAAG')
3.0
>>> seqm.gc_shift('AGGATAAG')
1.67
>>> seqm.dna_weight('AGGATAAG')
3968.59
>>> seqm.rna_weight('AGGATAAG')
4082.59
>>> seqm.aa_weight('AGGATAAG')
700.8
>>> seqm.zipsize('AGGATAAGAGATAGATTT')
22

It also has a seqm.Sequence object for object-based access to these properties:

>>> import seqm
>>> seq = seqm.Sequence('AAAACCGT')
>>> seq.hamming('AAAAGCGT')
1
>>> seq.gc_percent
0.375
>>> seq.revcomplement
ACGTACGT
>>> seq.dna_weight
3895.59

All of the metrics available in the repository are listed below, and can also be found in the API section of the documentation.

List of Available Functions

Sequence Quantification

Function

Metric

polydict()

Length of longest homopolymer for all bases in sequence.

polylength()

Length of longest homopolymer in sequence.

entropy()

Shannon entropy for bases in sequence.

gc_percent()

Percentage of GC bases in sequence relative to all bases.

gc_skew()

GC skew for sequence: (#G - #C)/(#G + #C).

gc_shift()

GC shift for sequence: (#A + #T)/(#G + #C)

dna_weight()

Molecular weight for sequence with DNA backbone.

rna_weight()

Molecular weight for sequence with RNA backbone.

aa_weight()

Molecular weight for amino acid sequence.

zipsize()

Compressibility of sequence.

tm()

Melting temperature of sequence.

Domain Conversion

Function

Conversion

revcomplement()

Length of longest homopolymer for all bases in sequence.

complement()

Length of longest homopolymer in sequence.

aa()

Shannon entropy for bases in sequence.

wrap()

Percentage of GC bases in sequence relative to all bases.

likelihood()

GC skew for sequence: (#G - #C)/(#G + #C).

qscore()

GC shift for sequence: (#A + #T)/(#G + #C)

Distance Metrics

Function

Distance Metric

hamming()

Hamming distance between sequences.

edit()

Edit (levenshtein) distance between sequences

Utilities

Function

Utility

random_sequence()

Generate random sequence.

wrap()

Newline-wrap sequence

Command-Line Usage

Once seqm is installed, all methods can be accessed via the seqm entry point:

~$ seqm

To run a specific method on a sequence, use:

~$ seqm gc_skew AGTAGTAGTTTAGGTTAGGTAG
8.0

For commands comparing sequences, simply use both sequences as arguments:

~$ seqm edit AGTAGTAGTAGTAT AGTAGTAGTAGAAAAT
3

And finally, to supply command line arguments to a method, do the following:

~$ seqm wrap --bases=10 AGTAGTAGTAGTATAGTAGTAGTAGAAAAT
AGTAGTAGTA
GTATAGTAGT
AGTAGAAAAT

You can also pipe commands with the cli tool:

~$ seqm random --length 10 | seqm wrap --bases 5 -
ATGGA
TATTA