About SPatt

What is SPatt ?

SPatt (Statistic for Patterns) is a suite of C++ programs designed for the computation of pattern occurrences p-value on text. Assuming the text is generated according to Markov model, the p-value of a given observation is its probability to occur. The lower is the p-value, the more unlikely is the observation. For example, this tools can be used to find patterns with unusual behaviour in DNA or proteins sequences.

Typical usage

The DNA motif/pattern GCTGG|CCAGC (means occurs GCTGG or CCAGC) 210 times in the complete genome of the bacteriophage lambda (phage_lambda.fasta, 48 kb). How significant is this observation assuming that the DNA sequence is random with independent and identically uniformely distributed letters (freq(A)=freq(C)=freq(G)=freq(T)=0.25) ?

Here is the command to run ("-S" for the provided sequence, "-p" for the pattern, for the alphabet descriptor, "-m" for the Markov model order, "-1" means independent and identically uniformely distributed, "--over" to compute over-representation p-value):

spatt -S phage_lambda.fasta -p "GCTGG|CCAGC" -a "ACGT" -m -1 --over

and here is its (truncated) result:

distribution:
P(N=0)=9.698565e-42
P(N=1)=9.130414e-40
P(N=2)=4.298067e-38
P(N=3)=1.348945e-36
P(N=4)=3.175459e-35
[...]
P(N=206)=2.629107e-23
P(N=207)=1.210827e-23
P(N=208)=5.549911e-24
P(N=209)=2.531807e-24
P(N>=210)=2.090885e-24
pattern=GCTGG|CCAGC	Nobs=210	P(N>=Nobs)=2.090885e-24

This result indicates that the observation of "at least 210 occurrences of GCTGG|CCAGC in a random DNA sequence of 48 kb" is highly significant with a p-value of 2.1e-24.

Features

Computing the distribution of pattern in random sequences is a challenging and computationally intensive task for which it exists many concurrent approaches. The goal of SPatt is to implement make available the most relevant ones in a single easy-to-use package. Here is a list of the current features implemented in SPatt:

  • arbitrary alphabet (DNA, protein, binary, others);
  • automatic detection of case sensitive alphabets;
  • regex-like syntax allowing for complex patterns;
  • homogeneous Markov model of abitrary order;
  • exact computations for a single sequence or a set of sequences;
  • Gaussian approximations;
  • overlapping or renewal counting;
  • presence/absence counting when dealing with datasets with several sequences;
  • efficient implementation using optimal Markov chain embedding through deterministic finite automata;
  • output of a scilab source code of the Markov chain embedding parameter (mostly for educational purpose);
  • optional output of dot (graphviz package) files for representing automata.

More details

If you need more details on the statistical framework and the methodology implemented in SPatt, please check the Statistics section. If you want to learn how to get SPatt on your system, please check the Download section. For a step by step tutorial, check the Tutorial section. Finally to learn more about the contributors to SPatt and how to cite SPatt, check the Credits section.

Last edited 01/13/2012