Introduction  ::   Integer intervals  ::   Suffix trees  ::   Compressed suffix arrays  ::   Utils  ::   Installation

Introduction

What PostBio is

PostBio is a set of bioinformatics extensions for PostgreSQL. It includes three data types: int_interval, an integer interval used to represent biological sequence features, stree, a suffix tree type to search for maximum unique matches, and fmindex, a compressed suffix array for fast short exact matches. PostBio also provides a set of utilitary routines.

PostBio is licensed under the MIT license (very similar to PostgreSQL's BSD license) and so can be freely used for academic and commercial purposes. Please refer to the Installation section for more details.

Integer intervals

A type for biological sequence features

An int_interval is not much different from a line segment lseg type in PostGIS, being an specialization of it: only integer coordinates are allowed and the line lies in the unidimensional space represented by a sequence, which gives the "interval" representation. As such, integer intervals admit topological operations and can be indexed by GiSTs.

Values of int_interval are specified with

int_interval '( low , high )'

where low and high are integers. Implicit casts from and to text are possible, and an integer i can be casted to (i, i).

The following table describes all int_interval operators, where we use ii1 as int_interval'(l1, h1)' and ii2 as int_interval'(l2, h2)' in the example column. The operators are actually syntatic sugar for the functions in the function column.

Operator Description Function Example
# Size (length) iint_size #ii1 = h1 - l1 + 1
#< Lower endpoint iint_low #<ii1 = l1
#> Upper endpoint iint_high #>ii1 = h1
|-| Distance iint_distance ii1 |-| ii2 = (|l1 - l2| + |h1 - h2| - #ii1 - #ii2) / 2
<-> Buffer iint_buffer ii1 <-> b = int_interval(max(0, l1 - b), h1 + b)
<< Strictly left? iint_left ii1 << ii2 is true iff h1 < l2
&< Left? iint_overleft ii1 &< ii2 is true iff h1 <= h2
&& Overlaps? iint_overlaps ii1 && ii2 is true iff l1 <= h2 and h1 >= l2
&> Right? iint_overright ii1 &> ii2 is true iff l1 >= l2
>> Strictly right? iint_right ii1 >> ii2 is true iff l1 > h2
~= Same? iint_same ii1 ~= ii2 is true iff l1 = l2 and h1 = h2
~ Contains? iint_contains ii1 ~ ii2 is true iff l1 <= l2 and h1 >= h2
@ Contained? iint_contained ii1 @ ii2 is true iff ii2 ~ ii1
+ Join iint_join ii1 + ii2 = int_interval(min(l1, l2), max(h1, h2))
if ii1 and ii2 overlap and NULL otherwise
* Overlap iint_overlap ii1 * ii2 = int_interval(max(l1, l2), min(h1, h2))
if ii1 and ii2 overlap and NULL otherwise
^ Span iint_span ii1 ^ ii2 = int_interval(min(l1, l2), max(h1, h2))

Operators join and overlap are pairwise union and intersection respectively. Union and intersection are defined as aggregate functions iint_union and iint_intersection returning an int_interval. The aggregator version of iint_span is iint_cover.

A function intpair_to_iint that takes integers l and h as arguments and returns int_interval'(l, h)' is provided for convenience. There is also a version of iint_buffer that allows the specification of left and right offsets: iint_buffer(ii, b1, b2), where ii = (l, h), returns (max(0, l - b1), h + b2); if b1 = b2 we have the version in the table.

Examples

A sequence feature corresponds to a triplet (id, orient, ii) where id is a sequence identifier (possibly the primary key for a sequence table), orient is a boolean indicating if the feature is in the same or contrary orientation of the sequence, and ii is the int_interval representing the feature as a subsequence.

Now, as a genomic application, suppose we have two tables of sequence features: genes and probesets, both with sequence identifiers relative to the chromosome sequence table for some species:

CREATE TABLE genes (
  id integer PRIMARY KEY,
  seq_id integer REFERENCES sequences(id),
  orient boolean,
  cds int_interval -- coding region
);
CREATE INDEX genes_id_idx ON genes(seq_id, orient);
CREATE INDEX genes_cds_idx ON genes USING gist(cds);

CREATE TABLE probesets (
  id integer PRIMARY KEY,
  seq_id integer REFERENCES sequences(id),
  orient boolean,
  region int_interval
);
CREATE INDEX probesets_id_idx ON probesets(seq_id, orient);
CREATE INDEX probesets_region_idx ON probesets USING gist(region);

Our first task is to select which probesets are contained within each gene:

SELECT p.seq_id, p.orient, p.region, g.cds FROM probesets p JOIN genes g ON
  g.seq_id=p.seq_id AND g.orient=p.orient -- same sequence and orientation
  AND p.region @ g.cds; -- probeset contained in CDS

As a more substantial application, let's now select probesets that are closest to a gene, within a specific maximum from the coding region (CDS) but not overlapping the CDS. We start with an auxiliary view:

\set offset 500
CREATE VIEW pset_buffer AS
  SELECT g.id AS gid, p.id AS pid, @(p.region |-| g.cds) AS dist
    FROM probesets p JOIN genes g ON
    g.seq_id=p.seq_id AND g.orient=p.orient -- same sequence and orientation
    -- overlapping buffer but not CDS:
    AND p.region && (g.cds <-> :offset) AND NOT p.region && g.cds;

Finally, we use pset_buffer to select probesets that attain minimum distance and report more details:

SELECT g.seq_id, g.orient, p.region, g.cds FROM genes g JOIN
  (SELECT b.gid, b.pid FROM pset_buffer b
    JOIN (SELECT gid, MIN(dist) AS dist FROM pset_buffer GROUP BY gid) AS q ON
      b.gid=q.gid AND b.dist=q.dist) AS j ON g.id=j.gid
    JOIN probesets p ON p.id=j.pid;

Suffix trees

Efficient nucleotide sequence search

A suffix tree for a reference sequence S is a data structure representing suffixes of S in a way to allow fast searching of maximal matches from query sequences. You can get more general information on suffix trees at Wikipedia.

The suffix tree data type, stree, is a serialized version of the suffix tree struct found in MUMmer version 3.22, more specifically in its core suffix tree library libstree implemented by Stefan Kurtz. For convenience, libstree is provided in PostBio but you are strongly advised to refer to the links for more details and references.

Values of stree are specified from a lower-case extended nucleotide sequence, that is,

stree '[a | c | g | t | s | w | r | y | m | k | b | d | h | v | n]+'

Suffix trees can also be casted from and to text. If you decide to store your reference sequences as strees and plan on retrieving many subsequences then stree_subtext(stree, start, len) is more memory efficient than casting to text and running substr(reftext, start, len).

Maximal matches between a stree and a query sequence can be found with stree_maxmatch, which uses a composite type, streematch, to store each result. If only the number of matches are wanted stree_maxmatchcount can be used.

CREATE TYPE streematch AS
  (matchlen integer, refstart integer, querystart integer);

stree_maxmatch(stree, query, uniqueinref, minmatchlen)

SRF that returns a maximal match (of type streematch) per row between the reference sequence in stree and the query sequence. uniqueinref is a boolean for whether the matches in the reference are unique; true is equivalent to specifying "-mumcand" in MUMmer (the default behavior) while false is equivalent to "-maxmatch". minmatchlen is the minimum length of the match, being equivalent to the "-l" input in MUMmer; if minmatchlen = 0 then the query sequence length is taken as the actual minimum length.

stree_maxmatchcount(stree, query, uniqueinref, minmatchlen, anymatch)

Returns the number of maximal matches in stree from query with length at least minmatchlen. The first four arguments have the same interpretation as in maxmatch. anymatch is a boolean for whether maxmatchcount should only check if there is any match returning 0 or 1, or if the actual number of matches should be returned.

Examples

A simple example would be:

# select stree_maxmatch('acgtacgt', 'cgta', false, 2);
 stree_maxmatch 
----------------
 (4,2,1)
 (3,6,1)
(2 rows)

If uniqueinref were true the result would be only the first row. In the next example we have two tables, one with sequences and other with motifs, and we wish to count the number of occurrences of each motif in each sequence:

SELECT s.id, motif, stree_maxmatchcount(seq::stree, motif, false, 0) AS nmatches
  FROM sequences s, motifs;

Compressed suffix arrays

Efficient exact short sequence search

Similarly to a suffix tree, a compressed suffix array aims to provide fast searching of exact matches in a reference sequence from a query sequence.

The compressed suffix array data type in PostBio, fmindex, is actually a FM index based on the Burrows-Wheeler transform of a sequence and due to Ferragina and Manzini. The latest version of this index, FMindexV2, was used to implement fmindex; please refer to the link for more details.

FM indexes can be explicitly created by fmindex_build, or simply casted from a text type. The following functions operate on a FM index data type:

fmindex_extract(fmindex, start, len)

Returns the substring of the sequence indexed by fmindex from position start and length len.

fmindex_length(fmindex)

Returns the length of the sequence indexed by fmindex.

fmindex_count(fmindex, pattern)

Returns the number of occurrences of pattern in the sequence indexed by fmindex.

fmindex_locate(fmindex, pattern)

SRF that returns the locations of pattern in the sequence indexed by fmindex, if any match occurs.

Examples

Suppose that the sequences from MUMmer's tutorial are stored in the following table:

CREATE TABLE pylori (id integer, suffix stree, sarray fmindex);
INSERT INTO pylori SELECT id, stree_build(seq), fmindex_build(seq)
  FROM (SELECT 1 AS id, fasta('H_pylori26695_Eslice.fasta') AS seq) AS q;
INSERT INTO pylori SELECT id, stree_build(seq), fmindex_build(seq)
  FROM (SELECT 2 AS id, fasta('H_pyloriJ99_Eslice.fasta') AS seq) AS q;

We could use the suffix tree to find all occurrences of, say, accgtc,

# select id, stree_maxmatch(suffix, 'accgtc', false, 0) from pylori;
 id |   maxmatch   
----+--------------
  1 | (6,119609,1)
  1 | (6,185413,1)
  2 | (6,85382,1)
(3 rows)

but a FM index is more efficient in this case; let us start with counting the number of occurrences,

# select id, fmindex_count(sarray, 'accgtc') from pylori;
 id | fmindex_count 
----+---------------
  1 |             2
  2 |             1
(2 rows)

and then listing the locations

# select id, fmindex_locate(sarray, 'accgtc') from pylori;
 id | fmindex_locate 
----+----------------
  1 |         185413
  1 |         119609
  2 |          85382
(3 rows)

Finally, let us check the results:

# select id, fmindex_extract(sarray, fmindex_locate(sarray, 'accgtc'), 6) from pylori;
 id | fmindex_extract 
----+-----------------
  1 | accgtc
  1 | accgtc
  2 | accgtc
(3 rows)

Utils

Utilitary routines

For now PostBio offers only one utilitary routine:

revcomp(seq [, complement])

Returns the reverse complement of sequence seq if complement is true (the default value) and the simple reverse otherwise. seq should contain only extended nucleotides, that is, only the same letters from the alphabet defining stree.

Installation

How to obtain and install PostBio

PostBio is distributed as a source package and can be obtained at PgFoundry. After downloading and unpacking, our first task is to compile libstree:

$ cd stree && make && make clean

The source package is then installed like any regular PostgreSQL module, that is, just run:

$ make && sudo make install
$ psql -f postbio.sql <database>

License

Copyright (c) 2008 Luis Carvalho

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.