Introduction
What PostBio isPostBio is a set of bioinformatics extensions for
PostgreSQL. It includes three data
types: int_interval
, an integer
interval used to represent biological sequence features,
stree
, a suffix tree type to search for
maximum unique matches, and
fmindex
, a compressed suffix array for
fast short exact matches.
PostBio also provides a set of utilitary routines.
PostBio is licensed under the MIT license (very similar to PostgreSQL's BSD license) and so can be freely used for academic and commercial purposes. Please refer to the Installation section for more details.
Integer intervals
A type for biological sequence featuresAn int_interval
is not much different from a
line segment lseg
type in PostGIS, being an specialization of it:
only integer coordinates are allowed and the line lies in the unidimensional
space represented by a sequence, which gives the "interval" representation. As
such, integer intervals admit topological operations and can be indexed by
GiSTs.
Values of int_interval
are specified with
int_interval '( low , high )'
where low
and high
are integers. Implicit casts
from and to text
are possible, and an integer i
can
be casted to (i, i)
.
The following table describes all int_interval
operators,
where we use ii1
as int_interval'(l1, h1)'
and
ii2
as int_interval'(l2, h2)'
in the
example column. The operators are actually syntatic sugar for the functions in
the function column.
Operator  Description  Function  Example 
# 
Size (length)  iint_size 
#ii1 = h1  l1 + 1 
#< 
Lower endpoint  iint_low 
#<ii1 = l1 
#> 
Upper endpoint  iint_high 
#>ii1 = h1 
 
Distance  iint_distance 
ii1  ii2 = (l1  l2 + h1  h2  #ii1  #ii2) / 2

<> 
Buffer  iint_buffer 
ii1 <> b = int_interval(max(0, l1  b), h1 + b)

<< 
Strictly left?  iint_left 
ii1 << ii2 is true iff
h1 < l2 
&< 
Left?  iint_overleft 
ii1 &< ii2 is true iff
h1 <= h2 
&& 
Overlaps?  iint_overlaps 
ii1 && ii2 is true iff
l1 <= h2 and h1 >= l2 
&> 
Right?  iint_overright 
ii1 &> ii2 is true iff
l1 >= l2 
>> 
Strictly right?  iint_right 
ii1 >> ii2 is true iff
l1 > h2 
~= 
Same?  iint_same 
ii1 ~= ii2 is true iff
l1 = l2 and h1 = h2 
~ 
Contains?  iint_contains 
ii1 ~ ii2 is true iff
l1 <= l2 and h1 >= h2 
@ 
Contained?  iint_contained 
ii1 @ ii2 is true iff
ii2 ~ ii1 
+ 
Join  iint_join 
ii1 + ii2 = int_interval(min(l1, l2), max(h1, h2)) if ii1 and ii2 overlap and NULL
otherwise

* 
Overlap  iint_overlap 
ii1 * ii2 = int_interval(max(l1, l2), min(h1, h2)) if ii1 and ii2 overlap and NULL
otherwise

^ 
Span  iint_span 
ii1 ^ ii2 = int_interval(min(l1, l2), max(h1, h2))

Operators join and overlap are pairwise union and
intersection respectively. Union and intersection are defined as aggregate
functions iint_union
and iint_intersection
returning
an int_interval
. The aggregator version of iint_span
is iint_cover
.
A function intpair_to_iint
that takes integers l
and h
as arguments and returns int_interval'(l, h)'
is provided for convenience. There is also a version of
iint_buffer
that allows the specification of left and right
offsets: iint_buffer(ii, b1, b2)
, where ii = (l, h)
,
returns (max(0, l  b1), h + b2)
; if b1 = b2
we have
the version in the table.
Examples
A sequence feature corresponds to a triplet
(id, orient, ii)
where id
is a sequence identifier
(possibly the primary key for a sequence table), orient
is a
boolean indicating if the feature is in the same or contrary orientation of
the sequence, and ii
is the int_interval
representing the feature as a subsequence.
Now, as a genomic application, suppose we have two tables of sequence
features: genes
and probesets
, both with sequence
identifiers relative to the chromosome sequence table for some species:
CREATE TABLE genes ( id integer PRIMARY KEY, seq_id integer REFERENCES sequences(id), orient boolean, cds int_interval  coding region ); CREATE INDEX genes_id_idx ON genes(seq_id, orient); CREATE INDEX genes_cds_idx ON genes USING gist(cds); CREATE TABLE probesets ( id integer PRIMARY KEY, seq_id integer REFERENCES sequences(id), orient boolean, region int_interval ); CREATE INDEX probesets_id_idx ON probesets(seq_id, orient); CREATE INDEX probesets_region_idx ON probesets USING gist(region);
Our first task is to select which probesets are contained within each gene:
SELECT p.seq_id, p.orient, p.region, g.cds FROM probesets p JOIN genes g ON g.seq_id=p.seq_id AND g.orient=p.orient  same sequence and orientation AND p.region @ g.cds;  probeset contained in CDS
As a more substantial application, let's now select probesets that are closest to a gene, within a specific maximum from the coding region (CDS) but not overlapping the CDS. We start with an auxiliary view:
\set offset 500 CREATE VIEW pset_buffer AS SELECT g.id AS gid, p.id AS pid, @(p.region  g.cds) AS dist FROM probesets p JOIN genes g ON g.seq_id=p.seq_id AND g.orient=p.orient  same sequence and orientation  overlapping buffer but not CDS: AND p.region && (g.cds <> :offset) AND NOT p.region && g.cds;
Finally, we use pset_buffer
to select probesets that attain
minimum distance and report more details:
SELECT g.seq_id, g.orient, p.region, g.cds FROM genes g JOIN (SELECT b.gid, b.pid FROM pset_buffer b JOIN (SELECT gid, MIN(dist) AS dist FROM pset_buffer GROUP BY gid) AS q ON b.gid=q.gid AND b.dist=q.dist) AS j ON g.id=j.gid JOIN probesets p ON p.id=j.pid;
Suffix trees
Efficient nucleotide sequence searchA suffix tree for a reference sequence S
is a data structure
representing suffixes of S
in a way to allow fast searching of
maximal matches from query sequences. You can get more general
information on suffix trees at
Wikipedia.
The suffix tree data type, stree
, is a
serialized version of the suffix tree struct found in
MUMmer version 3.22, more
specifically in its core suffix tree library libstree implemented by
Stefan Kurtz. For convenience, libstree is provided in PostBio but you
are strongly advised to refer to the links for more details and
references.
Values of stree
are specified from a lowercase
extended nucleotide sequence, that is,
stree '[a  c  g  t  s  w  r  y  m  k  b  d  h  v  n]+'
Suffix trees can also be casted from and to text
. If you
decide to store your reference sequences as stree
s and plan on
retrieving many subsequences then
stree_subtext(stree, start, len)
is more
memory efficient than casting to text
and running
substr(reftext, start, len)
.
Maximal matches between a stree
and a query sequence can be
found with stree_maxmatch
, which uses a composite type,
streematch
, to store each result. If only the number of matches
are wanted stree_maxmatchcount
can be used.
CREATE TYPE streematch AS (matchlen integer, refstart integer, querystart integer);
stree_maxmatch(stree, query, uniqueinref, minmatchlen)
SRF that returns a
maximal match (of type streematch
) per row between the reference
sequence in stree
and the query
sequence.
uniqueinref
is a boolean for whether the matches in the reference
are unique; true
is equivalent to specifying
"mumcand
" in MUMmer (the default behavior) while
false
is equivalent to "maxmatch
".
minmatchlen
is the minimum length of the match, being
equivalent to the "l
" input in MUMmer; if
minmatchlen = 0
then the query sequence length is taken as the
actual minimum length.
stree_maxmatchcount(stree, query, uniqueinref, minmatchlen, anymatch)
Returns the number of maximal matches in stree
from
query
with length at least minmatchlen
. The first
four arguments have the same interpretation as in maxmatch
.
anymatch
is a boolean for whether maxmatchcount
should only check if there is any match returning 0 or 1, or if the actual
number of matches should be returned.
Examples
A simple example would be:
# select stree_maxmatch('acgtacgt', 'cgta', false, 2); stree_maxmatch  (4,2,1) (3,6,1) (2 rows)
If uniqueinref
were true
the result would be only
the first row. In the next example we have two tables, one with sequences and
other with motifs, and we wish to count the number of occurrences of each
motif in each sequence:
SELECT s.id, motif, stree_maxmatchcount(seq::stree, motif, false, 0) AS nmatches FROM sequences s, motifs;
Compressed suffix arrays
Efficient exact short sequence searchSimilarly to a suffix tree, a compressed suffix array aims to provide fast searching of exact matches in a reference sequence from a query sequence.
The compressed suffix array data type in PostBio, fmindex
, is
actually a FM index based on the BurrowsWheeler transform of a
sequence and due to Ferragina and Manzini. The latest version of this index,
FMindexV2,
was used to implement fmindex
; please refer to the link for more
details.
FM indexes can be explicitly created by fmindex_build
, or
simply casted from a text
type. The following functions operate
on a FM index data type:
fmindex_extract(fmindex, start, len)
Returns the substring of the sequence indexed by fmindex
from
position start
and length len
.
fmindex_length(fmindex)
Returns the length of the sequence indexed by fmindex
.
fmindex_count(fmindex, pattern)
Returns the number of occurrences of pattern
in the sequence
indexed by fmindex
.
fmindex_locate(fmindex, pattern)
SRF that returns the
locations of pattern
in the sequence indexed by
fmindex
, if any match occurs.
Examples
Suppose that the sequences from MUMmer's tutorial are stored in the following table:
CREATE TABLE pylori (id integer, suffix stree, sarray fmindex); INSERT INTO pylori SELECT id, stree_build(seq), fmindex_build(seq) FROM (SELECT 1 AS id, fasta('H_pylori26695_Eslice.fasta') AS seq) AS q; INSERT INTO pylori SELECT id, stree_build(seq), fmindex_build(seq) FROM (SELECT 2 AS id, fasta('H_pyloriJ99_Eslice.fasta') AS seq) AS q;
We could use the suffix tree to find all occurrences of, say,
accgtc
,
# select id, stree_maxmatch(suffix, 'accgtc', false, 0) from pylori; id  maxmatch + 1  (6,119609,1) 1  (6,185413,1) 2  (6,85382,1) (3 rows)
but a FM index is more efficient in this case; let us start with counting the number of occurrences,
# select id, fmindex_count(sarray, 'accgtc') from pylori; id  fmindex_count + 1  2 2  1 (2 rows)
and then listing the locations
# select id, fmindex_locate(sarray, 'accgtc') from pylori; id  fmindex_locate + 1  185413 1  119609 2  85382 (3 rows)
Finally, let us check the results:
# select id, fmindex_extract(sarray, fmindex_locate(sarray, 'accgtc'), 6) from pylori; id  fmindex_extract + 1  accgtc 1  accgtc 2  accgtc (3 rows)
Utils
Utilitary routinesFor now PostBio offers only one utilitary routine:
revcomp(seq [, complement])
Returns the reverse complement of sequence seq
if
complement
is true
(the default value) and the
simple reverse otherwise. seq
should contain only extended
nucleotides, that is, only the same letters from the alphabet defining
stree
.
Installation
How to obtain and install PostBioPostBio is distributed as a source package and can be obtained at
PgFoundry.
After downloading and unpacking, our first task is to compile
libstree
:
$ cd stree && make && make clean
The source package is then installed like any regular PostgreSQL module, that is, just run:
$ make && sudo make install $ psql f postbio.sql <database>
License
Copyright (c) 2008 Luis Carvalho
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.