Introduction
What PostBio isPostBio is a set of bioinformatics extensions for
PostgreSQL. It includes three data
types: int_interval, an integer
interval used to represent biological sequence features,
stree, a suffix tree type to search for
maximum unique matches, and
fmindex, a compressed suffix array for
fast short exact matches.
PostBio also provides a set of utilitary routines.
PostBio is licensed under the MIT license (very similar to PostgreSQL's BSD license) and so can be freely used for academic and commercial purposes. Please refer to the Installation section for more details.
Integer intervals
A type for biological sequence featuresAn int_interval is not much different from a
line segment lseg type in PostGIS, being an specialization of it:
only integer coordinates are allowed and the line lies in the unidimensional
space represented by a sequence, which gives the "interval" representation. As
such, integer intervals admit topological operations and can be indexed by
GiSTs.
Values of int_interval are specified with
int_interval '( low , high )'
where low and high are integers. Implicit casts
from and to text are possible, and an integer i can
be casted to (i, i).
The following table describes all int_interval operators,
where we use ii1 as int_interval'(l1, h1)' and
ii2 as int_interval'(l2, h2)' in the
example column. The operators are actually syntatic sugar for the functions in
the function column.
| Operator | Description | Function | Example |
# |
Size (length) | iint_size |
#ii1 = h1 - l1 + 1 |
#< |
Lower endpoint | iint_low |
#<ii1 = l1 |
#> |
Upper endpoint | iint_high |
#>ii1 = h1 |
|-| |
Distance | iint_distance |
ii1 |-| ii2 = (|l1 - l2| + |h1 - h2| - #ii1 - #ii2) / 2
|
<-> |
Buffer | iint_buffer |
ii1 <-> b = int_interval(max(0, l1 - b), h1 + b)
|
<< |
Strictly left? | iint_left |
ii1 << ii2 is true iff
h1 < l2 |
&< |
Left? | iint_overleft |
ii1 &< ii2 is true iff
h1 <= h2 |
&& |
Overlaps? | iint_overlaps |
ii1 && ii2 is true iff
l1 <= h2 and h1 >= l2 |
&> |
Right? | iint_overright |
ii1 &> ii2 is true iff
l1 >= l2 |
>> |
Strictly right? | iint_right |
ii1 >> ii2 is true iff
l1 > h2 |
~= |
Same? | iint_same |
ii1 ~= ii2 is true iff
l1 = l2 and h1 = h2 |
~ |
Contains? | iint_contains |
ii1 ~ ii2 is true iff
l1 <= l2 and h1 >= h2 |
@ |
Contained? | iint_contained |
ii1 @ ii2 is true iff
ii2 ~ ii1 |
+ |
Join | iint_join |
ii1 + ii2 = int_interval(min(l1, l2), max(h1, h2))if ii1 and ii2 overlap and NULL
otherwise
|
* |
Overlap | iint_overlap |
ii1 * ii2 = int_interval(max(l1, l2), min(h1, h2))if ii1 and ii2 overlap and NULL
otherwise
|
^ |
Span | iint_span |
ii1 ^ ii2 = int_interval(min(l1, l2), max(h1, h2))
|
Operators join and overlap are pairwise union and
intersection respectively. Union and intersection are defined as aggregate
functions iint_union and iint_intersection returning
an int_interval. The aggregator version of iint_span
is iint_cover.
A function intpair_to_iint that takes integers l
and h as arguments and returns int_interval'(l, h)'
is provided for convenience. There is also a version of
iint_buffer that allows the specification of left and right
offsets: iint_buffer(ii, b1, b2), where ii = (l, h),
returns (max(0, l - b1), h + b2); if b1 = b2 we have
the version in the table.
Examples
A sequence feature corresponds to a triplet
(id, orient, ii) where id is a sequence identifier
(possibly the primary key for a sequence table), orient is a
boolean indicating if the feature is in the same or contrary orientation of
the sequence, and ii is the int_interval
representing the feature as a subsequence.
Now, as a genomic application, suppose we have two tables of sequence
features: genes and probesets, both with sequence
identifiers relative to the chromosome sequence table for some species:
CREATE TABLE genes ( id integer PRIMARY KEY, seq_id integer REFERENCES sequences(id), orient boolean, cds int_interval -- coding region ); CREATE INDEX genes_id_idx ON genes(seq_id, orient); CREATE INDEX genes_cds_idx ON genes USING gist(cds); CREATE TABLE probesets ( id integer PRIMARY KEY, seq_id integer REFERENCES sequences(id), orient boolean, region int_interval ); CREATE INDEX probesets_id_idx ON probesets(seq_id, orient); CREATE INDEX probesets_region_idx ON probesets USING gist(region);
Our first task is to select which probesets are contained within each gene:
SELECT p.seq_id, p.orient, p.region, g.cds FROM probesets p JOIN genes g ON g.seq_id=p.seq_id AND g.orient=p.orient -- same sequence and orientation AND p.region @ g.cds; -- probeset contained in CDS
As a more substantial application, let's now select probesets that are closest to a gene, within a specific maximum from the coding region (CDS) but not overlapping the CDS. We start with an auxiliary view:
\set offset 500
CREATE VIEW pset_buffer AS
SELECT g.id AS gid, p.id AS pid, @(p.region |-| g.cds) AS dist
FROM probesets p JOIN genes g ON
g.seq_id=p.seq_id AND g.orient=p.orient -- same sequence and orientation
-- overlapping buffer but not CDS:
AND p.region && (g.cds <-> :offset) AND NOT p.region && g.cds;
Finally, we use pset_buffer to select probesets that attain
minimum distance and report more details:
SELECT g.seq_id, g.orient, p.region, g.cds FROM genes g JOIN
(SELECT b.gid, b.pid FROM pset_buffer b
JOIN (SELECT gid, MIN(dist) AS dist FROM pset_buffer GROUP BY gid) AS q ON
b.gid=q.gid AND b.dist=q.dist) AS j ON g.id=j.gid
JOIN probesets p ON p.id=j.pid;
Suffix trees
Efficient nucleotide sequence searchA suffix tree for a reference sequence S is a data structure
representing suffixes of S in a way to allow fast searching of
maximal matches from query sequences. You can get more general
information on suffix trees at
Wikipedia.
The suffix tree data type, stree, is a
serialized version of the suffix tree struct found in
MUMmer version 3.22, more
specifically in its core suffix tree library libstree implemented by
Stefan Kurtz. For convenience, libstree is provided in PostBio but you
are strongly advised to refer to the links for more details and
references.
Values of stree are specified from a lower-case
extended nucleotide sequence, that is,
stree '[a | c | g | t | s | w | r | y | m | k | b | d | h | v | n]+'
Suffix trees can also be casted from and to text. If you
decide to store your reference sequences as strees and plan on
retrieving many subsequences then
stree_subtext(stree, start, len) is more
memory efficient than casting to text and running
substr(reftext, start, len).
Maximal matches between a stree and a query sequence can be
found with stree_maxmatch, which uses a composite type,
streematch, to store each result. If only the number of matches
are wanted stree_maxmatchcount can be used.
CREATE TYPE streematch AS (matchlen integer, refstart integer, querystart integer);
stree_maxmatch(stree, query, uniqueinref, minmatchlen)
SRF that returns a
maximal match (of type streematch) per row between the reference
sequence in stree and the query sequence.
uniqueinref is a boolean for whether the matches in the reference
are unique; true is equivalent to specifying
"-mumcand" in MUMmer (the default behavior) while
false is equivalent to "-maxmatch".
minmatchlen is the minimum length of the match, being
equivalent to the "-l" input in MUMmer; if
minmatchlen = 0 then the query sequence length is taken as the
actual minimum length.
stree_maxmatchcount(stree, query, uniqueinref, minmatchlen, anymatch)
Returns the number of maximal matches in stree from
query with length at least minmatchlen. The first
four arguments have the same interpretation as in maxmatch.
anymatch is a boolean for whether maxmatchcount
should only check if there is any match returning 0 or 1, or if the actual
number of matches should be returned.
Examples
A simple example would be:
# select stree_maxmatch('acgtacgt', 'cgta', false, 2);
stree_maxmatch
----------------
(4,2,1)
(3,6,1)
(2 rows)
If uniqueinref were true the result would be only
the first row. In the next example we have two tables, one with sequences and
other with motifs, and we wish to count the number of occurrences of each
motif in each sequence:
SELECT s.id, motif, stree_maxmatchcount(seq::stree, motif, false, 0) AS nmatches FROM sequences s, motifs;
Compressed suffix arrays
Efficient exact short sequence searchSimilarly to a suffix tree, a compressed suffix array aims to provide fast searching of exact matches in a reference sequence from a query sequence.
The compressed suffix array data type in PostBio, fmindex, is
actually a FM index based on the Burrows-Wheeler transform of a
sequence and due to Ferragina and Manzini. The latest version of this index,
FMindexV2,
was used to implement fmindex; please refer to the link for more
details.
FM indexes can be explicitly created by fmindex_build, or
simply casted from a text type. The following functions operate
on a FM index data type:
fmindex_extract(fmindex, start, len)
Returns the substring of the sequence indexed by fmindex from
position start and length len.
fmindex_length(fmindex)
Returns the length of the sequence indexed by fmindex.
fmindex_count(fmindex, pattern)
Returns the number of occurrences of pattern in the sequence
indexed by fmindex.
fmindex_locate(fmindex, pattern)
SRF that returns the
locations of pattern in the sequence indexed by
fmindex, if any match occurs.
Examples
Suppose that the sequences from MUMmer's tutorial are stored in the following table:
CREATE TABLE pylori (id integer, suffix stree, sarray fmindex);
INSERT INTO pylori SELECT id, stree_build(seq), fmindex_build(seq)
FROM (SELECT 1 AS id, fasta('H_pylori26695_Eslice.fasta') AS seq) AS q;
INSERT INTO pylori SELECT id, stree_build(seq), fmindex_build(seq)
FROM (SELECT 2 AS id, fasta('H_pyloriJ99_Eslice.fasta') AS seq) AS q;
We could use the suffix tree to find all occurrences of, say,
accgtc,
# select id, stree_maxmatch(suffix, 'accgtc', false, 0) from pylori; id | maxmatch ----+-------------- 1 | (6,119609,1) 1 | (6,185413,1) 2 | (6,85382,1) (3 rows)
but a FM index is more efficient in this case; let us start with counting the number of occurrences,
# select id, fmindex_count(sarray, 'accgtc') from pylori; id | fmindex_count ----+--------------- 1 | 2 2 | 1 (2 rows)
and then listing the locations
# select id, fmindex_locate(sarray, 'accgtc') from pylori; id | fmindex_locate ----+---------------- 1 | 185413 1 | 119609 2 | 85382 (3 rows)
Finally, let us check the results:
# select id, fmindex_extract(sarray, fmindex_locate(sarray, 'accgtc'), 6) from pylori; id | fmindex_extract ----+----------------- 1 | accgtc 1 | accgtc 2 | accgtc (3 rows)
Utils
Utilitary routinesFor now PostBio offers only one utilitary routine:
revcomp(seq [, complement])
Returns the reverse complement of sequence seq if
complement is true (the default value) and the
simple reverse otherwise. seq should contain only extended
nucleotides, that is, only the same letters from the alphabet defining
stree.
Installation
How to obtain and install PostBioPostBio is distributed as a source package and can be obtained at
PgFoundry.
After downloading and unpacking, our first task is to compile
libstree:
$ cd stree && make && make clean
The source package is then installed like any regular PostgreSQL module, that is, just run:
$ make && sudo make install $ psql -f postbio.sql <database>
License
Copyright (c) 2008 Luis Carvalho
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.