RUM: improved inverted index for full-text search based on GIN index
RUM is an extension which adds a RUM index to Postgresql.
RUM index is based on GIN that stores additional per-entry information in a posting tree. For example, positional information of lexemes or timestamps. In comparison to GIN it can use this information to make faster index-only scans for:
- Phrase search
- Text search with ranking by text distance operator
- Text
SELECT
s with ordering by some non-indexed additional column e.g. by timestamp.
RUM works best in scenarios when the possible keys are highly repeatable. I.e. all texts are composed of a limited amount of words, so per-lexeme indexing gives significant speed-up in searching texts containing word combinations or phrases.
Main operators for ordering are:
tsvector <=>
tsquery | float4 | Distance between tsvector and tsquery.
value <=>
value | float8 | Distance between two values.
Where value is timestamp, timestamptz, int2, int4, int8, float4, float8, money and oid
Usage
Enable the extension
You can get started with rum by enabling the extension in your Supabase dashboard.
- Go to the Database page in the Dashboard.
- Click on Extensions in the sidebar.
- Search for "rum" and enable the extension.
Syntax
For type: tsvector
To understand the following you may need first to see Official PostgreSQL documentation on text search
rum_tsvector_ops
And we can execute tsvector
selects with ordering by text distance operator:
rum_tsvector_addon_ops
Now we can execute the selects with ordering distance operator on attached column:
For type: anyarray
rum_anyarray_ops
This operator class stores anyarray elements with length of the array. It supports operators &&
, @>
, <@
, =
, %
operators. It also supports ordering by <=>
operator.
Now we can execute the query using index scan:
rum_anyarray_addon_ops
The does the same with anyarray index as rum_tsvector_addon_ops
i.e. allows to order select results using distance
operator by attached column.
Limitations
RUM
has slower build and insert times than GIN
due to:
- It is bigger due to the additional attributes stored in the index.
- It uses generic WAL records.