Hello,
I'm using a GIN index for a text column on a big table. I use it to rank
the rows, but I also need to get the term positions for each document of a
subset of documents for one or more terms. I suppose these positions are stored
in the index as the to_tsvector shows them : 'lexeme':{positions}

I've searched and asked on general postgresql mailing list, and I assume
there is no simple way to get these term positions.

For example, for 2 rows of a 'docs' table with a text column 'text' (indexed with GIN) :
'I get lexemes and I get term positions.'
'Did you get the positions ?'

I'd need a function like this :
select term_positions(text, 'get') from docs;
id_doc | positions
--------+-----------
1 | {2,6}
2 | {3}

I'd like to add this function in my database, for experimental purpose.
I got a look at the source code but didn't find some code example using the GIN index ;
I can not figure out where the GIN index is read as a tsvector
or where the '@@' operator gets the matching tsvectors for the terms of the tsquery.

Any help about where to start reading would be very welcome :)

Regards,
Yoann Moreau

Search Discussions

  • Kevin Grittner at Nov 3, 2011 at 5:29 pm

    Yoann Moreau wrote:

    I'd need a function like this :
    select term_positions(text, 'get') from docs;
    id_doc | positions
    --------+-----------
    1 | {2,6}
    2 | {3}

    I'd like to add this function in my database, for experimental
    purpose. I got a look at the source code but didn't find some code
    example using the GIN index ;
    I can not figure out where the GIN index is read as a tsvector
    or where the '@@' operator gets the matching tsvectors for the
    terms of the tsquery.

    Any help about where to start reading would be very welcome :)
    I'm not really clear on what you want to read about. Do you need
    help creating your own function on the fly, or with how to access
    the information to write the function?

    If the former, these links might help:

    http://www.postgresql.org/docs/9.1/interactive/extend.html

    http://www.postgresql.org/docs/9.1/interactive/sql-createfunction.html

    If the latter, have you looked at this file?:

    src/backend/utils/adt/tsrank.c

    Or was it something else that I'm missing?

    -Kevin
  • Florian Pflug at Nov 3, 2011 at 6:19 pm

    On Nov3, 2011, at 16:52 , Yoann Moreau wrote:
    I'm using a GIN index for a text column on a big table. I use it to rank
    the rows, but I also need to get the term positions for each document of a
    subset of documents for one or more terms. I suppose these positions are stored
    in the index as the to_tsvector shows them : 'lexeme':{positions}
    There's a difference between values of type tsvector, and what GIN indices
    on columns or expressions of type tsvector store.

    Values of type tsvector, of course, store weights and positions for each lexem.

    But GIN indices store only the bare lexems without weights and positions. In
    general, GIN indices work by extracting "elements" from values to be indexed,
    and store these "elements" in a btree, together with pointers to the rows
    containing the indexed values.

    Thus, if you created a function index on the results of to_tsvector, i.e.
    if you do
    CREATE INDEX gin_idx ON docs USING gin (to_tsvector(text))
    then the weights and positions aren't stored anywhere - they'll only exists in
    the transient, in-memory tsvector value that to_tsvector returns, but not in
    the on-disk GIN index gin_idx.

    For the positions and weights to be store, you need to store the result of
    to_tsvector in a column of type tsvector, say text_tsvector, and create the
    index as
    CREATE INDEX gin_idx ON docs USING gin (text_tsvector)

    The GIN index gin_idx still won't store weights and positions, but the column
    text_tsvector will.
    For example, for 2 rows of a 'docs' table with a text column 'text' (indexed with GIN) :
    'I get lexemes and I get term positions.'
    'Did you get the positions ?'

    I'd need a function like this :
    select term_positions(text, 'get') from docs;
    id_doc | positions
    --------+-----------
    1 | {2,6}
    2 | {3}
    As I pointed out above, you'll first need to make sure to store the result of
    to_tsvector in a columns. Then, what you need seems to be a functions that
    takes a tsvector value and returns the contained lexems as individual rows.

    Postgres doesn't seem to contain such a function currently (don't believe that,
    though - go and recheck the documentation. I don't know all thousands of built-in
    functions by heart). But it's easy to add one. You could either use PL/pgSQL
    to parse the tsvector's textual representation, or write a C function. If you
    go the PL/pgSQL route, regexp_split_to_table() might come in handy.
    I'd like to add this function in my database, for experimental purpose.
    I got a look at the source code but didn't find some code example using the GIN index ;
    I can not figure out where the GIN index is read as a tsvector
    or where the '@@' operator gets the matching tsvectors for the terms of the tsquery.
    The basic flow of information is:

    to_tsvector takes a string, parses and, applies various dictionaries according
    to the textsearch configuration, and finally returns a value of type tsvector.
    See the files names tsvector* for the implementation of that process, and for
    the implementation of the various support functions which work on values of type
    tsvector.

    The GIN index machinery then calls the tsvector's extractValue() function to extract
    the "elements" mentioned above from the tsvector value. That function is called
    gin_extract_tsvector() and lives in tsginidx.c. The extracted "elements" are
    then added to the GIN index's internal btree.

    During query execution, if postgres sees that the operator tsvector @@ tsquery
    is used, and that the left argument is a GIN-indexed column, it will use the
    extractQuery() and consistent() functions to quickly find matching rows by
    scanning the internal btree index. In the case of tsvector and tsquery, the
    implementation of these functions are gin_extract_tsquery() and
    gin_tsquery_consistent(), found also in tsginidx.c.

    I suggest you read http://www.postgresql.org/docs/9.1/interactive/gin.html,
    it explains all of this in (much) more detail.

    best regards,
    Florian Pflug
  • Yoann Moreau at Nov 4, 2011 at 10:15 am

    On 03/11/11 19:19, Florian Pflug wrote:
    There's a difference between values of type tsvector, and what GIN indices
    on columns or expressions of type tsvector store.
    I was wondering what was the point about storing the tsvector in the
    table, I now understand. I then should use the GIN index to rank my
    documents, and work on the stored tsvectors for positions.
    As I pointed out above, you'll first need to make sure to store the result of
    to_tsvector in a columns. Then, what you need seems to be a functions that
    takes a tsvector value and returns the contained lexems as individual rows.

    Postgres doesn't seem to contain such a function currently (don't believe that,
    though - go and recheck the documentation. I don't know all thousands of built-in
    functions by heart). But it's easy to add one. You could either use PL/pgSQL
    to parse the tsvector's textual representation, or write a C function. If you
    go the PL/pgSQL route, regexp_split_to_table() might come in handy.
    This seems easier to program than what I was thinking about, I'm going
    to do that. But I'm wondering about size of database with the GIN index
    plus the tsvector column, and performance about parsing the whole
    tsvectors for each document I need positions from (as I need them for a
    very few terms).

    Maybe some external fulltext engine managing lexemes and positions would
    be more efficient for my purpose. I'll try some different things and let
    you know the results.

    Thanks all for your help
    Regards,
    Yoann Moreau
  • Florian Pflug at Nov 4, 2011 at 11:16 am

    On Nov4, 2011, at 11:15 , Yoann Moreau wrote:
    On 03/11/11 19:19, Florian Pflug wrote:
    Postgres doesn't seem to contain such a function currently (don't believe that,
    though - go and recheck the documentation. I don't know all thousands of built-in
    functions by heart). But it's easy to add one. You could either use PL/pgSQL
    to parse the tsvector's textual representation, or write a C function. If you
    go the PL/pgSQL route, regexp_split_to_table() might come in handy.
    This seems easier to program than what I was thinking about, I'm going to do that.
    But I'm wondering about size of database with the GIN index plus the tsvector column,
    and performance about parsing the whole tsvectors for each document I need positions
    from (as I need them for a very few terms).
    AFAICS, the internal storage layout of tsvector should allow you to extract an
    individual lexem's positions quite efficiently (with time complexity log(N) where
    N is the number of lexems in the tsvector). Doing so will require you to implement
    your function in C though - any solution that works from a tsvector's textual
    representation will obviously have time complexity N.

    best regards,
    Florian Pflug
  • Yoann Moreau at Nov 4, 2011 at 2:26 pm

    On 04/11/11 12:15, Florian Pflug wrote:
    AFAICS, the internal storage layout of tsvector should allow you to extract an
    individual lexem's positions quite efficiently (with time complexity log(N) where
    N is the number of lexems in the tsvector). Doing so will require you to implement
    your function in C though - any solution that works from a tsvector's textual
    representation will obviously have time complexity N.

    best regards,
    Florian Pflug
    I'll do a pl/pgsql function first, I need to test it with other parts of
    the project. But I will look for more efficient algorithms for a C
    function as soon as possible if we still decide to use the postgresql
    fulltext engine.

    Regards,
    Yoann Moreau
  • Tom Lane at Nov 3, 2011 at 7:01 pm

    Yoann Moreau writes:
    I'm using a GIN index for a text column on a big table. I use it to rank
    the rows, but I also need to get the term positions for each document of a
    subset of documents for one or more terms. I suppose these positions are stored
    in the index as the to_tsvector shows them : 'lexeme':{positions}
    I'm pretty sure that a GIN index on tsvector does *not* store positions
    --- it only knows about the strings. Don't know one way or the other
    about GIST.

    regards, tom lane
  • Alexander Korotkov at Nov 3, 2011 at 7:40 pm

    On Thu, Nov 3, 2011 at 11:01 PM, Tom Lane wrote:

    Yoann Moreau <yoann.moreau@univ-avignon.fr> writes:
    I'm using a GIN index for a text column on a big table. I use it to rank
    the rows, but I also need to get the term positions for each document of a
    subset of documents for one or more terms. I suppose these positions are stored
    in the index as the to_tsvector shows them : 'lexeme':{positions}
    I'm pretty sure that a GIN index on tsvector does *not* store positions
    --- it only knows about the strings. Don't know one way or the other
    about GIST.
    GiST index doesn't store positions too. See gtsvector_compress. It converts
    tsvector to array of crc32 of words. If that value is anyway too large then
    function converts it to signature.

    ------
    With best regards,
    Alexander Korotkov.
  • Marcin Mańk at Nov 3, 2011 at 7:34 pm

    On Thu, Nov 3, 2011 at 4:52 PM, Yoann Moreau wrote:
    I'd need a function like this :
    select term_positions(text, 'get') from docs;
    id_doc | positions
    --------+-----------
    1 |     {2,6}
    2 |       {3}
    check this out:
    http://www.postgresql.org/docs/current/static/textsearch-debugging.html
    ts_debug does what You want, and more. Look at it's source - it`s a
    plain sql function, You can make something based on it.

    Greetings
    Marcin Mańk

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-hackers @
categoriespostgresql
postedNov 3, '11 at 3:52p
activeNov 4, '11 at 2:26p
posts9
users6
websitepostgresql.org...
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase