Hello.

TSearch2 allows to search a table of tsvectors by a single tsquery.
I need to solve the reverse problem.

*I have a large table of tsquery. I need to find all tsqueries in that table
that match a single document tsvector:
*
CREATE TABLE "test"."test_tsq" (
"id" SERIAL,
"q" TSQUERY NOT NULL,
CONSTRAINT "test_tsq_pkey" PRIMARY KEY("id")
);

insert into test.test_tsq(q)
select to_tsquery(g || 'x' || g) from generate_series(100000, 900000) as g;

explain analyze
select * from test.test_tsq
where to_tsvector('400000x400000') @@ q

This gets a strange explain analyze:

QUERY PLAN
Seq Scan on test_tsq (cost=0.00..17477.01 rows=800 width=36) (actual
time=68.698..181.458 rows=1 loops=1)
Filter: ('''400000x400000'':1'::tsvector @@ q)
Total runtime: 181.484 ms

No matter if I use GIST index on test_tsq.q or not: the explain analyze
result is the same.
So, why "rows=800"? The table contains much more rows...

Search Discussions

  • Oleg Bartunov at Sep 12, 2008 at 4:02 pm

    On Fri, 12 Sep 2008, Dmitry Koterov wrote:

    Hello.

    TSearch2 allows to search a table of tsvectors by a single tsquery.
    I need to solve the reverse problem.

    *I have a large table of tsquery. I need to find all tsqueries in that table
    that match a single document tsvector:
    *
    CREATE TABLE "test"."test_tsq" (
    "id" SERIAL,
    "q" TSQUERY NOT NULL,
    CONSTRAINT "test_tsq_pkey" PRIMARY KEY("id")
    );

    insert into test.test_tsq(q)
    select to_tsquery(g || 'x' || g) from generate_series(100000, 900000) as g;

    explain analyze
    select * from test.test_tsq
    where to_tsvector('400000x400000') @@ q
    why do you need tsvector @@ q ? Much better to use tsquery = tsquery

    test=# explain analyze select * from test_tsq where q = '400000x400000'::tsque>
    QUERY PLAN
    -----------------------------------------------------------------------------------------------------------
    Seq Scan on test_tsq (cost=0.00..16667.01 rows=1 width=38) (actual time=129.208..341.111 rows=1 loops=1)
    Filter: (q = '''400000x400000'''::tsquery)
    Total runtime: 341.134 ms
    (3 rows)

    Time: 341.478 ms

    This gets a strange explain analyze:

    QUERY PLAN
    Seq Scan on test_tsq (cost=0.00..17477.01 rows=800 width=36) (actual
    time=68.698..181.458 rows=1 loops=1)
    Filter: ('''400000x400000'':1'::tsvector @@ q)
    Total runtime: 181.484 ms

    No matter if I use GIST index on test_tsq.q or not: the explain analyze
    result is the same.
    So, why "rows=800"? The table contains much more rows...
    '800' is the number of estimated rows, which is not good, since you got only
    1 row.

    Regards,
    Oleg
    _____________________________________________________________
    Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
    Sternberg Astronomical Institute, Moscow University, Russia
    Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
    phone: +007(495)939-16-83, +007(495)939-23-83
  • Dmitry Koterov at Sep 13, 2008 at 7:12 pm

    explain analyze
    select * from test.test_tsq
    where to_tsvector('400000x400000') @@ q
    why do you need tsvector @@ q ? Much better to use tsquery = tsquery

    test=# explain analyze select * from test_tsq where q =
    '400000x400000'::tsque>
    QUERY PLAN

    -----------------------------------------------------------------------------------------------------------
    Seq Scan on test_tsq (cost=0.00..16667.01 rows=1 width=38) (actual
    time=129.208..341.111 rows=1 loops=1)
    Filter: (q = '''400000x400000'''::tsquery)
    Total runtime: 341.134 ms
    (3 rows)
    M-mmm... Seems your understood me incorrectly.

    I have to find NOT queries which are exactly equal to another query, BUT
    queries which MATCH the GIVEN document. '400000x400000' was a sample only,
    in real cases it will be 1-2K document.

    Here is a more realistic sample:

    explain analyze
    select * from test.test_tsq
    where to_tsvector('
    Here is a real document text. It may be long, 1-2K.
    In this sample it contains a lexem "400000x400000", so there is a tsquery
    in test_tsq.q which matches this document. I need to find all such queries
    fast.
    Of course, in real cases the document text is unpredictable.
    ') @@ q


    QUERY PLAN
    Seq Scan on test_tsq (cost=0.00..17477.01 rows=800 width=36) (actual
    time=68.698..181.458 rows=1 loops=1)
    Filter: ('''400000x400000'':1'::tsvector @@ q)
    Total runtime: 181.484 ms
    '800' is the number of estimated rows, which is not good, since you got
    only 1 row.
    Why 800? The table contains 800000 rows, and seqscan is used. Does it scan
    the whole table or not? If yes, possibly there is a bug in explain output?
    (No mater if I create GIST index on test_tsq.q or not, the number of rows is
    still 800, so it seems to me that GIST index is not used at all.)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-general @
categoriespostgresql
postedSep 12, '08 at 3:35p
activeSep 13, '08 at 7:12p
posts3
users2
websitepostgresql.org
irc#postgresql

2 users in discussion

Dmitry Koterov: 2 posts Oleg Bartunov: 1 post

People

Translate

site design / logo © 2022 Grokbase