Hi,

I am conducting a benchmark to compare KVP table vs. hstore and got
bad hstore performance results when the no. of records is greater than
about 500'000.

CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text );
-- with index on key
CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL );
-- with GIST index on obj

Does anyone have experience with that?

Yours, Stefan

Search Discussions

  • Craig Ringer at May 16, 2011 at 12:25 am

    On 14/05/11 18:10, Stefan Keller wrote:
    Hi,

    I am conducting a benchmark to compare KVP table vs. hstore and got
    bad hstore performance results when the no. of records is greater than
    about 500'000.

    CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text );
    -- with index on key
    CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL );
    -- with GIST index on obj

    Does anyone have experience with that?
    What are your queries?

    What does EXPLAIN ANALYZE report on those queries?

    Did you ANALYZE your tables after populating them?

    --
    Craig Ringer
  • M1ott at May 17, 2011 at 3:22 pm
    Hi Merlin

    The analyze command gave the following result:

    On the KVP table:
    Index Scan using kvpidx on bench_kvp (cost=0.00..8.53 rows=1 width=180) (actual time=0.037..0.038 rows=1 loops=1)
    Index Cond: (bench_id = '200000_200000'::text)
    Total runtime: 0.057 ms

    And on the Hstore table:
    Bitmap Heap Scan on bench_hstore (cost=32.22..3507.54 rows=1000 width=265) (actual time=145.040..256.173 rows=1 loops=1)
    Recheck Cond: (bench_hstore @> '"bench_id"=>"200000_200000"'::hstore)
    -> Bitmap Index Scan on hidx (cost=0.00..31.97 rows=1000 width=0) (actual time=114.748..114.748 rows=30605 loops=1)
    Index Cond: (bench_hstore @> '"bench_id"=>"200000_200000"'::hstore)
    Total runtime: 256.211 ms

    For Hstore I'm using a GIST index.

    Table analysis returned no message.

    Michel


    Von: Merlin Moncure mmoncure@gmail.com
    An: Stefan Keller <sfkeller@gmail.com>
    cc: pgsql-performance@postgresql.org
    Datum: 16. Mai 2011 15:47
    Betreff: Re: [PERFORM] KVP table vs. hstore - hstore performance (Was:
    Postgres NoSQL emulation)

    Merlin Moncure

    hstore is not really designed for large-ish sets like that.

    merlin

    2011/5/16 Stefan Keller <sfkeller@gmail.com>:
    Hoi Michel

    Hast du die EXPLAIN ANALYZE information?

    LG, Stefan


    ---------- Forwarded message ----------
    From: Craig Ringer <craig@postnewspapers.com.au>
    Date: 2011/5/16
    Subject: Re: [PERFORM] KVP table vs. hstore - hstore performance (Was:
    Postgres NoSQL emulation)
    To: Stefan Keller <sfkeller@gmail.com>
    Cc: pgsql-performance@postgresql.org

    On 14/05/11 18:10, Stefan Keller wrote:
    Hi,

    I am conducting a benchmark to compare KVP table vs. hstore and got
    bad hstore performance results when the no. of records is greater than
    about 500'000.

    CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text );
    -- with index on key
    CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL );
    -- with GIST index on obj

    Does anyone have experience with that?
    What are your queries?

    What does EXPLAIN ANALYZE report on those queries?

    Did you ANALYZE your tables after populating them?

    --
    Craig Ringer
  • Pierre C at May 23, 2011 at 11:00 am

    Hi Merlin

    The analyze command gave the following result:

    On the KVP table:
    Index Scan using kvpidx on bench_kvp (cost=0.00..8.53 rows=1 width=180)
    (actual time=0.037..0.038 rows=1 loops=1)
    Index Cond: (bench_id = '200000_200000'::text)
    Total runtime: 0.057 ms

    And on the Hstore table:
    Bitmap Heap Scan on bench_hstore (cost=32.22..3507.54 rows=1000
    width=265) (actual time=145.040..256.173 rows=1 loops=1)
    Recheck Cond: (bench_hstore @> '"bench_id"=>"200000_200000"'::hstore)
    -> Bitmap Index Scan on hidx (cost=0.00..31.97 rows=1000 width=0)
    (actual time=114.748..114.748 rows=30605 loops=1)
    Index Cond: (bench_hstore @> '"bench_id"=>"200000_200000"'::hstore)
    Total runtime: 256.211 ms

    For Hstore I'm using a GIST index.
    Try to create a btree index on "(bench_hstore->bench_id) WHERE
    (bench_hstore->bench_id) IS NOT NULL".
  • Stefan Keller at May 24, 2011 at 11:45 pm
    Salut Pierre

    You wrote
    Try to create a btree index on "(bench_hstore->bench_id) WHERE
    (bench_hstore->bench_id) IS NOT NULL".
    What do you mean exactly?
    => CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps) WHERE
    ??? IS NOT NULL;

    My table's def is:
    CREATE TABLE myhstore ( id bigint PRIMARY KEY, kvps hstore NOT NULL );
    So I'm doing something like:
    CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps);

    Stefan


    2011/5/23 Pierre C <lists@peufeu.com>:
    Hi Merlin

    The analyze command gave the following result:

    On the KVP table:
    Index Scan using kvpidx on bench_kvp (cost=0.00..8.53 rows=1 width=180)
    (actual time=0.037..0.038 rows=1 loops=1)
    Index Cond: (bench_id = '200000_200000'::text)
    Total runtime: 0.057 ms

    And on the Hstore table:
    Bitmap Heap Scan on bench_hstore (cost=32.22..3507.54 rows=1000 width=265)
    (actual time=145.040..256.173 rows=1 loops=1)
    Recheck Cond: (bench_hstore @> '"bench_id"=>"200000_200000"'::hstore)
    -> Bitmap Index Scan on hidx (cost=0.00..31.97 rows=1000 width=0) (actual
    time=114.748..114.748 rows=30605 loops=1)
    Index Cond: (bench_hstore @> '"bench_id"=>"200000_200000"'::hstore)
    Total runtime: 256.211 ms

    For Hstore I'm using a GIST index.
    Try to create a btree index on "(bench_hstore->bench_id) WHERE
    (bench_hstore->bench_id) IS NOT NULL".
  • Pierre C at May 25, 2011 at 6:00 pm

    You wrote
    Try to create a btree index on "(bench_hstore->bench_id) WHERE
    (bench_hstore->bench_id) IS NOT NULL".
    What do you mean exactly?
    => CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps) WHERE
    ??? IS NOT NULL;

    My table's def is:
    CREATE TABLE myhstore ( id bigint PRIMARY KEY, kvps hstore NOT NULL );
    So I'm doing something like:
    CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps);
    Hello ;

    I meant a plain old btree index like this :

    CREATE INDEX foo ON myhstore((kvps->'yourkeyname')) WHERE
    (kvps->'yourkeyname') IS NOT NULL;

    The idea is that :

    - The reason to use hstore is to have an arbitrary number of keys and use
    the keys you want, not have a fixed set of columns like in a table
    - Therefore, no hstore key is present in all rows (if it was, you'd make
    it a table column, and maybe index it)
    - You'll probably only want to index some of the keys/values (avoiding to
    index values that contain serialized data or other stuff that never
    appears in a WHERE clause)

    So, for each key that corresponds to a searchable attribute, I'd use a
    conditional index on that key, which only indexes the relevant rows. For
    keys that never appear in a WHERE, no index is needed.

    gist is good if you want the intersecton of a hstore with another one (for
    instance), btree is good if you want simple search or range search.
  • Merlin Moncure at May 25, 2011 at 10:09 pm

    On Wed, May 25, 2011 at 11:59 AM, Pierre C wrote:
    You wrote
    Try to create a btree index on "(bench_hstore->bench_id) WHERE
    (bench_hstore->bench_id) IS NOT NULL".
    What  do you mean exactly?
    => CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps) WHERE
    ??? IS NOT NULL;

    My table's def is:
    CREATE TABLE myhstore ( id bigint PRIMARY KEY, kvps hstore NOT NULL );
    So I'm doing something like:
    CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps);
    Hello ;

    I meant a plain old btree index like this :

    CREATE INDEX foo ON myhstore((kvps->'yourkeyname')) WHERE
    (kvps->'yourkeyname') IS NOT NULL;

    The idea is that :

    - The reason to use hstore is to have an arbitrary number of keys and use
    the keys you want, not have a fixed set of columns like in a table
    - Therefore, no hstore key is present in all rows (if it was, you'd make
    it a table column, and maybe index it)
    - You'll probably only want to index some of the keys/values (avoiding to
    index values that contain serialized data or other stuff that never
    appears in a WHERE clause)

    So, for each key that corresponds to a searchable attribute, I'd use a
    conditional index on that key, which only indexes the relevant rows. For
    keys that never appear in a WHERE, no index is needed.

    gist is good if you want the intersecton of a hstore with another one (for
    instance), btree is good if you want simple search or range search.
    +1 on this approach. it works really well (unless, of course, you
    need 50 indexes...)

    merlin
  • Stefan Keller at May 25, 2011 at 11:03 pm
    Hi all

    Thank you to all who answered: That worked:

    CREATE INDEX planet_osm_point_tags_amenity
    ON planet_osm_point ((tags->'amenity'))
    WHERE (tags->'amenity') IS NOT NULL;

    My problem is, that in fact I don't know which tag to index since I'm
    running a web admin application where users can enter arbitrary
    queries.

    Yours, Stefan

    2011/5/25 Pierre C <lists@peufeu.com>:
    You wrote
    Try to create a btree index on "(bench_hstore->bench_id) WHERE
    (bench_hstore->bench_id) IS NOT NULL".
    What  do you mean exactly?
    => CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps) WHERE
    ??? IS NOT NULL;

    My table's def is:
    CREATE TABLE myhstore ( id bigint PRIMARY KEY, kvps hstore NOT NULL );
    So I'm doing something like:
    CREATE INDEX myhstore_kps_gin_idx ON myhstore USING gin(kvps);
    Hello ;

    I meant a plain old btree index like this :

    CREATE INDEX foo ON myhstore((kvps->'yourkeyname')) WHERE
    (kvps->'yourkeyname') IS NOT NULL;

    The idea is that :

    - The reason to use hstore is to have an arbitrary number of keys and use
    the keys you want, not have a fixed set of columns like in a table
    - Therefore, no hstore key is present in all rows (if it was, you'd make
    it a table column, and maybe index it)
    - You'll probably only want to index some of the keys/values (avoiding to
    index values that contain serialized data or other stuff that never
    appears in a WHERE clause)

    So, for each key that corresponds to a searchable attribute, I'd use a
    conditional index on that key, which only indexes the relevant rows. For
    keys that never appear in a WHERE, no index is needed.

    gist is good if you want the intersecton of a hstore with another one (for
    instance), btree is good if you want simple search or range search.
  • Pierre C at May 26, 2011 at 9:31 am

    My problem is, that in fact I don't know which tag to index since I'm
    running a web admin application where users can enter arbitrary
    queries.
    For a tag cloud, try this :

    - table tags ( tag_id, tag_name )
    - table articles ( article_id )
    - table articles_to_tags( article_id, tag_id )

    now this is the classical approach, which doesn't work so well when you
    want to get an article that has several tags (tag intersection).

    So, materialize the list of tag_ids for each article in an INTEGER[] array
    in the articles table, kept up to date with triggers.

    Create a gist index on that, and use indexed array vs array operators.
  • Robert Haas at May 23, 2011 at 5:19 pm

    On Tue, May 17, 2011 at 11:10 AM, wrote:
    For Hstore I'm using a GIST index.
    I would have thought that GIN would be a better choice for this workload.

    --
    Robert Haas
    EnterpriseDB: http://www.enterprisedb.com
    The Enterprise PostgreSQL Company
  • Merlin Moncure at May 16, 2011 at 1:47 pm

    On Sat, May 14, 2011 at 5:10 AM, Stefan Keller wrote:
    Hi,

    I am conducting a benchmark to compare KVP table vs. hstore and got
    bad hstore performance results when the no. of records is greater than
    about 500'000.

    CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text );
    -- with index on key
    CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL );
    -- with GIST index on obj

    Does anyone have experience with that?
    hstore is not really designed for large-ish sets like that.

    merlin
  • Jim Nasby at May 17, 2011 at 7:31 pm

    On May 16, 2011, at 8:47 AM, Merlin Moncure wrote:
    On Sat, May 14, 2011 at 5:10 AM, Stefan Keller wrote:
    Hi,

    I am conducting a benchmark to compare KVP table vs. hstore and got
    bad hstore performance results when the no. of records is greater than
    about 500'000.

    CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text );
    -- with index on key
    CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL );
    -- with GIST index on obj

    Does anyone have experience with that?
    hstore is not really designed for large-ish sets like that.
    And KVP is? ;)

    IIRC hstore ends up just storing everything as text, with pointers to know where things start and end. There's no real indexing inside hstore, so basically the only thing it can do is scan the entire hstore.

    That said, I would strongly reconsider using KVP for anything except the most trivial of data sets. It is *extremely* inefficient. Do you really have absolutely no idea what *any* of your keys will be? Even if you need to support a certain amount of non-deterministic stuff, I would put everything you possibly can into real fields and only use KVP or hstore for things that you really didn't anticipate.

    Keep in mind that for every *value*, your overhead is 24 bytes for the heap header, 2+ varlena bytes in the heap, plus the length of the key. In the index you're looking at 6+ bytes of overhead, 1+ byte for varlena, plus the length of the key. The PK will cost you an additional 16-24 bytes, depending on alignment. So that's a *minimum* of ~50 bytes per value, and realistically the overhead will be closer to 65-70 bytes, *per value*. Unless your values are decent-sized strings, the overhead is going to be many times larger than the actual data!
    --
    Jim C. Nasby, Database Architect jim@nasby.net
    512.569.9461 (cell) http://jim.nasby.net
  • Stefan Keller at May 17, 2011 at 10:07 pm
    Hi Jim

    You actually made me think about the schema Michel and I are using:
    And KVP is? ;)
    CREATE TABLE mykvpstore( id bigint PRIMARY KEY )
    CREATE TABLE kvp ( id bigint REFERENCES mykvpstore(id), key text NOT
    NULL, value text, );
    -- with index on key

    And the table with the associative array type (hstore) is:
    CREATE TABLE myhstore ( id bigint PRIMARY KEY, kvps hstore NOT NULL );
    -- with GIST index on obj

    It seems to me that in the mykvpstore-kvp there is also some overhead.

    And yes, we have no clue what keys to anticipate, except for some
    common ones like 'name': The use case is coming from OpenStreetMap
    (http://wiki.openstreetmap.org/wiki/Database_schema ).

    Yours, Stefan


    2011/5/17 Jim Nasby <jim@nasby.net>:
    On May 16, 2011, at 8:47 AM, Merlin Moncure wrote:
    On Sat, May 14, 2011 at 5:10 AM, Stefan Keller wrote:
    Hi,

    I am conducting a benchmark to compare KVP table vs. hstore and got
    bad hstore performance results when the no. of records is greater than
    about 500'000.

    CREATE TABLE kvp ( id SERIAL PRIMARY KEY, key text NOT NULL, value text );
    -- with index on key
    CREATE TABLE myhstore ( id SERIAL PRIMARY KEY, obj hstore NOT NULL );
    -- with GIST index on obj

    Does anyone have experience with that?
    hstore is not really designed for large-ish sets like that.
    And KVP is? ;)

    IIRC hstore ends up just storing everything as text, with pointers to know where things start and end. There's no real indexing inside hstore, so basically the only thing it can do is scan the entire hstore.

    That said, I would strongly reconsider using KVP for anything except the most trivial of data sets. It is *extremely* inefficient. Do you really have absolutely no idea what *any* of your keys will be? Even if you need to support a certain amount of non-deterministic stuff, I would put everything you possibly can into real fields and only use KVP or hstore for things that you really didn't anticipate.

    Keep in mind that for every *value*, your overhead is 24 bytes for the heap header, 2+ varlena bytes in the heap, plus the length of the key. In the index you're looking at 6+ bytes of overhead, 1+ byte for varlena, plus the length of the key. The PK will cost you an additional 16-24 bytes, depending on alignment. So that's a *minimum* of ~50 bytes per value, and realistically the overhead will be closer to 65-70 bytes, *per value*. Unless your values are decent-sized strings, the overhead is going to be many times larger than the actual data!
    --
    Jim C. Nasby, Database Architect                   jim@nasby.net
    512.569.9461 (cell)                         http://jim.nasby.net

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppgsql-performance @
categoriespostgresql
postedMay 14, '11 at 10:10a
activeMay 26, '11 at 9:31a
posts13
users7
websitepostgresql.org
irc#postgresql

People

Translate

site design / logo © 2021 Grokbase