Hi all,

Can one estimate how big the spelling database would be, based
on the total index size and/or the size of the postlist table ?
From my own experience, the size of the spelling database is
roughly similar to the postlist table. That's with indexing documents
with the TermGenerator, and stemming enabled for most documents.
Does this sound about right ?

Fabrice

Search Discussions

  • Richard Boulton at Nov 2, 2007 at 2:50 pm

    Fabrice Colin wrote:
    Hi all,

    Can one estimate how big the spelling database would be, based
    on the total index size and/or the size of the postlist table ?
    From my own experience, the size of the spelling database is
    roughly similar to the postlist table. That's with indexing documents
    with the TermGenerator, and stemming enabled for most documents.
    Does this sound about right ?
    In my experience, the spelling table is a lot smaller than the postlist.
    For example, I have one database for which the spelling table is 16Mb
    and the postlist is 576Mb.

    However, for a small database, they could well be similar. I'd expect
    the postlist database to grow roughly with the number of documents in
    the database, and the spelling database to grow more like the number of
    terms. Assuming there are lots of documents sharing the same terms, the
    spelling database should therefore grow a lot more slowly.

    What are your current actual sizes?

    --
    Richard
  • Fabrice Colin at Nov 2, 2007 at 3:37 pm

    On 11/2/07, Richard Boulton wrote:
    Fabrice Colin wrote:
    Hi all,

    Can one estimate how big the spelling database would be, based
    on the total index size and/or the size of the postlist table ?
    From my own experience, the size of the spelling database is
    roughly similar to the postlist table. That's with indexing documents
    with the TermGenerator, and stemming enabled for most documents.
    Does this sound about right ?
    In my experience, the spelling table is a lot smaller than the postlist.
    For example, I have one database for which the spelling table is 16Mb
    and the postlist is 576Mb.

    However, for a small database, they could well be similar. I'd expect
    the postlist database to grow roughly with the number of documents in
    the database, and the spelling database to grow more like the number of
    terms. Assuming there are lots of documents sharing the same terms, the
    spelling database should therefore grow a lot more slowly.

    What are your current actual sizes?
    I have one 192Mb index here with a 51Mb postlist and a 98Mb spelling
    database. Another index is 597Mb big, with a 159Mb postlist.DB and a
    275Mb spelling.DB.
    I also got a report about a 1,3Gb index with a 408Mb postlist.DB and a
    622Mb spelling.DB.

    Should I be worried ? ;-)

    Fabrice
  • Olly Betts at Nov 2, 2007 at 6:29 pm

    On Fri, Nov 02, 2007 at 11:37:42PM +0800, Fabrice Colin wrote:
    I have one 192Mb index here with a 51Mb postlist and a 98Mb spelling
    database. Another index is 597Mb big, with a 159Mb postlist.DB and a
    275Mb spelling.DB.
    I also got a report about a 1,3Gb index with a 408Mb postlist.DB and a
    622Mb spelling.DB.

    Should I be worried ? ;-)
    It's certainly worth investigating.

    1.0.3 fixed a bug which was preventing zlib compression from being used,
    so the spelling table will be smaller if you're using 1.0.3 or later.

    The only other thing which comes to mind is that long terms could be
    bloating up the spelling data. The number of n-grams generated is
    proportional to the term length, and we store the term in a list for
    each n-gram. We do then prefix-compress and then zlib-compress these
    lists of terms but the extra space required for a term is likely to be
    super-linear in the term length.

    Cheers,
    Olly
  • Fabrice Colin at Nov 4, 2007 at 6:24 am
    Richard, Olly, thanks for your reply.
    On 11/3/07, Olly Betts wrote:
    On Fri, Nov 02, 2007 at 11:37:42PM +0800, Fabrice Colin wrote:
    I have one 192Mb index here with a 51Mb postlist and a 98Mb spelling
    database. Another index is 597Mb big, with a 159Mb postlist.DB and a
    275Mb spelling.DB.
    I also got a report about a 1,3Gb index with a 408Mb postlist.DB and a
    622Mb spelling.DB.

    Should I be worried ? ;-)
    It's certainly worth investigating.

    1.0.3 fixed a bug which was preventing zlib compression from being used,
    so the spelling table will be smaller if you're using 1.0.3 or later.
    The figures I gave were with 1.0.4, unless I am mistaken.
    The only other thing which comes to mind is that long terms could be
    bloating up the spelling data. The number of n-grams generated is
    proportional to the term length, and we store the term in a list for
    each n-gram. We do then prefix-compress and then zlib-compress these
    lists of terms but the extra space required for a term is likely to be
    super-linear in the term length.
    Do prefixed terms contribute to the spelling database too ? For instance,
    terms like Tmime_type Uuri and XDIR:/directory/name etc...

    What should I try to diagnose this ?

    Fabrice
  • Olly Betts at Nov 4, 2007 at 6:35 am

    On Sun, Nov 04, 2007 at 02:24:52PM +0800, Fabrice Colin wrote:
    Do prefixed terms contribute to the spelling database too ? For instance,
    terms like Tmime_type Uuri and XDIR:/directory/name etc...
    Only if you pass them to WritableDatabase::add_spelling().

    And the TermGenerator class only adds spelling entries for unprefixed
    terms (at least at the moment).
    What should I try to diagnose this ?
    If you run xapian-check on just the spelling table in question, it'll
    tell you some statistics about the table:

    xapian-check /path/to/spelling.DB

    It would also be interesting to see how much smaller it gets when run
    through xapian-compact.

    Cheers,
    Olly
  • Fabrice Colin at Nov 4, 2007 at 10:32 am

    On 11/4/07, Olly Betts wrote:
    On Sun, Nov 04, 2007 at 02:24:52PM +0800, Fabrice Colin wrote:
    Do prefixed terms contribute to the spelling database too ? For instance,
    terms like Tmime_type Uuri and XDIR:/directory/name etc...
    Only if you pass them to WritableDatabase::add_spelling().

    And the TermGenerator class only adds spelling entries for unprefixed
    terms (at least at the moment).
    Okay. I use the Termgenerator.
    What should I try to diagnose this ?
    If you run xapian-check on just the spelling table in question, it'll
    tell you some statistics about the table:

    xapian-check /path/to/spelling.DB

    It would also be interesting to see how much smaller it gets when run
    through xapian-compact.
    Hmm xapian-compact complains with :
    postlist ...xapian-compact: DatabaseCorruptError: Bad postlist key
    I ran xapian-check on postlist.DB and got this :
    baseB blocksize=8K items�5148 lastblock$529 revision9 levels=2 root=6
    B-tree checked okay
    Extra bytes after key for first chunk of posting list for term `'

    The last line is printed in a seemingly infinite loop.
    Xapian-check on spelling.DB prints this :
    baseA blocksize=8K items`3348 lastblock0861 revision9 levels=2 rootC7
    B-tree checked okay
    spelling table: Don't know how to check structure

    On a more positive note, it looks like the 1.3Gb index I mentioned previously
    was built with 1.0.3, and that the compression bug was responsible...
    After rebuilding from scratch with 1.0.4, it shrunk down to 464Mb. The
    spelling and postlist tables are 168Mb and 155Mb respectively.

    Fabrice
  • Olly Betts at Nov 4, 2007 at 5:45 pm

    On Sun, Nov 04, 2007 at 06:32:28PM +0800, Fabrice Colin wrote:
    Hmm xapian-compact complains with :
    postlist ...xapian-compact: DatabaseCorruptError: Bad postlist key
    I ran xapian-check on postlist.DB and got this :
    baseB blocksize=8K items�5148 lastblock$529 revision9 levels=2 root=6
    B-tree checked okay
    Extra bytes after key for first chunk of posting list for term `'
    There's a (post-1.0.4) fix in SVN for failing to handle user meta-data
    (which is stored in the postlist table).

    Looks like xapian-compact needs updating for user meta-data too - I'll
    take a look.
    The last line is printed in a seemingly infinite loop.
    That's also fixed in SVN.

    Cheers,
    Olly
  • Fabrice Colin at Nov 8, 2007 at 11:11 am
    Hi all,
    On Nov 5, 2007 1:45 AM, Olly Betts wrote:
    There's a (post-1.0.4) fix in SVN for failing to handle user meta-data
    (which is stored in the postlist table).

    Looks like xapian-compact needs updating for user meta-data too - I'll
    take a look.
    The last line is printed in a seemingly infinite loop.
    That's also fixed in SVN.
    I tried a source snapshot (SVN 9657) and got the following :

    $ xapian-check db_with_spelling/spelling.DB
    baseA blocksize=8K items`3348 lastblock0861 revision9 levels=2 rootC7
    B-tree checked okay
    spelling table: Don't know how to check structure

    No errors found

    $ xapian-check db_with_spelling/postlist.DB
    baseB blocksize=8K items�5148 lastblock$529 revision9 levels=2 root=6
    B-tree checked okay
    Extra bytes after key for first chunk of posting list for term `'
    Extra bytes after key for first chunk of posting list for term `'
    postlist table errors found: 2

    $ xapian-compact db_with_spelling db_with_spelling_compact
    postlist ...xapian-compact: DatabaseCorruptError: Bad postlist key

    User meta-data is indeed used in this database.

    Fabrice

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupxapian-discuss @
categoriesxapian
postedNov 2, '07 at 2:31p
activeNov 8, '07 at 11:11a
posts9
users3
websitexapian.org
irc#xapian

People

Translate

site design / logo © 2021 Grokbase