scriptindex is using well over 200MB of RAM to build our search index.

Is this normal?

If it is normal, is there anything I can do to reduce the memory usage?
This is quickly going to become a problem from a resources standpoint.

FYI, we are indexing around 20k items, 18k of which are active. The
resulting data file that we build for scriptindex to process ends up at
about 12MB.

The sizes of the Xapian database files are:

position: 147MB
postlist: 88MB
record: 7MB
termlist: 65MB
value: 1.5MB

I can provide the indexer_script configuration file if that would help
as well.

Thanks!
- Jim

Search Discussions

  • Olly Betts at Nov 17, 2007 at 12:15 pm

    On Fri, Nov 16, 2007 at 05:13:45PM -0500, Jim Spath wrote:
    scriptindex is using well over 200MB of RAM to build our search index.

    Is this normal?
    I've not studied scriptindex's memory usage much, but that does sounds
    rather high given the reported size of the data.

    How are you measuring that memory usage?

    And what platform is this? And which Xapian version?
    I can provide the indexer_script configuration file if that would help
    as well.
    Please.

    Cheers,
    Olly
  • Jim Spath at Nov 19, 2007 at 6:26 pm

    Olly Betts wrote:
    On Fri, Nov 16, 2007 at 05:13:45PM -0500, Jim Spath wrote:
    scriptindex is using well over 200MB of RAM to build our search index.

    Is this normal?
    I've not studied scriptindex's memory usage much, but that does sounds
    rather high given the reported size of the data.

    How are you measuring that memory usage?
    I ran top to monitor the index building process.

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    9169 schedule 18 0 236m 227m 1504 D 42 44.6 1:27.66 scriptindex
    And what platform is this? And which Xapian version?
    Ubuntu LTS 6.06 and Xapian 1.0.4
    I can provide the indexer_script configuration file if that would help
    as well.
    Please.
    quiz_id : field=quiz_id unique=Q boolean=Q
    quiz_title : field=title weight=4 index index=XTITLE
    quiz_path : field=path
    tags : weight=3 index index=XTAGS
    questions : weight=2 index index=XQUESTIONS
    answers : weight=1 index index=XANSWERS
    adult : field=adult index boolean=XADULT
    type : field=type boolean=XTYPE
    create_date : value=0
    language_string : field=language_string boolean=L

    Thanks!
    - Jim
  • Jim Spath at Nov 19, 2007 at 10:39 pm

    Jim Spath wrote:
    quiz_id : field=quiz_id unique=Q boolean=Q
    quiz_title : field=title weight=4 index index=XTITLE
    quiz_path : field=path
    tags : weight=3 index index=XTAGS
    questions : weight=2 index index=XQUESTIONS
    answers : weight=1 index index=XANSWERS
    adult : field=adult index boolean=XADULT
    type : field=type boolean=XTYPE
    create_date : value=0
    language_string : field=language_string boolean=L
    Looking my indexer_script over, I saw a some optimizations I could make
    and have lowered the amount of memory scriptindex is using by over 100MB:

    VIRT RES SHR
    previously: 236m 227m 1504
    currently: 138m 129m 1508

    My indexer_script now looks like:

    quiz_id : field=quiz_id unique=Q boolean=Q
    quiz_title : field=title weight=4 index=XTITLE
    quiz_path : field=path
    tags : weight=3 index=XTAGS
    questions : weight=2 index=XQUESTIONS
    answers : weight=1 index=XANSWERS
    adult : boolean=XADULT
    type : boolean=XTYPE
    create_date : value=0
    language_string : boolean=L

    The resulting database files are much smaller now too:

    position: 59M vs 148M
    postlist: 51M vs 89M
    record: 4.4M vs 7.7M
    termlist: 50M vs 67M
    value: 1.3M vs 1.5M

    I'm still worried about resource use as the amount of data grows, but I
    guess I'm somewhat better off now.

    Are there some generally accepted "best practices" for indexing large
    datasets?

    - Jim
  • Kevin Duraj at Nov 21, 2007 at 3:50 am
    Dear Jim,

    My scriptindex uses 7.2 GB of memory when indexing 56 millions of
    documents. Xapian memory indexing usage is based on
    XAPIAN_FLUSH_THRESHOLD envrionment variable. The default is 10K, mine
    is 1 million. I have switch all memory slots to 2GB memory modules and
    have been throwing 500MB memory modules to garbage. If you send me
    self adress envelope with postage I will send you back couple of 500MB
    memory modules. It will be more than double what you need.


    Cheers
    Kevin Duraj
    http://UncensoredWebSearch.com

    On Nov 19, 2007 2:39 PM, Jim Spath wrote:
    Jim Spath wrote:
    quiz_id : field=quiz_id unique=Q boolean=Q
    quiz_title : field=title weight=4 index index=XTITLE
    quiz_path : field=path
    tags : weight=3 index index=XTAGS
    questions : weight=2 index index=XQUESTIONS
    answers : weight=1 index index=XANSWERS
    adult : field=adult index boolean=XADULT
    type : field=type boolean=XTYPE
    create_date : value=0
    language_string : field=language_string boolean=L
    Looking my indexer_script over, I saw a some optimizations I could make
    and have lowered the amount of memory scriptindex is using by over 100MB:

    VIRT RES SHR
    previously: 236m 227m 1504
    currently: 138m 129m 1508

    My indexer_script now looks like:

    quiz_id : field=quiz_id unique=Q boolean=Q
    quiz_title : field=title weight=4 index=XTITLE
    quiz_path : field=path
    tags : weight=3 index=XTAGS
    questions : weight=2 index=XQUESTIONS
    answers : weight=1 index=XANSWERS
    adult : boolean=XADULT
    type : boolean=XTYPE
    create_date : value=0
    language_string : boolean=L

    The resulting database files are much smaller now too:

    position: 59M vs 148M
    postlist: 51M vs 89M
    record: 4.4M vs 7.7M
    termlist: 50M vs 67M
    value: 1.3M vs 1.5M

    I'm still worried about resource use as the amount of data grows, but I
    guess I'm somewhat better off now.

    Are there some generally accepted "best practices" for indexing large
    datasets?


    - Jim

    _______________________________________________
    Xapian-discuss mailing list
    Xapian-discuss@lists.xapian.org
    http://lists.xapian.org/mailman/listinfo/xapian-discuss
  • Jim Spath at Nov 21, 2007 at 1:50 pm

    Kevin Duraj wrote:
    Dear Jim,

    My scriptindex uses 7.2 GB of memory when indexing 56 millions of
    documents. Xapian memory indexing usage is based on
    XAPIAN_FLUSH_THRESHOLD envrionment variable. The default is 10K, mine
    is 1 million.
    I wasn't aware this variable existed, I'll have to try it out. Thanks!
    I have switch all memory slots to 2GB memory modules and
    have been throwing 500MB memory modules to garbage. If you send me
    self adress envelope with postage I will send you back couple of 500MB
    memory modules. It will be more than double what you need.
    We're running on a Xen virtual instance so all we need to do is to
    upgrade the instance if we want more memory... but I like to optimize
    the process first, then add resources.

    - Jim

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupxapian-discuss @
categoriesxapian
postedNov 16, '07 at 10:13p
activeNov 21, '07 at 1:50p
posts6
users3
websitexapian.org
irc#xapian

People

Translate

site design / logo © 2021 Grokbase