Hi

About the "warming-up" of xapian from the first few queries, in which
prespective does it cache the data in?
xapian / xapian-binding / filesystem IO?

I don't know if that was the right question to ask, say, i have 2 machines
Machine A)Write Head of xapian, write to local HD (continuous writing 24hrs)
[Python]
Machine B)Read Head, network mount to A's xapian DB folder [PHP]

I wonder if i want a faster search, which machine's amount of RAM matters
most?

What happen to the cache when the DB is flush? The cache in memory will gone
or will incrementally added up?

If i use python to search and cache up, does it benefit to php searches?

notice that the DB keep flushing every 10,000 doc (@5mins), will the search
preformance better-off if seperated to 2 DBs, and search over them like
this? will the cache of db1 stays and benefits?
db1 < very large
db2 < only todays document, flush every 5mins 10,000 doc

one more question on Enquire.sort_by_value(), does it use string comparasion
only? because its relative slow comparing to sort_by_docid().. (my values
are all numeric timestamps)

ar.. I when i use set_collapse_key ( MD5(title+domain) ) for removing
duplicated title under a domain, i found it a bit expensive in %.
30M documents with collapse_key : 2-9+ secs
30M documents without collapse_key: 0.01 - 0.9 secs
(my keys are currently 32-byte string)

I will keep testing it after tunned the cache part and database grows..

Big Thanks
Andrey

Search Discussions

  • James Aylett at Nov 23, 2007 at 11:32 pm

    On Fri, Nov 23, 2007 at 02:25:31PM -0800, Andrey wrote:

    About the "warming-up" of xapian from the first few queries, in which
    prespective does it cache the data in?
    xapian / xapian-binding / filesystem IO?
    Right now Xapian does (effectively) no explicit caching; it lets the
    operating system cache whatever it likes. This makes it difficult to
    answer most of your questions without knowing exactly what your
    operating system is (and details of how it caches). However in
    general, assuming there is enough core (physical memory) for the
    processes to never go into swap, the remaining memory will be used to
    cache blocks from the filesystem. From now on when I say 'cache' I
    mean 'operating system filesystem cache'.

    [Right now I'll point out that I can't remember any of the deep
    details of how flint btrees are likely to map onto disk blocks, and so
    some of this may need to be elaborated on or corrected by Olly or
    Richard.]

    When a Xapian writer flushes the database to disk, a number of file
    system blocks will change. How many cached blocks in the reader
    operating system become invalidated at that point will depend on
    details of your database and indexing and search profiles; you're best
    off measuring the effects of various changes here.

    If you have a writer using local disk and exporting to a remote reader
    (presumably using NFS), you are using the memory in the writer for two
    distinct things: caching the blocks off disk of the revision being
    used by the reader (so that requests from the reader that aren't in
    the reader's cache already will incur only the network overhead, not a
    hit to disk on the writer as well) and caching the blocks onto disk of
    the revision being assembled by the writer. (It's a little more
    complex than that because of the way revisions work, but hopefully
    that's a helpful view.)

    In very high performance situations, you /may/ get better mileage out
    of having the storage local to the reader, not the writer (throw lots
    of memory at the reader), or in a different box altogether (throw lots
    of memory at both reader and backend storage). However there may also
    be advantages to having the storage local to the writer (see below).

    Note that if your continual indexing process is 'sane' (by which I
    mean it's nowhere near intensive enough to risk getting behind - ie
    it's mostly sleeping, not actually doing work) then the memory in the
    writer isn't so important (but if the writer is also the final storage
    machine, the memory for that is important).
    What happen to the cache when the DB is flush? The cache in memory
    will gone or will incrementally added up?
    That depends on lots of things. Whatever has the storage local to it
    will do a pretty good job of throwing away invalidated cache blocks
    and, where necessary, reloading the freshened blocks from disk. (If
    the writer is on the same operating system instance, those freshened
    blocks are likely to already be in cache because of write-behind, in
    which case: win! Nothing has to hit disk to get them into core,
    assuming you have enough memory.)

    If the reader doesn't have local storage, it will have its own (now
    invalid) blocks cached. A good NFS implementation will deal with this
    fairly efficiently (NFSv4 more so than NFSv3, with the caveat that
    some NFSv4 implementations seem less stable in all sorts of nasty edge
    cases; however when you're pushing stuff that hard you're always going
    to have to do more work, so I'd ignore that for the time being). It'll
    need to go back across the network to freshen the block (assuming it
    needs that block again) or to fetch a new one (if that block is no
    longer used, which is a minor pain as it might not be invalidated if
    it's no longer used but unchanged; you can probably trust your OS to
    do the sensible thing here and just throw it away eventually in favour
    of blocks that are still being used). With luck you'll have enough
    memory on your storage box that the majority of these (ie: the most
    common blocks, ie those blocks needed for the most common searches)
    will be in core, so you won't actually hit disk there.

    (Some NFS implementations allow you to cache on disk, either by an
    extension layer above NFS or built into the file system implementation
    itself. The same kind of thing applies there, except that you might
    get better speed than having to do a network hit, depending on the
    relative speed of network vs local disk, and your disk loading.)

    It would be nice to be able to point a monitoring system at a running
    OS and figure out what's going on in its cache usage. You can get this
    kind of data to an extent on some systems, with the caveats that (a)
    it will take up memory, and so slow things down if you're running
    short on core, and (b) it will take up processor time. However, given
    a bit of time (and perhaps the risk that sometimes your system will
    respond much slower than it should as you work out the right tuning
    parameters), you can do it externally by measuring what you care about
    and tuning to improve that measurement. (This has the added advantage
    that you don't need to know intimately how your OS caches work.)

    The big message is: measure it, change it a bit, measure it
    again. Empirical data coming out of realistic simulated (or actual
    real live) searches and indexing using your code is the only real way
    know that you're improving things.
    notice that the DB keep flushing every 10,000 doc (@5mins), will the search
    preformance better-off if seperated to 2 DBs, and search over them like
    this? will the cache of db1 stays and benefits?
    db1 < very large
    db2 < only todays document, flush every 5mins 10,000 doc
    Possibly, but not necessarily for caching reasons. I *think* (Olly or
    Richard should jump in here) that providing your underlying filesystem
    block size is the same as the btree block size that you won't see a
    huge amount of difference in terms of caching efficiency. You should
    get other benefits, particularly around inserting into db2 (because
    the btree isn't nearly as big).



    Finally, note that there are many other routes you can take. Without
    knowing anything about what scale you're trying to achieve, what your
    budget is, and so on, no one's going to be able to give you a set of
    instructions on how to build the best system for your needs. (And even
    if someone could, they'd probably want to charge you a consulting fee
    for it ;-)

    J

    --
    /--------------------------------------------------------------------------\
    James Aylett xapian.org
    james@tartarus.org uncertaintydivision.org
  • Andrey at Nov 24, 2007 at 5:43 pm
    Thank you very much James for your detailed information.

    my current development config are as follow, both running CENTOS 5
    machine1 , raid 0 2x SATA hard disk , 2 G ram , 3G PD 860 Dell
    machine2, raid 5 3x 15k rpm SAS , 8G ram , 2X 2G Xeon 2950 dell

    Do u suggest any utilities to monitor the OS filesystem caches so I can
    try to monitor + tune them

    Cheers
    Andrey


    "James Aylett" <james-xapian@tartarus.org> wrote in message
    news:20071123233258.GE25387@tartarus.org...
    On Fri, Nov 23, 2007 at 02:25:31PM -0800, Andrey wrote:

    About the "warming-up" of xapian from the first few queries, in which
    prespective does it cache the data in?
    xapian / xapian-binding / filesystem IO?
    Right now Xapian does (effectively) no explicit caching; it lets the
    operating system cache whatever it likes. This makes it difficult to
    answer most of your questions without knowing exactly what your
    operating system is (and details of how it caches). However in
    general, assuming there is enough core (physical memory) for the
    processes to never go into swap, the remaining memory will be used to
    cache blocks from the filesystem. From now on when I say 'cache' I
    mean 'operating system filesystem cache'.

    [Right now I'll point out that I can't remember any of the deep
    details of how flint btrees are likely to map onto disk blocks, and so
    some of this may need to be elaborated on or corrected by Olly or
    Richard.]

    When a Xapian writer flushes the database to disk, a number of file
    system blocks will change. How many cached blocks in the reader
    operating system become invalidated at that point will depend on
    details of your database and indexing and search profiles; you're best
    off measuring the effects of various changes here.

    If you have a writer using local disk and exporting to a remote reader
    (presumably using NFS), you are using the memory in the writer for two
    distinct things: caching the blocks off disk of the revision being
    used by the reader (so that requests from the reader that aren't in
    the reader's cache already will incur only the network overhead, not a
    hit to disk on the writer as well) and caching the blocks onto disk of
    the revision being assembled by the writer. (It's a little more
    complex than that because of the way revisions work, but hopefully
    that's a helpful view.)

    In very high performance situations, you /may/ get better mileage out
    of having the storage local to the reader, not the writer (throw lots
    of memory at the reader), or in a different box altogether (throw lots
    of memory at both reader and backend storage). However there may also
    be advantages to having the storage local to the writer (see below).

    Note that if your continual indexing process is 'sane' (by which I
    mean it's nowhere near intensive enough to risk getting behind - ie
    it's mostly sleeping, not actually doing work) then the memory in the
    writer isn't so important (but if the writer is also the final storage
    machine, the memory for that is important).
    What happen to the cache when the DB is flush? The cache in memory
    will gone or will incrementally added up?
    That depends on lots of things. Whatever has the storage local to it
    will do a pretty good job of throwing away invalidated cache blocks
    and, where necessary, reloading the freshened blocks from disk. (If
    the writer is on the same operating system instance, those freshened
    blocks are likely to already be in cache because of write-behind, in
    which case: win! Nothing has to hit disk to get them into core,
    assuming you have enough memory.)

    If the reader doesn't have local storage, it will have its own (now
    invalid) blocks cached. A good NFS implementation will deal with this
    fairly efficiently (NFSv4 more so than NFSv3, with the caveat that
    some NFSv4 implementations seem less stable in all sorts of nasty edge
    cases; however when you're pushing stuff that hard you're always going
    to have to do more work, so I'd ignore that for the time being). It'll
    need to go back across the network to freshen the block (assuming it
    needs that block again) or to fetch a new one (if that block is no
    longer used, which is a minor pain as it might not be invalidated if
    it's no longer used but unchanged; you can probably trust your OS to
    do the sensible thing here and just throw it away eventually in favour
    of blocks that are still being used). With luck you'll have enough
    memory on your storage box that the majority of these (ie: the most
    common blocks, ie those blocks needed for the most common searches)
    will be in core, so you won't actually hit disk there.

    (Some NFS implementations allow you to cache on disk, either by an
    extension layer above NFS or built into the file system implementation
    itself. The same kind of thing applies there, except that you might
    get better speed than having to do a network hit, depending on the
    relative speed of network vs local disk, and your disk loading.)

    It would be nice to be able to point a monitoring system at a running
    OS and figure out what's going on in its cache usage. You can get this
    kind of data to an extent on some systems, with the caveats that (a)
    it will take up memory, and so slow things down if you're running
    short on core, and (b) it will take up processor time. However, given
    a bit of time (and perhaps the risk that sometimes your system will
    respond much slower than it should as you work out the right tuning
    parameters), you can do it externally by measuring what you care about
    and tuning to improve that measurement. (This has the added advantage
    that you don't need to know intimately how your OS caches work.)

    The big message is: measure it, change it a bit, measure it
    again. Empirical data coming out of realistic simulated (or actual
    real live) searches and indexing using your code is the only real way
    know that you're improving things.
    notice that the DB keep flushing every 10,000 doc (@5mins), will the
    search
    preformance better-off if seperated to 2 DBs, and search over them like
    this? will the cache of db1 stays and benefits?
    db1 < very large
    db2 < only todays document, flush every 5mins 10,000 doc
    Possibly, but not necessarily for caching reasons. I *think* (Olly or
    Richard should jump in here) that providing your underlying filesystem
    block size is the same as the btree block size that you won't see a
    huge amount of difference in terms of caching efficiency. You should
    get other benefits, particularly around inserting into db2 (because
    the btree isn't nearly as big).



    Finally, note that there are many other routes you can take. Without
    knowing anything about what scale you're trying to achieve, what your
    budget is, and so on, no one's going to be able to give you a set of
    instructions on how to build the best system for your needs. (And even
    if someone could, they'd probably want to charge you a consulting fee
    for it ;-)

    J

    --
    /--------------------------------------------------------------------------\
    James Aylett xapian.org
    james@tartarus.org uncertaintydivision.org
  • James Aylett at Nov 25, 2007 at 10:48 am

    On Sat, Nov 24, 2007 at 09:43:00AM -0800, Andrey wrote:

    Do u suggest any utilities to monitor the OS filesystem caches so I can
    try to monitor + tune them
    Monitoring is in a bit of a state of flux at the moment. I've used
    Cacti with considerable success in the past, but Zenoss looks
    promising for the future, although it hasn't made it into all (or even
    many) OS distributions yet, and it can apparently be difficult to
    install because of the dependencies. I haven't tried it yet.

    J

    --
    /--------------------------------------------------------------------------\
    James Aylett xapian.org
    james@tartarus.org uncertaintydivision.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupxapian-discuss @
categoriesxapian
postedNov 23, '07 at 10:25p
activeNov 25, '07 at 10:48a
posts4
users2
websitexapian.org
irc#xapian

2 users in discussion

James Aylett: 2 posts Andrey: 2 posts

People

Translate

site design / logo © 2021 Grokbase