FAQ
Hello,

I wish to get some feedback on the use of Xapian in a virtual machine hosting plan with 360MB. The processes to share the 360MB will be the following:

0. nginx web server as front (estimated 5MB)

1. custom C++ FastCGI for dynamic requests (estimated 10MB)

2. Xapian writer (1 process and 1 thread)

3. Xapian readers (1 process with n threads for n readers)

4. PostgreSQL (estimated 50MB or lower)


That leaves about 300MB for Xapian and the rest of the Linux OS. The main UI will be a Google style search box.

Questions:

0. How would you configure Xapian for such low memory systems (e.g. how many readers, flush threshold for writer)?

1. Will file handle limitation be a problem for multithreaded Xapian reader?

2. What are advantages of multiprocess readers (compared to multithreaded) aside from crash isolation


Thanks so much!
Marlon

Search Discussions

  • Richard Boulton at Jan 21, 2010 at 5:19 pm

    2010/1/21 Marlon Baculio <mbaculio at hotmail.com>:
    That leaves about 300MB for Xapian and the rest of the Linux OS. The main UI will be a Google style search box.
    0. How would you configure Xapian for such low memory systems (e.g. how many readers, flush threshold for writer)?
    Totally depends on the load, and the size of documents. To work out
    the flush threshold, I'd probably do an index run, logging the number
    of documents processed and watching the memory use in "top", and set
    the flush threshold to the number processed when the indexer memory
    use reaches about half the available memory use (so space is left for
    the OS disk block cache, and the other processes).
    1. Will file handle limitation be a problem for multithreaded Xapian reader?
    Depends on search load. Each reader keeps about 5 filehandles open,
    so multiply that by the number of concurrent readers you want. If it
    comes close to the per-process fd limit, you've got a problem.
    2. What are advantages of multiprocess readers (compared to multithreaded) aside from crash isolation
    I can't think of any significant ones off the top of my head. You
    can't access a reader concurrently from multiple threads, so it
    doesn't make much difference to Xapian whether the reader is in a
    separate thread or a separate process.

    You might find it easier to pool connections for reuse if you use
    threads, but a process pool is perfectly feasible in theory too.

    The readers should have very little memory overhead - they don't cache
    anything, leaving that up to the OS disk cache (so it'll be shared
    between all readers automatically).

    --
    Richard
  • Marlon Baculio at Jan 21, 2010 at 9:10 pm
    Richard, thanks for the info on number of file handles per reader as well as the fact the readers won't cache anything. The OS caching sounds perfect for my needs. Thanks also for the tip on how I might compute the write threshold.

    I'm used to multithreaded programming so I would tend to lean on that. I assume there are no file locking necessary in Xapian's read mode, or if there is, it's thread-safe? (I remember for Solaris, fcntl-based file locking doesn't work if you have multiple file descriptors to the same file from multiple threads (i.e. close() and lockf() don't mix). I did not encounter the same problem for Linux' flock() though.)

    Also, to service lots of concurrent searches, the way to go (aside from throwing more hardware) is to increase number of processes/threads in the pool with their own reader as Xapian is not asynchronous. Is that correct?

    (I might be prematurely optimizing here as I'm sure a $19 VPS hosting for 360MB will be a little tight for thousands of users, but I'm trying to get my money's worth :-)


    --marlon
  • Richard Boulton at Jan 21, 2010 at 9:24 pm

    2010/1/21 Marlon Baculio <mbaculio at hotmail.com>:
    I'm used to multithreaded programming so I would tend to lean on that. I
    assume there are no file locking necessary in Xapian's read mode, or if
    there is, it's thread-safe? (I remember for Solaris, fcntl-based file
    locking doesn't work if you have multiple file descriptors to the same file
    from multiple threads (i.e. close() and lockf() don't mix). I did not
    encounter the same problem for Linux' flock() though.)
    There is no file locking in Xapian's read mode at present (though it's
    likely to be added at some point). I think you can rely on us making
    it thread safe if and when we add it, though - or if we can't do that
    efficiently, documenting what guarantees the locking does provide. We
    went to great effort to ensure that the locking for writable databases
    is threadsafe.
    Also, to service lots of concurrent searches, the way to go (aside from
    throwing more hardware) is to increase number of processes/threads in the
    pool with their own reader as Xapian is not asynchronous. Is that correct?
    Xapian is not asynchronous, yes.

    As to the rest of your question: possibly. It depends on how big your
    database is, really. With this amount of memory, it's probably quite
    easy for you to become IO bound, in which case running multiple
    concurrent searches won't get the total work done much faster, and can
    increase the latency for individual queries massively (complex queries
    can end up waiting while lots of other queries do disk accesses). You
    might be best keeping the pool fairly small, and just queuing up
    queries if the pool is all busy.
    (I might be prematurely optimizing here as I'm sure a $19 VPS hosting for
    360MB will be a little tight for thousands of users, but I'm trying to get
    my money's worth :-)
    If you run some experiments to see what profile you get, do share your
    findings here; I'm sure others would be interested.

    --
    Richard
  • Olly Betts at Jan 21, 2010 at 10:18 pm

    On Thu, Jan 21, 2010 at 05:19:41PM +0000, Richard Boulton wrote:
    2010/1/21 Marlon Baculio <mbaculio at hotmail.com>:
    1. Will file handle limitation be a problem for multithreaded Xapian reader?
    Depends on search load. Each reader keeps about 5 filehandles open,
    so multiply that by the number of concurrent readers you want. If it
    comes close to the per-process fd limit, you've got a problem.
    "About 5" is between 3 and 7 for flint, the default backend in 1.0.

    The tables for values, spelling, synonyms, and positional data are optional
    and created lazily if such data is actually added to the database.

    The per-process fd limit is pretty high on most modern OSes, and can often be
    increased so it's probably not an issue for most people.

    Cheers,
    Olly
  • Kevin Duraj at Jan 26, 2010 at 2:50 am
    Please do not add any file locking in Xapian's read mode. Searches on
    Xapian index would become lot slower if processes would have to
    negotiate between themselves, which one will lock file first. There is
    not file locking necessary when using BerkeleyDB in read mode
    therefore Xapian should follow this standard. We already have one slow
    search engine Lucene on the market. We do not need to make Xapian the
    second Lucene.

    Thanks,
    Kevin Duraj
    On Thu, Jan 21, 2010 at 2:18 PM, Olly Betts wrote:
    On Thu, Jan 21, 2010 at 05:19:41PM +0000, Richard Boulton wrote:
    2010/1/21 Marlon Baculio <mbaculio at hotmail.com>:
    1. Will file handle limitation be a problem for multithreaded Xapian reader?
    Depends on search load. ?Each reader keeps about 5 filehandles open,
    so multiply that by the number of concurrent readers you want. ?If it
    comes close to the per-process fd limit, you've got a problem.
    "About 5" is between 3 and 7 for flint, the default backend in 1.0.

    The tables for values, spelling, synonyms, and positional data are optional
    and created lazily if such data is actually added to the database.

    The per-process fd limit is pretty high on most modern OSes, and can often be
    increased so it's probably not an issue for most people.

    Cheers,
    ? ?Olly

    _______________________________________________
    Xapian-discuss mailing list
    Xapian-discuss at lists.xapian.org
    http://lists.xapian.org/mailman/listinfo/xapian-discuss

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupxapian-discuss @
categoriesxapian
postedJan 21, '10 at 4:31p
activeJan 26, '10 at 2:50a
posts6
users5
websitexapian.org
irc#xapian

People

Translate

site design / logo © 2022 Grokbase