You are full of crap. From your own comments in Lucene 1458:

"The work on streamlining the term dictionary is excellent, but perhaps we can do better still. Can we design a format that allows us rely upon the operating system's virtual memory and avoid caching in process memory altogether?

Say that we break up the index file into fixed-width blocks of 1024 bytes. Most blocks would start with a complete term/pointer pairing, though at the top of each block, we'd need a status byte indicating whether the block contains a continuation from the previous block in order to handle cases where term length exceeds the block size.

For Lucy/KinoSearch our plan would be to mmap() on the file, but accessing it as a stream would work, too. Seeking around the index term dictionary would involve seeking the stream to multiples of the block size and performing binary search, rather than performing binary search on an array of cached terms. There would be increased processor overhead; my guess is that since the second stage of a term dictionary seek – scanning through the primary term dictionary – involves comparatively more processor power than this, the increased costs would be acceptable."

and then you state farther down

"Killing off the term dictionary index yields a nice improvement in code and file specification simplicity, and there's no performance penalty for our primary optimization target use case.
We could also explore something in-between, eg it'd be nice to
genericize MultiLevelSkipListWriter so that it could index arbitrary
files, then we could use that to index the terms dict. You could
choose to spend dedicated process RAM on the higher levels of the skip
tree, and then tentatively trust IO cache for the lower levels.
That doesn't meet the design goals of bringing the cost of opening/warming an IndexReader down to near-zero and sharing backing buffers among multiple forks. It's also very complicated, which of course bothers me more than it bothers you. So I imagine we'll choose different paths."

The thing I find funny is that many are approaching these issues as if new ground is being broken. These are ALL standard, long-known issues that any database engineer has already worked with, and there are accepted designs given applicable constraints.

This is why I've tried to point folks towards alternative designs that open the door much wider to increased performance/reliability/robustness.

Do what you like. You obviously will. This is the problem with the Lucene managers - the problems are only the ones they see - same with the solutions. If the solution (or questions) put them outside their comfort zone, they are ignored or dismissed in a tone that is designed to limit any further questions (especially those that might question their ability and/or understanding).

-----Original Message-----
From: Marvin Humphrey <marvin@rectangular.com>
Sent: Dec 26, 2008 3:53 PM
To: java-dev@lucene.apache.org, Robert Engels <rengels@ix.netcom.com>
Subject: Re: Realtime Search


Three exchanges ago in this thread, you made the incorrect assumption that the
motivation behind using mmap was read speed, and that memory mapping was being
waved around as some sort of magic wand:

Is there something that I am missing? I see lots of references to
using "memory mapped" files to "dramatically" improve performance.

I don't think this is the case at all. At the lowest levels, it is
somewhat more efficient from a CPU standpoint, but with a decent OS
cache the IO performance difference is going to negligible.

In response, I indicated that the mmap design had been discussed in JIRA, and
pointed you at a particular issue.

There have been substantial discussions about this design in JIRA,
notably LUCENE-1458.

The "dramatic" improvement is WRT to opening/reopening an IndexReader.

Apparently, you did not go back to read that JIRA thread, because you
subsequently offered a critique of a purely invented design you assumed we
must have arrived at, and continued to argue with a straw man about read

1. with "fixed" size terms, the additional IO (larger pages) probably
offsets a lot of the random access benefit. This is why "compressed"
disks on a fast machine (CPU) are often faster than "uncompressed" -
more data is read during every IO access.

While my reply did not specifically point back to LUCENE-1458 again, I hoped
that having your foolish assumption exposed would motivate you to go back and
read it, so that you could offer an informed critique of the *actual* design.
I also linked to a specific comment in LUCENE-831 which explained how mmap
applied to sort caches.

Additionally, sort caches would be written at index time in three files, and
memory mapped as laid out in

Apparently you still didn't go back and read up, because you subsequently made
a third incorrect assumption, this time about plans to do away with the term
dictionary index. In response I griped about JIRA again, using slightly
stronger but still intentionally indirect language.

No. That idea was entertained briefly and quickly discarded. There seems
to be an awful lot of irrelevant noise in the current thread arising due
to lack of familiarity with the ongoing discussions in JIRA.

Unfortunately, this must not have worked either, because you have now offered a
fourth message based on incorrect assumptions which would have been remedied by
bringing yourself up to date with the relevant JIRA threads.
That could very well be, but I was referencing your statement:

"1) Design index formats that can be memory mapped rather than slurped,
bringing the cost of opening/reopening an IndexReader down to a
negligible level."

The only reason to do this (or have it happen) is if you perform a binary
search on the term index.
No. As discussed in LUCENE-1458, LUCENE-1483, the specific link I pointed you
towards in LUCENE-831, the message where I provided you with that link, and
elsewhere in this thread... loading the term dictionary index is important, but
the cost pales in comparison to the cost of loading sort caches.
Using a 2 file system is going to be WAY slower - I'll bet lunch. It might be
workable if the files were on a striped drive, or put each file on a different
drive/controller, but requiring such specially configured hardware is not a
good idea. In the common case (single drive), you are going to be seeking all
over the place.
Mike McCandless and I had an extensive debate about the pros and cons of
depending on the OS cache to hold the term dictionary index under LUCENE-1458.
The concerns you express here were fully addressed, and even resolved under an
"agree to disagree" design.
Also, the mmap is only suitable for 64 bit platforms, since there is no way
in Java to unmap, you are going to run out of address space as segments are
The discussion of how the mmap design translates from Lucy to Lucene is an
important one, but I despair of having it if we have to rehash all of
LUCENE-1458, LUCENE-831, and possibly LUCENE-1476 and LUCENE-1483 because you
cannot be troubled to bring yourself up to speed before commenting.

You are obviously knowledgable on the subject of low level memory issues. Me
and Mike McCandless ain't exactly chopped liver, though, and neither are a lot
of other people around here who *are* bothering to keep up with the threads in
JIRA. I request that you show the rest of us more respect. Our time is
valuable, too.

Marvin Humphrey

To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 27 of 28 | next ›
Discussion Overview
groupjava-dev @
postedDec 24, '08 at 1:52a
activeDec 26, '08 at 10:51p



site design / logo © 2021 Grokbase