On Tue, Feb 9, 2010 at 4:44 PM, Marvin Humphrey wrote:
Interesting... and segment merging just does its own private
concatenation/mapping-around-deletes of the doc/positions?
I think the answer is yes, but I'm not sure I understand the
question completely since I'm not sure why you'd ask that in this
Segment merging is one place that "legitimately" needs to append
docs/positions enum of multiple sub readers... but obviously it can
just do this itself (and it must, since it renumbers the docIDs).
what's a "flat positions space"?
It's something Google once used. Instead of positions starting with
0 at each document, they just keep going.
doc 1: "Three Blind Mice" - positions 0, 1, 2
doc 2: "Peter Peter Pumpkin Eater" - positions 3, 4, 5, 6
And we don't return "objects or aggregates" with Multi*Enum now...
Yeah, this is different. In KS right now, we use a generic
PostingList, which conveys different information depending on what
class of Posting it contains.
In flex right now the codec is unware that it's being "consumed" by
Right, but in KinoSearch's case PostingList had to be aware of that
because the Posting object could be consumed at either the segment
level or the index level -- so it needed a setDocBase(offset) method
which adjusted the doc num in the Posting. It was messy.
The change I made was to eliminate PolyPostingList and
PolyPostingListReader, which made it possible to remove the
setDocBase() method from SegPostingList.
But why didn't you have the Multi*Enums layer add the offset (so that
the codec need not know who's consuming it)? Performance?
It still returns primitives. If instead we returned an int for
positions (hmm -- may be a good reason to make positions be an
Attribute, Uwe), I think it would still be OK?
In the flat positions space example, it would be necessary to add an
offset to each of the positions in that array. Each segment would
have a "positions max" analogous to maxDoc(); these would be summed
to obtain the positions offset the same way we add up maxDoc() now
to obtain the doc id offset.
OK, but [so far] we don't have that problem with the flex APIs -- the
codec is not aware that there's a multi enum layer consuming it.
That example may not be a deal breaker for you, but I'm not willing
to guarantee that Lucy will always return primitives from these
enums, now and forever, one per method call.
But it'd be a major API change down the road to change this, for
Lucy/KS? Ie this example seems not to apply to Lucene, and even for
KS/Lucy seems contrived -- neither Lucene nor KS/Lucy would/could up
and make such a major API change to the enums, once "committed".
Also, this is why we're adding Attribute* to all the postings enums,
with flex -- any codec & consumer can use their own private
attributes. The attrs pass through Multi*Enum.
Still torn... I think it's convenience vs performance.
But convenience for the posting format plugin developer matters too,
Right but the existince of Multi*Enums isn't affecting the codec dev
(so far, I think).
Are you confident that a generic aggregator can support all possible
codecs, or will plugin developers be forced to ensure that
aggregation works because you've guaranteed to users like Renaud
that it will?
Well... pretty confident. So far, at least? We have an existence
proof :) The codec API really should not (and, should not have to)
bake in details of who's consuming it.
To unsubscribe, e-mail: email@example.com
For additional commands, e-mail: firstname.lastname@example.org