FAQ
Hi guys:

I am trying to figure how to add the ability to create custom segment
files. Hopefully it is possible to create a plugin framework where one can
provide some sort of callback to add to a segment given a doc and provide
some sort of merge logic. This is in light of the flexible indexing effort.

After digging thru the latest trunk code in that area, I see a
Writer/WriterPerThread pattern for different types of segment files, e.g.
Stored data, norms, inverted doc, etc.

Do you think it is a good idea to consolidate them? Are there
intricacies where there are cross dependency between different types of
writers?

Merge logic seems to be in the SegmentMerger class. Seems to do this,
it would be good to separate it out to per writer type.

I am still trying to understand the code, any help is greatly
appreciated.

Thoughts?

Thanks

-John

Search Discussions

  • Michael McCandless at Sep 17, 2009 at 2:43 pm
    I'm actively working on LUCENE-1458, to enable differenct codecs for
    reading/writing the terms dict and doc/freq/prox/payload postings.
    I'm working now towards getting PforDelta working...

    However, that change doesn't [yet] do anything for norms, stored
    fields nor term vectors.

    Can you describe more details about what kinds of customization you're
    looking to do?

    Mike
    On Thu, Sep 17, 2009 at 10:00 AM, John Wang wrote:
    Hi guys:

    I am trying to figure how to add the ability to create custom segment
    files. Hopefully it is possible to create a plugin framework where one can
    provide some sort of callback to add to a segment given a doc and provide
    some sort of merge logic. This is in light of the flexible indexing effort.

    After digging thru the latest trunk code in that area, I see a
    Writer/WriterPerThread pattern for different types of segment files, e.g.
    Stored data, norms, inverted doc, etc.

    Do you think it is a good idea to consolidate them? Are there
    intricacies where there are cross dependency between different types of
    writers?

    Merge logic seems to be in the SegmentMerger class. Seems to do this,
    it would be good to separate it out to per writer type.

    I am still trying to understand the code, any help is greatly
    appreciated.

    Thoughts?

    Thanks

    -John
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • John Wang at Sep 18, 2009 at 12:14 am
    Sure.

    A simple example:

    Say you have a type of field with fixed length data per doc, e.g. a 8 bytes.
    It might be good to store in a segment:
    <numdocs><v1><v2>....<vn>

    so if you have 1000 docs, your seg file is 8k+4 bytes.

    Merging would be rather trivial as well.

    Doing this right now involves storing into payload, which pays a cost of
    parsing byte[] to say a long per doc.

    I think this problem is orthogonal to 1458.

    There are other usecases, so I thought it might be a good idea to abstract
    it out, since on a high level it is rather similar:

    start
    write per doc
    end
    merge

    Hopefully I am describing it clearly.

    Thanks

    -John

    On Thu, Sep 17, 2009 at 10:35 PM, Michael McCandless wrote:

    I'm actively working on LUCENE-1458, to enable differenct codecs for
    reading/writing the terms dict and doc/freq/prox/payload postings.
    I'm working now towards getting PforDelta working...

    However, that change doesn't [yet] do anything for norms, stored
    fields nor term vectors.

    Can you describe more details about what kinds of customization you're
    looking to do?

    Mike
    On Thu, Sep 17, 2009 at 10:00 AM, John Wang wrote:
    Hi guys:

    I am trying to figure how to add the ability to create custom segment
    files. Hopefully it is possible to create a plugin framework where one can
    provide some sort of callback to add to a segment given a doc and provide
    some sort of merge logic. This is in light of the flexible indexing effort.
    After digging thru the latest trunk code in that area, I see a
    Writer/WriterPerThread pattern for different types of segment files, e.g.
    Stored data, norms, inverted doc, etc.

    Do you think it is a good idea to consolidate them? Are there
    intricacies where there are cross dependency between different types of
    writers?

    Merge logic seems to be in the SegmentMerger class. Seems to do this,
    it would be good to separate it out to per writer type.

    I am still trying to understand the code, any help is greatly
    appreciated.

    Thoughts?

    Thanks

    -John
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Marvin Humphrey at Sep 18, 2009 at 3:53 am

    On Fri, Sep 18, 2009 at 08:14:24AM +0800, John Wang wrote:

    Say you have a type of field with fixed length data per doc, e.g. a 8 bytes.
    It might be good to store in a segment:
    <numdocs><v1><v2>....<vn>
    Heh. You've just described this proof of concept class:

    http://www.rectangular.com/kinosearch/docs/devel/KSx/Index/ByteBufDocWriter.html
    http://www.rectangular.com/svn/kinosearch/trunk/perl/lib/KSx/Index/ByteBufDocWriter.pm
    Hopefully I am describing it clearly.
    Sure, I understand exactly what you mean.

    Marvin Humphrey


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Michael McCandless at Sep 18, 2009 at 10:01 am

    Say you have a type of field with fixed length data per doc, e.g. a
    8 bytes.
    OK this makes sense -- thanks for the example! This sounds like
    getting column-stride-fields before that feature is added to Lucene
    "for real".

    For flushing, you can plugin your own indexing chain to IndexWriter.
    This (customizing what's indexed per-doc and what's written for the
    new segment) is exactly what the pluggable indexing chain is for.
    BUT: this API is still very experimental and package private.

    I suppose, for looser integration we could add a hook that's called in
    IndexWriter giving you a chance to do something at flush.
    Hmm... actually could you use doAfterFlush()?

    Merging, however, doesn't yet have hooks / pluggability in place to do
    something custom, and I agree it's sorely needed. Patches very
    welcome here!

    This could enable "loose" customization on what's flushed and how it's
    merged, and you'd have to make your own reader external to Lucene.

    LUCENE-1458 is aiming to cover this sort of use case, but in a more
    tightly integrated way. EG the new enumeration API in LUCENE-1458 (to
    replace TermEnum, TermDocs, TermPositions) is based on AttributeSource
    so that you could add your own attribute at the field, term, doc or
    positions level. However I haven't explored this at all yet, and eg
    customizable merging is not done.
    It [flush] probably doesn't need to be final Mike?
    I agree. Wanna include un-final'ing it in a patch?
    Is there a wiki or some sort of write up on LUCENE-1458?
    Sorry not just yet. I agree it's badly needed... it's an enormous set
    of changes at this point. I'll add a wiki page that I'll try to keep
    current as the design iterates.

    Mike
    On Thu, Sep 17, 2009 at 8:14 PM, John Wang wrote:
    Sure.

    A simple example:

    Say you have a type of field with fixed length data per doc, e.g. a 8 bytes.
    It might be good to store in a segment:
    <numdocs><v1><v2>....<vn>

    so if you have 1000 docs, your seg file is 8k+4 bytes.

    Merging would be rather trivial as well.

    Doing this right now involves storing into payload, which pays a cost of
    parsing byte[] to say a long per doc.

    I think this problem is orthogonal to 1458.

    There are other usecases, so I thought it might be a good idea to abstract
    it out, since on a high level it is rather similar:

    start
    write per doc
    end
    merge

    Hopefully I am describing it clearly.

    Thanks

    -John


    On Thu, Sep 17, 2009 at 10:35 PM, Michael McCandless
    wrote:
    I'm actively working on LUCENE-1458, to enable differenct codecs for
    reading/writing the terms dict and doc/freq/prox/payload postings.
    I'm working now towards getting PforDelta working...

    However, that change doesn't [yet] do anything for norms, stored
    fields nor term vectors.

    Can you describe more details about what kinds of customization you're
    looking to do?

    Mike
    On Thu, Sep 17, 2009 at 10:00 AM, John Wang wrote:
    Hi guys:

    I am trying to figure how to add the ability to create custom
    segment
    files. Hopefully it is possible to create a plugin framework where one
    can
    provide some sort of callback to add to a segment given a doc and
    provide
    some sort of merge logic. This is in light of the flexible indexing
    effort.

    After digging thru the latest trunk code in that area, I see a
    Writer/WriterPerThread pattern for different types of segment files,
    e.g.
    Stored data, norms, inverted doc, etc.

    Do you think it is a good idea to consolidate them? Are there
    intricacies where there are cross dependency between different types of
    writers?

    Merge logic seems to be in the SegmentMerger class. Seems to do
    this,
    it would be good to separate it out to per writer type.

    I am still trying to understand the code, any help is greatly
    appreciated.

    Thoughts?

    Thanks

    -John
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • John Wang at Sep 18, 2009 at 11:02 am
    Thank you very much Michael for the information!

    -John
    On Fri, Sep 18, 2009 at 6:01 PM, Michael McCandless wrote:

    Say you have a type of field with fixed length data per doc, e.g. a
    8 bytes.
    OK this makes sense -- thanks for the example! This sounds like
    getting column-stride-fields before that feature is added to Lucene
    "for real".

    For flushing, you can plugin your own indexing chain to IndexWriter.
    This (customizing what's indexed per-doc and what's written for the
    new segment) is exactly what the pluggable indexing chain is for.
    BUT: this API is still very experimental and package private.

    I suppose, for looser integration we could add a hook that's called in
    IndexWriter giving you a chance to do something at flush.
    Hmm... actually could you use doAfterFlush()?

    Merging, however, doesn't yet have hooks / pluggability in place to do
    something custom, and I agree it's sorely needed. Patches very
    welcome here!

    This could enable "loose" customization on what's flushed and how it's
    merged, and you'd have to make your own reader external to Lucene.

    LUCENE-1458 is aiming to cover this sort of use case, but in a more
    tightly integrated way. EG the new enumeration API in LUCENE-1458 (to
    replace TermEnum, TermDocs, TermPositions) is based on AttributeSource
    so that you could add your own attribute at the field, term, doc or
    positions level. However I haven't explored this at all yet, and eg
    customizable merging is not done.
    It [flush] probably doesn't need to be final Mike?
    I agree. Wanna include un-final'ing it in a patch?
    Is there a wiki or some sort of write up on LUCENE-1458?
    Sorry not just yet. I agree it's badly needed... it's an enormous set
    of changes at this point. I'll add a wiki page that I'll try to keep
    current as the design iterates.

    Mike
    On Thu, Sep 17, 2009 at 8:14 PM, John Wang wrote:
    Sure.

    A simple example:

    Say you have a type of field with fixed length data per doc, e.g. a 8 bytes.
    It might be good to store in a segment:
    <numdocs><v1><v2>....<vn>

    so if you have 1000 docs, your seg file is 8k+4 bytes.

    Merging would be rather trivial as well.

    Doing this right now involves storing into payload, which pays a cost of
    parsing byte[] to say a long per doc.

    I think this problem is orthogonal to 1458.

    There are other usecases, so I thought it might be a good idea to abstract
    it out, since on a high level it is rather similar:

    start
    write per doc
    end
    merge

    Hopefully I am describing it clearly.

    Thanks

    -John


    On Thu, Sep 17, 2009 at 10:35 PM, Michael McCandless
    wrote:
    I'm actively working on LUCENE-1458, to enable differenct codecs for
    reading/writing the terms dict and doc/freq/prox/payload postings.
    I'm working now towards getting PforDelta working...

    However, that change doesn't [yet] do anything for norms, stored
    fields nor term vectors.

    Can you describe more details about what kinds of customization you're
    looking to do?

    Mike
    On Thu, Sep 17, 2009 at 10:00 AM, John Wang wrote:
    Hi guys:

    I am trying to figure how to add the ability to create custom
    segment
    files. Hopefully it is possible to create a plugin framework where one
    can
    provide some sort of callback to add to a segment given a doc and
    provide
    some sort of merge logic. This is in light of the flexible indexing
    effort.

    After digging thru the latest trunk code in that area, I see a
    Writer/WriterPerThread pattern for different types of segment files,
    e.g.
    Stored data, norms, inverted doc, etc.

    Do you think it is a good idea to consolidate them? Are there
    intricacies where there are cross dependency between different types
    of
    writers?

    Merge logic seems to be in the SegmentMerger class. Seems to do
    this,
    it would be good to separate it out to per writer type.

    I am still trying to understand the code, any help is greatly
    appreciated.

    Thoughts?

    Thanks

    -John
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Earwin Burrfoot at Sep 18, 2009 at 7:55 am
    I bet custom per-segment files could very well be used for per-segment
    userdata/debuginfo we introduced earlier.
    With them it could be stored neatly in a separate file instead of
    being grafted onto current ones.

    On Thu, Sep 17, 2009 at 18:35, Michael McCandless
    wrote:
    I'm actively working on LUCENE-1458, to enable differenct codecs for
    reading/writing the terms dict and doc/freq/prox/payload postings.
    I'm working now towards getting PforDelta working...

    However, that change doesn't [yet] do anything for norms, stored
    fields nor term vectors.

    Can you describe more details about what kinds of customization you're
    looking to do?

    Mike
    On Thu, Sep 17, 2009 at 10:00 AM, John Wang wrote:
    Hi guys:

    I am trying to figure how to add the ability to create custom segment
    files. Hopefully it is possible to create a plugin framework where one can
    provide some sort of callback to add to a segment given a doc and provide
    some sort of merge logic. This is in light of the flexible indexing effort.

    After digging thru the latest trunk code in that area, I see a
    Writer/WriterPerThread pattern for different types of segment files, e.g.
    Stored data, norms, inverted doc, etc.

    Do you think it is a good idea to consolidate them? Are there
    intricacies where there are cross dependency between different types of
    writers?

    Merge logic seems to be in the SegmentMerger class. Seems to do this,
    it would be good to separate it out to per writer type.

    I am still trying to understand the code, any help is greatly
    appreciated.

    Thoughts?

    Thanks

    -John
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org


    --
    Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
    Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
    ICQ: 104465785

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen at Sep 18, 2009 at 1:10 am
    I believe you could override the IW.flush and IW.mergeSuccess
    methods. flush unfortunately doesn't expose the new SegmentInfo,
    however it could be obtained via
    IW.getReader().getSequentialSubReaders (by comparing the before
    and after).

    Adjacent segment files could then be maintained without hacking into
    SegmentMerger.
    On Thu, Sep 17, 2009 at 7:00 AM, John Wang wrote:
    Hi guys:

    I am trying to figure how to add the ability to create custom segment
    files. Hopefully it is possible to create a plugin framework where one can
    provide some sort of callback to add to a segment given a doc and provide
    some sort of merge logic. This is in light of the flexible indexing effort.

    After digging thru the latest trunk code in that area, I see a
    Writer/WriterPerThread pattern for different types of segment files, e.g.
    Stored data, norms, inverted doc, etc.

    Do you think it is a good idea to consolidate them? Are there
    intricacies where there are cross dependency between different types of
    writers?

    Merge logic seems to be in the SegmentMerger class. Seems to do this,
    it would be good to separate it out to per writer type.

    I am still trying to understand the code, any help is greatly
    appreciated.

    Thoughts?

    Thanks

    -John
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • John Wang at Sep 18, 2009 at 2:17 am
    Hi Michael:

    Is there a wiki or some sort of write up on LUCENE-1458? It looks
    extremely cool!

    Re: Jason: isn't flush final?

    -John
    On Fri, Sep 18, 2009 at 9:09 AM, Jason Rutherglen wrote:

    I believe you could override the IW.flush and IW.mergeSuccess
    methods. flush unfortunately doesn't expose the new SegmentInfo,
    however it could be obtained via
    IW.getReader().getSequentialSubReaders (by comparing the before
    and after).

    Adjacent segment files could then be maintained without hacking into
    SegmentMerger.
    On Thu, Sep 17, 2009 at 7:00 AM, John Wang wrote:
    Hi guys:

    I am trying to figure how to add the ability to create custom segment
    files. Hopefully it is possible to create a plugin framework where one can
    provide some sort of callback to add to a segment given a doc and provide
    some sort of merge logic. This is in light of the flexible indexing effort.
    After digging thru the latest trunk code in that area, I see a
    Writer/WriterPerThread pattern for different types of segment files, e.g.
    Stored data, norms, inverted doc, etc.

    Do you think it is a good idea to consolidate them? Are there
    intricacies where there are cross dependency between different types of
    writers?

    Merge logic seems to be in the SegmentMerger class. Seems to do this,
    it would be good to separate it out to per writer type.

    I am still trying to understand the code, any help is greatly
    appreciated.

    Thoughts?

    Thanks

    -John
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen at Sep 18, 2009 at 4:08 am
    Yes, I guess you could branch the code? It probably doesn't need to
    be final Mike?
    On Thu, Sep 17, 2009 at 7:16 PM, John Wang wrote:
    Hi Michael:

    Is there a wiki or some sort of write up on LUCENE-1458? It looks
    extremely cool!

    Re: Jason: isn't flush final?

    -John

    On Fri, Sep 18, 2009 at 9:09 AM, Jason Rutherglen
    wrote:
    I believe you could override the IW.flush and IW.mergeSuccess
    methods. flush unfortunately doesn't expose the new SegmentInfo,
    however it could be obtained via
    IW.getReader().getSequentialSubReaders (by comparing the before
    and after).

    Adjacent segment files could then be maintained without hacking into
    SegmentMerger.
    On Thu, Sep 17, 2009 at 7:00 AM, John Wang wrote:
    Hi guys:

    I am trying to figure how to add the ability to create custom
    segment
    files. Hopefully it is possible to create a plugin framework where one
    can
    provide some sort of callback to add to a segment given a doc and
    provide
    some sort of merge logic. This is in light of the flexible indexing
    effort.

    After digging thru the latest trunk code in that area, I see a
    Writer/WriterPerThread pattern for different types of segment files,
    e.g.
    Stored data, norms, inverted doc, etc.

    Do you think it is a good idea to consolidate them? Are there
    intricacies where there are cross dependency between different types of
    writers?

    Merge logic seems to be in the SegmentMerger class. Seems to do
    this,
    it would be good to separate it out to per writer type.

    I am still trying to understand the code, any help is greatly
    appreciated.

    Thoughts?

    Thanks

    -John
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-dev @
categorieslucene
postedSep 17, '09 at 2:00p
activeSep 18, '09 at 11:02a
posts10
users5
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase