FAQ
Here's my problem:

We're indexing books. I need to
a> return books ordered by relevancy
b> for any single book, return the number of hits in each chapter (which, of
course, may be many pages).

1>If I index each page as a document, creating the relevance on a book basis
is interesting, but collecting page hits per book is easy.
2>If I index each book as a document, returning the books by relevance is
easy but aggregating hits per chapter is interesting.

No, creating two indexes is not an option at present, although that would be
the least work for me.....

I can make <2> work if, for a particular field, I can determine what the
last termposition on each page is *at index time*. Oh, we don't want
searches to span pages. Pages are added to the doc with multiple calls like
so....

doc.add('"field", first page text);
doc.add('"field", second page text);
doc.add('"field", third page text);


The only approach I've really managed to come up with so far is to make my
own Analyzer that has the following characteristics...
1> override getPositionIncrementGap for this field and return, say, 100.
This should keep us from spanning pages, and provide a convenient trigger
for me to know we've finished (or are starting to) index a new page.
2> record the last token position and provide a mechanism for me to retrieve
that number. I can then keep a record in this document of what offset each
page starts at, and then accomplish my aggregation by storing, with the
document, the termpositions of the start (or end) of each page.

Note, I'm rolling my own counter for where terms hit. It'll be a degenerate
case of only ANDing things together, so it should be pretty simple even in
the wildcard case.

I'm using the Srnd* classes to do my spans, since they may include wildcards
and don't see a way to get a Spans object from that, but it's late in the
day <G>.

Last time I appealed to y'all, you wrote back that it was already done. My
hope is that it's already done again, but I've spent a couple of hours
looking and it isn't obvious to me. What I want is a way to do something
like this....

doc.add('"field", first page text);
int pos = XXX.getLastTermPosition("field");
doc.add('"field", second page text);
pos = XXX.getLastTermPosition("field");
doc.add('"field", third page text);
pos = XXX.getLastTermPosition("field");

But if I understand what's happening, the text doesn't get analyzed until
the doc is added to the index, all the doc.add(field, value) is just set-up
work without any position information really being available yet. I'd be
happy to be wrong about that <G>.

Thanks
Erick

Search Discussions

  • Michael D. Curtin at Oct 18, 2006 at 9:00 pm

    Erick Erickson wrote:

    Here's my problem:

    We're indexing books. I need to
    a> return books ordered by relevancy
    b> for any single book, return the number of hits in each chapter
    (which, of
    course, may be many pages).

    1>If I index each page as a document, creating the relevance on a book
    basis
    is interesting, but collecting page hits per book is easy.
    2>If I index each book as a document, returning the books by relevance is
    easy but aggregating hits per chapter is interesting.

    No, creating two indexes is not an option at present, although that
    would be
    the least work for me.....
    Could you elaborate on why this approach isn't an option?

    --MDC

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Oct 18, 2006 at 9:50 pm
    Arbitrary restrictions by IT on the space the indexes can take up.

    Actually, I won't categorically I *can't* make this happen, but in order to
    use this option, I need to be able to present a convincing case. And I can't
    do that until I've exhausted my options/creativity.

    And this it way keeps folks on the list from suggesting it when I've already
    thought of it.

    Erick
    On 10/18/06, Michael D. Curtin wrote:

    Erick Erickson wrote:
    Here's my problem:

    We're indexing books. I need to
    a> return books ordered by relevancy
    b> for any single book, return the number of hits in each chapter
    (which, of
    course, may be many pages).

    1>If I index each page as a document, creating the relevance on a book
    basis
    is interesting, but collecting page hits per book is easy.
    2>If I index each book as a document, returning the books by relevance is
    easy but aggregating hits per chapter is interesting.

    No, creating two indexes is not an option at present, although that
    would be
    the least work for me.....
    Could you elaborate on why this approach isn't an option?

    --MDC

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael D. Curtin at Oct 18, 2006 at 10:13 pm

    Erick Erickson wrote:

    Arbitrary restrictions by IT on the space the indexes can take up.

    Actually, I won't categorically I *can't* make this happen, but in order to
    use this option, I need to be able to present a convincing case. And I
    can't
    do that until I've exhausted my options/creativity.
    Disk space is a LOT cheaper than engineering time. Any manager worth his/her
    salt should be able to evaluate that tradeoff in a millisecond, and any IT
    professional unable to do so should be reprimanded. Maybe your boss can fix
    it. If not, yours is probably not the only such situation in the world ...

    If you can retrieve the pre-index content at search time, maybe this would work:

    1. Create the "real" index in the form that lets you get the top N books by
    relevance, on IT's disks.

    2. Create a temporary index on those books in the form that gives you the
    chapter counts in RAM, search it, then discard it.

    If N is sufficiently small, #2 could be pretty darn fast.


    If that wouldn't work, here's another idea. I'm not clear on how your
    solution with getLastTermPosition() would work, but how about just counting
    words in the pages as you document.add() them (instead of relying on
    getLastTermPosition())? It would mean two passes of parsing, but you wouldn't
    have to modify any Lucene code ...

    --MDC

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Oct 19, 2006 at 12:31 am
    I tried the notion of a temporary RAMDirectory already, and the documents
    parse unacceptably slowly , 8-10 seconds. Great minds think alike. Believe
    it or not, I have to deal with a 7,500 page book that details Civil War
    records of Michigan volunteers. The XML form is 24M, probably 16M of text
    exclusive of tags.

    About your second suggestion, I'm trying to figure out how to do essentially
    that. But a word count isn't very straight forward with stop words and dirty
    ascii (OCR) data. I'm trying to hook that process into the tokenizer so the
    counts have a better chance of being accurate, which is the essence of the
    scheme. I'd far rather get the term offset from the same place the indexer
    will than try to do a similar-but-not-quite-identical algorithm that failed
    miserably on, say, the 3,000th and subsequent pages... I'm sure you've been
    somewhere similar....

    OK, you've just caused me to think a bit, for which I thank you. I think
    it's actually pretty simple. Just instantiate a class that is a thin wrapper
    around the Lucene analyzer that implements the tokenstream (or whatever)
    interface by calling a contained analyzer (has-a). Return the token and do
    any recording I want to. And provide any additional data to my process as
    necessary. I'll have to look at that in the morning.

    All in all, I'm probably going to make your exact argument about disk space
    being waaaay cheaper than engineering time. That said, exploring this serves
    two purposes; first it lets me back my recommendation with data. Second, and
    longer term, we're using Lucene on more and more products, and exploring the
    nooks and crannies involved in exotic schemes vastly increases my ability to
    quickly tirage ways of doing things. The *other* thing my boss is good at is
    being OK with a reasonable amount of time "wasted" in order to increase my
    toolkit. So it isn't as frustrating as it might have appeared by my rather
    off-hand blaming of IT <G>.

    Thanks for the suggestions,
    Erick
    On 10/18/06, Michael D. Curtin wrote:

    Erick Erickson wrote:
    Arbitrary restrictions by IT on the space the indexes can take up.

    Actually, I won't categorically I *can't* make this happen, but in order to
    use this option, I need to be able to present a convincing case. And I
    can't
    do that until I've exhausted my options/creativity.
    Disk space is a LOT cheaper than engineering time. Any manager worth
    his/her
    salt should be able to evaluate that tradeoff in a millisecond, and any IT
    professional unable to do so should be reprimanded. Maybe your boss can
    fix
    it. If not, yours is probably not the only such situation in the world
    ...

    If you can retrieve the pre-index content at search time, maybe this would
    work:

    1. Create the "real" index in the form that lets you get the top N books
    by
    relevance, on IT's disks.

    2. Create a temporary index on those books in the form that gives you the
    chapter counts in RAM, search it, then discard it.

    If N is sufficiently small, #2 could be pretty darn fast.


    If that wouldn't work, here's another idea. I'm not clear on how your
    solution with getLastTermPosition() would work, but how about just
    counting
    words in the pages as you document.add() them (instead of relying on
    getLastTermPosition())? It would mean two passes of parsing, but you
    wouldn't
    have to modify any Lucene code ...

    --MDC

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erik Hatcher at Oct 19, 2006 at 10:24 am

    On Oct 18, 2006, at 4:50 PM, Erick Erickson wrote:
    We're indexing books. I need to
    a> return books ordered by relevancy
    b> for any single book, return the number of hits in each chapter
    (which, of
    course, may be many pages).
    I think your application deserves a good look at XTF:

    <http://www.cdlib.org/inside/projects/xtf/>

    Here's an example of its results:

    <http://content.cdlib.org/xtf/view?
    query=gold&docId=kt6489n9wp&chunk.id=0&toc.depth=1&toc.id=0&brand=oac&x=
    22&y=9>

    I've been looking into XTF lately. I've seen that they do some
    patches to Lucene, or at least now extend it somehow, and I'm curious
    if anyone on this list is part of the XTF team and could describe the
    changes they've made to Lucene and possibly submit them for
    consideration into the core.

    Erik


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Oct 19, 2006 at 12:41 pm
    Thanks. That's very similar to what we're doing, and I'd love to see some
    technical details too...

    Erick
    On 10/19/06, Erik Hatcher wrote:

    On Oct 18, 2006, at 4:50 PM, Erick Erickson wrote:
    We're indexing books. I need to
    a> return books ordered by relevancy
    b> for any single book, return the number of hits in each chapter
    (which, of
    course, may be many pages).
    I think your application deserves a good look at XTF:

    <http://www.cdlib.org/inside/projects/xtf/>

    Here's an example of its results:

    <http://content.cdlib.org/xtf/view?
    query=gold&docId=kt6489n9wp&chunk.id=0&toc.depth=1&toc.id=0&brand=oac&x=
    22&y=9>

    I've been looking into XTF lately. I've seen that they do some
    patches to Lucene, or at least now extend it somehow, and I'm curious
    if anyone on this list is part of the XTF team and could describe the
    changes they've made to Lucene and possibly submit them for
    consideration into the core.

    Erik


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedOct 18, '06 at 8:51p
activeOct 19, '06 at 12:41p
posts7
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase