FAQ
Greetings all. I have read many posts concerning similar use cases, but I am
still a little hazy on the best way to achieve what I need to do. Here is
the background:

2 million documents with multiple sections, some sections contain structured
data, some unstructured.

We parse the docs and place the structured stuff in oracle where each
section is a table and one master table to relate them all.

We index the unstructured sections with lucene where each section is a
document (meaning a total of about ~30 million documents) with extra fields
including one for the primary key of the master table and then some meta
fields to describe the section - type, date, etc.

For a common use case, say we have a table called demographics with a number
field that represents age (overly simplistic but gets the point across).

So say we want all people over the age of 50 who may have visited Panama:

--
We have our lucene index and we want to search the section text for the word
"panama"

AND

We want to select from the demographics table where age > 50.
--

Now I need to intersect the master table IDs from my lucene hits and my
table results.

I have a java stored procedure that runs the lucene query and creates a
temporary table with a single column where I insert the master id from the
hits of my lucene query. I then can do a join with my structured query
results.

The problem here is obviously the speed of iterating through the hits to
extract the single field that I need.

Notes:
- I must be able to get a full set of results, though I only need the one id
field
- We originally went with Oracle text which was simple, but limited and
quite slow for most queries


I have read a little about the hitcollector class and the fieldselector api,
but I am still not sure how they may help me or even if they can.

I have also tooled around with the idea of using termdocs, but the queries
may get a little complex with various ors/ands/nots, though probably not
spans and so forth.

Any suggestions will be greatly apreciated.

Thanks,

J

--
View this message in context: http://www.nabble.com/retrieve-all-docs-efficiently---just-one-field-tp17766268p17766268.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Karl Wettin at Jun 10, 2008 at 11:56 pm
    11 jun 2008 kl. 00.35 skrev 1world1love:
    We have our lucene index and we want to search the section text for
    the word
    "panama"

    AND

    We want to select from the demographics table where age > 50.
    --

    Now I need to intersect the master table IDs from my lucene hits and
    my
    table results.
    I might be missing something here -- can't you just add the age field
    to the index and include that in your query?


    karl

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Johannes Christen at Jun 11, 2008 at 7:40 am
    That might be a solution in this case, but I have the same kind of problem in another case.
    We index documents from an NTFS source. One field is the URI of the document.
    After a query has been processed, we perform an access check on the hits to ensure the user has access rights to open the document. If we have a big result set it takes very long to retrieve the URIs from all the hits, which we need to perform the access check against the file system.

    Any good solution for this?
    I think a fix document ID in lucene would help in this cases a lot. The mapping between lucene documents and other systems (e.g. Oracle) would be much faster.

    Jo

    -----Ursprüngliche Nachricht-----
    Von: Karl Wettin
    Gesendet: Mittwoch, 11. Juni 2008 01:55
    An: java-user@lucene.apache.org
    Betreff: Re: retrieve all docs efficiently - just one field


    11 jun 2008 kl. 00.35 skrev 1world1love:
    We have our lucene index and we want to search the section text for
    the word
    "panama"

    AND

    We want to select from the demographics table where age > 50.
    --

    Now I need to intersect the master table IDs from my lucene hits and
    my
    table results.
    I might be missing something here -- can't you just add the age field
    to the index and include that in your query?


    karl

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Karl Wettin at Jun 11, 2008 at 1:43 pm
    11 jun 2008 kl. 09.38 skrev Johannes Christen:
    That might be a solution in this case, but I have the same kind of
    problem in another case.
    We index documents from an NTFS source. One field is the URI of the
    document.
    After a query has been processed, we perform an access check on the
    hits to ensure the user has access rights to open the document. If
    we have a big result set it takes very long to retrieve the URIs
    from all the hits, which we need to perform the access check against
    the file system.
    How many users usually have access to any given document? Can't you
    just index them with the document?
    Any good solution for this?
    I think a fix document ID in lucene would help in this cases a lot.
    The mapping between lucene documents and other systems (e.g. Oracle)
    would be much faster.
    LUCENE-879 is a proof of concept that shows how you can enforce the
    Lucene document numbers if you really want to go that way.


    karl

    Jo

    -----Ursprüngliche Nachricht-----
    Von: Karl Wettin
    Gesendet: Mittwoch, 11. Juni 2008 01:55
    An: java-user@lucene.apache.org
    Betreff: Re: retrieve all docs efficiently - just one field


    11 jun 2008 kl. 00.35 skrev 1world1love:
    We have our lucene index and we want to search the section text for
    the word
    "panama"

    AND

    We want to select from the demographics table where age > 50.
    --

    Now I need to intersect the master table IDs from my lucene hits and
    my
    table results.
    I might be missing something here -- can't you just add the age field
    to the index and include that in your query?


    karl

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Johannes Christen at Jun 13, 2008 at 2:26 pm
    There might be quite a lot of users in different groups, but more important the user access rights might change and keeping them up to date in the index would be a real challenge.

    But thanks for the LUCENE-879 tip. I will look into that next week.

    Jo


    -----Ursprüngliche Nachricht-----
    Von: Karl Wettin
    Gesendet: Mittwoch, 11. Juni 2008 15:43
    An: java-user@lucene.apache.org
    Betreff: Re: retrieve all docs efficiently - just one field


    11 jun 2008 kl. 09.38 skrev Johannes Christen:
    That might be a solution in this case, but I have the same kind of
    problem in another case.
    We index documents from an NTFS source. One field is the URI of the
    document.
    After a query has been processed, we perform an access check on the
    hits to ensure the user has access rights to open the document. If
    we have a big result set it takes very long to retrieve the URIs
    from all the hits, which we need to perform the access check against
    the file system.
    How many users usually have access to any given document? Can't you
    just index them with the document?
    Any good solution for this?
    I think a fix document ID in lucene would help in this cases a lot.
    The mapping between lucene documents and other systems (e.g. Oracle)
    would be much faster.
    LUCENE-879 is a proof of concept that shows how you can enforce the
    Lucene document numbers if you really want to go that way.


    karl



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • 1world1love at Jun 11, 2008 at 1:46 pm

    karl wettin-3 wrote:


    I might be missing something here -- can't you just add the age field
    to the index and include that in your query?
    Thanks for the response Karl:

    I just used the age field as an example, but in reality the structured data
    is copious and complex relationships exist so there are dozens of such
    tables to manage it. The unstructured data is actually the more simplistic
    element of the data model.

    Also, in presenting the data, we must perform a number of aggregations and
    summaries that are fairly straightforward in SQL, but would be quite tedious
    and time consuming to do with lucene/programatically.
    --
    View this message in context: http://www.nabble.com/retrieve-all-docs-efficiently---just-one-field-tp17766268p17777993.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Jun 11, 2008 at 2:06 pm
    <<<I have read a little about the hitcollector class and the fieldselector
    api,
    but I am still not sure how they may help me or even if they can.>>>

    I infer from this that you're using a Hits object to get your IDs to insert
    in
    your temporary table. Here's the problem with Hits... It re-executes
    the query every 100 (200?) hits. So you can think of it as

    while (more hits) {
    if ((count % 100) == 0) execute the search and throw away the first
    <count> items
    work with the document
    }

    It can be a major bottleneck to re-execute the query every 100 hits you look
    at. HitCollector avoids this re-execution, and can result in very
    significant
    speedups when iterating through many documents.

    FieldSelector will allow lazy fetching. That is, when you do something
    like Reader.document(idx, selector) you'll be able to only load those
    fields from the document that you specify with the selector. In your case,
    you would only load the ID you care about and insert that in your temporary
    table. This can also result in very significant savings, especially if you
    only want to load a very small field from a document that has very large
    fields. See a writeup I did for one of my projects on the Lucene Wiki

    http://wiki.apache.org/lucene-java/FieldSelectorPerformance?highlight=(FieldSelector)


    Hope this helps
    Erick


    On Tue, Jun 10, 2008 at 6:35 PM, 1world1love wrote:


    Greetings all. I have read many posts concerning similar use cases, but I
    am
    still a little hazy on the best way to achieve what I need to do. Here is
    the background:

    2 million documents with multiple sections, some sections contain
    structured
    data, some unstructured.

    We parse the docs and place the structured stuff in oracle where each
    section is a table and one master table to relate them all.

    We index the unstructured sections with lucene where each section is a
    document (meaning a total of about ~30 million documents) with extra fields
    including one for the primary key of the master table and then some meta
    fields to describe the section - type, date, etc.

    For a common use case, say we have a table called demographics with a
    number
    field that represents age (overly simplistic but gets the point across).

    So say we want all people over the age of 50 who may have visited Panama:

    --
    We have our lucene index and we want to search the section text for the
    word
    "panama"

    AND

    We want to select from the demographics table where age > 50.
    --

    Now I need to intersect the master table IDs from my lucene hits and my
    table results.

    I have a java stored procedure that runs the lucene query and creates a
    temporary table with a single column where I insert the master id from the
    hits of my lucene query. I then can do a join with my structured query
    results.

    The problem here is obviously the speed of iterating through the hits to
    extract the single field that I need.

    Notes:
    - I must be able to get a full set of results, though I only need the one
    id
    field
    - We originally went with Oracle text which was simple, but limited and
    quite slow for most queries


    I have read a little about the hitcollector class and the fieldselector
    api,
    but I am still not sure how they may help me or even if they can.

    I have also tooled around with the idea of using termdocs, but the queries
    may get a little complex with various ors/ands/nots, though probably not
    spans and so forth.

    Any suggestions will be greatly apreciated.

    Thanks,

    J

    --
    View this message in context:
    http://www.nabble.com/retrieve-all-docs-efficiently---just-one-field-tp17766268p17766268.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • 1world1love at Jun 11, 2008 at 2:22 pm
    Thanks Erick. That is what I was assuming but couldn't confirm if it was
    worth going down those paths to acheive what I was hoping. Your essay was
    very informative about realistic expectations with the fieldselector.

    I actually just got through reading the discussion on deprecating hits which
    essentially provides great detail about the summary you provided (link for
    anyone else who comes upon this thread and is curious -
    https://issues.apache.org/jira/browse/LUCENE-1290).

    I am still not quite sure how exactly to ustilize the hitcollector api, but
    I will make a first pass at refactoring my code to use both.


    Erick Erickson wrote:
    It can be a major bottleneck ....
    --
    View this message in context: http://www.nabble.com/retrieve-all-docs-efficiently---just-one-field-tp17766268p17779004.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 10, '08 at 10:36p
activeJun 13, '08 at 2:26p
posts8
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase