FAQ
Hi,

first, thanks for this great a resource, and sorry if i am oversimplfying a few things, i am still rather new to Lucene.

I have been thinking how to integrate my app with Lucene - it is a CMS type system that has documents organized in a tree-style layout. A few facts about the system:
- Every node in the system is represented by a unique numeric id in a field called "id"
- There is one deinfed root node, and an arbitrary amount of descendants.
- Each of the nodes on any level knows his descendants in a field called "child"
- Each node also knows his parent node in a field called "parent"

I am indexing all the fields from all the nodes in Lucene already, and thus, i can use Lucene to e.g. get all the descendant node IDs of a node simply by issuing a query like "id:2" and then extracting the multivalue-field "child".

Now, here is what i am trying to solve now -- i would like to be able to fetch all the nodes that match a certain criteria, if they are contained in some fragment of the tree. To visualize:

Root
+-> A
+-> B
+-> C
+->D
+->E
+->F
+->G

i would like to issue a query that gives me all the nodes within "A", a flat list of results that contain B,C,D and E

Now, since per my definition D is not directly correlated with A (it knows his parent C, but not that its also part of A -- only C knows that) , i was thinking of introducing a new field for every node into my Lucene index that holds a list of IDs thats trace back to the root element (in this case, the D node would have C and A in that field, in that order) - but it strikes me this may not be the most elegant approach...

The above is only a simplified example, in reality, i have a tree about 10 levels deep, with thousands of nodes, and i frequently need to surface nodes within a certain fragment of that tree.

Is there any best practice that you ran into on how to map this elegantly into Lucene?

Thanks a ton for any pointers,
Emanuel Schleussinger


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Erick Erickson at Mar 21, 2007 at 1:13 pm
    Is it a fair restatement of your problem that you want to generate
    a list of all children of a node? That's what I'm reading.....

    Would it work for you to store the complete ancestry in each node?
    By that I mean (from your example),

    NOTE: it's no problem in Lucene to store different values for the
    same field in the same document. I.e
    Document doc.
    doc.add("field" "value1");
    doc.add("field" "value2");
    writer.add(doc);...

    This is equivalent (if using WhitespaceAnalyzer in this example) to:
    Document doc;
    doc.add("field", "value1 value2");
    writer.add(doc);

    (There is a subtle difference between the two having to do
    with PositionIncrementGap, but that's probably irrelevant for you
    in this problem).

    So what about just doing that for each parent node in your tree? So
    your "ancestry" field for documents D and E have stored "C" and "A".
    This is TOKENIZED, but not necessarily STORED....

    Document C has only "A".

    Now, finding the children of "A" reduces to something like
    +ancestry:A
    which you can add to your BooleanClauses if you want to also specify
    other search criteria or just use by itself if you don't.


    What follows is my first idea, but I think the above is a better notion.

    Node A stores nothing
    Nodes B and C store "A"
    Nodex D and E store "A$C"
    etc.

    Now, finding all the children of A reduces to doing a WildcardTermEnum
    on "A*" and, for each resulting term using TermDocs.seek(term) to find
    the corresponding document.

    Note a couple of things:
    1> index the ancestry field UN_TOKENIZED. You don't need to store it.
    1a> You could use something like this to form a Lucene Filter if you needed
    to, say, find all the nodes in the tree that were children of a specified
    node
    AND met certain search criteria.

    2> You could also just search on A*, but be aware that you may have to
    deal with TooManyClauses exceptions. The TermEnum/TermDocs method
    avoids that problem, but may be overkill in your situation.
    2a> Lucene 2.1 allows wildcards in the first position if you do a wlidcard
    search, but you need to turn that on by a call which I can't bring up from
    memory.


    Hope this helps
    Erick
    On 3/21/07, Emanuel Schleussinger wrote:

    Hi,

    first, thanks for this great a resource, and sorry if i am oversimplfying
    a few things, i am still rather new to Lucene.

    I have been thinking how to integrate my app with Lucene - it is a CMS
    type system that has documents organized in a tree-style layout. A few facts
    about the system:
    - Every node in the system is represented by a unique numeric id in a
    field called "id"
    - There is one deinfed root node, and an arbitrary amount of descendants.
    - Each of the nodes on any level knows his descendants in a field called
    "child"
    - Each node also knows his parent node in a field called "parent"

    I am indexing all the fields from all the nodes in Lucene already, and
    thus, i can use Lucene to e.g. get all the descendant node IDs of a node
    simply by issuing a query like "id:2" and then extracting the
    multivalue-field "child".

    Now, here is what i am trying to solve now -- i would like to be able to
    fetch all the nodes that match a certain criteria, if they are contained in
    some fragment of the tree. To visualize:

    Root
    +-> A
    +-> B
    +-> C
    +->D
    +->E
    +->F
    +->G

    i would like to issue a query that gives me all the nodes within "A", a
    flat list of results that contain B,C,D and E

    Now, since per my definition D is not directly correlated with A (it knows
    his parent C, but not that its also part of A -- only C knows that) , i was
    thinking of introducing a new field for every node into my Lucene index that
    holds a list of IDs thats trace back to the root element (in this case, the
    D node would have C and A in that field, in that order) - but it strikes me
    this may not be the most elegant approach...

    The above is only a simplified example, in reality, i have a tree about 10
    levels deep, with thousands of nodes, and i frequently need to surface nodes
    within a certain fragment of that tree.

    Is there any best practice that you ran into on how to map this elegantly
    into Lucene?

    Thanks a ton for any pointers,
    Emanuel Schleussinger


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Emanuel Schleussinger at Mar 22, 2007 at 10:43 am
    Hi Erick,

    excellent insight, thanks a lot. As you would expect, this method works a treat.

    thanks a lot for your time!
    Emanuel

    ----- Original Message -----
    From: "Erick Erickson" <erickerickson@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Wednesday, March 21, 2007 2:12:49 PM (GMT+0100) Europe/Berlin
    Subject: Re: Querying fragments of a tree structure

    Is it a fair restatement of your problem that you want to generate
    a list of all children of a node? That's what I'm reading.....

    Would it work for you to store the complete ancestry in each node?
    By that I mean (from your example),

    NOTE: it's no problem in Lucene to store different values for the
    same field in the same document. I.e
    Document doc.
    doc.add("field" "value1");
    doc.add("field" "value2");
    writer.add(doc);...

    This is equivalent (if using WhitespaceAnalyzer in this example) to:
    Document doc;
    doc.add("field", "value1 value2");
    writer.add(doc);

    (There is a subtle difference between the two having to do
    with PositionIncrementGap, but that's probably irrelevant for you
    in this problem).

    So what about just doing that for each parent node in your tree? So
    your "ancestry" field for documents D and E have stored "C" and "A".
    This is TOKENIZED, but not necessarily STORED....

    Document C has only "A".

    Now, finding the children of "A" reduces to something like
    +ancestry:A
    which you can add to your BooleanClauses if you want to also specify
    other search criteria or just use by itself if you don't.


    What follows is my first idea, but I think the above is a better notion.

    Node A stores nothing
    Nodes B and C store "A"
    Nodex D and E store "A$C"
    etc.

    Now, finding all the children of A reduces to doing a WildcardTermEnum
    on "A*" and, for each resulting term using TermDocs.seek(term) to find
    the corresponding document.

    Note a couple of things:
    1> index the ancestry field UN_TOKENIZED. You don't need to store it.
    1a> You could use something like this to form a Lucene Filter if you needed
    to, say, find all the nodes in the tree that were children of a specified
    node
    AND met certain search criteria.

    2> You could also just search on A*, but be aware that you may have to
    deal with TooManyClauses exceptions. The TermEnum/TermDocs method
    avoids that problem, but may be overkill in your situation.
    2a> Lucene 2.1 allows wildcards in the first position if you do a wlidcard
    search, but you need to turn that on by a call which I can't bring up from
    memory.


    Hope this helps
    Erick
    On 3/21/07, Emanuel Schleussinger wrote:

    Hi,

    first, thanks for this great a resource, and sorry if i am oversimplfying
    a few things, i am still rather new to Lucene.

    I have been thinking how to integrate my app with Lucene - it is a CMS
    type system that has documents organized in a tree-style layout. A few facts
    about the system:
    - Every node in the system is represented by a unique numeric id in a
    field called "id"
    - There is one deinfed root node, and an arbitrary amount of descendants.
    - Each of the nodes on any level knows his descendants in a field called
    "child"
    - Each node also knows his parent node in a field called "parent"

    I am indexing all the fields from all the nodes in Lucene already, and
    thus, i can use Lucene to e.g. get all the descendant node IDs of a node
    simply by issuing a query like "id:2" and then extracting the
    multivalue-field "child".

    Now, here is what i am trying to solve now -- i would like to be able to
    fetch all the nodes that match a certain criteria, if they are contained in
    some fragment of the tree. To visualize:

    Root
    +-> A
    +-> B
    +-> C
    +->D
    +->E
    +->F
    +->G

    i would like to issue a query that gives me all the nodes within "A", a
    flat list of results that contain B,C,D and E

    Now, since per my definition D is not directly correlated with A (it knows
    his parent C, but not that its also part of A -- only C knows that) , i was
    thinking of introducing a new field for every node into my Lucene index that
    holds a list of IDs thats trace back to the root element (in this case, the
    D node would have C and A in that field, in that order) - but it strikes me
    this may not be the most elegant approach...

    The above is only a simplified example, in reality, i have a tree about 10
    levels deep, with thousands of nodes, and i frequently need to surface nodes
    within a certain fragment of that tree.

    Is there any best practice that you ran into on how to map this elegantly
    into Lucene?

    Thanks a ton for any pointers,
    Emanuel Schleussinger


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMar 21, '07 at 9:21a
activeMar 22, '07 at 10:43a
posts3
users2
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase