FAQ
Hello Lucene Gurus,



I'm new to Lucene so sorry if this question basic or naïve.



I have a Document to which I want to add a Field named, say, "foo" that is tokenized, indexed and unstored. I am using the "Field(String name, TokenStream tokenStream)" constructor to create it. The TokenStream may take a fairly long time to return all its tokens.



Now for querying reasons I want to add another Field named, say, "bar", that is tokenized and indexed in exactly the same way as "foo". I could just pass it the same TokenStream that I used to create "foo" but since it takes so long to return all its tokens, I was wondering if there is a way to say, create "bar" as a copy of "foo". I looked thru the javadoc but didn't see anything.



Is this possible in Lucene or do I just have to bite the bullet build the new Field using the same TokenStream again?

--
Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
www.sungard.com/energy <blocked::http://www.sungard.com/energy>

Search Discussions

  • Grant Ingersoll at Jun 27, 2008 at 11:27 am

    On Jun 27, 2008, at 12:01 AM, <[email protected]> wrote:

    Hello Lucene Gurus,



    I'm new to Lucene so sorry if this question basic or naïve.



    I have a Document to which I want to add a Field named, say, "foo"
    that is tokenized, indexed and unstored. I am using the
    "Field(String name, TokenStream tokenStream)" constructor to create
    it. The TokenStream may take a fairly long time to return all its
    tokens.
    Can you share some code here? What's the reasoning behind using it
    (not saying it's wrong, just wondering what led you to it)? Are you
    just loading it up from a file, string or something or do you have
    another reason?


    Now for querying reasons I want to add another Field named, say,
    "bar", that is tokenized and indexed in exactly the same way as
    "foo". I could just pass it the same TokenStream that I used to
    create "foo" but since it takes so long to return all its tokens, I
    was wondering if there is a way to say, create "bar" as a copy of
    "foo". I looked thru the javadoc but didn't see anything.
    By exactly the same, do you really mean exactly the same? What's the
    point of that? What are the "querying reasons"?

    You may want to look at the TeeTokenFilter and the SinkTokenizer, but
    I guess I'd like to know more about what's going on before fully
    recommending anything.

    Is this possible in Lucene or do I just have to bite the bullet
    build the new Field using the same TokenStream again?

    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194
    Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected] >
    www.sungard.com/energy <blocked::http://www.sungard.com/energy>

    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Bill Chesky at Jun 27, 2008 at 2:58 pm
    Grant,

    Thanks for the reply. What we're trying to do is kind of esoteric and hard to explain without going into a lot of gory details so I was trying to keep it simple. But I'll try to summarize.

    We're trying to index entities in a relational database. One of the entities we're trying to index is something called a Property. Think of a Property kind of like the java.util.Properties class, i.e. a name/value pair. So some examples of Properties might be:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Etc., etc.

    (Note: this isn't the type of data we're storing... just trying to keep it simple.)

    Imagine that the above list represents the the set of Properties that specify the address for a single person, Joe Smith. Each Property in the set will be indexed by the values on the right-hand side of all the other name/value pairs in the set, i.e.: California, Sacremento, 94203, South, Main, 1234, Joe and Smith.

    There are two types of queries that we want to do.
    1) retrieve every Property matching the specified search terms, regardless of its left-hand side. For this we want to create a field in EVERY Document called "keywords" and index it by the right-hand side values as described above.
    2) retrieve every Property with a given left-hand side that matches the specified search terms. For example, find all the 'City' Properties that match the term 'South'. For this we want to create a field with the name of the left-hand side (e.g. State, City, ZipCode, etc.) but only in those Documents that correspond to a Property with that left-hand side. Again this field will be indexed by the right-hand side values as described above.

    So a couple of examples from the above list might look something like:

    Document: State=California
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'State' indexed by 'California', 'Sacremento', '94203', etc.

    Document: City=Sacremento
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.

    Now if I'm interested in all the Properties that match the word "South", I search the index on the "keywords" field for the term "South". This will return both documents above.

    But if I'm only interested in any 'City' Properties that match the term 'South' I search the index on the "City" field for the term "South". This will only return the 'City=Sacremento' document above because it's the only Document of the two that even has a 'City' field in it.

    But in any case, the 'State' field and the 'City' field are indexed exactly the same way as the 'keywords' field. Which is why I was wondering if there was a way to just create these fields as copies of the 'keywords' field.

    Here is a code sample where I'm creating the index. We're using Hibernate search to search the indexes, thus the "id" and "_hibernate_class" fields.

    Query q = em.createQuery("select p from Property p");

    List<Property> properties = q.getResultList();

    for (Property p : properties)
    {
    // Indexing property.
    Document doc = new Document();
    doc.add(new Field("id",
    Integer.toString(p.getId()),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    doc.add(new Field("_hibernate_class",
    Property.class.getCanonicalName(),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field("keywords", tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    // Here is where I would like to add the second field that is a copy
    // of the "keywords" field just created above. Note: the call
    // p.getCharacteristic().getName() is getting the name of the
    // left-hand side of the Property as described above.
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field(p.getCharacteristic().getName(), tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    }

    Hope that clears it up.

    BTW, in case this seems like a strange way to index things, I will also add that we are doing it this way in order to impose a heirarchical structure on Properties. So my example above should really look like this:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Use your imagination to visualize what the tree might look like with millions of peoples' addresses. Now imagine trying to tokenize the Document corresponding to "State=California". Each path thru the tree from root (State) to leaf (Name) represents a set of Properties that is used to index the "keywords" field in the "State=California" document. In other words it takes a long time to index. This is why I'm looking for a way to just copy one field to another.

    There is a lot more to our design to facilitate this hierarchical structure but this is probably more than you wanted to know. :)

    thanks in advance,
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Grant Ingersoll
    Sent: Friday, June 27, 2008 7:26 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?

    On Jun 27, 2008, at 12:01 AM, <[email protected]> wrote:

    Hello Lucene Gurus,



    I'm new to Lucene so sorry if this question basic or naïve.



    I have a Document to which I want to add a Field named, say, "foo"
    that is tokenized, indexed and unstored. I am using the
    "Field(String name, TokenStream tokenStream)" constructor to create
    it. The TokenStream may take a fairly long time to return all its
    tokens.
    Can you share some code here? What's the reasoning behind using it
    (not saying it's wrong, just wondering what led you to it)? Are you
    just loading it up from a file, string or something or do you have
    another reason?


    Now for querying reasons I want to add another Field named, say,
    "bar", that is tokenized and indexed in exactly the same way as
    "foo". I could just pass it the same TokenStream that I used to
    create "foo" but since it takes so long to return all its tokens, I
    was wondering if there is a way to say, create "bar" as a copy of
    "foo". I looked thru the javadoc but didn't see anything.
    By exactly the same, do you really mean exactly the same? What's the
    point of that? What are the "querying reasons"?

    You may want to look at the TeeTokenFilter and the SinkTokenizer, but
    I guess I'd like to know more about what's going on before fully
    recommending anything.

    Is this possible in Lucene or do I just have to bite the bullet
    build the new Field using the same TokenStream again?

    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194
    Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected] >
    www.sungard.com/energy <blocked::http://www.sungard.com/energy>

    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Matthew Hall at Jun 27, 2008 at 3:52 pm
    I'm not sure if this is helpful, but I do something VERY similar to this
    in my project.

    So, for the example you are citing I would design my index as follows:

    db_key, data, data_type

    Where the data_type is some sort of value representing the thing that's
    on the left hand side of your property relationship there.

    So, then in order to satisfy your search, the queries become quite simple:

    The search for everything simply searches against the data field in this
    index, wheras the search for a specific data_type + searchterm becomes a
    simple boolean query, that has a MUST clause for the data_type value.

    As an even BETTER bonus, this will then mean that all of your searchable
    values will now have relevance to each other at scoring time, which is
    quite useful in the long run.

    Hope this helps you out,

    Matt

    [email protected] wrote:
    Grant,

    Thanks for the reply. What we're trying to do is kind of esoteric and hard to explain without going into a lot of gory details so I was trying to keep it simple. But I'll try to summarize.

    We're trying to index entities in a relational database. One of the entities we're trying to index is something called a Property. Think of a Property kind of like the java.util.Properties class, i.e. a name/value pair. So some examples of Properties might be:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Etc., etc.

    (Note: this isn't the type of data we're storing... just trying to keep it simple.)

    Imagine that the above list represents the the set of Properties that specify the address for a single person, Joe Smith. Each Property in the set will be indexed by the values on the right-hand side of all the other name/value pairs in the set, i.e.: California, Sacremento, 94203, South, Main, 1234, Joe and Smith.

    There are two types of queries that we want to do.
    1) retrieve every Property matching the specified search terms, regardless of its left-hand side. For this we want to create a field in EVERY Document called "keywords" and index it by the right-hand side values as described above.
    2) retrieve every Property with a given left-hand side that matches the specified search terms. For example, find all the 'City' Properties that match the term 'South'. For this we want to create a field with the name of the left-hand side (e.g. State, City, ZipCode, etc.) but only in those Documents that correspond to a Property with that left-hand side. Again this field will be indexed by the right-hand side values as described above.

    So a couple of examples from the above list might look something like:

    Document: State=California
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'State' indexed by 'California', 'Sacremento', '94203', etc.

    Document: City=Sacremento
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.

    Now if I'm interested in all the Properties that match the word "South", I search the index on the "keywords" field for the term "South". This will return both documents above.

    But if I'm only interested in any 'City' Properties that match the term 'South' I search the index on the "City" field for the term "South". This will only return the 'City=Sacremento' document above because it's the only Document of the two that even has a 'City' field in it.

    But in any case, the 'State' field and the 'City' field are indexed exactly the same way as the 'keywords' field. Which is why I was wondering if there was a way to just create these fields as copies of the 'keywords' field.

    Here is a code sample where I'm creating the index. We're using Hibernate search to search the indexes, thus the "id" and "_hibernate_class" fields.

    Query q = em.createQuery("select p from Property p");

    List<Property> properties = q.getResultList();

    for (Property p : properties)
    {
    // Indexing property.
    Document doc = new Document();
    doc.add(new Field("id",
    Integer.toString(p.getId()),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    doc.add(new Field("_hibernate_class",
    Property.class.getCanonicalName(),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field("keywords", tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    // Here is where I would like to add the second field that is a copy
    // of the "keywords" field just created above. Note: the call
    // p.getCharacteristic().getName() is getting the name of the
    // left-hand side of the Property as described above.
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field(p.getCharacteristic().getName(), tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    }

    Hope that clears it up.

    BTW, in case this seems like a strange way to index things, I will also add that we are doing it this way in order to impose a heirarchical structure on Properties. So my example above should really look like this:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Use your imagination to visualize what the tree might look like with millions of peoples' addresses. Now imagine trying to tokenize the Document corresponding to "State=California". Each path thru the tree from root (State) to leaf (Name) represents a set of Properties that is used to index the "keywords" field in the "State=California" document. In other words it takes a long time to index. This is why I'm looking for a way to just copy one field to another.

    There is a lot more to our design to facilitate this hierarchical structure but this is probably more than you wanted to know. :)

    thanks in advance,
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Grant Ingersoll
    Sent: Friday, June 27, 2008 7:26 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?


    On Jun 27, 2008, at 12:01 AM, <[email protected]> <[email protected]
    wrote:
    Hello Lucene Gurus,



    I'm new to Lucene so sorry if this question basic or naïve.



    I have a Document to which I want to add a Field named, say, "foo"
    that is tokenized, indexed and unstored. I am using the
    "Field(String name, TokenStream tokenStream)" constructor to create
    it. The TokenStream may take a fairly long time to return all its
    tokens.
    Can you share some code here? What's the reasoning behind using it
    (not saying it's wrong, just wondering what led you to it)? Are you
    just loading it up from a file, string or something or do you have
    another reason?


    Now for querying reasons I want to add another Field named, say,
    "bar", that is tokenized and indexed in exactly the same way as
    "foo". I could just pass it the same TokenStream that I used to
    create "foo" but since it takes so long to return all its tokens, I
    was wondering if there is a way to say, create "bar" as a copy of
    "foo". I looked thru the javadoc but didn't see anything.

    By exactly the same, do you really mean exactly the same? What's the
    point of that? What are the "querying reasons"?

    You may want to look at the TeeTokenFilter and the SinkTokenizer, but
    I guess I'd like to know more about what's going on before fully
    recommending anything.


    Is this possible in Lucene or do I just have to bite the bullet
    build the new Field using the same TokenStream again?

    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194
    Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected] >
    www.sungard.com/energy <blocked::http://www.sungard.com/energy>


    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

    --
    Matthew Hall
    Software Engineer
    Mouse Genome Informatics
    [email protected]
    (207) 288-6012



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Bill Chesky at Jun 27, 2008 at 6:42 pm
    Matthew,

    Thanks for the reply. This looks very interesting. If I'm understanding correctly your db_key, data and data_type are Fields within the Document, correct? So is this how you envision it?

    Document: State=California
    Field: 'db_key'='1395' (primary key into relational table, correct?)
    Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'data_type' indexed by 'State'

    Document: City=Sacremento
    Field: 'db_key'='2405'
    Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'data_type' indexed by 'City'

    Then my query for all Properties would be:

    +data:South

    My query for only 'City' Properties would be:

    +data:South +data_type:City

    Is that right?

    I think that would work. Very nice. Thank you very much!!!!
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Matthew Hall
    Sent: Friday, June 27, 2008 11:49 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?

    I'm not sure if this is helpful, but I do something VERY similar to this
    in my project.

    So, for the example you are citing I would design my index as follows:

    db_key, data, data_type

    Where the data_type is some sort of value representing the thing that's
    on the left hand side of your property relationship there.

    So, then in order to satisfy your search, the queries become quite simple:

    The search for everything simply searches against the data field in this
    index, wheras the search for a specific data_type + searchterm becomes a
    simple boolean query, that has a MUST clause for the data_type value.

    As an even BETTER bonus, this will then mean that all of your searchable
    values will now have relevance to each other at scoring time, which is
    quite useful in the long run.

    Hope this helps you out,

    Matt

    [email protected] wrote:
    Grant,

    Thanks for the reply. What we're trying to do is kind of esoteric and hard to explain without going into a lot of gory details so I was trying to keep it simple. But I'll try to summarize.

    We're trying to index entities in a relational database. One of the entities we're trying to index is something called a Property. Think of a Property kind of like the java.util.Properties class, i.e. a name/value pair. So some examples of Properties might be:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Etc., etc.

    (Note: this isn't the type of data we're storing... just trying to keep it simple.)

    Imagine that the above list represents the the set of Properties that specify the address for a single person, Joe Smith. Each Property in the set will be indexed by the values on the right-hand side of all the other name/value pairs in the set, i.e.: California, Sacremento, 94203, South, Main, 1234, Joe and Smith.

    There are two types of queries that we want to do.
    1) retrieve every Property matching the specified search terms, regardless of its left-hand side. For this we want to create a field in EVERY Document called "keywords" and index it by the right-hand side values as described above.
    2) retrieve every Property with a given left-hand side that matches the specified search terms. For example, find all the 'City' Properties that match the term 'South'. For this we want to create a field with the name of the left-hand side (e.g. State, City, ZipCode, etc.) but only in those Documents that correspond to a Property with that left-hand side. Again this field will be indexed by the right-hand side values as described above.

    So a couple of examples from the above list might look something like:

    Document: State=California
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'State' indexed by 'California', 'Sacremento', '94203', etc.

    Document: City=Sacremento
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.

    Now if I'm interested in all the Properties that match the word "South", I search the index on the "keywords" field for the term "South". This will return both documents above.

    But if I'm only interested in any 'City' Properties that match the term 'South' I search the index on the "City" field for the term "South". This will only return the 'City=Sacremento' document above because it's the only Document of the two that even has a 'City' field in it.

    But in any case, the 'State' field and the 'City' field are indexed exactly the same way as the 'keywords' field. Which is why I was wondering if there was a way to just create these fields as copies of the 'keywords' field.

    Here is a code sample where I'm creating the index. We're using Hibernate search to search the indexes, thus the "id" and "_hibernate_class" fields.

    Query q = em.createQuery("select p from Property p");

    List<Property> properties = q.getResultList();

    for (Property p : properties)
    {
    // Indexing property.
    Document doc = new Document();
    doc.add(new Field("id",
    Integer.toString(p.getId()),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    doc.add(new Field("_hibernate_class",
    Property.class.getCanonicalName(),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field("keywords", tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    // Here is where I would like to add the second field that is a copy
    // of the "keywords" field just created above. Note: the call
    // p.getCharacteristic().getName() is getting the name of the
    // left-hand side of the Property as described above.
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field(p.getCharacteristic().getName(), tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    }

    Hope that clears it up.

    BTW, in case this seems like a strange way to index things, I will also add that we are doing it this way in order to impose a heirarchical structure on Properties. So my example above should really look like this:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Use your imagination to visualize what the tree might look like with millions of peoples' addresses. Now imagine trying to tokenize the Document corresponding to "State=California". Each path thru the tree from root (State) to leaf (Name) represents a set of Properties that is used to index the "keywords" field in the "State=California" document. In other words it takes a long time to index. This is why I'm looking for a way to just copy one field to another.

    There is a lot more to our design to facilitate this hierarchical structure but this is probably more than you wanted to know. :)

    thanks in advance,
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Grant Ingersoll
    Sent: Friday, June 27, 2008 7:26 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?


    On Jun 27, 2008, at 12:01 AM, <[email protected]> <[email protected]
    wrote:
    Hello Lucene Gurus,



    I'm new to Lucene so sorry if this question basic or naïve.



    I have a Document to which I want to add a Field named, say, "foo"
    that is tokenized, indexed and unstored. I am using the
    "Field(String name, TokenStream tokenStream)" constructor to create
    it. The TokenStream may take a fairly long time to return all its
    tokens.
    Can you share some code here? What's the reasoning behind using it
    (not saying it's wrong, just wondering what led you to it)? Are you
    just loading it up from a file, string or something or do you have
    another reason?


    Now for querying reasons I want to add another Field named, say,
    "bar", that is tokenized and indexed in exactly the same way as
    "foo". I could just pass it the same TokenStream that I used to
    create "foo" but since it takes so long to return all its tokens, I
    was wondering if there is a way to say, create "bar" as a copy of
    "foo". I looked thru the javadoc but didn't see anything.

    By exactly the same, do you really mean exactly the same? What's the
    point of that? What are the "querying reasons"?

    You may want to look at the TeeTokenFilter and the SinkTokenizer, but
    I guess I'd like to know more about what's going on before fully
    recommending anything.


    Is this possible in Lucene or do I just have to bite the bullet
    build the new Field using the same TokenStream again?

    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194
    Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected] >
    www.sungard.com/energy <blocked::http://www.sungard.com/energy>


    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

    --
    Matthew Hall
    Software Engineer
    Mouse Genome Informatics
    [email protected]
    (207) 288-6012



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Matthew Hall at Jun 27, 2008 at 7:36 pm
    Yup, you're pretty much there.

    The only part I'm a bit confused about is what you've said in your data
    field there,

    I'm thinking you mean that for the data_type: "State", you would have
    the data entry of "California", right?

    If so, then yup, you are spot on ^^

    We use this technique all the time on our side, and its helped
    considerably. We then use the db_key to reference into a display time
    cache that holds all of the display information for the underlying
    object that we would ever want to present to the user. This allows our
    search time index to be very concise, and as a result nearly every
    search we hit it with is subsecond, which is a nice place to be ^^

    Matt

    [email protected] wrote:
    Matthew,

    Thanks for the reply. This looks very interesting. If I'm understanding correctly your db_key, data and data_type are Fields within the Document, correct? So is this how you envision it?

    Document: State=California
    Field: 'db_key'='1395' (primary key into relational table, correct?)
    Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'data_type' indexed by 'State'

    Document: City=Sacremento
    Field: 'db_key'='2405'
    Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'data_type' indexed by 'City'

    Then my query for all Properties would be:

    +data:South

    My query for only 'City' Properties would be:

    +data:South +data_type:City

    Is that right?

    I think that would work. Very nice. Thank you very much!!!!
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Matthew Hall
    Sent: Friday, June 27, 2008 11:49 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?

    I'm not sure if this is helpful, but I do something VERY similar to this
    in my project.

    So, for the example you are citing I would design my index as follows:

    db_key, data, data_type

    Where the data_type is some sort of value representing the thing that's
    on the left hand side of your property relationship there.

    So, then in order to satisfy your search, the queries become quite simple:

    The search for everything simply searches against the data field in this
    index, wheras the search for a specific data_type + searchterm becomes a
    simple boolean query, that has a MUST clause for the data_type value.

    As an even BETTER bonus, this will then mean that all of your searchable
    values will now have relevance to each other at scoring time, which is
    quite useful in the long run.

    Hope this helps you out,

    Matt

    [email protected] wrote:
    Grant,

    Thanks for the reply. What we're trying to do is kind of esoteric and hard to explain without going into a lot of gory details so I was trying to keep it simple. But I'll try to summarize.

    We're trying to index entities in a relational database. One of the entities we're trying to index is something called a Property. Think of a Property kind of like the java.util.Properties class, i.e. a name/value pair. So some examples of Properties might be:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Etc., etc.

    (Note: this isn't the type of data we're storing... just trying to keep it simple.)

    Imagine that the above list represents the the set of Properties that specify the address for a single person, Joe Smith. Each Property in the set will be indexed by the values on the right-hand side of all the other name/value pairs in the set, i.e.: California, Sacremento, 94203, South, Main, 1234, Joe and Smith.

    There are two types of queries that we want to do.
    1) retrieve every Property matching the specified search terms, regardless of its left-hand side. For this we want to create a field in EVERY Document called "keywords" and index it by the right-hand side values as described above.
    2) retrieve every Property with a given left-hand side that matches the specified search terms. For example, find all the 'City' Properties that match the term 'South'. For this we want to create a field with the name of the left-hand side (e.g. State, City, ZipCode, etc.) but only in those Documents that correspond to a Property with that left-hand side. Again this field will be indexed by the right-hand side values as described above.

    So a couple of examples from the above list might look something like:

    Document: State=California
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'State' indexed by 'California', 'Sacremento', '94203', etc.

    Document: City=Sacremento
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.

    Now if I'm interested in all the Properties that match the word "South", I search the index on the "keywords" field for the term "South". This will return both documents above.

    But if I'm only interested in any 'City' Properties that match the term 'South' I search the index on the "City" field for the term "South". This will only return the 'City=Sacremento' document above because it's the only Document of the two that even has a 'City' field in it.

    But in any case, the 'State' field and the 'City' field are indexed exactly the same way as the 'keywords' field. Which is why I was wondering if there was a way to just create these fields as copies of the 'keywords' field.

    Here is a code sample where I'm creating the index. We're using Hibernate search to search the indexes, thus the "id" and "_hibernate_class" fields.

    Query q = em.createQuery("select p from Property p");

    List<Property> properties = q.getResultList();

    for (Property p : properties)
    {
    // Indexing property.
    Document doc = new Document();
    doc.add(new Field("id",
    Integer.toString(p.getId()),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    doc.add(new Field("_hibernate_class",
    Property.class.getCanonicalName(),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field("keywords", tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    // Here is where I would like to add the second field that is a copy
    // of the "keywords" field just created above. Note: the call
    // p.getCharacteristic().getName() is getting the name of the
    // left-hand side of the Property as described above.
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field(p.getCharacteristic().getName(), tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    }

    Hope that clears it up.

    BTW, in case this seems like a strange way to index things, I will also add that we are doing it this way in order to impose a heirarchical structure on Properties. So my example above should really look like this:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Use your imagination to visualize what the tree might look like with millions of peoples' addresses. Now imagine trying to tokenize the Document corresponding to "State=California". Each path thru the tree from root (State) to leaf (Name) represents a set of Properties that is used to index the "keywords" field in the "State=California" document. In other words it takes a long time to index. This is why I'm looking for a way to just copy one field to another.

    There is a lot more to our design to facilitate this hierarchical structure but this is probably more than you wanted to know. :)

    thanks in advance,
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Grant Ingersoll
    Sent: Friday, June 27, 2008 7:26 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?


    On Jun 27, 2008, at 12:01 AM, <[email protected]> <[email protected]
    wrote:
    Hello Lucene Gurus,



    I'm new to Lucene so sorry if this question basic or naïve.



    I have a Document to which I want to add a Field named, say, "foo"
    that is tokenized, indexed and unstored. I am using the
    "Field(String name, TokenStream tokenStream)" constructor to create
    it. The TokenStream may take a fairly long time to return all its
    tokens.

    Can you share some code here? What's the reasoning behind using it
    (not saying it's wrong, just wondering what led you to it)? Are you
    just loading it up from a file, string or something or do you have
    another reason?



    Now for querying reasons I want to add another Field named, say,
    "bar", that is tokenized and indexed in exactly the same way as
    "foo". I could just pass it the same TokenStream that I used to
    create "foo" but since it takes so long to return all its tokens, I
    was wondering if there is a way to say, create "bar" as a copy of
    "foo". I looked thru the javadoc but didn't see anything.


    By exactly the same, do you really mean exactly the same? What's the
    point of that? What are the "querying reasons"?

    You may want to look at the TeeTokenFilter and the SinkTokenizer, but
    I guess I'd like to know more about what's going on before fully
    recommending anything.



    Is this possible in Lucene or do I just have to bite the bullet
    build the new Field using the same TokenStream again?

    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194
    Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected] >>
    www.sungard.com/energy <blocked::http://www.sungard.com/energy>



    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]


    --
    Matthew Hall
    Software Engineer
    Mouse Genome Informatics
    [email protected]
    (207) 288-6012



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Bill Chesky at Jun 27, 2008 at 9:17 pm
    Hmmm, I think maybe I am missing something. In your design is the 'data' field indexed, i.e. searchable? Or is it an unindexed, stored field?

    I was thinking that both 'data' and 'data_type' were indexed and searchable.

    Maybe the confusion stems from the fact that for the Document corresponding to "State=California", we're not just indexing on the token 'California'. We're indexing on all the tokens from all the Properties in the set of Properties corresponding to a person's address. In my original example this would be: California, Sacremento, 94203, South, Main, 1234, Joe and Smith.

    For the 'data_type' field I was thinking you were saying we'd index on a single token, namely 'State' (or whatever the left-hand side is).

    Does that make sense?
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Matthew Hall
    Sent: Friday, June 27, 2008 3:33 PM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?

    Yup, you're pretty much there.

    The only part I'm a bit confused about is what you've said in your data
    field there,

    I'm thinking you mean that for the data_type: "State", you would have
    the data entry of "California", right?

    If so, then yup, you are spot on ^^

    We use this technique all the time on our side, and its helped
    considerably. We then use the db_key to reference into a display time
    cache that holds all of the display information for the underlying
    object that we would ever want to present to the user. This allows our
    search time index to be very concise, and as a result nearly every
    search we hit it with is subsecond, which is a nice place to be ^^

    Matt

    [email protected] wrote:
    Matthew,

    Thanks for the reply. This looks very interesting. If I'm understanding correctly your db_key, data and data_type are Fields within the Document, correct? So is this how you envision it?

    Document: State=California
    Field: 'db_key'='1395' (primary key into relational table, correct?)
    Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'data_type' indexed by 'State'

    Document: City=Sacremento
    Field: 'db_key'='2405'
    Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'data_type' indexed by 'City'

    Then my query for all Properties would be:

    +data:South

    My query for only 'City' Properties would be:

    +data:South +data_type:City

    Is that right?

    I think that would work. Very nice. Thank you very much!!!!
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Matthew Hall
    Sent: Friday, June 27, 2008 11:49 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?

    I'm not sure if this is helpful, but I do something VERY similar to this
    in my project.

    So, for the example you are citing I would design my index as follows:

    db_key, data, data_type

    Where the data_type is some sort of value representing the thing that's
    on the left hand side of your property relationship there.

    So, then in order to satisfy your search, the queries become quite simple:

    The search for everything simply searches against the data field in this
    index, wheras the search for a specific data_type + searchterm becomes a
    simple boolean query, that has a MUST clause for the data_type value.

    As an even BETTER bonus, this will then mean that all of your searchable
    values will now have relevance to each other at scoring time, which is
    quite useful in the long run.

    Hope this helps you out,

    Matt

    [email protected] wrote:
    Grant,

    Thanks for the reply. What we're trying to do is kind of esoteric and hard to explain without going into a lot of gory details so I was trying to keep it simple. But I'll try to summarize.

    We're trying to index entities in a relational database. One of the entities we're trying to index is something called a Property. Think of a Property kind of like the java.util.Properties class, i.e. a name/value pair. So some examples of Properties might be:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Etc., etc.

    (Note: this isn't the type of data we're storing... just trying to keep it simple.)

    Imagine that the above list represents the the set of Properties that specify the address for a single person, Joe Smith. Each Property in the set will be indexed by the values on the right-hand side of all the other name/value pairs in the set, i.e.: California, Sacremento, 94203, South, Main, 1234, Joe and Smith.

    There are two types of queries that we want to do.
    1) retrieve every Property matching the specified search terms, regardless of its left-hand side. For this we want to create a field in EVERY Document called "keywords" and index it by the right-hand side values as described above.
    2) retrieve every Property with a given left-hand side that matches the specified search terms. For example, find all the 'City' Properties that match the term 'South'. For this we want to create a field with the name of the left-hand side (e.g. State, City, ZipCode, etc.) but only in those Documents that correspond to a Property with that left-hand side. Again this field will be indexed by the right-hand side values as described above.

    So a couple of examples from the above list might look something like:

    Document: State=California
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'State' indexed by 'California', 'Sacremento', '94203', etc.

    Document: City=Sacremento
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.

    Now if I'm interested in all the Properties that match the word "South", I search the index on the "keywords" field for the term "South". This will return both documents above.

    But if I'm only interested in any 'City' Properties that match the term 'South' I search the index on the "City" field for the term "South". This will only return the 'City=Sacremento' document above because it's the only Document of the two that even has a 'City' field in it.

    But in any case, the 'State' field and the 'City' field are indexed exactly the same way as the 'keywords' field. Which is why I was wondering if there was a way to just create these fields as copies of the 'keywords' field.

    Here is a code sample where I'm creating the index. We're using Hibernate search to search the indexes, thus the "id" and "_hibernate_class" fields.

    Query q = em.createQuery("select p from Property p");

    List<Property> properties = q.getResultList();

    for (Property p : properties)
    {
    // Indexing property.
    Document doc = new Document();
    doc.add(new Field("id",
    Integer.toString(p.getId()),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    doc.add(new Field("_hibernate_class",
    Property.class.getCanonicalName(),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field("keywords", tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    // Here is where I would like to add the second field that is a copy
    // of the "keywords" field just created above. Note: the call
    // p.getCharacteristic().getName() is getting the name of the
    // left-hand side of the Property as described above.
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field(p.getCharacteristic().getName(), tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    }

    Hope that clears it up.

    BTW, in case this seems like a strange way to index things, I will also add that we are doing it this way in order to impose a heirarchical structure on Properties. So my example above should really look like this:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Use your imagination to visualize what the tree might look like with millions of peoples' addresses. Now imagine trying to tokenize the Document corresponding to "State=California". Each path thru the tree from root (State) to leaf (Name) represents a set of Properties that is used to index the "keywords" field in the "State=California" document. In other words it takes a long time to index. This is why I'm looking for a way to just copy one field to another.

    There is a lot more to our design to facilitate this hierarchical structure but this is probably more than you wanted to know. :)

    thanks in advance,
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Grant Ingersoll
    Sent: Friday, June 27, 2008 7:26 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?


    On Jun 27, 2008, at 12:01 AM, <[email protected]> <[email protected]
    wrote:
    Hello Lucene Gurus,



    I'm new to Lucene so sorry if this question basic or naïve.



    I have a Document to which I want to add a Field named, say, "foo"
    that is tokenized, indexed and unstored. I am using the
    "Field(String name, TokenStream tokenStream)" constructor to create
    it. The TokenStream may take a fairly long time to return all its
    tokens.

    Can you share some code here? What's the reasoning behind using it
    (not saying it's wrong, just wondering what led you to it)? Are you
    just loading it up from a file, string or something or do you have
    another reason?



    Now for querying reasons I want to add another Field named, say,
    "bar", that is tokenized and indexed in exactly the same way as
    "foo". I could just pass it the same TokenStream that I used to
    create "foo" but since it takes so long to return all its tokens, I
    was wondering if there is a way to say, create "bar" as a copy of
    "foo". I looked thru the javadoc but didn't see anything.


    By exactly the same, do you really mean exactly the same? What's the
    point of that? What are the "querying reasons"?

    You may want to look at the TeeTokenFilter and the SinkTokenizer, but
    I guess I'd like to know more about what's going on before fully
    recommending anything.



    Is this possible in Lucene or do I just have to bite the bullet
    build the new Field using the same TokenStream again?

    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194
    Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected] >>
    www.sungard.com/energy <blocked::http://www.sungard.com/energy>



    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]


    --
    Matthew Hall
    Software Engineer
    Mouse Genome Informatics
    [email protected]
    (207) 288-6012



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Matthew Hall at Jun 30, 2008 at 12:29 pm
    Sorry, didn't get this until this morning.

    Yes, both fields should be indexed and searchable, though the data_type
    one should likely be untokenized.

    Data should be indexed and tokenized with whatever appropriate Analyzer
    works for your data.

    As for what your indexing, may I ask why you are doing it like that?

    I would have thought indexing each property seperately (a seperate doc)
    would have been sufficient for your needs, but if you can explain a bit
    more about your situation perhaps I can be more helpful on this matter?

    Matt

    [email protected] wrote:
    Hmmm, I think maybe I am missing something. In your design is the 'data' field indexed, i.e. searchable? Or is it an unindexed, stored field?

    I was thinking that both 'data' and 'data_type' were indexed and searchable.

    Maybe the confusion stems from the fact that for the Document corresponding to "State=California", we're not just indexing on the token 'California'. We're indexing on all the tokens from all the Properties in the set of Properties corresponding to a person's address. In my original example this would be: California, Sacremento, 94203, South, Main, 1234, Joe and Smith.

    For the 'data_type' field I was thinking you were saying we'd index on a single token, namely 'State' (or whatever the left-hand side is).

    Does that make sense?
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Matthew Hall
    Sent: Friday, June 27, 2008 3:33 PM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?

    Yup, you're pretty much there.

    The only part I'm a bit confused about is what you've said in your data
    field there,

    I'm thinking you mean that for the data_type: "State", you would have
    the data entry of "California", right?

    If so, then yup, you are spot on ^^

    We use this technique all the time on our side, and its helped
    considerably. We then use the db_key to reference into a display time
    cache that holds all of the display information for the underlying
    object that we would ever want to present to the user. This allows our
    search time index to be very concise, and as a result nearly every
    search we hit it with is subsecond, which is a nice place to be ^^

    Matt

    [email protected] wrote:
    Matthew,

    Thanks for the reply. This looks very interesting. If I'm understanding correctly your db_key, data and data_type are Fields within the Document, correct? So is this how you envision it?

    Document: State=California
    Field: 'db_key'='1395' (primary key into relational table, correct?)
    Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'data_type' indexed by 'State'

    Document: City=Sacremento
    Field: 'db_key'='2405'
    Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'data_type' indexed by 'City'

    Then my query for all Properties would be:

    +data:South

    My query for only 'City' Properties would be:

    +data:South +data_type:City

    Is that right?

    I think that would work. Very nice. Thank you very much!!!!
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Matthew Hall
    Sent: Friday, June 27, 2008 11:49 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?

    I'm not sure if this is helpful, but I do something VERY similar to this
    in my project.

    So, for the example you are citing I would design my index as follows:

    db_key, data, data_type

    Where the data_type is some sort of value representing the thing that's
    on the left hand side of your property relationship there.

    So, then in order to satisfy your search, the queries become quite simple:

    The search for everything simply searches against the data field in this
    index, wheras the search for a specific data_type + searchterm becomes a
    simple boolean query, that has a MUST clause for the data_type value.

    As an even BETTER bonus, this will then mean that all of your searchable
    values will now have relevance to each other at scoring time, which is
    quite useful in the long run.

    Hope this helps you out,

    Matt

    [email protected] wrote:

    Grant,

    Thanks for the reply. What we're trying to do is kind of esoteric and hard to explain without going into a lot of gory details so I was trying to keep it simple. But I'll try to summarize.

    We're trying to index entities in a relational database. One of the entities we're trying to index is something called a Property. Think of a Property kind of like the java.util.Properties class, i.e. a name/value pair. So some examples of Properties might be:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Etc., etc.

    (Note: this isn't the type of data we're storing... just trying to keep it simple.)

    Imagine that the above list represents the the set of Properties that specify the address for a single person, Joe Smith. Each Property in the set will be indexed by the values on the right-hand side of all the other name/value pairs in the set, i.e.: California, Sacremento, 94203, South, Main, 1234, Joe and Smith.

    There are two types of queries that we want to do.
    1) retrieve every Property matching the specified search terms, regardless of its left-hand side. For this we want to create a field in EVERY Document called "keywords" and index it by the right-hand side values as described above.
    2) retrieve every Property with a given left-hand side that matches the specified search terms. For example, find all the 'City' Properties that match the term 'South'. For this we want to create a field with the name of the left-hand side (e.g. State, City, ZipCode, etc.) but only in those Documents that correspond to a Property with that left-hand side. Again this field will be indexed by the right-hand side values as described above.

    So a couple of examples from the above list might look something like:

    Document: State=California
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'State' indexed by 'California', 'Sacremento', '94203', etc.

    Document: City=Sacremento
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.

    Now if I'm interested in all the Properties that match the word "South", I search the index on the "keywords" field for the term "South". This will return both documents above.

    But if I'm only interested in any 'City' Properties that match the term 'South' I search the index on the "City" field for the term "South". This will only return the 'City=Sacremento' document above because it's the only Document of the two that even has a 'City' field in it.

    But in any case, the 'State' field and the 'City' field are indexed exactly the same way as the 'keywords' field. Which is why I was wondering if there was a way to just create these fields as copies of the 'keywords' field.

    Here is a code sample where I'm creating the index. We're using Hibernate search to search the indexes, thus the "id" and "_hibernate_class" fields.

    Query q = em.createQuery("select p from Property p");

    List<Property> properties = q.getResultList();

    for (Property p : properties)
    {
    // Indexing property.
    Document doc = new Document();
    doc.add(new Field("id",
    Integer.toString(p.getId()),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    doc.add(new Field("_hibernate_class",
    Property.class.getCanonicalName(),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field("keywords", tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    // Here is where I would like to add the second field that is a copy
    // of the "keywords" field just created above. Note: the call
    // p.getCharacteristic().getName() is getting the name of the
    // left-hand side of the Property as described above.
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field(p.getCharacteristic().getName(), tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    }

    Hope that clears it up.

    BTW, in case this seems like a strange way to index things, I will also add that we are doing it this way in order to impose a heirarchical structure on Properties. So my example above should really look like this:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Use your imagination to visualize what the tree might look like with millions of peoples' addresses. Now imagine trying to tokenize the Document corresponding to "State=California". Each path thru the tree from root (State) to leaf (Name) represents a set of Properties that is used to index the "keywords" field in the "State=California" document. In other words it takes a long time to index. This is why I'm looking for a way to just copy one field to another.

    There is a lot more to our design to facilitate this hierarchical structure but this is probably more than you wanted to know. :)

    thanks in advance,
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Grant Ingersoll
    Sent: Friday, June 27, 2008 7:26 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?


    On Jun 27, 2008, at 12:01 AM, <[email protected]> <[email protected]
    wrote:


    Hello Lucene Gurus,



    I'm new to Lucene so sorry if this question basic or naïve.



    I have a Document to which I want to add a Field named, say, "foo"
    that is tokenized, indexed and unstored. I am using the
    "Field(String name, TokenStream tokenStream)" constructor to create
    it. The TokenStream may take a fairly long time to return all its
    tokens.


    Can you share some code here? What's the reasoning behind using it
    (not saying it's wrong, just wondering what led you to it)? Are you
    just loading it up from a file, string or something or do you have
    another reason?




    Now for querying reasons I want to add another Field named, say,
    "bar", that is tokenized and indexed in exactly the same way as
    "foo". I could just pass it the same TokenStream that I used to
    create "foo" but since it takes so long to return all its tokens, I
    was wondering if there is a way to say, create "bar" as a copy of
    "foo". I looked thru the javadoc but didn't see anything.



    By exactly the same, do you really mean exactly the same? What's the
    point of that? What are the "querying reasons"?

    You may want to look at the TeeTokenFilter and the SinkTokenizer, but
    I guess I'd like to know more about what's going on before fully
    recommending anything.




    Is this possible in Lucene or do I just have to bite the bullet
    build the new Field using the same TokenStream again?

    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194
    Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected] >>>
    www.sungard.com/energy <blocked::http://www.sungard.com/energy>




    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]



    --
    Matthew Hall
    Software Engineer
    Mouse Genome Informatics
    [email protected]
    (207) 288-6012



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Bill Chesky at Jun 30, 2008 at 1:52 pm
    Matthew,

    It has to do with the fact that we're trying to represent these Property entitities hierarchically. We are displaying them in a tree structure, similar to the way Windows Explorer displays directories and files your file system. E.g. all the states would be at the root level. If you expanded a particular state you would see all the cities in that state, etc.

    If the user does a search we want to filter or "reduce" the tree. E.g. imagine you search on the term 'Smith'. Well since it's a safe bet to assume that there's somebody with the last name of Smith in all fifty states, then all fifty states would show up at the root level. On the other hand, suppose there's one guy in the whole country named with the last name of 'Fleebleflabble' and he lives in Michigan. If I search on that term I would expect only one state, namely Michigan to show up at the root level. Each level in the heirarchy is filtered by the search specified terms in this way.

    Searches are not limited to people's names though. We want to reduce the tree by matches on ANY field in the Properties from 'State' to 'Name'. So for example, a seach on 'Smith' would return matches for everybody that lived in a city named 'Smith City' or on a street named 'Smith Avenue', etc.

    This doesn't make a lot of sense for people and addresses, I admit. I just used that as an easy follow example. But it does make sense for the data we're storing. And BTW, maybe you can see a few holes in this approach. There's a bit more to it than I've described above. We have had to get a little creative with other documents and fields in order for it work correctly. I'd be happy to elaborate if anybody is interested. There may be better ways to do it. Like I said I'm fairly new to Lucene. Was just trying to keep it simple.

    --
    Bill

    -----Original Message-----
    From: Matthew Hall
    Sent: Monday, June 30, 2008 8:26 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?

    Sorry, didn't get this until this morning.

    Yes, both fields should be indexed and searchable, though the data_type
    one should likely be untokenized.

    Data should be indexed and tokenized with whatever appropriate Analyzer
    works for your data.

    As for what your indexing, may I ask why you are doing it like that?

    I would have thought indexing each property seperately (a seperate doc)
    would have been sufficient for your needs, but if you can explain a bit
    more about your situation perhaps I can be more helpful on this matter?

    Matt

    [email protected] wrote:
    Hmmm, I think maybe I am missing something. In your design is the 'data' field indexed, i.e. searchable? Or is it an unindexed, stored field?

    I was thinking that both 'data' and 'data_type' were indexed and searchable.

    Maybe the confusion stems from the fact that for the Document corresponding to "State=California", we're not just indexing on the token 'California'. We're indexing on all the tokens from all the Properties in the set of Properties corresponding to a person's address. In my original example this would be: California, Sacremento, 94203, South, Main, 1234, Joe and Smith.

    For the 'data_type' field I was thinking you were saying we'd index on a single token, namely 'State' (or whatever the left-hand side is).

    Does that make sense?
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Matthew Hall
    Sent: Friday, June 27, 2008 3:33 PM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?

    Yup, you're pretty much there.

    The only part I'm a bit confused about is what you've said in your data
    field there,

    I'm thinking you mean that for the data_type: "State", you would have
    the data entry of "California", right?

    If so, then yup, you are spot on ^^

    We use this technique all the time on our side, and its helped
    considerably. We then use the db_key to reference into a display time
    cache that holds all of the display information for the underlying
    object that we would ever want to present to the user. This allows our
    search time index to be very concise, and as a result nearly every
    search we hit it with is subsecond, which is a nice place to be ^^

    Matt

    [email protected] wrote:
    Matthew,

    Thanks for the reply. This looks very interesting. If I'm understanding correctly your db_key, data and data_type are Fields within the Document, correct? So is this how you envision it?

    Document: State=California
    Field: 'db_key'='1395' (primary key into relational table, correct?)
    Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'data_type' indexed by 'State'

    Document: City=Sacremento
    Field: 'db_key'='2405'
    Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'data_type' indexed by 'City'

    Then my query for all Properties would be:

    +data:South

    My query for only 'City' Properties would be:

    +data:South +data_type:City

    Is that right?

    I think that would work. Very nice. Thank you very much!!!!
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Matthew Hall
    Sent: Friday, June 27, 2008 11:49 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?

    I'm not sure if this is helpful, but I do something VERY similar to this
    in my project.

    So, for the example you are citing I would design my index as follows:

    db_key, data, data_type

    Where the data_type is some sort of value representing the thing that's
    on the left hand side of your property relationship there.

    So, then in order to satisfy your search, the queries become quite simple:

    The search for everything simply searches against the data field in this
    index, wheras the search for a specific data_type + searchterm becomes a
    simple boolean query, that has a MUST clause for the data_type value.

    As an even BETTER bonus, this will then mean that all of your searchable
    values will now have relevance to each other at scoring time, which is
    quite useful in the long run.

    Hope this helps you out,

    Matt

    [email protected] wrote:

    Grant,

    Thanks for the reply. What we're trying to do is kind of esoteric and hard to explain without going into a lot of gory details so I was trying to keep it simple. But I'll try to summarize.

    We're trying to index entities in a relational database. One of the entities we're trying to index is something called a Property. Think of a Property kind of like the java.util.Properties class, i.e. a name/value pair. So some examples of Properties might be:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Etc., etc.

    (Note: this isn't the type of data we're storing... just trying to keep it simple.)

    Imagine that the above list represents the the set of Properties that specify the address for a single person, Joe Smith. Each Property in the set will be indexed by the values on the right-hand side of all the other name/value pairs in the set, i.e.: California, Sacremento, 94203, South, Main, 1234, Joe and Smith.

    There are two types of queries that we want to do.
    1) retrieve every Property matching the specified search terms, regardless of its left-hand side. For this we want to create a field in EVERY Document called "keywords" and index it by the right-hand side values as described above.
    2) retrieve every Property with a given left-hand side that matches the specified search terms. For example, find all the 'City' Properties that match the term 'South'. For this we want to create a field with the name of the left-hand side (e.g. State, City, ZipCode, etc.) but only in those Documents that correspond to a Property with that left-hand side. Again this field will be indexed by the right-hand side values as described above.

    So a couple of examples from the above list might look something like:

    Document: State=California
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'State' indexed by 'California', 'Sacremento', '94203', etc.

    Document: City=Sacremento
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.

    Now if I'm interested in all the Properties that match the word "South", I search the index on the "keywords" field for the term "South". This will return both documents above.

    But if I'm only interested in any 'City' Properties that match the term 'South' I search the index on the "City" field for the term "South". This will only return the 'City=Sacremento' document above because it's the only Document of the two that even has a 'City' field in it.

    But in any case, the 'State' field and the 'City' field are indexed exactly the same way as the 'keywords' field. Which is why I was wondering if there was a way to just create these fields as copies of the 'keywords' field.

    Here is a code sample where I'm creating the index. We're using Hibernate search to search the indexes, thus the "id" and "_hibernate_class" fields.

    Query q = em.createQuery("select p from Property p");

    List<Property> properties = q.getResultList();

    for (Property p : properties)
    {
    // Indexing property.
    Document doc = new Document();
    doc.add(new Field("id",
    Integer.toString(p.getId()),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    doc.add(new Field("_hibernate_class",
    Property.class.getCanonicalName(),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field("keywords", tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    // Here is where I would like to add the second field that is a copy
    // of the "keywords" field just created above. Note: the call
    // p.getCharacteristic().getName() is getting the name of the
    // left-hand side of the Property as described above.
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field(p.getCharacteristic().getName(), tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    }

    Hope that clears it up.

    BTW, in case this seems like a strange way to index things, I will also add that we are doing it this way in order to impose a heirarchical structure on Properties. So my example above should really look like this:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Use your imagination to visualize what the tree might look like with millions of peoples' addresses. Now imagine trying to tokenize the Document corresponding to "State=California". Each path thru the tree from root (State) to leaf (Name) represents a set of Properties that is used to index the "keywords" field in the "State=California" document. In other words it takes a long time to index. This is why I'm looking for a way to just copy one field to another.

    There is a lot more to our design to facilitate this hierarchical structure but this is probably more than you wanted to know. :)

    thanks in advance,
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Grant Ingersoll
    Sent: Friday, June 27, 2008 7:26 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?


    On Jun 27, 2008, at 12:01 AM, <[email protected]> <[email protected]
    wrote:


    Hello Lucene Gurus,



    I'm new to Lucene so sorry if this question basic or naïve.



    I have a Document to which I want to add a Field named, say, "foo"
    that is tokenized, indexed and unstored. I am using the
    "Field(String name, TokenStream tokenStream)" constructor to create
    it. The TokenStream may take a fairly long time to return all its
    tokens.


    Can you share some code here? What's the reasoning behind using it
    (not saying it's wrong, just wondering what led you to it)? Are you
    just loading it up from a file, string or something or do you have
    another reason?




    Now for querying reasons I want to add another Field named, say,
    "bar", that is tokenized and indexed in exactly the same way as
    "foo". I could just pass it the same TokenStream that I used to
    create "foo" but since it takes so long to return all its tokens, I
    was wondering if there is a way to say, create "bar" as a copy of
    "foo". I looked thru the javadoc but didn't see anything.



    By exactly the same, do you really mean exactly the same? What's the
    point of that? What are the "querying reasons"?

    You may want to look at the TeeTokenFilter and the SinkTokenizer, but
    I guess I'd like to know more about what's going on before fully
    recommending anything.




    Is this possible in Lucene or do I just have to bite the bullet
    build the new Field using the same TokenStream again?

    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194
    Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected] >>>
    www.sungard.com/energy <blocked::http://www.sungard.com/energy>




    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]



    --
    Matthew Hall
    Software Engineer
    Mouse Genome Informatics
    [email protected]
    (207) 288-6012



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Matthew Hall at Jun 30, 2008 at 2:00 pm
    Hrm, sorry then I'm not sure how much more help I'm going to be able to
    be on this on. I have to index things that have a DAG Structure
    (Treelike), but in order to get that functionality into my search I
    simply flatten out my dag, so any single term knows all of its children,
    but loses the structure of those children beyond that. This approach
    works for my data, but it doesn't sound like it will for yours.

    So, while I think you can still use the general technique that I showed
    you on this on, I have a feeling you are going to need to customize it
    some for your domain.

    Best of luck, and if there's anything else I can help with let me know.

    Matt

    [email protected] wrote:
    Matthew,

    It has to do with the fact that we're trying to represent these Property entitities hierarchically. We are displaying them in a tree structure, similar to the way Windows Explorer displays directories and files your file system. E.g. all the states would be at the root level. If you expanded a particular state you would see all the cities in that state, etc.

    If the user does a search we want to filter or "reduce" the tree. E.g. imagine you search on the term 'Smith'. Well since it's a safe bet to assume that there's somebody with the last name of Smith in all fifty states, then all fifty states would show up at the root level. On the other hand, suppose there's one guy in the whole country named with the last name of 'Fleebleflabble' and he lives in Michigan. If I search on that term I would expect only one state, namely Michigan to show up at the root level. Each level in the heirarchy is filtered by the search specified terms in this way.

    Searches are not limited to people's names though. We want to reduce the tree by matches on ANY field in the Properties from 'State' to 'Name'. So for example, a seach on 'Smith' would return matches for everybody that lived in a city named 'Smith City' or on a street named 'Smith Avenue', etc.

    This doesn't make a lot of sense for people and addresses, I admit. I just used that as an easy follow example. But it does make sense for the data we're storing. And BTW, maybe you can see a few holes in this approach. There's a bit more to it than I've described above. We have had to get a little creative with other documents and fields in order for it work correctly. I'd be happy to elaborate if anybody is interested. There may be better ways to do it. Like I said I'm fairly new to Lucene. Was just trying to keep it simple.

    --
    Bill

    -----Original Message-----
    From: Matthew Hall
    Sent: Monday, June 30, 2008 8:26 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?

    Sorry, didn't get this until this morning.

    Yes, both fields should be indexed and searchable, though the data_type
    one should likely be untokenized.

    Data should be indexed and tokenized with whatever appropriate Analyzer
    works for your data.

    As for what your indexing, may I ask why you are doing it like that?

    I would have thought indexing each property seperately (a seperate doc)
    would have been sufficient for your needs, but if you can explain a bit
    more about your situation perhaps I can be more helpful on this matter?

    Matt

    [email protected] wrote:
    Hmmm, I think maybe I am missing something. In your design is the 'data' field indexed, i.e. searchable? Or is it an unindexed, stored field?

    I was thinking that both 'data' and 'data_type' were indexed and searchable.

    Maybe the confusion stems from the fact that for the Document corresponding to "State=California", we're not just indexing on the token 'California'. We're indexing on all the tokens from all the Properties in the set of Properties corresponding to a person's address. In my original example this would be: California, Sacremento, 94203, South, Main, 1234, Joe and Smith.

    For the 'data_type' field I was thinking you were saying we'd index on a single token, namely 'State' (or whatever the left-hand side is).

    Does that make sense?
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Matthew Hall
    Sent: Friday, June 27, 2008 3:33 PM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?

    Yup, you're pretty much there.

    The only part I'm a bit confused about is what you've said in your data
    field there,

    I'm thinking you mean that for the data_type: "State", you would have
    the data entry of "California", right?

    If so, then yup, you are spot on ^^

    We use this technique all the time on our side, and its helped
    considerably. We then use the db_key to reference into a display time
    cache that holds all of the display information for the underlying
    object that we would ever want to present to the user. This allows our
    search time index to be very concise, and as a result nearly every
    search we hit it with is subsecond, which is a nice place to be ^^

    Matt

    [email protected] wrote:

    Matthew,

    Thanks for the reply. This looks very interesting. If I'm understanding correctly your db_key, data and data_type are Fields within the Document, correct? So is this how you envision it?

    Document: State=California
    Field: 'db_key'='1395' (primary key into relational table, correct?)
    Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'data_type' indexed by 'State'

    Document: City=Sacremento
    Field: 'db_key'='2405'
    Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'data_type' indexed by 'City'

    Then my query for all Properties would be:

    +data:South

    My query for only 'City' Properties would be:

    +data:South +data_type:City

    Is that right?

    I think that would work. Very nice. Thank you very much!!!!
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Matthew Hall
    Sent: Friday, June 27, 2008 11:49 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?

    I'm not sure if this is helpful, but I do something VERY similar to this
    in my project.

    So, for the example you are citing I would design my index as follows:

    db_key, data, data_type

    Where the data_type is some sort of value representing the thing that's
    on the left hand side of your property relationship there.

    So, then in order to satisfy your search, the queries become quite simple:

    The search for everything simply searches against the data field in this
    index, wheras the search for a specific data_type + searchterm becomes a
    simple boolean query, that has a MUST clause for the data_type value.

    As an even BETTER bonus, this will then mean that all of your searchable
    values will now have relevance to each other at scoring time, which is
    quite useful in the long run.

    Hope this helps you out,

    Matt

    [email protected] wrote:


    Grant,

    Thanks for the reply. What we're trying to do is kind of esoteric and hard to explain without going into a lot of gory details so I was trying to keep it simple. But I'll try to summarize.

    We're trying to index entities in a relational database. One of the entities we're trying to index is something called a Property. Think of a Property kind of like the java.util.Properties class, i.e. a name/value pair. So some examples of Properties might be:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Etc., etc.

    (Note: this isn't the type of data we're storing... just trying to keep it simple.)

    Imagine that the above list represents the the set of Properties that specify the address for a single person, Joe Smith. Each Property in the set will be indexed by the values on the right-hand side of all the other name/value pairs in the set, i.e.: California, Sacremento, 94203, South, Main, 1234, Joe and Smith.

    There are two types of queries that we want to do.
    1) retrieve every Property matching the specified search terms, regardless of its left-hand side. For this we want to create a field in EVERY Document called "keywords" and index it by the right-hand side values as described above.
    2) retrieve every Property with a given left-hand side that matches the specified search terms. For example, find all the 'City' Properties that match the term 'South'. For this we want to create a field with the name of the left-hand side (e.g. State, City, ZipCode, etc.) but only in those Documents that correspond to a Property with that left-hand side. Again this field will be indexed by the right-hand side values as described above.

    So a couple of examples from the above list might look something like:

    Document: State=California
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'State' indexed by 'California', 'Sacremento', '94203', etc.

    Document: City=Sacremento
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.

    Now if I'm interested in all the Properties that match the word "South", I search the index on the "keywords" field for the term "South". This will return both documents above.

    But if I'm only interested in any 'City' Properties that match the term 'South' I search the index on the "City" field for the term "South". This will only return the 'City=Sacremento' document above because it's the only Document of the two that even has a 'City' field in it.

    But in any case, the 'State' field and the 'City' field are indexed exactly the same way as the 'keywords' field. Which is why I was wondering if there was a way to just create these fields as copies of the 'keywords' field.

    Here is a code sample where I'm creating the index. We're using Hibernate search to search the indexes, thus the "id" and "_hibernate_class" fields.

    Query q = em.createQuery("select p from Property p");

    List<Property> properties = q.getResultList();

    for (Property p : properties)
    {
    // Indexing property.
    Document doc = new Document();
    doc.add(new Field("id",
    Integer.toString(p.getId()),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    doc.add(new Field("_hibernate_class",
    Property.class.getCanonicalName(),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field("keywords", tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    // Here is where I would like to add the second field that is a copy
    // of the "keywords" field just created above. Note: the call
    // p.getCharacteristic().getName() is getting the name of the
    // left-hand side of the Property as described above.
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field(p.getCharacteristic().getName(), tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    }

    Hope that clears it up.

    BTW, in case this seems like a strange way to index things, I will also add that we are doing it this way in order to impose a heirarchical structure on Properties. So my example above should really look like this:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Use your imagination to visualize what the tree might look like with millions of peoples' addresses. Now imagine trying to tokenize the Document corresponding to "State=California". Each path thru the tree from root (State) to leaf (Name) represents a set of Properties that is used to index the "keywords" field in the "State=California" document. In other words it takes a long time to index. This is why I'm looking for a way to just copy one field to another.

    There is a lot more to our design to facilitate this hierarchical structure but this is probably more than you wanted to know. :)

    thanks in advance,
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Grant Ingersoll
    Sent: Friday, June 27, 2008 7:26 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?


    On Jun 27, 2008, at 12:01 AM, <[email protected]> <[email protected]
    wrote:



    Hello Lucene Gurus,



    I'm new to Lucene so sorry if this question basic or naïve.



    I have a Document to which I want to add a Field named, say, "foo"
    that is tokenized, indexed and unstored. I am using the
    "Field(String name, TokenStream tokenStream)" constructor to create
    it. The TokenStream may take a fairly long time to return all its
    tokens.



    Can you share some code here? What's the reasoning behind using it
    (not saying it's wrong, just wondering what led you to it)? Are you
    just loading it up from a file, string or something or do you have
    another reason?





    Now for querying reasons I want to add another Field named, say,
    "bar", that is tokenized and indexed in exactly the same way as
    "foo". I could just pass it the same TokenStream that I used to
    create "foo" but since it takes so long to return all its tokens, I
    was wondering if there is a way to say, create "bar" as a copy of
    "foo". I looked thru the javadoc but didn't see anything.




    By exactly the same, do you really mean exactly the same? What's the
    point of that? What are the "querying reasons"?

    You may want to look at the TeeTokenFilter and the SinkTokenizer, but
    I guess I'd like to know more about what's going on before fully
    recommending anything.





    Is this possible in Lucene or do I just have to bite the bullet
    build the new Field using the same TokenStream again?

    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194
    Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected] >>>>
    www.sungard.com/energy <blocked::http://www.sungard.com/energy>





    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    --
    Matthew Hall
    Software Engineer
    Mouse Genome Informatics
    [email protected]
    (207) 288-6012



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Bill Chesky at Jun 30, 2008 at 3:09 pm
    Actually, you've been a big help. Your 'data_type' field suggestion I think will work for our app and obviates the need for the Field copy functionality that I was originally asking about. Just having one problem with it still, but I think it has to do with my limited knowledge of how analyzers work. If I can't figure it out, I'll post a question in a different thread.

    Thanks!
    --
    Bill


    -----Original Message-----
    From: Matthew Hall
    Sent: Monday, June 30, 2008 9:57 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?

    Hrm, sorry then I'm not sure how much more help I'm going to be able to
    be on this on. I have to index things that have a DAG Structure
    (Treelike), but in order to get that functionality into my search I
    simply flatten out my dag, so any single term knows all of its children,
    but loses the structure of those children beyond that. This approach
    works for my data, but it doesn't sound like it will for yours.

    So, while I think you can still use the general technique that I showed
    you on this on, I have a feeling you are going to need to customize it
    some for your domain.

    Best of luck, and if there's anything else I can help with let me know.

    Matt

    [email protected] wrote:
    Matthew,

    It has to do with the fact that we're trying to represent these Property entitities hierarchically. We are displaying them in a tree structure, similar to the way Windows Explorer displays directories and files your file system. E.g. all the states would be at the root level. If you expanded a particular state you would see all the cities in that state, etc.

    If the user does a search we want to filter or "reduce" the tree. E.g. imagine you search on the term 'Smith'. Well since it's a safe bet to assume that there's somebody with the last name of Smith in all fifty states, then all fifty states would show up at the root level. On the other hand, suppose there's one guy in the whole country named with the last name of 'Fleebleflabble' and he lives in Michigan. If I search on that term I would expect only one state, namely Michigan to show up at the root level. Each level in the heirarchy is filtered by the search specified terms in this way.

    Searches are not limited to people's names though. We want to reduce the tree by matches on ANY field in the Properties from 'State' to 'Name'. So for example, a seach on 'Smith' would return matches for everybody that lived in a city named 'Smith City' or on a street named 'Smith Avenue', etc.

    This doesn't make a lot of sense for people and addresses, I admit. I just used that as an easy follow example. But it does make sense for the data we're storing. And BTW, maybe you can see a few holes in this approach. There's a bit more to it than I've described above. We have had to get a little creative with other documents and fields in order for it work correctly. I'd be happy to elaborate if anybody is interested. There may be better ways to do it. Like I said I'm fairly new to Lucene. Was just trying to keep it simple.

    --
    Bill

    -----Original Message-----
    From: Matthew Hall
    Sent: Monday, June 30, 2008 8:26 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?

    Sorry, didn't get this until this morning.

    Yes, both fields should be indexed and searchable, though the data_type
    one should likely be untokenized.

    Data should be indexed and tokenized with whatever appropriate Analyzer
    works for your data.

    As for what your indexing, may I ask why you are doing it like that?

    I would have thought indexing each property seperately (a seperate doc)
    would have been sufficient for your needs, but if you can explain a bit
    more about your situation perhaps I can be more helpful on this matter?

    Matt

    [email protected] wrote:
    Hmmm, I think maybe I am missing something. In your design is the 'data' field indexed, i.e. searchable? Or is it an unindexed, stored field?

    I was thinking that both 'data' and 'data_type' were indexed and searchable.

    Maybe the confusion stems from the fact that for the Document corresponding to "State=California", we're not just indexing on the token 'California'. We're indexing on all the tokens from all the Properties in the set of Properties corresponding to a person's address. In my original example this would be: California, Sacremento, 94203, South, Main, 1234, Joe and Smith.

    For the 'data_type' field I was thinking you were saying we'd index on a single token, namely 'State' (or whatever the left-hand side is).

    Does that make sense?
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Matthew Hall
    Sent: Friday, June 27, 2008 3:33 PM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?

    Yup, you're pretty much there.

    The only part I'm a bit confused about is what you've said in your data
    field there,

    I'm thinking you mean that for the data_type: "State", you would have
    the data entry of "California", right?

    If so, then yup, you are spot on ^^

    We use this technique all the time on our side, and its helped
    considerably. We then use the db_key to reference into a display time
    cache that holds all of the display information for the underlying
    object that we would ever want to present to the user. This allows our
    search time index to be very concise, and as a result nearly every
    search we hit it with is subsecond, which is a nice place to be ^^

    Matt

    [email protected] wrote:

    Matthew,

    Thanks for the reply. This looks very interesting. If I'm understanding correctly your db_key, data and data_type are Fields within the Document, correct? So is this how you envision it?

    Document: State=California
    Field: 'db_key'='1395' (primary key into relational table, correct?)
    Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'data_type' indexed by 'State'

    Document: City=Sacremento
    Field: 'db_key'='2405'
    Field: 'data' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'data_type' indexed by 'City'

    Then my query for all Properties would be:

    +data:South

    My query for only 'City' Properties would be:

    +data:South +data_type:City

    Is that right?

    I think that would work. Very nice. Thank you very much!!!!
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Matthew Hall
    Sent: Friday, June 27, 2008 11:49 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?

    I'm not sure if this is helpful, but I do something VERY similar to this
    in my project.

    So, for the example you are citing I would design my index as follows:

    db_key, data, data_type

    Where the data_type is some sort of value representing the thing that's
    on the left hand side of your property relationship there.

    So, then in order to satisfy your search, the queries become quite simple:

    The search for everything simply searches against the data field in this
    index, wheras the search for a specific data_type + searchterm becomes a
    simple boolean query, that has a MUST clause for the data_type value.

    As an even BETTER bonus, this will then mean that all of your searchable
    values will now have relevance to each other at scoring time, which is
    quite useful in the long run.

    Hope this helps you out,

    Matt

    [email protected] wrote:


    Grant,

    Thanks for the reply. What we're trying to do is kind of esoteric and hard to explain without going into a lot of gory details so I was trying to keep it simple. But I'll try to summarize.

    We're trying to index entities in a relational database. One of the entities we're trying to index is something called a Property. Think of a Property kind of like the java.util.Properties class, i.e. a name/value pair. So some examples of Properties might be:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Etc., etc.

    (Note: this isn't the type of data we're storing... just trying to keep it simple.)

    Imagine that the above list represents the the set of Properties that specify the address for a single person, Joe Smith. Each Property in the set will be indexed by the values on the right-hand side of all the other name/value pairs in the set, i.e.: California, Sacremento, 94203, South, Main, 1234, Joe and Smith.

    There are two types of queries that we want to do.
    1) retrieve every Property matching the specified search terms, regardless of its left-hand side. For this we want to create a field in EVERY Document called "keywords" and index it by the right-hand side values as described above.
    2) retrieve every Property with a given left-hand side that matches the specified search terms. For example, find all the 'City' Properties that match the term 'South'. For this we want to create a field with the name of the left-hand side (e.g. State, City, ZipCode, etc.) but only in those Documents that correspond to a Property with that left-hand side. Again this field will be indexed by the right-hand side values as described above.

    So a couple of examples from the above list might look something like:

    Document: State=California
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'State' indexed by 'California', 'Sacremento', '94203', etc.

    Document: City=Sacremento
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.

    Now if I'm interested in all the Properties that match the word "South", I search the index on the "keywords" field for the term "South". This will return both documents above.

    But if I'm only interested in any 'City' Properties that match the term 'South' I search the index on the "City" field for the term "South". This will only return the 'City=Sacremento' document above because it's the only Document of the two that even has a 'City' field in it.

    But in any case, the 'State' field and the 'City' field are indexed exactly the same way as the 'keywords' field. Which is why I was wondering if there was a way to just create these fields as copies of the 'keywords' field.

    Here is a code sample where I'm creating the index. We're using Hibernate search to search the indexes, thus the "id" and "_hibernate_class" fields.

    Query q = em.createQuery("select p from Property p");

    List<Property> properties = q.getResultList();

    for (Property p : properties)
    {
    // Indexing property.
    Document doc = new Document();
    doc.add(new Field("id",
    Integer.toString(p.getId()),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    doc.add(new Field("_hibernate_class",
    Property.class.getCanonicalName(),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field("keywords", tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    // Here is where I would like to add the second field that is a copy
    // of the "keywords" field just created above. Note: the call
    // p.getCharacteristic().getName() is getting the name of the
    // left-hand side of the Property as described above.
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field(p.getCharacteristic().getName(), tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    }

    Hope that clears it up.

    BTW, in case this seems like a strange way to index things, I will also add that we are doing it this way in order to impose a heirarchical structure on Properties. So my example above should really look like this:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Use your imagination to visualize what the tree might look like with millions of peoples' addresses. Now imagine trying to tokenize the Document corresponding to "State=California". Each path thru the tree from root (State) to leaf (Name) represents a set of Properties that is used to index the "keywords" field in the "State=California" document. In other words it takes a long time to index. This is why I'm looking for a way to just copy one field to another.

    There is a lot more to our design to facilitate this hierarchical structure but this is probably more than you wanted to know. :)

    thanks in advance,
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Grant Ingersoll
    Sent: Friday, June 27, 2008 7:26 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?


    On Jun 27, 2008, at 12:01 AM, <[email protected]> <[email protected]
    wrote:



    Hello Lucene Gurus,



    I'm new to Lucene so sorry if this question basic or naïve.



    I have a Document to which I want to add a Field named, say, "foo"
    that is tokenized, indexed and unstored. I am using the
    "Field(String name, TokenStream tokenStream)" constructor to create
    it. The TokenStream may take a fairly long time to return all its
    tokens.



    Can you share some code here? What's the reasoning behind using it
    (not saying it's wrong, just wondering what led you to it)? Are you
    just loading it up from a file, string or something or do you have
    another reason?





    Now for querying reasons I want to add another Field named, say,
    "bar", that is tokenized and indexed in exactly the same way as
    "foo". I could just pass it the same TokenStream that I used to
    create "foo" but since it takes so long to return all its tokens, I
    was wondering if there is a way to say, create "bar" as a copy of
    "foo". I looked thru the javadoc but didn't see anything.




    By exactly the same, do you really mean exactly the same? What's the
    point of that? What are the "querying reasons"?

    You may want to look at the TeeTokenFilter and the SinkTokenizer, but
    I guess I'd like to know more about what's going on before fully
    recommending anything.





    Is this possible in Lucene or do I just have to bite the bullet
    build the new Field using the same TokenStream again?

    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194
    Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected] >>>>
    www.sungard.com/energy <blocked::http://www.sungard.com/energy>





    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    --
    Matthew Hall
    Software Engineer
    Mouse Genome Informatics
    [email protected]
    (207) 288-6012



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Erick Erickson at Jun 27, 2008 at 5:38 pm
    How sure are you that the TokenStream is that expensive? But
    assuming you are AND that the values for these properties
    aren't that big, the simple-minded approach that comes to my
    simple mind is to just iterate through the stream yourself, assemble
    a string from the returned tokens and pass the string to the two add
    calls.

    This might be worth it if your tokenizer is going to the DB or something....

    Best
    Erick

    On Fri, Jun 27, 2008 at 10:56 AM, wrote:

    Grant,

    Thanks for the reply. What we're trying to do is kind of esoteric and hard
    to explain without going into a lot of gory details so I was trying to keep
    it simple. But I'll try to summarize.

    We're trying to index entities in a relational database. One of the
    entities we're trying to index is something called a Property. Think of a
    Property kind of like the java.util.Properties class, i.e. a name/value
    pair. So some examples of Properties might be:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Etc., etc.

    (Note: this isn't the type of data we're storing... just trying to keep it
    simple.)

    Imagine that the above list represents the the set of Properties that
    specify the address for a single person, Joe Smith. Each Property in the
    set will be indexed by the values on the right-hand side of all the other
    name/value pairs in the set, i.e.: California, Sacremento, 94203, South,
    Main, 1234, Joe and Smith.

    There are two types of queries that we want to do.
    1) retrieve every Property matching the specified search terms, regardless
    of its left-hand side. For this we want to create a field in EVERY Document
    called "keywords" and index it by the right-hand side values as described
    above.
    2) retrieve every Property with a given left-hand side that matches the
    specified search terms. For example, find all the 'City' Properties that
    match the term 'South'. For this we want to create a field with the name of
    the left-hand side (e.g. State, City, ZipCode, etc.) but only in those
    Documents that correspond to a Property with that left-hand side. Again
    this field will be indexed by the right-hand side values as described above.

    So a couple of examples from the above list might look something like:

    Document: State=California
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'State' indexed by 'California', 'Sacremento', '94203', etc.

    Document: City=Sacremento
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.

    Now if I'm interested in all the Properties that match the word "South", I
    search the index on the "keywords" field for the term "South". This will
    return both documents above.

    But if I'm only interested in any 'City' Properties that match the term
    'South' I search the index on the "City" field for the term "South". This
    will only return the 'City=Sacremento' document above because it's the only
    Document of the two that even has a 'City' field in it.

    But in any case, the 'State' field and the 'City' field are indexed exactly
    the same way as the 'keywords' field. Which is why I was wondering if there
    was a way to just create these fields as copies of the 'keywords' field.

    Here is a code sample where I'm creating the index. We're using Hibernate
    search to search the indexes, thus the "id" and "_hibernate_class" fields.

    Query q = em.createQuery("select p from Property p");

    List<Property> properties = q.getResultList();

    for (Property p : properties)
    {
    // Indexing property.
    Document doc = new Document();
    doc.add(new Field("id",
    Integer.toString(p.getId()),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    doc.add(new Field("_hibernate_class",
    Property.class.getCanonicalName(),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field("keywords", tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    // Here is where I would like to add the second field that is a copy
    // of the "keywords" field just created above. Note: the call
    // p.getCharacteristic().getName() is getting the name of the
    // left-hand side of the Property as described above.
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field(p.getCharacteristic().getName(), tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    }

    Hope that clears it up.

    BTW, in case this seems like a strange way to index things, I will also add
    that we are doing it this way in order to impose a heirarchical structure on
    Properties. So my example above should really look like this:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Use your imagination to visualize what the tree might look like with
    millions of peoples' addresses. Now imagine trying to tokenize the Document
    corresponding to "State=California". Each path thru the tree from root
    (State) to leaf (Name) represents a set of Properties that is used to index
    the "keywords" field in the "State=California" document. In other words it
    takes a long time to index. This is why I'm looking for a way to just copy
    one field to another.

    There is a lot more to our design to facilitate this hierarchical structure
    but this is probably more than you wanted to know. :)

    thanks in advance,
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak
    Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Grant Ingersoll
    Sent: Friday, June 27, 2008 7:26 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?


    On Jun 27, 2008, at 12:01 AM, <[email protected]> <
    [email protected]
    wrote:
    Hello Lucene Gurus,



    I'm new to Lucene so sorry if this question basic or naïve.



    I have a Document to which I want to add a Field named, say, "foo"
    that is tokenized, indexed and unstored. I am using the
    "Field(String name, TokenStream tokenStream)" constructor to create
    it. The TokenStream may take a fairly long time to return all its
    tokens.
    Can you share some code here? What's the reasoning behind using it
    (not saying it's wrong, just wondering what led you to it)? Are you
    just loading it up from a file, string or something or do you have
    another reason?


    Now for querying reasons I want to add another Field named, say,
    "bar", that is tokenized and indexed in exactly the same way as
    "foo". I could just pass it the same TokenStream that I used to
    create "foo" but since it takes so long to return all its tokens, I
    was wondering if there is a way to say, create "bar" as a copy of
    "foo". I looked thru the javadoc but didn't see anything.
    By exactly the same, do you really mean exactly the same? What's the
    point of that? What are the "querying reasons"?

    You may want to look at the TeeTokenFilter and the SinkTokenizer, but
    I guess I'd like to know more about what's going on before fully
    recommending anything.

    Is this possible in Lucene or do I just have to bite the bullet
    build the new Field using the same TokenStream again?

    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194
    Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected] <mailto:
    [email protected]
    www.sungard.com/energy <blocked::http://www.sungard.com/energy>

    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Bill Chesky at Jun 27, 2008 at 6:44 pm
    Erick,

    Thanks for the response. I'm very sure the TokenStream is expensive. Not always but in some case, yes, it can take a long time to complete. However, I do like your approach. I'm going to try a different approach suggested by another poster first, but this is very interesting.

    Thank you!
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Erick Erickson
    Sent: Friday, June 27, 2008 1:37 PM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?

    How sure are you that the TokenStream is that expensive? But
    assuming you are AND that the values for these properties
    aren't that big, the simple-minded approach that comes to my
    simple mind is to just iterate through the stream yourself, assemble
    a string from the returned tokens and pass the string to the two add
    calls.

    This might be worth it if your tokenizer is going to the DB or something....

    Best
    Erick

    On Fri, Jun 27, 2008 at 10:56 AM, wrote:

    Grant,

    Thanks for the reply. What we're trying to do is kind of esoteric and hard
    to explain without going into a lot of gory details so I was trying to keep
    it simple. But I'll try to summarize.

    We're trying to index entities in a relational database. One of the
    entities we're trying to index is something called a Property. Think of a
    Property kind of like the java.util.Properties class, i.e. a name/value
    pair. So some examples of Properties might be:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Etc., etc.

    (Note: this isn't the type of data we're storing... just trying to keep it
    simple.)

    Imagine that the above list represents the the set of Properties that
    specify the address for a single person, Joe Smith. Each Property in the
    set will be indexed by the values on the right-hand side of all the other
    name/value pairs in the set, i.e.: California, Sacremento, 94203, South,
    Main, 1234, Joe and Smith.

    There are two types of queries that we want to do.
    1) retrieve every Property matching the specified search terms, regardless
    of its left-hand side. For this we want to create a field in EVERY Document
    called "keywords" and index it by the right-hand side values as described
    above.
    2) retrieve every Property with a given left-hand side that matches the
    specified search terms. For example, find all the 'City' Properties that
    match the term 'South'. For this we want to create a field with the name of
    the left-hand side (e.g. State, City, ZipCode, etc.) but only in those
    Documents that correspond to a Property with that left-hand side. Again
    this field will be indexed by the right-hand side values as described above.

    So a couple of examples from the above list might look something like:

    Document: State=California
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'State' indexed by 'California', 'Sacremento', '94203', etc.

    Document: City=Sacremento
    Field: 'keywords' indexed by 'California', 'Sacremento', '94203', etc.
    Field: 'City' indexed by 'California', 'Sacremento', '94203', etc.

    Now if I'm interested in all the Properties that match the word "South", I
    search the index on the "keywords" field for the term "South". This will
    return both documents above.

    But if I'm only interested in any 'City' Properties that match the term
    'South' I search the index on the "City" field for the term "South". This
    will only return the 'City=Sacremento' document above because it's the only
    Document of the two that even has a 'City' field in it.

    But in any case, the 'State' field and the 'City' field are indexed exactly
    the same way as the 'keywords' field. Which is why I was wondering if there
    was a way to just create these fields as copies of the 'keywords' field.

    Here is a code sample where I'm creating the index. We're using Hibernate
    search to search the indexes, thus the "id" and "_hibernate_class" fields.

    Query q = em.createQuery("select p from Property p");

    List<Property> properties = q.getResultList();

    for (Property p : properties)
    {
    // Indexing property.
    Document doc = new Document();
    doc.add(new Field("id",
    Integer.toString(p.getId()),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    doc.add(new Field("_hibernate_class",
    Property.class.getCanonicalName(),
    Field.Store.YES,
    Field.Index.UN_TOKENIZED));
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field("keywords", tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    // Here is where I would like to add the second field that is a copy
    // of the "keywords" field just created above. Note: the call
    // p.getCharacteristic().getName() is getting the name of the
    // left-hand side of the Property as described above.
    TokenStream tokenStream = new PropertyTokenStream(p);
    doc.add(new Field(p.getCharacteristic().getName(), tokenStream));
    propertyIndexWriter.addDocument(doc);
    tokenStream.close();
    }

    Hope that clears it up.

    BTW, in case this seems like a strange way to index things, I will also add
    that we are doing it this way in order to impose a heirarchical structure on
    Properties. So my example above should really look like this:

    State=California
    City=Sacremento
    ZipCode=94203
    StreetName=South Main
    StreetNumber=1234
    Name=Joe Smith

    Use your imagination to visualize what the tree might look like with
    millions of peoples' addresses. Now imagine trying to tokenize the Document
    corresponding to "State=California". Each path thru the tree from root
    (State) to leaf (Name) represents a set of Properties that is used to index
    the "keywords" field in the "State=California" document. In other words it
    takes a long time to index. This is why I'm looking for a way to just copy
    one field to another.

    There is a lot more to our design to facilitate this hierarchical structure
    but this is probably more than you wanted to know. :)

    thanks in advance,
    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194 Oak
    Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected]
    www.sungard.com/energy


    -----Original Message-----
    From: Grant Ingersoll
    Sent: Friday, June 27, 2008 7:26 AM
    To: [email protected]
    Subject: Re: Can you create a Field that is a copy of another Field?


    On Jun 27, 2008, at 12:01 AM, <[email protected]> <
    [email protected]
    wrote:
    Hello Lucene Gurus,



    I'm new to Lucene so sorry if this question basic or naïve.



    I have a Document to which I want to add a Field named, say, "foo"
    that is tokenized, indexed and unstored. I am using the
    "Field(String name, TokenStream tokenStream)" constructor to create
    it. The TokenStream may take a fairly long time to return all its
    tokens.
    Can you share some code here? What's the reasoning behind using it
    (not saying it's wrong, just wondering what led you to it)? Are you
    just loading it up from a file, string or something or do you have
    another reason?


    Now for querying reasons I want to add another Field named, say,
    "bar", that is tokenized and indexed in exactly the same way as
    "foo". I could just pass it the same TokenStream that I used to
    create "foo" but since it takes so long to return all its tokens, I
    was wondering if there is a way to say, create "bar" as a copy of
    "foo". I looked thru the javadoc but didn't see anything.
    By exactly the same, do you really mean exactly the same? What's the
    point of that? What are the "querying reasons"?

    You may want to look at the TeeTokenFilter and the SinkTokenizer, but
    I guess I'd like to know more about what's going on before fully
    recommending anything.

    Is this possible in Lucene or do I just have to bite the bullet
    build the new Field using the same TokenStream again?

    --
    Bill Chesky * Sr. Software Developer * SunGard * FAME Energy * 1194
    Oak Valley Drive * Ann Arbor, MI 48103
    Tel 734-332-4405 * Fax 734-332-4440 * [email protected] <mailto:
    [email protected]
    www.sungard.com/energy <blocked::http://www.sungard.com/energy>

    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 27, '08 at 4:03a
activeJun 30, '08 at 3:09p
posts13
users4
websitelucene.apache.org

People

Translate

site design / logo © 2023 Grokbase