FAQ
I've been wrestling with a way to index and search data with a
geo-positional aspect. By a geo-positional search, I want to constrain
search results within a given location range. Furthermore, I want to allow
the user to set/change the geo-positional boundaries as needed for their
search. This isn't a foreign concept to anyone who's needed to do the same.

There's been some work done in this area publicly (or at least what I could
find via the list archives and google), but not a lot. The guys at
eventax.de have done some very impressive work. Their implementation is
within the constructs of nutch; there's more here at
http://wiki.apache.org/nutch/GeoPosition. Their approach is very
interesting and is predicated by having data indexed in a certain format.
I've considered building something based on the FunctionQuery class as
well. Within this class, I could actually do the mathematical calculations
(Haversine formula, anyone?) for geo-positional filtering. Range queries on
this data set were an option as well.

I've hit a performance wall with these approaches. The geo-positional
calculations are proving to be intensive. With the combination of my data
set, hardware, and OS, this just wasn't getting it done.

So, I reversed my thinking. Instead of getting Lucene to do geo-math, what
if I made geo-math do Lucene? Lucene is exceptional at string lookups; how
could I do geo-positional search in that framework? I seized upon an
approach that I've validated in testing, but wanted to get more feedback
from the community.

********************************************************************************************************

Data structure:
Latitude & longitude values in decimal format, i.e. latitude=47.480227,
longitude=-122.333670 (btw, a tire shop I recently visited).

Geo definition:
Boxing around a center point. It's not critical to do a radius search with
a given circle. A boxed approach allows for taller or wider frames of
reference, which are applicable for our use.


How I'm solving:
Treat the data as strings and formulate boolean query lookups based on
positional comparison. Here are the steps:

Indexing:
1. Break up lat/long values to individual characters, and store each field
in progression. Field storage type = Keyword. The tire shop example:
Lat1 = [4]
Lat2 = [7]
Lat3 = [.]
Lat4 = [4]
Lat5 = [8]
...
Long1 = [-]
Long2 = [1]
Long3 = [2]
...

Searching:
1. Start with box coordinates - North/West corner, South/East corner. For
example, a sample box around Seattle, WA:
North latitude = 47.77704
West longitude = -122.41909
South latitude = 47.34827
East longitude = -122.22031
2. Break up lat/long values in a manner similar to indexing.
3. Begin to build boolean query. Query will contain two required clauses:
the North/South query, and the West/East query. Both queries are built
using the same progressional evaluation of characters by position.
4. Compare North/South (top/bottom) values. Here's a pseudo-graphical
chart:

North = [4][7][.][7][7][7][0][4]
South = [4][7][.][3][4][8][2][7]

Qualifying records will have latitude values between 47.77704 and 47.34827.
With lexicographical ordering in mind, I can safely include this phrase in
my North/South query:

(+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

The logic is that any latitude with the first three values matching, and the
fourth position containing either 4 or 5 or 6 will yield a qualifying match.

What other queries are needed? Ones that match 7 or 3 in the fourth
position should be considered, but they need further qualification. The
further qualification is based on values from the fifth position. I can
safely included the following phrases in my North/South query:

(+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3 Lat5:2
Lat5:1 Lat5:0)
(+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8 Lat5:9)

The logic here is simply an extension of the first query, extended to next
position in the latitude range. In the Northern latitude, with the first
four positions matching those values exactly, anything less than 7 in the
fifth position will yield a matching latitude. Same goes for the South
(bottom) query.

*******************************************************************************************

This approach yields a higher number of boolean query clauses, the more
granular you get. In this scenario, I've estimated approximately 145
clauses within the final constructed query. In validation testing, this
approach has proven to be:

1) Accurate.
2) Performant (thus far).


At last, my question to everyone who cares to respond (and read this far):
feedback?

Thanks,
-- jeff

Search Discussions

  • Chris Hostetter at Feb 28, 2006 at 8:35 pm
    : Geo definition:
    : Boxing around a center point. It's not critical to do a radius search with
    : a given circle. A boxed approach allows for taller or wider frames of
    : reference, which are applicable for our use.

    if you are just loking to confine your results to a box then i think
    RangeFiltering on both the X and Y axis will be more efficient then the
    individual term queries you are producing.

    It will have the added bonus of not artificially affecting the scores of
    hte documents based on how often a particular digit apears in a particular
    position of hte latitue accross your corpus.

    Once you've filtered down to a particular bounding box, you might consider
    going back to the function query approach to score documents inside that
    box based on their actual distance from the center point. I don't recall
    at the moment but i believe FunctionQuery's Scorer supports skipTo in such
    a way that it won't bother computing the function for a document that has
    been skiped (ie: when containing in a BooleanQuery with another clause
    that has already prohibited it, or when executed in the context of a
    Filter)



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Jeff Rodenburg at Feb 28, 2006 at 9:41 pm
    Very good points, I hadn't considered the term frequency of the digits
    affecting scoring. As an aside, can that aspect of the score be ignored for
    these fields?

    I need to spend more time with FunctionQuery, I haven't given it the
    attention it deserves.

    Great feedback, thanks for the notes.

    -- jeff
    On 2/28/06, Chris Hostetter wrote:


    : Geo definition:
    : Boxing around a center point. It's not critical to do a radius search
    with
    : a given circle. A boxed approach allows for taller or wider frames of
    : reference, which are applicable for our use.

    if you are just loking to confine your results to a box then i think
    RangeFiltering on both the X and Y axis will be more efficient then the
    individual term queries you are producing.

    It will have the added bonus of not artificially affecting the scores of
    hte documents based on how often a particular digit apears in a particular
    position of hte latitue accross your corpus.

    Once you've filtered down to a particular bounding box, you might consider
    going back to the function query approach to score documents inside that
    box based on their actual distance from the center point. I don't recall
    at the moment but i believe FunctionQuery's Scorer supports skipTo in such
    a way that it won't bother computing the function for a document that has
    been skiped (ie: when containing in a BooleanQuery with another clause
    that has already prohibited it, or when executed in the context of a
    Filter)



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at Feb 28, 2006 at 10:13 pm
    : Very good points, I hadn't considered the term frequency of the digits
    : affecting scoring. As an aside, can that aspect of the score be ignored for
    : these fields?

    The easiest way is to use a boost that is so low it's insignificant, or
    you could subclass TermQuery and override getSimilarity to return a
    DelegateSimilarity which wraps the real instance and returns constant
    values for things like tf() and idf() ... but i'm 95% sure that using a
    RangeFilter (or a ConstantScoreRangeQuery) is going to be faster then all
    of those TermQueries no matter what.

    : I need to spend more time with FunctionQuery, I haven't given it the
    : attention it deserves.

    i would start by trying out an apples to apples comparison of your current
    approach with one where your index only has one indexed field each for
    long/lat that uses ConstantScoreRangeQuery to do the boxing. Compare both
    the size of the resulting indexes, the memory footprint while open, and
    the time spent executing comparable queries. You should probably compare
    queries that involve both large boxes and small boxes, and depending on
    hte usage pattern you expect consider caching your Filters if you expect
    many boxes to be reused frequently.

    once you've found the "best" way to do your boxing ... then look into
    using FunctionQueries to influence your scores based on distance fro mthe
    center of hte box.

    :
    : Great feedback, thanks for the notes.
    :
    : -- jeff
    :
    : On 2/28/06, Chris Hostetter wrote:
    : >
    : >
    : > : Geo definition:
    : > : Boxing around a center point. It's not critical to do a radius search
    : > with
    : > : a given circle. A boxed approach allows for taller or wider frames of
    : > : reference, which are applicable for our use.
    : >
    : > if you are just loking to confine your results to a box then i think
    : > RangeFiltering on both the X and Y axis will be more efficient then the
    : > individual term queries you are producing.
    : >
    : > It will have the added bonus of not artificially affecting the scores of
    : > hte documents based on how often a particular digit apears in a particular
    : > position of hte latitue accross your corpus.
    : >
    : > Once you've filtered down to a particular bounding box, you might consider
    : > going back to the function query approach to score documents inside that
    : > box based on their actual distance from the center point. I don't recall
    : > at the moment but i believe FunctionQuery's Scorer supports skipTo in such
    : > a way that it won't bother computing the function for a document that has
    : > been skiped (ie: when containing in a BooleanQuery with another clause
    : > that has already prohibited it, or when executed in the context of a
    : > Filter)
    : >
    : >
    : >
    : > -Hoss
    : >
    : >
    : > ---------------------------------------------------------------------
    : > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    : > For additional commands, e-mail: java-user-help@lucene.apache.org
    : >
    : >
    :



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Jeff Rodenburg at Mar 1, 2006 at 5:02 pm
    Thanks to everyone on the replies. I'm going to try several of these
    approaches and with equivalent data sets and run some side-by-side tests.

    No timeframes guarantees here, but I'll report back with the different
    approaches and the test results.

    cheers,
    -- j

    On 2/28/06, Chris Hostetter wrote:


    : Very good points, I hadn't considered the term frequency of the digits
    : affecting scoring. As an aside, can that aspect of the score be ignored
    for
    : these fields?

    The easiest way is to use a boost that is so low it's insignificant, or
    you could subclass TermQuery and override getSimilarity to return a
    DelegateSimilarity which wraps the real instance and returns constant
    values for things like tf() and idf() ... but i'm 95% sure that using a
    RangeFilter (or a ConstantScoreRangeQuery) is going to be faster then all
    of those TermQueries no matter what.

    : I need to spend more time with FunctionQuery, I haven't given it the
    : attention it deserves.

    i would start by trying out an apples to apples comparison of your current
    approach with one where your index only has one indexed field each for
    long/lat that uses ConstantScoreRangeQuery to do the boxing. Compare both
    the size of the resulting indexes, the memory footprint while open, and
    the time spent executing comparable queries. You should probably compare
    queries that involve both large boxes and small boxes, and depending on
    hte usage pattern you expect consider caching your Filters if you expect
    many boxes to be reused frequently.

    once you've found the "best" way to do your boxing ... then look into
    using FunctionQueries to influence your scores based on distance fro mthe
    center of hte box.

    :
    : Great feedback, thanks for the notes.
    :
    : -- jeff
    :
    : On 2/28/06, Chris Hostetter wrote:
    : >
    : >
    : > : Geo definition:
    : > : Boxing around a center point. It's not critical to do a radius
    search
    : > with
    : > : a given circle. A boxed approach allows for taller or wider frames
    of
    : > : reference, which are applicable for our use.
    : >
    : > if you are just loking to confine your results to a box then i think
    : > RangeFiltering on both the X and Y axis will be more efficient then
    the
    : > individual term queries you are producing.
    : >
    : > It will have the added bonus of not artificially affecting the scores
    of
    : > hte documents based on how often a particular digit apears in a
    particular
    : > position of hte latitue accross your corpus.
    : >
    : > Once you've filtered down to a particular bounding box, you might
    consider
    : > going back to the function query approach to score documents inside
    that
    : > box based on their actual distance from the center point. I don't
    recall
    : > at the moment but i believe FunctionQuery's Scorer supports skipTo in
    such
    : > a way that it won't bother computing the function for a document that
    has
    : > been skiped (ie: when containing in a BooleanQuery with another clause
    : > that has already prohibited it, or when executed in the context of a
    : > Filter)
    : >
    : >
    : >
    : > -Hoss
    : >
    : >
    : > ---------------------------------------------------------------------
    : > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    : > For additional commands, e-mail: java-user-help@lucene.apache.org
    : >
    : >
    :



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Bryzek.Michael at Feb 28, 2006 at 8:49 pm
    Jeff -

    This is an interesting approach. On our end, we have experimented with
    two variants:

    Variant 1: Use Range Query

    Rather than precomputing the boolean clauses yourself, index encoded
    latitude and longitude values and use a Range Query. We encode by
    adding 1000 to each of the values. Note: We only deal with zip codes
    in the US for which this encoding works fine, but is worth another
    look if you have a broader range of coordinates.

    Example: Compute the box as you describe, then encode each value and
    use a RangeQuery. Using the box you provided and lucene's query
    parser:

    North latitude = 47.77704
    West longitude = -122.41909
    South latitude = 47.34827
    East longitude = -122.22031

    gives us this query:

    latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO 877.77969]

    Lucene then handles expanding this query into the appropriate set of
    Boolean Queries.


    Variant 2: Combine the above w/ a custom scorer

    For us, boxing works well to retrieve relevant documents, but our
    users want those documents sorted by distance. We modified the custom
    scoring class that Hatcher provides in Lucene in Action to compute the
    distance between two points for only those documents which actually
    match. Thus, we do our search using a Range Query, then for all
    matching documents we compute an actual score that incorporates the
    distance from the actual location from which the user is searching.

    On our data set, we can still end up with 1000s of matching documents
    after boxing. Thus, we still see a bottleneck computing the score for
    even this smaller set of documents which we are still working through.

    -Mike


    -----Original Message-----
    From: Jeff Rodenburg
    Sent: Tue 2/28/06 3:10 PM
    To: java-user@lucene.apache.org
    Cc:
    Subject: Hacking proximity search: looking for feedback

    I've been wrestling with a way to index and search data with a
    geo-positional aspect. By a geo-positional search, I want to constrain
    search results within a given location range. Furthermore, I want to allow
    the user to set/change the geo-positional boundaries as needed for their
    search. This isn't a foreign concept to anyone who's needed to do the same.

    There's been some work done in this area publicly (or at least what I could
    find via the list archives and google), but not a lot. The guys at
    eventax.de have done some very impressive work. Their implementation is
    within the constructs of nutch; there's more here at
    http://wiki.apache.org/nutch/GeoPosition. Their approach is very
    interesting and is predicated by having data indexed in a certain format.
    I've considered building something based on the FunctionQuery class as
    well. Within this class, I could actually do the mathematical calculations
    (Haversine formula, anyone?) for geo-positional filtering. Range queries on
    this data set were an option as well.

    I've hit a performance wall with these approaches. The geo-positional
    calculations are proving to be intensive. With the combination of my data
    set, hardware, and OS, this just wasn't getting it done.

    So, I reversed my thinking. Instead of getting Lucene to do geo-math, what
    if I made geo-math do Lucene? Lucene is exceptional at string lookups; how
    could I do geo-positional search in that framework? I seized upon an
    approach that I've validated in testing, but wanted to get more feedback
    from the community.

    ********************************************************************************************************

    Data structure:
    Latitude & longitude values in decimal format, i.e. latitude=47.480227,
    longitude=-122.333670 (btw, a tire shop I recently visited).

    Geo definition:
    Boxing around a center point. It's not critical to do a radius search with
    a given circle. A boxed approach allows for taller or wider frames of
    reference, which are applicable for our use.


    How I'm solving:
    Treat the data as strings and formulate boolean query lookups based on
    positional comparison. Here are the steps:

    Indexing:
    1. Break up lat/long values to individual characters, and store each field
    in progression. Field storage type = Keyword. The tire shop example:
    Lat1 = [4]
    Lat2 = [7]
    Lat3 = [.]
    Lat4 = [4]
    Lat5 = [8]
    ...
    Long1 = [-]
    Long2 = [1]
    Long3 = [2]
    ...

    Searching:
    1. Start with box coordinates - North/West corner, South/East corner. For
    example, a sample box around Seattle, WA:
    North latitude = 47.77704
    West longitude = -122.41909
    South latitude = 47.34827
    East longitude = -122.22031
    2. Break up lat/long values in a manner similar to indexing.
    3. Begin to build boolean query. Query will contain two required clauses:
    the North/South query, and the West/East query. Both queries are built
    using the same progressional evaluation of characters by position.
    4. Compare North/South (top/bottom) values. Here's a pseudo-graphical
    chart:

    North = [4][7][.][7][7][7][0][4]
    South = [4][7][.][3][4][8][2][7]

    Qualifying records will have latitude values between 47.77704 and 47.34827.
    With lexicographical ordering in mind, I can safely include this phrase in
    my North/South query:

    (+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

    The logic is that any latitude with the first three values matching, and the
    fourth position containing either 4 or 5 or 6 will yield a qualifying match.

    What other queries are needed? Ones that match 7 or 3 in the fourth
    position should be considered, but they need further qualification. The
    further qualification is based on values from the fifth position. I can
    safely included the following phrases in my North/South query:

    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3 Lat5:2
    Lat5:1 Lat5:0)
    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8 Lat5:9)

    The logic here is simply an extension of the first query, extended to next
    position in the latitude range. In the Northern latitude, with the first
    four positions matching those values exactly, anything less than 7 in the
    fifth position will yield a matching latitude. Same goes for the South
    (bottom) query.

    *******************************************************************************************

    This approach yields a higher number of boolean query clauses, the more
    granular you get. In this scenario, I've estimated approximately 145
    clauses within the final constructed query. In validation testing, this
    approach has proven to be:

    1) Accurate.
    2) Performant (thus far).


    At last, my question to everyone who cares to respond (and read this far):
    feedback?

    Thanks,
    -- jeff
  • Jeff Rodenburg at Feb 28, 2006 at 9:33 pm
    Michael -

    Great thoughts, and thanks for the feedback.

    Following on the Range Query approach, how is performance? I found the
    range approach (albeit with the exact values) to be slower than the
    parsed-string approach I posited.

    On the custom scoring, is the distance element for sorting or as a component
    of relevance? We have a need for distance sorting, but I'm trying to slay
    that beast at a later stage.

    -- jeff
    On 2/28/06, Bryzek.Michael wrote:

    Jeff -

    This is an interesting approach. On our end, we have experimented with
    two variants:

    Variant 1: Use Range Query

    Rather than precomputing the boolean clauses yourself, index encoded
    latitude and longitude values and use a Range Query. We encode by
    adding 1000 to each of the values. Note: We only deal with zip codes
    in the US for which this encoding works fine, but is worth another
    look if you have a broader range of coordinates.

    Example: Compute the box as you describe, then encode each value and
    use a RangeQuery. Using the box you provided and lucene's query
    parser:

    North latitude = 47.77704
    West longitude = -122.41909
    South latitude = 47.34827
    East longitude = -122.22031

    gives us this query:

    latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO 877.77969]

    Lucene then handles expanding this query into the appropriate set of
    Boolean Queries.


    Variant 2: Combine the above w/ a custom scorer

    For us, boxing works well to retrieve relevant documents, but our
    users want those documents sorted by distance. We modified the custom
    scoring class that Hatcher provides in Lucene in Action to compute the
    distance between two points for only those documents which actually
    match. Thus, we do our search using a Range Query, then for all
    matching documents we compute an actual score that incorporates the
    distance from the actual location from which the user is searching.

    On our data set, we can still end up with 1000s of matching documents
    after boxing. Thus, we still see a bottleneck computing the score for
    even this smaller set of documents which we are still working through.

    -Mike


    -----Original Message-----
    From: Jeff Rodenburg
    Sent: Tue 2/28/06 3:10 PM
    To: java-user@lucene.apache.org
    Cc:
    Subject: Hacking proximity search: looking for feedback

    I've been wrestling with a way to index and search data with a
    geo-positional aspect. By a geo-positional search, I want to constrain
    search results within a given location range. Furthermore, I want to
    allow
    the user to set/change the geo-positional boundaries as needed for their
    search. This isn't a foreign concept to anyone who's needed to do the
    same.

    There's been some work done in this area publicly (or at least what I
    could
    find via the list archives and google), but not a lot. The guys at
    eventax.de have done some very impressive work. Their implementation is
    within the constructs of nutch; there's more here at
    http://wiki.apache.org/nutch/GeoPosition. Their approach is very
    interesting and is predicated by having data indexed in a certain format.
    I've considered building something based on the FunctionQuery class as
    well. Within this class, I could actually do the mathematical
    calculations
    (Haversine formula, anyone?) for geo-positional filtering. Range queries
    on
    this data set were an option as well.

    I've hit a performance wall with these approaches. The geo-positional
    calculations are proving to be intensive. With the combination of my data
    set, hardware, and OS, this just wasn't getting it done.

    So, I reversed my thinking. Instead of getting Lucene to do geo-math,
    what
    if I made geo-math do Lucene? Lucene is exceptional at string lookups;
    how
    could I do geo-positional search in that framework? I seized upon an
    approach that I've validated in testing, but wanted to get more feedback
    from the community.


    ********************************************************************************************************

    Data structure:
    Latitude & longitude values in decimal format, i.e. latitude=47.480227,
    longitude=-122.333670 (btw, a tire shop I recently visited).

    Geo definition:
    Boxing around a center point. It's not critical to do a radius search
    with
    a given circle. A boxed approach allows for taller or wider frames of
    reference, which are applicable for our use.


    How I'm solving:
    Treat the data as strings and formulate boolean query lookups based on
    positional comparison. Here are the steps:

    Indexing:
    1. Break up lat/long values to individual characters, and store each field
    in progression. Field storage type = Keyword. The tire shop example:
    Lat1 = [4]
    Lat2 = [7]
    Lat3 = [.]
    Lat4 = [4]
    Lat5 = [8]
    ...
    Long1 = [-]
    Long2 = [1]
    Long3 = [2]
    ...

    Searching:
    1. Start with box coordinates - North/West corner, South/East corner. For
    example, a sample box around Seattle, WA:
    North latitude = 47.77704
    West longitude = -122.41909
    South latitude = 47.34827
    East longitude = -122.22031
    2. Break up lat/long values in a manner similar to indexing.
    3. Begin to build boolean query. Query will contain two required clauses:
    the North/South query, and the West/East query. Both queries are built
    using the same progressional evaluation of characters by position.
    4. Compare North/South (top/bottom) values. Here's a pseudo-graphical
    chart:

    North = [4][7][.][7][7][7][0][4]
    South = [4][7][.][3][4][8][2][7]

    Qualifying records will have latitude values between 47.77704 and 47.34827
    .
    With lexicographical ordering in mind, I can safely include this phrase in
    my North/South query:

    (+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

    The logic is that any latitude with the first three values matching, and
    the
    fourth position containing either 4 or 5 or 6 will yield a qualifying
    match.

    What other queries are needed? Ones that match 7 or 3 in the fourth
    position should be considered, but they need further qualification. The
    further qualification is based on values from the fifth position. I can
    safely included the following phrases in my North/South query:

    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3 Lat5:2
    Lat5:1 Lat5:0)
    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8
    Lat5:9)

    The logic here is simply an extension of the first query, extended to next
    position in the latitude range. In the Northern latitude, with the first
    four positions matching those values exactly, anything less than 7 in the
    fifth position will yield a matching latitude. Same goes for the South
    (bottom) query.


    *******************************************************************************************

    This approach yields a higher number of boolean query clauses, the more
    granular you get. In this scenario, I've estimated approximately 145
    clauses within the final constructed query. In validation testing, this
    approach has proven to be:

    1) Accurate.
    2) Performant (thus far).


    At last, my question to everyone who cares to respond (and read this far):
    feedback?

    Thanks,
    -- jeff





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Doug Cutting at Mar 1, 2006 at 7:27 pm

    Jeff Rodenburg wrote:
    Following on the Range Query approach, how is performance? I found the
    range approach (albeit with the exact values) to be slower than the
    parsed-string approach I posited.
    Note that Hoss suggested RangeFilter, not RangeQuery. Or perhaps
    ConstantScoreRangeQuery, which is implemented in terms of RangeFilter.
    But RangeQuery will be slower than either RangeFilter or
    ConstantScoreRangeQuery.

    Doug

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Jeff Rodenburg at Mar 1, 2006 at 11:49 pm
    Very good note, I missed that. I need the development environment in front
    of me to remember all the different class names correctly. ;-)

    -- j

    On 3/1/06, Doug Cutting wrote:

    Jeff Rodenburg wrote:
    Following on the Range Query approach, how is performance? I found the
    range approach (albeit with the exact values) to be slower than the
    parsed-string approach I posited.
    Note that Hoss suggested RangeFilter, not RangeQuery. Or perhaps
    ConstantScoreRangeQuery, which is implemented in terms of RangeFilter.
    But RangeQuery will be slower than either RangeFilter or
    ConstantScoreRangeQuery.

    Doug

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • John Powers at Feb 28, 2006 at 8:55 pm
    I don't know if this matters, but we do all of our geolocating in sql
    with decent speed. All the trig is in the query itself and then we can
    limit top 5, top 10 etc for what we show. Is the data such that you
    need lucene? Can I ask what causes it to be beyond a databases
    ability?

    -----Original Message-----
    From: Bryzek.Michael
    Sent: Tuesday, February 28, 2006 2:49 PM
    To: java-user@lucene.apache.org
    Subject: RE: Hacking proximity search: looking for feedback

    Jeff -

    This is an interesting approach. On our end, we have experimented with
    two variants:

    Variant 1: Use Range Query

    Rather than precomputing the boolean clauses yourself, index encoded
    latitude and longitude values and use a Range Query. We encode by
    adding 1000 to each of the values. Note: We only deal with zip codes
    in the US for which this encoding works fine, but is worth another
    look if you have a broader range of coordinates.

    Example: Compute the box as you describe, then encode each value and
    use a RangeQuery. Using the box you provided and lucene's query
    parser:

    North latitude = 47.77704
    West longitude = -122.41909
    South latitude = 47.34827
    East longitude = -122.22031

    gives us this query:

    latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO
    877.77969]

    Lucene then handles expanding this query into the appropriate set of
    Boolean Queries.


    Variant 2: Combine the above w/ a custom scorer

    For us, boxing works well to retrieve relevant documents, but our
    users want those documents sorted by distance. We modified the custom
    scoring class that Hatcher provides in Lucene in Action to compute the
    distance between two points for only those documents which actually
    match. Thus, we do our search using a Range Query, then for all
    matching documents we compute an actual score that incorporates the
    distance from the actual location from which the user is searching.

    On our data set, we can still end up with 1000s of matching documents
    after boxing. Thus, we still see a bottleneck computing the score for
    even this smaller set of documents which we are still working through.

    -Mike


    -----Original Message-----
    From: Jeff Rodenburg
    Sent: Tue 2/28/06 3:10 PM
    To: java-user@lucene.apache.org
    Cc:
    Subject: Hacking proximity search: looking for feedback

    I've been wrestling with a way to index and search data with a
    geo-positional aspect. By a geo-positional search, I want to constrain
    search results within a given location range. Furthermore, I want to
    allow
    the user to set/change the geo-positional boundaries as needed for their
    search. This isn't a foreign concept to anyone who's needed to do the
    same.

    There's been some work done in this area publicly (or at least what I
    could
    find via the list archives and google), but not a lot. The guys at
    eventax.de have done some very impressive work. Their implementation is
    within the constructs of nutch; there's more here at
    http://wiki.apache.org/nutch/GeoPosition. Their approach is very
    interesting and is predicated by having data indexed in a certain
    format.
    I've considered building something based on the FunctionQuery class as
    well. Within this class, I could actually do the mathematical
    calculations
    (Haversine formula, anyone?) for geo-positional filtering. Range
    queries on
    this data set were an option as well.

    I've hit a performance wall with these approaches. The geo-positional
    calculations are proving to be intensive. With the combination of my
    data
    set, hardware, and OS, this just wasn't getting it done.

    So, I reversed my thinking. Instead of getting Lucene to do geo-math,
    what
    if I made geo-math do Lucene? Lucene is exceptional at string lookups;
    how
    could I do geo-positional search in that framework? I seized upon an
    approach that I've validated in testing, but wanted to get more feedback
    from the community.

    ************************************************************************
    ********************************

    Data structure:
    Latitude & longitude values in decimal format, i.e. latitude=47.480227,
    longitude=-122.333670 (btw, a tire shop I recently visited).

    Geo definition:
    Boxing around a center point. It's not critical to do a radius search
    with
    a given circle. A boxed approach allows for taller or wider frames of
    reference, which are applicable for our use.


    How I'm solving:
    Treat the data as strings and formulate boolean query lookups based on
    positional comparison. Here are the steps:

    Indexing:
    1. Break up lat/long values to individual characters, and store each
    field
    in progression. Field storage type = Keyword. The tire shop example:
    Lat1 = [4]
    Lat2 = [7]
    Lat3 = [.]
    Lat4 = [4]
    Lat5 = [8]
    ...
    Long1 = [-]
    Long2 = [1]
    Long3 = [2]
    ...

    Searching:
    1. Start with box coordinates - North/West corner, South/East corner.
    For
    example, a sample box around Seattle, WA:
    North latitude = 47.77704
    West longitude = -122.41909
    South latitude = 47.34827
    East longitude = -122.22031
    2. Break up lat/long values in a manner similar to indexing.
    3. Begin to build boolean query. Query will contain two required
    clauses:
    the North/South query, and the West/East query. Both queries are built
    using the same progressional evaluation of characters by position.
    4. Compare North/South (top/bottom) values. Here's a pseudo-graphical
    chart:

    North = [4][7][.][7][7][7][0][4]
    South = [4][7][.][3][4][8][2][7]

    Qualifying records will have latitude values between 47.77704 and
    47.34827.
    With lexicographical ordering in mind, I can safely include this phrase
    in
    my North/South query:

    (+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

    The logic is that any latitude with the first three values matching, and
    the
    fourth position containing either 4 or 5 or 6 will yield a qualifying
    match.

    What other queries are needed? Ones that match 7 or 3 in the fourth
    position should be considered, but they need further qualification. The
    further qualification is based on values from the fifth position. I can
    safely included the following phrases in my North/South query:

    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3
    Lat5:2
    Lat5:1 Lat5:0)
    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8
    Lat5:9)

    The logic here is simply an extension of the first query, extended to
    next
    position in the latitude range. In the Northern latitude, with the
    first
    four positions matching those values exactly, anything less than 7 in
    the
    fifth position will yield a matching latitude. Same goes for the South
    (bottom) query.

    ************************************************************************
    *******************

    This approach yields a higher number of boolean query clauses, the more
    granular you get. In this scenario, I've estimated approximately 145
    clauses within the final constructed query. In validation testing, this
    approach has proven to be:

    1) Accurate.
    2) Performant (thus far).


    At last, my question to everyone who cares to respond (and read this
    far):
    feedback?

    Thanks,
    -- jeff




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Bryzek.Michael at Feb 28, 2006 at 9:00 pm
    Our geo searches are combined with keyword searches. We previously performed all of our queries in the database (Oracle 10g w/ interMedia for the unstructured portion) but found that it was easier to scale search outside the database than within.


    -----Original Message-----
    From: John Powers
    Sent: Tue 2/28/06 3:53 PM
    To: java-user@lucene.apache.org
    Cc:
    Subject: RE: Hacking proximity search: looking for feedback

    I don't know if this matters, but we do all of our geolocating in sql
    with decent speed. All the trig is in the query itself and then we can
    limit top 5, top 10 etc for what we show. Is the data such that you
    need lucene? Can I ask what causes it to be beyond a databases
    ability?

    -----Original Message-----
    From: Bryzek.Michael
    Sent: Tuesday, February 28, 2006 2:49 PM
    To: java-user@lucene.apache.org
    Subject: RE: Hacking proximity search: looking for feedback

    Jeff -

    This is an interesting approach. On our end, we have experimented with
    two variants:

    Variant 1: Use Range Query

    Rather than precomputing the boolean clauses yourself, index encoded
    latitude and longitude values and use a Range Query. We encode by
    adding 1000 to each of the values. Note: We only deal with zip codes
    in the US for which this encoding works fine, but is worth another
    look if you have a broader range of coordinates.

    Example: Compute the box as you describe, then encode each value and
    use a RangeQuery. Using the box you provided and lucene's query
    parser:

    North latitude = 47.77704
    West longitude = -122.41909
    South latitude = 47.34827
    East longitude = -122.22031

    gives us this query:

    latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO
    877.77969]

    Lucene then handles expanding this query into the appropriate set of
    Boolean Queries.


    Variant 2: Combine the above w/ a custom scorer

    For us, boxing works well to retrieve relevant documents, but our
    users want those documents sorted by distance. We modified the custom
    scoring class that Hatcher provides in Lucene in Action to compute the
    distance between two points for only those documents which actually
    match. Thus, we do our search using a Range Query, then for all
    matching documents we compute an actual score that incorporates the
    distance from the actual location from which the user is searching.

    On our data set, we can still end up with 1000s of matching documents
    after boxing. Thus, we still see a bottleneck computing the score for
    even this smaller set of documents which we are still working through.

    -Mike


    -----Original Message-----
    From: Jeff Rodenburg
    Sent: Tue 2/28/06 3:10 PM
    To: java-user@lucene.apache.org
    Cc:
    Subject: Hacking proximity search: looking for feedback

    I've been wrestling with a way to index and search data with a
    geo-positional aspect. By a geo-positional search, I want to constrain
    search results within a given location range. Furthermore, I want to
    allow
    the user to set/change the geo-positional boundaries as needed for their
    search. This isn't a foreign concept to anyone who's needed to do the
    same.

    There's been some work done in this area publicly (or at least what I
    could
    find via the list archives and google), but not a lot. The guys at
    eventax.de have done some very impressive work. Their implementation is
    within the constructs of nutch; there's more here at
    http://wiki.apache.org/nutch/GeoPosition. Their approach is very
    interesting and is predicated by having data indexed in a certain
    format.
    I've considered building something based on the FunctionQuery class as
    well. Within this class, I could actually do the mathematical
    calculations
    (Haversine formula, anyone?) for geo-positional filtering. Range
    queries on
    this data set were an option as well.

    I've hit a performance wall with these approaches. The geo-positional
    calculations are proving to be intensive. With the combination of my
    data
    set, hardware, and OS, this just wasn't getting it done.

    So, I reversed my thinking. Instead of getting Lucene to do geo-math,
    what
    if I made geo-math do Lucene? Lucene is exceptional at string lookups;
    how
    could I do geo-positional search in that framework? I seized upon an
    approach that I've validated in testing, but wanted to get more feedback
    from the community.

    ************************************************************************
    ********************************

    Data structure:
    Latitude & longitude values in decimal format, i.e. latitude=47.480227,
    longitude=-122.333670 (btw, a tire shop I recently visited).

    Geo definition:
    Boxing around a center point. It's not critical to do a radius search
    with
    a given circle. A boxed approach allows for taller or wider frames of
    reference, which are applicable for our use.


    How I'm solving:
    Treat the data as strings and formulate boolean query lookups based on
    positional comparison. Here are the steps:

    Indexing:
    1. Break up lat/long values to individual characters, and store each
    field
    in progression. Field storage type = Keyword. The tire shop example:
    Lat1 = [4]
    Lat2 = [7]
    Lat3 = [.]
    Lat4 = [4]
    Lat5 = [8]
    ...
    Long1 = [-]
    Long2 = [1]
    Long3 = [2]
    ...

    Searching:
    1. Start with box coordinates - North/West corner, South/East corner.
    For
    example, a sample box around Seattle, WA:
    North latitude = 47.77704
    West longitude = -122.41909
    South latitude = 47.34827
    East longitude = -122.22031
    2. Break up lat/long values in a manner similar to indexing.
    3. Begin to build boolean query. Query will contain two required
    clauses:
    the North/South query, and the West/East query. Both queries are built
    using the same progressional evaluation of characters by position.
    4. Compare North/South (top/bottom) values. Here's a pseudo-graphical
    chart:

    North = [4][7][.][7][7][7][0][4]
    South = [4][7][.][3][4][8][2][7]

    Qualifying records will have latitude values between 47.77704 and
    47.34827.
    With lexicographical ordering in mind, I can safely include this phrase
    in
    my North/South query:

    (+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

    The logic is that any latitude with the first three values matching, and
    the
    fourth position containing either 4 or 5 or 6 will yield a qualifying
    match.

    What other queries are needed? Ones that match 7 or 3 in the fourth
    position should be considered, but they need further qualification. The
    further qualification is based on values from the fifth position. I can
    safely included the following phrases in my North/South query:

    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3
    Lat5:2
    Lat5:1 Lat5:0)
    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8
    Lat5:9)

    The logic here is simply an extension of the first query, extended to
    next
    position in the latitude range. In the Northern latitude, with the
    first
    four positions matching those values exactly, anything less than 7 in
    the
    fifth position will yield a matching latitude. Same goes for the South
    (bottom) query.

    ************************************************************************
    *******************

    This approach yields a higher number of boolean query clauses, the more
    granular you get. In this scenario, I've estimated approximately 145
    clauses within the final constructed query. In validation testing, this
    approach has proven to be:

    1) Accurate.
    2) Performant (thus far).


    At last, my question to everyone who cares to respond (and read this
    far):
    feedback?

    Thanks,
    -- jeff




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Jeff Rodenburg at Feb 28, 2006 at 9:26 pm
    I'm in the same boat as Michael on this one. It's not a matter of finding
    the right technology to do geo-locational calculations, but rather being
    able to accomplish that task in conjunction with keyword search.

    -- j
    On 2/28/06, Bryzek.Michael wrote:

    Our geo searches are combined with keyword searches. We previously
    performed all of our queries in the database (Oracle 10g w/ interMedia for
    the unstructured portion) but found that it was easier to scale search
    outside the database than within.


    -----Original Message-----
    From: John Powers
    Sent: Tue 2/28/06 3:53 PM
    To: java-user@lucene.apache.org
    Cc:
    Subject: RE: Hacking proximity search: looking for feedback

    I don't know if this matters, but we do all of our geolocating in sql
    with decent speed. All the trig is in the query itself and then we can
    limit top 5, top 10 etc for what we show. Is the data such that you
    need lucene? Can I ask what causes it to be beyond a databases
    ability?

    -----Original Message-----
    From: Bryzek.Michael
    Sent: Tuesday, February 28, 2006 2:49 PM
    To: java-user@lucene.apache.org
    Subject: RE: Hacking proximity search: looking for feedback

    Jeff -

    This is an interesting approach. On our end, we have experimented with
    two variants:

    Variant 1: Use Range Query

    Rather than precomputing the boolean clauses yourself, index encoded
    latitude and longitude values and use a Range Query. We encode by
    adding 1000 to each of the values. Note: We only deal with zip codes
    in the US for which this encoding works fine, but is worth another
    look if you have a broader range of coordinates.

    Example: Compute the box as you describe, then encode each value and
    use a RangeQuery. Using the box you provided and lucene's query
    parser:

    North latitude = 47.77704
    West longitude = -122.41909
    South latitude = 47.34827
    East longitude = -122.22031

    gives us this query:

    latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO
    877.77969]

    Lucene then handles expanding this query into the appropriate set of
    Boolean Queries.


    Variant 2: Combine the above w/ a custom scorer

    For us, boxing works well to retrieve relevant documents, but our
    users want those documents sorted by distance. We modified the custom
    scoring class that Hatcher provides in Lucene in Action to compute the
    distance between two points for only those documents which actually
    match. Thus, we do our search using a Range Query, then for all
    matching documents we compute an actual score that incorporates the
    distance from the actual location from which the user is searching.

    On our data set, we can still end up with 1000s of matching documents
    after boxing. Thus, we still see a bottleneck computing the score for
    even this smaller set of documents which we are still working through.

    -Mike


    -----Original Message-----
    From: Jeff Rodenburg
    Sent: Tue 2/28/06 3:10 PM
    To: java-user@lucene.apache.org
    Cc:
    Subject: Hacking proximity search: looking for feedback

    I've been wrestling with a way to index and search data with a
    geo-positional aspect. By a geo-positional search, I want to constrain
    search results within a given location range. Furthermore, I want to
    allow
    the user to set/change the geo-positional boundaries as needed for their
    search. This isn't a foreign concept to anyone who's needed to do the
    same.

    There's been some work done in this area publicly (or at least what I
    could
    find via the list archives and google), but not a lot. The guys at
    eventax.de have done some very impressive work. Their implementation is
    within the constructs of nutch; there's more here at
    http://wiki.apache.org/nutch/GeoPosition. Their approach is very
    interesting and is predicated by having data indexed in a certain
    format.
    I've considered building something based on the FunctionQuery class as
    well. Within this class, I could actually do the mathematical
    calculations
    (Haversine formula, anyone?) for geo-positional filtering. Range
    queries on
    this data set were an option as well.

    I've hit a performance wall with these approaches. The geo-positional
    calculations are proving to be intensive. With the combination of my
    data
    set, hardware, and OS, this just wasn't getting it done.

    So, I reversed my thinking. Instead of getting Lucene to do geo-math,
    what
    if I made geo-math do Lucene? Lucene is exceptional at string lookups;
    how
    could I do geo-positional search in that framework? I seized upon an
    approach that I've validated in testing, but wanted to get more feedback
    from the community.

    ************************************************************************
    ********************************

    Data structure:
    Latitude & longitude values in decimal format, i.e. latitude=47.480227,
    longitude=-122.333670 (btw, a tire shop I recently visited).

    Geo definition:
    Boxing around a center point. It's not critical to do a radius search
    with
    a given circle. A boxed approach allows for taller or wider frames of
    reference, which are applicable for our use.


    How I'm solving:
    Treat the data as strings and formulate boolean query lookups based on
    positional comparison. Here are the steps:

    Indexing:
    1. Break up lat/long values to individual characters, and store each
    field
    in progression. Field storage type = Keyword. The tire shop example:
    Lat1 = [4]
    Lat2 = [7]
    Lat3 = [.]
    Lat4 = [4]
    Lat5 = [8]
    ...
    Long1 = [-]
    Long2 = [1]
    Long3 = [2]
    ...

    Searching:
    1. Start with box coordinates - North/West corner, South/East corner.
    For
    example, a sample box around Seattle, WA:
    North latitude = 47.77704
    West longitude = -122.41909
    South latitude = 47.34827
    East longitude = -122.22031
    2. Break up lat/long values in a manner similar to indexing.
    3. Begin to build boolean query. Query will contain two required
    clauses:
    the North/South query, and the West/East query. Both queries are built
    using the same progressional evaluation of characters by position.
    4. Compare North/South (top/bottom) values. Here's a pseudo-graphical
    chart:

    North = [4][7][.][7][7][7][0][4]
    South = [4][7][.][3][4][8][2][7]

    Qualifying records will have latitude values between 47.77704 and
    47.34827.
    With lexicographical ordering in mind, I can safely include this phrase
    in
    my North/South query:

    (+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

    The logic is that any latitude with the first three values matching, and
    the
    fourth position containing either 4 or 5 or 6 will yield a qualifying
    match.

    What other queries are needed? Ones that match 7 or 3 in the fourth
    position should be considered, but they need further qualification. The
    further qualification is based on values from the fifth position. I can
    safely included the following phrases in my North/South query:

    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3
    Lat5:2
    Lat5:1 Lat5:0)
    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8
    Lat5:9)

    The logic here is simply an extension of the first query, extended to
    next
    position in the latitude range. In the Northern latitude, with the
    first
    four positions matching those values exactly, anything less than 7 in
    the
    fifth position will yield a matching latitude. Same goes for the South
    (bottom) query.

    ************************************************************************
    *******************

    This approach yields a higher number of boolean query clauses, the more
    granular you get. In this scenario, I've estimated approximately 145
    clauses within the final constructed query. In validation testing, this
    approach has proven to be:

    1) Accurate.
    2) Performant (thus far).


    At last, my question to everyone who cares to respond (and read this
    far):
    feedback?

    Thanks,
    -- jeff




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedFeb 28, '06 at 8:11p
activeMar 1, '06 at 11:49p
posts12
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase