FAQ
I've been wrestling with a way to index and search data with a
geo-positional aspect. By a geo-positional search, I want to constrain
search results within a given location range. Furthermore, I want to allow
the user to set/change the geo-positional boundaries as needed for their
search. This isn't a foreign concept to anyone who's needed to do the same.

There's been some work done in this area publicly (or at least what I could
find via the list archives and google), but not a lot. The guys at
eventax.de have done some very impressive work. Their implementation is
within the constructs of nutch; there's more here at
http://wiki.apache.org/nutch/GeoPosition. Their approach is very
interesting and is predicated by having data indexed in a certain format.
I've considered building something based on the FunctionQuery class as
well. Within this class, I could actually do the mathematical calculations
(Haversine formula, anyone?) for geo-positional filtering. Range queries on
this data set were an option as well.

I've hit a performance wall with these approaches. The geo-positional
calculations are proving to be intensive. With the combination of my data
set, hardware, and OS, this just wasn't getting it done.

So, I reversed my thinking. Instead of getting Lucene to do geo-math, what
if I made geo-math do Lucene? Lucene is exceptional at string lookups; how
could I do geo-positional search in that framework? I seized upon an
approach that I've validated in testing, but wanted to get more feedback
from the community.

********************************************************************************************************

Data structure:
Latitude & longitude values in decimal format, i.e. latitude=47.480227,
longitude=-122.333670 (btw, a tire shop I recently visited).

Geo definition:
Boxing around a center point. It's not critical to do a radius search with
a given circle. A boxed approach allows for taller or wider frames of
reference, which are applicable for our use.

How I'm solving:
Treat the data as strings and formulate boolean query lookups based on
positional comparison. Here are the steps:

Indexing:
1. Break up lat/long values to individual characters, and store each field
in progression. Field storage type = Keyword. The tire shop example:
Lat1 = [4]
Lat2 = [7]
Lat3 = [.]
Lat4 = [4]
Lat5 = [8]
...
Long1 = [-]
Long2 = [1]
Long3 = [2]
...

Searching:
example, a sample box around Seattle, WA:
North latitude = 47.77704
West longitude = -122.41909
South latitude = 47.34827
East longitude = -122.22031
2. Break up lat/long values in a manner similar to indexing.
3. Begin to build boolean query. Query will contain two required clauses:
the North/South query, and the West/East query. Both queries are built
using the same progressional evaluation of characters by position.
4. Compare North/South (top/bottom) values. Here's a pseudo-graphical
chart:

North = [4][7][.][7][7][7][0][4]
South = [4][7][.][3][4][8][2][7]

Qualifying records will have latitude values between 47.77704 and 47.34827.
With lexicographical ordering in mind, I can safely include this phrase in
my North/South query:

(+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

The logic is that any latitude with the first three values matching, and the
fourth position containing either 4 or 5 or 6 will yield a qualifying match.

What other queries are needed? Ones that match 7 or 3 in the fourth
position should be considered, but they need further qualification. The
further qualification is based on values from the fifth position. I can
safely included the following phrases in my North/South query:

(+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3 Lat5:2
Lat5:1 Lat5:0)
(+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8 Lat5:9)

The logic here is simply an extension of the first query, extended to next
position in the latitude range. In the Northern latitude, with the first
four positions matching those values exactly, anything less than 7 in the
fifth position will yield a matching latitude. Same goes for the South
(bottom) query.

*******************************************************************************************

This approach yields a higher number of boolean query clauses, the more
granular you get. In this scenario, I've estimated approximately 145
clauses within the final constructed query. In validation testing, this
approach has proven to be:

1) Accurate.
2) Performant (thus far).

At last, my question to everyone who cares to respond (and read this far):
feedback?

Thanks,
-- jeff

## Search Discussions

•  at Feb 28, 2006 at 8:35 pm ⇧
: Geo definition:
: Boxing around a center point. It's not critical to do a radius search with
: a given circle. A boxed approach allows for taller or wider frames of
: reference, which are applicable for our use.

if you are just loking to confine your results to a box then i think
RangeFiltering on both the X and Y axis will be more efficient then the
individual term queries you are producing.

It will have the added bonus of not artificially affecting the scores of
hte documents based on how often a particular digit apears in a particular
position of hte latitue accross your corpus.

Once you've filtered down to a particular bounding box, you might consider
going back to the function query approach to score documents inside that
box based on their actual distance from the center point. I don't recall
at the moment but i believe FunctionQuery's Scorer supports skipTo in such
a way that it won't bother computing the function for a document that has
been skiped (ie: when containing in a BooleanQuery with another clause
that has already prohibited it, or when executed in the context of a
Filter)

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
•  at Feb 28, 2006 at 9:41 pm ⇧
Very good points, I hadn't considered the term frequency of the digits
affecting scoring. As an aside, can that aspect of the score be ignored for
these fields?

I need to spend more time with FunctionQuery, I haven't given it the
attention it deserves.

Great feedback, thanks for the notes.

-- jeff
On 2/28/06, Chris Hostetter wrote:

: Geo definition:
: Boxing around a center point. It's not critical to do a radius search
with
: a given circle. A boxed approach allows for taller or wider frames of
: reference, which are applicable for our use.

if you are just loking to confine your results to a box then i think
RangeFiltering on both the X and Y axis will be more efficient then the
individual term queries you are producing.

It will have the added bonus of not artificially affecting the scores of
hte documents based on how often a particular digit apears in a particular
position of hte latitue accross your corpus.

Once you've filtered down to a particular bounding box, you might consider
going back to the function query approach to score documents inside that
box based on their actual distance from the center point. I don't recall
at the moment but i believe FunctionQuery's Scorer supports skipTo in such
a way that it won't bother computing the function for a document that has
been skiped (ie: when containing in a BooleanQuery with another clause
that has already prohibited it, or when executed in the context of a
Filter)

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
•  at Feb 28, 2006 at 10:13 pm ⇧
: Very good points, I hadn't considered the term frequency of the digits
: affecting scoring. As an aside, can that aspect of the score be ignored for
: these fields?

The easiest way is to use a boost that is so low it's insignificant, or
you could subclass TermQuery and override getSimilarity to return a
DelegateSimilarity which wraps the real instance and returns constant
values for things like tf() and idf() ... but i'm 95% sure that using a
RangeFilter (or a ConstantScoreRangeQuery) is going to be faster then all
of those TermQueries no matter what.

: I need to spend more time with FunctionQuery, I haven't given it the
: attention it deserves.

i would start by trying out an apples to apples comparison of your current
approach with one where your index only has one indexed field each for
long/lat that uses ConstantScoreRangeQuery to do the boxing. Compare both
the size of the resulting indexes, the memory footprint while open, and
the time spent executing comparable queries. You should probably compare
queries that involve both large boxes and small boxes, and depending on
hte usage pattern you expect consider caching your Filters if you expect
many boxes to be reused frequently.

once you've found the "best" way to do your boxing ... then look into
using FunctionQueries to influence your scores based on distance fro mthe
center of hte box.

:
: Great feedback, thanks for the notes.
:
: -- jeff
:
: On 2/28/06, Chris Hostetter wrote:
: >
: >
: > : Geo definition:
: > : Boxing around a center point. It's not critical to do a radius search
: > with
: > : a given circle. A boxed approach allows for taller or wider frames of
: > : reference, which are applicable for our use.
: >
: > if you are just loking to confine your results to a box then i think
: > RangeFiltering on both the X and Y axis will be more efficient then the
: > individual term queries you are producing.
: >
: > It will have the added bonus of not artificially affecting the scores of
: > hte documents based on how often a particular digit apears in a particular
: > position of hte latitue accross your corpus.
: >
: > Once you've filtered down to a particular bounding box, you might consider
: > going back to the function query approach to score documents inside that
: > box based on their actual distance from the center point. I don't recall
: > at the moment but i believe FunctionQuery's Scorer supports skipTo in such
: > a way that it won't bother computing the function for a document that has
: > been skiped (ie: when containing in a BooleanQuery with another clause
: > that has already prohibited it, or when executed in the context of a
: > Filter)
: >
: >
: >
: > -Hoss
: >
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: > For additional commands, e-mail: java-user-help@lucene.apache.org
: >
: >
:

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
•  at Mar 1, 2006 at 5:02 pm ⇧
Thanks to everyone on the replies. I'm going to try several of these
approaches and with equivalent data sets and run some side-by-side tests.

No timeframes guarantees here, but I'll report back with the different
approaches and the test results.

cheers,
-- j

On 2/28/06, Chris Hostetter wrote:

: Very good points, I hadn't considered the term frequency of the digits
: affecting scoring. As an aside, can that aspect of the score be ignored
for
: these fields?

The easiest way is to use a boost that is so low it's insignificant, or
you could subclass TermQuery and override getSimilarity to return a
DelegateSimilarity which wraps the real instance and returns constant
values for things like tf() and idf() ... but i'm 95% sure that using a
RangeFilter (or a ConstantScoreRangeQuery) is going to be faster then all
of those TermQueries no matter what.

: I need to spend more time with FunctionQuery, I haven't given it the
: attention it deserves.

i would start by trying out an apples to apples comparison of your current
approach with one where your index only has one indexed field each for
long/lat that uses ConstantScoreRangeQuery to do the boxing. Compare both
the size of the resulting indexes, the memory footprint while open, and
the time spent executing comparable queries. You should probably compare
queries that involve both large boxes and small boxes, and depending on
hte usage pattern you expect consider caching your Filters if you expect
many boxes to be reused frequently.

once you've found the "best" way to do your boxing ... then look into
using FunctionQueries to influence your scores based on distance fro mthe
center of hte box.

:
: Great feedback, thanks for the notes.
:
: -- jeff
:
: On 2/28/06, Chris Hostetter wrote:
: >
: >
: > : Geo definition:
: > : Boxing around a center point. It's not critical to do a radius
search
: > with
: > : a given circle. A boxed approach allows for taller or wider frames
of
: > : reference, which are applicable for our use.
: >
: > if you are just loking to confine your results to a box then i think
: > RangeFiltering on both the X and Y axis will be more efficient then
the
: > individual term queries you are producing.
: >
: > It will have the added bonus of not artificially affecting the scores
of
: > hte documents based on how often a particular digit apears in a
particular
: > position of hte latitue accross your corpus.
: >
: > Once you've filtered down to a particular bounding box, you might
consider
: > going back to the function query approach to score documents inside
that
: > box based on their actual distance from the center point. I don't
recall
: > at the moment but i believe FunctionQuery's Scorer supports skipTo in
such
: > a way that it won't bother computing the function for a document that
has
: > been skiped (ie: when containing in a BooleanQuery with another clause
: > that has already prohibited it, or when executed in the context of a
: > Filter)
: >
: >
: >
: > -Hoss
: >
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
: > For additional commands, e-mail: java-user-help@lucene.apache.org
: >
: >
:

-Hoss

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
•  at Feb 28, 2006 at 8:49 pm ⇧
Jeff -

This is an interesting approach. On our end, we have experimented with
two variants:

Variant 1: Use Range Query

Rather than precomputing the boolean clauses yourself, index encoded
latitude and longitude values and use a Range Query. We encode by
adding 1000 to each of the values. Note: We only deal with zip codes
in the US for which this encoding works fine, but is worth another
look if you have a broader range of coordinates.

Example: Compute the box as you describe, then encode each value and
use a RangeQuery. Using the box you provided and lucene's query
parser:

North latitude = 47.77704
West longitude = -122.41909
South latitude = 47.34827
East longitude = -122.22031

gives us this query:

latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO 877.77969]

Lucene then handles expanding this query into the appropriate set of
Boolean Queries.

Variant 2: Combine the above w/ a custom scorer

For us, boxing works well to retrieve relevant documents, but our
users want those documents sorted by distance. We modified the custom
scoring class that Hatcher provides in Lucene in Action to compute the
distance between two points for only those documents which actually
match. Thus, we do our search using a Range Query, then for all
matching documents we compute an actual score that incorporates the
distance from the actual location from which the user is searching.

On our data set, we can still end up with 1000s of matching documents
after boxing. Thus, we still see a bottleneck computing the score for
even this smaller set of documents which we are still working through.

-Mike

-----Original Message-----
From: Jeff Rodenburg
Sent: Tue 2/28/06 3:10 PM
To: java-user@lucene.apache.org
Cc:
Subject: Hacking proximity search: looking for feedback

I've been wrestling with a way to index and search data with a
geo-positional aspect. By a geo-positional search, I want to constrain
search results within a given location range. Furthermore, I want to allow
the user to set/change the geo-positional boundaries as needed for their
search. This isn't a foreign concept to anyone who's needed to do the same.

There's been some work done in this area publicly (or at least what I could
find via the list archives and google), but not a lot. The guys at
eventax.de have done some very impressive work. Their implementation is
within the constructs of nutch; there's more here at
http://wiki.apache.org/nutch/GeoPosition. Their approach is very
interesting and is predicated by having data indexed in a certain format.
I've considered building something based on the FunctionQuery class as
well. Within this class, I could actually do the mathematical calculations
(Haversine formula, anyone?) for geo-positional filtering. Range queries on
this data set were an option as well.

I've hit a performance wall with these approaches. The geo-positional
calculations are proving to be intensive. With the combination of my data
set, hardware, and OS, this just wasn't getting it done.

So, I reversed my thinking. Instead of getting Lucene to do geo-math, what
if I made geo-math do Lucene? Lucene is exceptional at string lookups; how
could I do geo-positional search in that framework? I seized upon an
approach that I've validated in testing, but wanted to get more feedback
from the community.

********************************************************************************************************

Data structure:
Latitude & longitude values in decimal format, i.e. latitude=47.480227,
longitude=-122.333670 (btw, a tire shop I recently visited).

Geo definition:
Boxing around a center point. It's not critical to do a radius search with
a given circle. A boxed approach allows for taller or wider frames of
reference, which are applicable for our use.

How I'm solving:
Treat the data as strings and formulate boolean query lookups based on
positional comparison. Here are the steps:

Indexing:
1. Break up lat/long values to individual characters, and store each field
in progression. Field storage type = Keyword. The tire shop example:
Lat1 = [4]
Lat2 = [7]
Lat3 = [.]
Lat4 = [4]
Lat5 = [8]
...
Long1 = [-]
Long2 = [1]
Long3 = [2]
...

Searching:
example, a sample box around Seattle, WA:
North latitude = 47.77704
West longitude = -122.41909
South latitude = 47.34827
East longitude = -122.22031
2. Break up lat/long values in a manner similar to indexing.
3. Begin to build boolean query. Query will contain two required clauses:
the North/South query, and the West/East query. Both queries are built
using the same progressional evaluation of characters by position.
4. Compare North/South (top/bottom) values. Here's a pseudo-graphical
chart:

North = [4][7][.][7][7][7][0][4]
South = [4][7][.][3][4][8][2][7]

Qualifying records will have latitude values between 47.77704 and 47.34827.
With lexicographical ordering in mind, I can safely include this phrase in
my North/South query:

(+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

The logic is that any latitude with the first three values matching, and the
fourth position containing either 4 or 5 or 6 will yield a qualifying match.

What other queries are needed? Ones that match 7 or 3 in the fourth
position should be considered, but they need further qualification. The
further qualification is based on values from the fifth position. I can
safely included the following phrases in my North/South query:

(+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3 Lat5:2
Lat5:1 Lat5:0)
(+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8 Lat5:9)

The logic here is simply an extension of the first query, extended to next
position in the latitude range. In the Northern latitude, with the first
four positions matching those values exactly, anything less than 7 in the
fifth position will yield a matching latitude. Same goes for the South
(bottom) query.

*******************************************************************************************

This approach yields a higher number of boolean query clauses, the more
granular you get. In this scenario, I've estimated approximately 145
clauses within the final constructed query. In validation testing, this
approach has proven to be:

1) Accurate.
2) Performant (thus far).

At last, my question to everyone who cares to respond (and read this far):
feedback?

Thanks,
-- jeff
•  at Feb 28, 2006 at 9:33 pm ⇧
Michael -

Great thoughts, and thanks for the feedback.

Following on the Range Query approach, how is performance? I found the
range approach (albeit with the exact values) to be slower than the
parsed-string approach I posited.

On the custom scoring, is the distance element for sorting or as a component
of relevance? We have a need for distance sorting, but I'm trying to slay
that beast at a later stage.

-- jeff
On 2/28/06, Bryzek.Michael wrote:

Jeff -

This is an interesting approach. On our end, we have experimented with
two variants:

Variant 1: Use Range Query

Rather than precomputing the boolean clauses yourself, index encoded
latitude and longitude values and use a Range Query. We encode by
adding 1000 to each of the values. Note: We only deal with zip codes
in the US for which this encoding works fine, but is worth another
look if you have a broader range of coordinates.

Example: Compute the box as you describe, then encode each value and
use a RangeQuery. Using the box you provided and lucene's query
parser:

North latitude = 47.77704
West longitude = -122.41909
South latitude = 47.34827
East longitude = -122.22031

gives us this query:

latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO 877.77969]

Lucene then handles expanding this query into the appropriate set of
Boolean Queries.

Variant 2: Combine the above w/ a custom scorer

For us, boxing works well to retrieve relevant documents, but our
users want those documents sorted by distance. We modified the custom
scoring class that Hatcher provides in Lucene in Action to compute the
distance between two points for only those documents which actually
match. Thus, we do our search using a Range Query, then for all
matching documents we compute an actual score that incorporates the
distance from the actual location from which the user is searching.

On our data set, we can still end up with 1000s of matching documents
after boxing. Thus, we still see a bottleneck computing the score for
even this smaller set of documents which we are still working through.

-Mike

-----Original Message-----
From: Jeff Rodenburg
Sent: Tue 2/28/06 3:10 PM
To: java-user@lucene.apache.org
Cc:
Subject: Hacking proximity search: looking for feedback

I've been wrestling with a way to index and search data with a
geo-positional aspect. By a geo-positional search, I want to constrain
search results within a given location range. Furthermore, I want to
allow
the user to set/change the geo-positional boundaries as needed for their
search. This isn't a foreign concept to anyone who's needed to do the
same.

There's been some work done in this area publicly (or at least what I
could
find via the list archives and google), but not a lot. The guys at
eventax.de have done some very impressive work. Their implementation is
within the constructs of nutch; there's more here at
http://wiki.apache.org/nutch/GeoPosition. Their approach is very
interesting and is predicated by having data indexed in a certain format.
I've considered building something based on the FunctionQuery class as
well. Within this class, I could actually do the mathematical
calculations
(Haversine formula, anyone?) for geo-positional filtering. Range queries
on
this data set were an option as well.

I've hit a performance wall with these approaches. The geo-positional
calculations are proving to be intensive. With the combination of my data
set, hardware, and OS, this just wasn't getting it done.

So, I reversed my thinking. Instead of getting Lucene to do geo-math,
what
if I made geo-math do Lucene? Lucene is exceptional at string lookups;
how
could I do geo-positional search in that framework? I seized upon an
approach that I've validated in testing, but wanted to get more feedback
from the community.

********************************************************************************************************

Data structure:
Latitude & longitude values in decimal format, i.e. latitude=47.480227,
longitude=-122.333670 (btw, a tire shop I recently visited).

Geo definition:
Boxing around a center point. It's not critical to do a radius search
with
a given circle. A boxed approach allows for taller or wider frames of
reference, which are applicable for our use.

How I'm solving:
Treat the data as strings and formulate boolean query lookups based on
positional comparison. Here are the steps:

Indexing:
1. Break up lat/long values to individual characters, and store each field
in progression. Field storage type = Keyword. The tire shop example:
Lat1 = [4]
Lat2 = [7]
Lat3 = [.]
Lat4 = [4]
Lat5 = [8]
...
Long1 = [-]
Long2 = [1]
Long3 = [2]
...

Searching:
example, a sample box around Seattle, WA:
North latitude = 47.77704
West longitude = -122.41909
South latitude = 47.34827
East longitude = -122.22031
2. Break up lat/long values in a manner similar to indexing.
3. Begin to build boolean query. Query will contain two required clauses:
the North/South query, and the West/East query. Both queries are built
using the same progressional evaluation of characters by position.
4. Compare North/South (top/bottom) values. Here's a pseudo-graphical
chart:

North = [4][7][.][7][7][7][0][4]
South = [4][7][.][3][4][8][2][7]

Qualifying records will have latitude values between 47.77704 and 47.34827
.
With lexicographical ordering in mind, I can safely include this phrase in
my North/South query:

(+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

The logic is that any latitude with the first three values matching, and
the
fourth position containing either 4 or 5 or 6 will yield a qualifying
match.

What other queries are needed? Ones that match 7 or 3 in the fourth
position should be considered, but they need further qualification. The
further qualification is based on values from the fifth position. I can
safely included the following phrases in my North/South query:

(+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3 Lat5:2
Lat5:1 Lat5:0)
(+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8
Lat5:9)

The logic here is simply an extension of the first query, extended to next
position in the latitude range. In the Northern latitude, with the first
four positions matching those values exactly, anything less than 7 in the
fifth position will yield a matching latitude. Same goes for the South
(bottom) query.

*******************************************************************************************

This approach yields a higher number of boolean query clauses, the more
granular you get. In this scenario, I've estimated approximately 145
clauses within the final constructed query. In validation testing, this
approach has proven to be:

1) Accurate.
2) Performant (thus far).

At last, my question to everyone who cares to respond (and read this far):
feedback?

Thanks,
-- jeff

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
•  at Mar 1, 2006 at 7:27 pm ⇧

Jeff Rodenburg wrote:
Following on the Range Query approach, how is performance? I found the
range approach (albeit with the exact values) to be slower than the
parsed-string approach I posited.
Note that Hoss suggested RangeFilter, not RangeQuery. Or perhaps
ConstantScoreRangeQuery, which is implemented in terms of RangeFilter.
But RangeQuery will be slower than either RangeFilter or
ConstantScoreRangeQuery.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
•  at Mar 1, 2006 at 11:49 pm ⇧
Very good note, I missed that. I need the development environment in front
of me to remember all the different class names correctly. ;-)

-- j

On 3/1/06, Doug Cutting wrote:

Jeff Rodenburg wrote:
Following on the Range Query approach, how is performance? I found the
range approach (albeit with the exact values) to be slower than the
parsed-string approach I posited.
Note that Hoss suggested RangeFilter, not RangeQuery. Or perhaps
ConstantScoreRangeQuery, which is implemented in terms of RangeFilter.
But RangeQuery will be slower than either RangeFilter or
ConstantScoreRangeQuery.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
•  at Feb 28, 2006 at 8:55 pm ⇧
I don't know if this matters, but we do all of our geolocating in sql
with decent speed. All the trig is in the query itself and then we can
limit top 5, top 10 etc for what we show. Is the data such that you
need lucene? Can I ask what causes it to be beyond a databases
ability?

-----Original Message-----
From: Bryzek.Michael
Sent: Tuesday, February 28, 2006 2:49 PM
To: java-user@lucene.apache.org
Subject: RE: Hacking proximity search: looking for feedback

Jeff -

This is an interesting approach. On our end, we have experimented with
two variants:

Variant 1: Use Range Query

Rather than precomputing the boolean clauses yourself, index encoded
latitude and longitude values and use a Range Query. We encode by
adding 1000 to each of the values. Note: We only deal with zip codes
in the US for which this encoding works fine, but is worth another
look if you have a broader range of coordinates.

Example: Compute the box as you describe, then encode each value and
use a RangeQuery. Using the box you provided and lucene's query
parser:

North latitude = 47.77704
West longitude = -122.41909
South latitude = 47.34827
East longitude = -122.22031

gives us this query:

latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO
877.77969]

Lucene then handles expanding this query into the appropriate set of
Boolean Queries.

Variant 2: Combine the above w/ a custom scorer

For us, boxing works well to retrieve relevant documents, but our
users want those documents sorted by distance. We modified the custom
scoring class that Hatcher provides in Lucene in Action to compute the
distance between two points for only those documents which actually
match. Thus, we do our search using a Range Query, then for all
matching documents we compute an actual score that incorporates the
distance from the actual location from which the user is searching.

On our data set, we can still end up with 1000s of matching documents
after boxing. Thus, we still see a bottleneck computing the score for
even this smaller set of documents which we are still working through.

-Mike

-----Original Message-----
From: Jeff Rodenburg
Sent: Tue 2/28/06 3:10 PM
To: java-user@lucene.apache.org
Cc:
Subject: Hacking proximity search: looking for feedback

I've been wrestling with a way to index and search data with a
geo-positional aspect. By a geo-positional search, I want to constrain
search results within a given location range. Furthermore, I want to
allow
the user to set/change the geo-positional boundaries as needed for their
search. This isn't a foreign concept to anyone who's needed to do the
same.

There's been some work done in this area publicly (or at least what I
could
find via the list archives and google), but not a lot. The guys at
eventax.de have done some very impressive work. Their implementation is
within the constructs of nutch; there's more here at
http://wiki.apache.org/nutch/GeoPosition. Their approach is very
interesting and is predicated by having data indexed in a certain
format.
I've considered building something based on the FunctionQuery class as
well. Within this class, I could actually do the mathematical
calculations
(Haversine formula, anyone?) for geo-positional filtering. Range
queries on
this data set were an option as well.

I've hit a performance wall with these approaches. The geo-positional
calculations are proving to be intensive. With the combination of my
data
set, hardware, and OS, this just wasn't getting it done.

So, I reversed my thinking. Instead of getting Lucene to do geo-math,
what
if I made geo-math do Lucene? Lucene is exceptional at string lookups;
how
could I do geo-positional search in that framework? I seized upon an
approach that I've validated in testing, but wanted to get more feedback
from the community.

************************************************************************
********************************

Data structure:
Latitude & longitude values in decimal format, i.e. latitude=47.480227,
longitude=-122.333670 (btw, a tire shop I recently visited).

Geo definition:
Boxing around a center point. It's not critical to do a radius search
with
a given circle. A boxed approach allows for taller or wider frames of
reference, which are applicable for our use.

How I'm solving:
Treat the data as strings and formulate boolean query lookups based on
positional comparison. Here are the steps:

Indexing:
1. Break up lat/long values to individual characters, and store each
field
in progression. Field storage type = Keyword. The tire shop example:
Lat1 = [4]
Lat2 = [7]
Lat3 = [.]
Lat4 = [4]
Lat5 = [8]
...
Long1 = [-]
Long2 = [1]
Long3 = [2]
...

Searching:
For
example, a sample box around Seattle, WA:
North latitude = 47.77704
West longitude = -122.41909
South latitude = 47.34827
East longitude = -122.22031
2. Break up lat/long values in a manner similar to indexing.
3. Begin to build boolean query. Query will contain two required
clauses:
the North/South query, and the West/East query. Both queries are built
using the same progressional evaluation of characters by position.
4. Compare North/South (top/bottom) values. Here's a pseudo-graphical
chart:

North = [4][7][.][7][7][7][0][4]
South = [4][7][.][3][4][8][2][7]

Qualifying records will have latitude values between 47.77704 and
47.34827.
With lexicographical ordering in mind, I can safely include this phrase
in
my North/South query:

(+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

The logic is that any latitude with the first three values matching, and
the
fourth position containing either 4 or 5 or 6 will yield a qualifying
match.

What other queries are needed? Ones that match 7 or 3 in the fourth
position should be considered, but they need further qualification. The
further qualification is based on values from the fifth position. I can
safely included the following phrases in my North/South query:

(+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3
Lat5:2
Lat5:1 Lat5:0)
(+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8
Lat5:9)

The logic here is simply an extension of the first query, extended to
next
position in the latitude range. In the Northern latitude, with the
first
four positions matching those values exactly, anything less than 7 in
the
fifth position will yield a matching latitude. Same goes for the South
(bottom) query.

************************************************************************
*******************

This approach yields a higher number of boolean query clauses, the more
granular you get. In this scenario, I've estimated approximately 145
clauses within the final constructed query. In validation testing, this
approach has proven to be:

1) Accurate.
2) Performant (thus far).

At last, my question to everyone who cares to respond (and read this
far):
feedback?

Thanks,
-- jeff

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
•  at Feb 28, 2006 at 9:00 pm ⇧
Our geo searches are combined with keyword searches. We previously performed all of our queries in the database (Oracle 10g w/ interMedia for the unstructured portion) but found that it was easier to scale search outside the database than within.

-----Original Message-----
From: John Powers
Sent: Tue 2/28/06 3:53 PM
To: java-user@lucene.apache.org
Cc:
Subject: RE: Hacking proximity search: looking for feedback

I don't know if this matters, but we do all of our geolocating in sql
with decent speed. All the trig is in the query itself and then we can
limit top 5, top 10 etc for what we show. Is the data such that you
need lucene? Can I ask what causes it to be beyond a databases
ability?

-----Original Message-----
From: Bryzek.Michael
Sent: Tuesday, February 28, 2006 2:49 PM
To: java-user@lucene.apache.org
Subject: RE: Hacking proximity search: looking for feedback

Jeff -

This is an interesting approach. On our end, we have experimented with
two variants:

Variant 1: Use Range Query

Rather than precomputing the boolean clauses yourself, index encoded
latitude and longitude values and use a Range Query. We encode by
adding 1000 to each of the values. Note: We only deal with zip codes
in the US for which this encoding works fine, but is worth another
look if you have a broader range of coordinates.

Example: Compute the box as you describe, then encode each value and
use a RangeQuery. Using the box you provided and lucene's query
parser:

North latitude = 47.77704
West longitude = -122.41909
South latitude = 47.34827
East longitude = -122.22031

gives us this query:

latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO
877.77969]

Lucene then handles expanding this query into the appropriate set of
Boolean Queries.

Variant 2: Combine the above w/ a custom scorer

For us, boxing works well to retrieve relevant documents, but our
users want those documents sorted by distance. We modified the custom
scoring class that Hatcher provides in Lucene in Action to compute the
distance between two points for only those documents which actually
match. Thus, we do our search using a Range Query, then for all
matching documents we compute an actual score that incorporates the
distance from the actual location from which the user is searching.

On our data set, we can still end up with 1000s of matching documents
after boxing. Thus, we still see a bottleneck computing the score for
even this smaller set of documents which we are still working through.

-Mike

-----Original Message-----
From: Jeff Rodenburg
Sent: Tue 2/28/06 3:10 PM
To: java-user@lucene.apache.org
Cc:
Subject: Hacking proximity search: looking for feedback

I've been wrestling with a way to index and search data with a
geo-positional aspect. By a geo-positional search, I want to constrain
search results within a given location range. Furthermore, I want to
allow
the user to set/change the geo-positional boundaries as needed for their
search. This isn't a foreign concept to anyone who's needed to do the
same.

There's been some work done in this area publicly (or at least what I
could
find via the list archives and google), but not a lot. The guys at
eventax.de have done some very impressive work. Their implementation is
within the constructs of nutch; there's more here at
http://wiki.apache.org/nutch/GeoPosition. Their approach is very
interesting and is predicated by having data indexed in a certain
format.
I've considered building something based on the FunctionQuery class as
well. Within this class, I could actually do the mathematical
calculations
(Haversine formula, anyone?) for geo-positional filtering. Range
queries on
this data set were an option as well.

I've hit a performance wall with these approaches. The geo-positional
calculations are proving to be intensive. With the combination of my
data
set, hardware, and OS, this just wasn't getting it done.

So, I reversed my thinking. Instead of getting Lucene to do geo-math,
what
if I made geo-math do Lucene? Lucene is exceptional at string lookups;
how
could I do geo-positional search in that framework? I seized upon an
approach that I've validated in testing, but wanted to get more feedback
from the community.

************************************************************************
********************************

Data structure:
Latitude & longitude values in decimal format, i.e. latitude=47.480227,
longitude=-122.333670 (btw, a tire shop I recently visited).

Geo definition:
Boxing around a center point. It's not critical to do a radius search
with
a given circle. A boxed approach allows for taller or wider frames of
reference, which are applicable for our use.

How I'm solving:
Treat the data as strings and formulate boolean query lookups based on
positional comparison. Here are the steps:

Indexing:
1. Break up lat/long values to individual characters, and store each
field
in progression. Field storage type = Keyword. The tire shop example:
Lat1 = [4]
Lat2 = [7]
Lat3 = [.]
Lat4 = [4]
Lat5 = [8]
...
Long1 = [-]
Long2 = [1]
Long3 = [2]
...

Searching:
For
example, a sample box around Seattle, WA:
North latitude = 47.77704
West longitude = -122.41909
South latitude = 47.34827
East longitude = -122.22031
2. Break up lat/long values in a manner similar to indexing.
3. Begin to build boolean query. Query will contain two required
clauses:
the North/South query, and the West/East query. Both queries are built
using the same progressional evaluation of characters by position.
4. Compare North/South (top/bottom) values. Here's a pseudo-graphical
chart:

North = [4][7][.][7][7][7][0][4]
South = [4][7][.][3][4][8][2][7]

Qualifying records will have latitude values between 47.77704 and
47.34827.
With lexicographical ordering in mind, I can safely include this phrase
in
my North/South query:

(+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

The logic is that any latitude with the first three values matching, and
the
fourth position containing either 4 or 5 or 6 will yield a qualifying
match.

What other queries are needed? Ones that match 7 or 3 in the fourth
position should be considered, but they need further qualification. The
further qualification is based on values from the fifth position. I can
safely included the following phrases in my North/South query:

(+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3
Lat5:2
Lat5:1 Lat5:0)
(+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8
Lat5:9)

The logic here is simply an extension of the first query, extended to
next
position in the latitude range. In the Northern latitude, with the
first
four positions matching those values exactly, anything less than 7 in
the
fifth position will yield a matching latitude. Same goes for the South
(bottom) query.

************************************************************************
*******************

This approach yields a higher number of boolean query clauses, the more
granular you get. In this scenario, I've estimated approximately 145
clauses within the final constructed query. In validation testing, this
approach has proven to be:

1) Accurate.
2) Performant (thus far).

At last, my question to everyone who cares to respond (and read this
far):
feedback?

Thanks,
-- jeff

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
•  at Feb 28, 2006 at 9:26 pm ⇧
I'm in the same boat as Michael on this one. It's not a matter of finding
the right technology to do geo-locational calculations, but rather being
able to accomplish that task in conjunction with keyword search.

-- j
On 2/28/06, Bryzek.Michael wrote:

Our geo searches are combined with keyword searches. We previously
performed all of our queries in the database (Oracle 10g w/ interMedia for
the unstructured portion) but found that it was easier to scale search
outside the database than within.

-----Original Message-----
From: John Powers
Sent: Tue 2/28/06 3:53 PM
To: java-user@lucene.apache.org
Cc:
Subject: RE: Hacking proximity search: looking for feedback

I don't know if this matters, but we do all of our geolocating in sql
with decent speed. All the trig is in the query itself and then we can
limit top 5, top 10 etc for what we show. Is the data such that you
need lucene? Can I ask what causes it to be beyond a databases
ability?

-----Original Message-----
From: Bryzek.Michael
Sent: Tuesday, February 28, 2006 2:49 PM
To: java-user@lucene.apache.org
Subject: RE: Hacking proximity search: looking for feedback

Jeff -

This is an interesting approach. On our end, we have experimented with
two variants:

Variant 1: Use Range Query

Rather than precomputing the boolean clauses yourself, index encoded
latitude and longitude values and use a Range Query. We encode by
adding 1000 to each of the values. Note: We only deal with zip codes
in the US for which this encoding works fine, but is worth another
look if you have a broader range of coordinates.

Example: Compute the box as you describe, then encode each value and
use a RangeQuery. Using the box you provided and lucene's query
parser:

North latitude = 47.77704
West longitude = -122.41909
South latitude = 47.34827
East longitude = -122.22031

gives us this query:

latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO
877.77969]

Lucene then handles expanding this query into the appropriate set of
Boolean Queries.

Variant 2: Combine the above w/ a custom scorer

For us, boxing works well to retrieve relevant documents, but our
users want those documents sorted by distance. We modified the custom
scoring class that Hatcher provides in Lucene in Action to compute the
distance between two points for only those documents which actually
match. Thus, we do our search using a Range Query, then for all
matching documents we compute an actual score that incorporates the
distance from the actual location from which the user is searching.

On our data set, we can still end up with 1000s of matching documents
after boxing. Thus, we still see a bottleneck computing the score for
even this smaller set of documents which we are still working through.

-Mike

-----Original Message-----
From: Jeff Rodenburg
Sent: Tue 2/28/06 3:10 PM
To: java-user@lucene.apache.org
Cc:
Subject: Hacking proximity search: looking for feedback

I've been wrestling with a way to index and search data with a
geo-positional aspect. By a geo-positional search, I want to constrain
search results within a given location range. Furthermore, I want to
allow
the user to set/change the geo-positional boundaries as needed for their
search. This isn't a foreign concept to anyone who's needed to do the
same.

There's been some work done in this area publicly (or at least what I
could
find via the list archives and google), but not a lot. The guys at
eventax.de have done some very impressive work. Their implementation is
within the constructs of nutch; there's more here at
http://wiki.apache.org/nutch/GeoPosition. Their approach is very
interesting and is predicated by having data indexed in a certain
format.
I've considered building something based on the FunctionQuery class as
well. Within this class, I could actually do the mathematical
calculations
(Haversine formula, anyone?) for geo-positional filtering. Range
queries on
this data set were an option as well.

I've hit a performance wall with these approaches. The geo-positional
calculations are proving to be intensive. With the combination of my
data
set, hardware, and OS, this just wasn't getting it done.

So, I reversed my thinking. Instead of getting Lucene to do geo-math,
what
if I made geo-math do Lucene? Lucene is exceptional at string lookups;
how
could I do geo-positional search in that framework? I seized upon an
approach that I've validated in testing, but wanted to get more feedback
from the community.

************************************************************************
********************************

Data structure:
Latitude & longitude values in decimal format, i.e. latitude=47.480227,
longitude=-122.333670 (btw, a tire shop I recently visited).

Geo definition:
Boxing around a center point. It's not critical to do a radius search
with
a given circle. A boxed approach allows for taller or wider frames of
reference, which are applicable for our use.

How I'm solving:
Treat the data as strings and formulate boolean query lookups based on
positional comparison. Here are the steps:

Indexing:
1. Break up lat/long values to individual characters, and store each
field
in progression. Field storage type = Keyword. The tire shop example:
Lat1 = [4]
Lat2 = [7]
Lat3 = [.]
Lat4 = [4]
Lat5 = [8]
...
Long1 = [-]
Long2 = [1]
Long3 = [2]
...

Searching:
For
example, a sample box around Seattle, WA:
North latitude = 47.77704
West longitude = -122.41909
South latitude = 47.34827
East longitude = -122.22031
2. Break up lat/long values in a manner similar to indexing.
3. Begin to build boolean query. Query will contain two required
clauses:
the North/South query, and the West/East query. Both queries are built
using the same progressional evaluation of characters by position.
4. Compare North/South (top/bottom) values. Here's a pseudo-graphical
chart:

North = [4][7][.][7][7][7][0][4]
South = [4][7][.][3][4][8][2][7]

Qualifying records will have latitude values between 47.77704 and
47.34827.
With lexicographical ordering in mind, I can safely include this phrase
in
my North/South query:

(+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

The logic is that any latitude with the first three values matching, and
the
fourth position containing either 4 or 5 or 6 will yield a qualifying
match.

What other queries are needed? Ones that match 7 or 3 in the fourth
position should be considered, but they need further qualification. The
further qualification is based on values from the fifth position. I can
safely included the following phrases in my North/South query:

(+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3
Lat5:2
Lat5:1 Lat5:0)
(+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8
Lat5:9)

The logic here is simply an extension of the first query, extended to
next
position in the latitude range. In the Northern latitude, with the
first
four positions matching those values exactly, anything less than 7 in
the
fifth position will yield a matching latitude. Same goes for the South
(bottom) query.

************************************************************************
*******************

This approach yields a higher number of boolean query clauses, the more
granular you get. In this scenario, I've estimated approximately 145
clauses within the final constructed query. In validation testing, this
approach has proven to be:

1) Accurate.
2) Performant (thus far).

At last, my question to everyone who cares to respond (and read this
far):
feedback?

Thanks,
-- jeff

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org

## Related Discussions

Discussion Overview
 group java-user categories lucene posted Feb 28, '06 at 8:11p active Mar 1, '06 at 11:49p posts 12 users 5 website lucene.apache.org

### 5 users in discussion

Content

People

Support

Translate

site design / logo © 2022 Grokbase