FAQ
Hi,

I'm having problems understanding query parsers handling of AND and OR
if there's more than one operator.

E.g.
a OR b AND c
gives the same number of hits as
b AND c
(only scores are different)

and
a AND b OR c AND d
seems to be equivalent to
a AND b AND C AND d

which doesn't seem logical to me.

I'd expect to have AND higher precedence than OR (as a logical AND / OR in
C or Java) so that a OR b AND c would be equivalent to a OR (b AND c)
and a AND b OR c AND d equivalent to (a AND b) OR (c AND d)

When I look at the query parsers sources, I find, that -- unless paranthesis
are used -- all these terms are added to one boolean query, and the
AND operator makes the term left and right of it required (unless there
are NOT operators making them prohibited).
So
a OR b AND c gives one boolean query where b and c are required, whereas
a is not.
a AND b OR c AND d produces a boolean query where a, b, c and d are required,
which is indeed the same as a AND b AND c AND d.

Should this be considered a bug?

greetings
Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org

## Search Discussions

•  at Dec 9, 2003 at 6:49 pm ⇧

On Tue, Dec 09, 2003 at 10:57:51AM +0100, Morus Walter wrote:
Hi,

I'm having problems understanding query parsers handling of AND and OR
if there's more than one operator.

E.g.
a OR b AND c
gives the same number of hits as
b AND c
(only scores are different)
This would make sense if all the document that have a also have both B
and C in them.
and
a AND b OR c AND d
seems to be equivalent to
a AND b AND C AND d
That's not what I get.
http://www.fastbuzz.com/search/results.jsp?query=dean+AND+kerry+AND+clark+AND+gephardt&days=
returns 479 items
but
http://www.fastbuzz.com/search/results.jsp?query=dean+AND+kerry+OR+clark+AND+gephardt&days=
returns 564 items which indicates that the OR does make a difference.
As expcted, you end up getting more items with the OR.

Regards,

Dror
which doesn't seem logical to me.

I'd expect to have AND higher precedence than OR (as a logical AND / OR in
C or Java) so that a OR b AND c would be equivalent to a OR (b AND c)
and a AND b OR c AND d equivalent to (a AND b) OR (c AND d)

When I look at the query parsers sources, I find, that -- unless paranthesis
are used -- all these terms are added to one boolean query, and the
AND operator makes the term left and right of it required (unless there
are NOT operators making them prohibited).
So
a OR b AND c gives one boolean query where b and c are required, whereas
a is not.
a AND b OR c AND d produces a boolean query where a, b, c and d are required,
which is indeed the same as a AND b AND c AND d.

Should this be considered a bug?

greetings
Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
--
Dror Matalon
Zapatec Inc
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Dec 10, 2003 at 9:01 am ⇧
Hi Dror,

I'm having problems understanding query parsers handling of AND and OR
if there's more than one operator.

E.g.
a OR b AND c
gives the same number of hits as
b AND c
(only scores are different)
This would make sense if all the document that have a also have both B
and C in them.
Then the query should be equivalent to (a OR b) AND c.
But it isn't. For specific a, b and c I get 766 hits for a OR b AND c
and 1086 for (a OR b) AND c.
and
a AND b OR c AND d
seems to be equivalent to
a AND b AND C AND d
That's not what I get.
http://www.fastbuzz.com/search/results.jsp?query=dean+AND+kerry+AND+clark+AND+gephardt&days=
returns 479 items
but
http://www.fastbuzz.com/search/results.jsp?query=dean+AND+kerry+OR+clark+AND+gephardt&days=
returns 564 items which indicates that the OR does make a difference.
As expcted, you end up getting more items with the OR.
Hmm. I was sloppy not specifying the lucene version.
My tests were on 1.2.
But I reindex a part of my documents using 1.3rc3 and find the same.
What version does fastbuzz use?

I wrote s small test programm indexing all documents consisting of
one or zero occurences of a, b, c and d (ignoring order, so without
the empty document, that's just 15 docs) and performing some queries
on it.
Programm see below, this is what I get:

a OR b AND c -> a +b +c
4 documents found
a b c
a b c d
b c
b c d
(a OR b) AND c -> +(a b) +c
6 documents found
a b c
a b c d
a c
b c
a c d
b c d
a OR (b AND c) -> a (+b +c)
10 documents found
a b c
a b c d
b c
a
b c d
a b
a c
a d
a b d
a c d
b AND c -> +b +c
4 documents found
b c
a b c
b c d
a b c d
a AND b OR c AND d -> +a +b +c +d
1 documents found
a b c d
(a AND b) OR (c AND d) -> (+a +b) (+c +d)
7 documents found
a b c d
a b
c d
a b c
a b d
a c d
b c d
a AND (b OR c) AND d -> +a +(b c) +d
3 documents found
a b c d
a b d
a c d
((a AND b) OR c) AND d -> +((+a +b) c) +d
5 documents found
a b c d
a b d
c d
a c d
b c d
a AND (b OR (c AND d)) -> +a +(b (+c +d))
5 documents found
a b c d
a c d
a b
a b c
a b d
a AND b AND c AND d -> +a +b +c +d
1 documents found
a b c d

Using 1.3rc3, 1.3rc2 or 1.3rc1; I get the same results with a slightly
different order for 1.2.

So I still get the same for
a OR b AND c and b AND c
and
a AND b OR c AND d and a AND b AND c AND d
(note, that the result of the toString method of the query is equal in
both cases)
but different results for any operator grouping, I can think of.
So to me, the question remains, what does AND and OR mean, if they are
combined in one expression?
I can understand all the query results where AND and OR queries are
explicitly grouped by paranthesis, and the results are, what I expect.
But the rules for combined AND and OR aren't what I would expect.

greetings
Morus

PS: the test program:

import org.apache.lucene.document.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.search.*;
import org.apache.lucene.queryParser.QueryParser;

class LuceneTest
{
static String[] docs = {
"a", "b", "c", "d",
"a b", "a c", "a d", "b c", "b d", "c d",
"a b c", "a b d", "a c d", "b c d",
"a b c d"
};

static String[] queries = {
"a OR b AND c",
"(a OR b) AND c",
"a OR (b AND c)",
"b AND c",
"a AND b OR c AND d",
"(a AND b) OR (c AND d)",
"a AND (b OR c) AND d",
"((a AND b) OR c) AND d",
"a AND (b OR (c AND d))",
"a AND b AND c AND d"
};

public static void main(String argv[]) throws Exception {
Directory dir = new RAMDirectory();
String[] stop = {};
Analyzer analyzer = new StandardAnalyzer(stop);

IndexWriter writer = new IndexWriter(dir, analyzer, true);

for ( int i=0; i < docs.length; i++ ) {
Document doc = new Document();
}
writer.close();

Searcher searcher = new IndexSearcher(dir);
for ( int i=0; i < queries.length; i++ ) {
Query query = QueryParser.parse(queries[i], "text", analyzer);
System.out.println(queries[i] + " -> " + query.toString("text"));
Hits hits = searcher.search(query);
System.out.println(" " + hits.length() + " documents found");
for ( int j=0; j < hits.length(); j++ ) {
Document doc = hits.doc(j);
System.out.println("\t"+doc.get("text"));
}
}
}
}

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Dec 10, 2003 at 4:30 pm ⇧
What Morus is saying is right, an expression without parenthesis, when
interpreted, assumes terms on either side of an AND clause are compulsory
terms, and any terms on either side of an OR clause are optional. However,
if you combine AND and OR in an expression, the optional terms have no
effect because the others are compulsory.

What needs to be done is that the query parse should process any query
string that has AND, and "put brackets" round it first. As it stands it is
no use, as the OR does not work in the way you would think. AND should be
given implicit priority.

-----Original Message-----
From: Morus Walter
Sent: 10 December 2003 09:01
To: Lucene Users List
Subject: Re: Query Parser AND / OR

Hi Dror,

I'm having problems understanding query parsers handling of AND and OR
if there's more than one operator.

E.g.
a OR b AND c
gives the same number of hits as
b AND c
(only scores are different)
This would make sense if all the document that have a also have both B
and C in them.
Then the query should be equivalent to (a OR b) AND c.
But it isn't. For specific a, b and c I get 766 hits for a OR b AND c
and 1086 for (a OR b) AND c.
and
a AND b OR c AND d
seems to be equivalent to
a AND b AND C AND d
a OR b AND c -> a +b +c
4 documents found
a b c
a b c d
b c
b c d
(a OR b) AND c -> +(a b) +c
6 documents found

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Dec 28, 2003 at 4:24 pm ⇧

Jamie Stallwood wrote:

What Morus is saying is right, an expression without parenthesis, when
interpreted, assumes terms on either side of an AND clause are compulsory
terms, and any terms on either side of an OR clause are optional. However,
if you combine AND and OR in an expression, the optional terms have no
effect because the others are compulsory.

What needs to be done is that the query parse should process any query
string that has AND, and "put brackets" round it first. As it stands it is
no use, as the OR does not work in the way you would think. AND should be
given implicit priority.
I had a closer look at this and wrote a patch, that implements this by
changing the vector of boolean clauses into a vector of vectors of boolean
clauses in the addClause method of the query parser. A new sub-vector is
created whenever an explicit OR operator is used.

Queries using explicit AND/OR are grouped by precedence of AND over OR.
That is a OR b AND c gets a OR (b AND c).

Queries using implicit AND/OR (depending on the default operator) are handled
as before (so one can still use a +b -c to create one boolean query, where
b is required, c forbidden and a optional).

It's less clear how a query using both explizit AND/OR and implicit operators
should be handled.
Since the patch groups on explicit OR operators a query
a OR b c is read as a (b c)
whereas
a AND b c as +a +b c
(given that default operator or is used).

There's one issue left:
The old query parser reads a query
`a OR NOT b' as `a -b' which is the same as `a AND NOT b'.
The modified query parser reads this as `a (-b)'.
While this looks better (at least to me), it does not produce the result
of a OR NOT b. Instead the (-b) part seems to be silently dropped.
While I understand that this query is illegal (just searching for one negative
term) I don't think that silently dropping this part is an appropriate
way to deal with that. But I don't think that's a query parser issue.
The only question is, if the query parser should take care of that.

I attached the patch (made against 1.3rc3 but working for 1.3final as well)
and a test program.
The test program parses a number of queries with default-or and default-and
operator and reparses the result of the toString method of the created query.
It outputs the initial query, the parsed query with default or, the reparesed
query, the parsed query with the default and it's reparsed query.
If called with a -q option, it also run's the queries against an index
consisting of all documentes containing one or none a b c or d.
Using an unpatched and a patched version of lucene in the classpath one
can look at the effect of the patch in detail.

I'm interested in your comments. Given that noone objects the patch, I'd enter
a bug report, so it doesn't get lost.

Morus
•  at Dec 28, 2003 at 4:46 pm ⇧

Morus Walter writes:

I attached the patch (made against 1.3rc3 but working for 1.3final as well)
and a test program.
Seems the attachments got stripped...

So once again:

The patch:

===File lucene/QueryParser.jj.patch===============
*** QueryParser.jj.org Mon Dec 22 11:47:30 2003
--- QueryParser.jj Mon Dec 22 13:20:57 2003
***************
*** 233,255 ****

protected void addClause(Vector clauses, int conj, int mods, Query q) {
boolean required, prohibited;
!
! // If this term is introduced by AND, make the preceding term required,
if (conj == CONJ_AND) {
! BooleanClause c = (BooleanClause) clauses.elementAt(clauses.size()-1);
! if (!c.prohibited)
! c.required = true;
! }
!
! if (operator == DEFAULT_OPERATOR_AND && conj == CONJ_OR) {
! // If this term is introduced by OR, make the preceding term optional,
! // unless it's prohibited (that means we leave -a OR b but +a OR b-->a OR b)
! // notice if the input is a OR b, first term is parsed as required; without
! // this modification a OR b would parsed as +a OR b
! BooleanClause c = (BooleanClause) clauses.elementAt(clauses.size()-1);
! if (!c.prohibited)
! c.required = false;
}

// We might have been passed a null query; the term might have been
--- 233,249 ----

protected void addClause(Vector clauses, int conj, int mods, Query q) {
boolean required, prohibited;
! // System.out.println(conj+ " " + mods + " " + q.toString("text"));
! // If this term is introduced by AND, check if the previous term is the
! // first term in this or-group and make that term required,
if (conj == CONJ_AND) {
! Vector clauses2 = (Vector)clauses.elementAt(clauses.size()-1);
! //if ( clauses2.size() == 1 ) {
! BooleanClause c = (BooleanClause) clauses2.elementAt(clauses2.size()-1);
! if (!c.prohibited)
! c.required = true;
! //}
}

// We might have been passed a null query; the term might have been
***************
*** 257,277 ****
if (q == null)
return;

if (operator == DEFAULT_OPERATOR_OR) {
- // We set REQUIRED if we're introduced by AND or +; PROHIBITED if
// introduced by NOT or -; make sure not to set both.
prohibited = (mods == MOD_NOT);
! required = (mods == MOD_REQ);
! if (conj == CONJ_AND && !prohibited) {
! required = true;
! }
! } else {
! // We set PROHIBITED if we're introduced by NOT or -; We set REQUIRED
! // if not PROHIBITED and not introduced by OR
prohibited = (mods == MOD_NOT);
! required = (!prohibited && conj != CONJ_OR);
}
}

/**
--- 251,279 ----
if (q == null)
return;

+ // start new or-group if there's an explit or
+ if ( conj == CONJ_OR ) {
+ }
+
if (operator == DEFAULT_OPERATOR_OR) {
// introduced by NOT or -; make sure not to set both.
prohibited = (mods == MOD_NOT);
! // for explizit conjunctions: set required to true
! if ( conj == CONJ_AND ) {
! required = true;
! }
! else {
! // default OR -> required only when requested
! required = (mods == MOD_REQ);
! }
! } else { // operator == DEFAULT_OPERATOR_AND
! // We set PROHIBITED if we're introduced by NOT or -
prohibited = (mods == MOD_NOT);
! // always REQUIRED unless PROHIBITED
! required = (!prohibited);
}
}

/**
***************
*** 359,369 ****
*/
protected Query getBooleanQuery(Vector clauses) throws ParseException
{
! BooleanQuery query = new BooleanQuery();
! for (int i = 0; i < clauses.size(); i++) {
! }
! return query;
}

/**
--- 361,389 ----
*/
protected Query getBooleanQuery(Vector clauses) throws ParseException
{
! BooleanQuery query = new BooleanQuery();
! if ( clauses.size() == 1 ) {
! clauses = (Vector)clauses.elementAt(0);
! for (int i = 0; i < clauses.size(); i++) {
! }
! }
! else {
! for ( int i = 0; i < clauses.size(); i++ ) {
! Vector clauses2 = (Vector)clauses.elementAt(i);
! if ( clauses2.size() == 1 && ((BooleanClause)clauses2.elementAt(0)).prohibited == false ) {
! }
! else if ( clauses2.size() >= 1 ) {
! BooleanQuery query2 = new BooleanQuery();
! for ( int j = 0; j < clauses2.size(); j++ ) {
! }
! }
! }
! }
! return query;
}

/**
***************
*** 551,556 ****
--- 571,577 ----
Query Query(String field) :
{
Vector clauses = new Vector();
Query q, firstQuery=null;
int conj, mods;
}
***************
*** 566,572 ****
{ addClause(clauses, conj, mods, q); }
)*
{
! if (clauses.size() == 1 && firstQuery != null)
return firstQuery;
else {
return getBooleanQuery(clauses);
--- 587,593 ----
{ addClause(clauses, conj, mods, q); }
)*
{
! if (clauses.size() == 1 && ((Vector)clauses.elementAt(0)).size() == 1 && firstQuery != null)
return firstQuery;
else {
return getBooleanQuery(clauses);
============================================================

and the test program:

===File lucene/LuceneTest.java===============
import org.apache.lucene.document.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.search.*;
import org.apache.lucene.queryParser.QueryParser;

class LuceneTest
{
static String[] docs = {
"a", "b", "c", "d",
"a b", "a c", "a d", "b c", "b d", "c d",
"a b c", "a b d", "a c d", "b c d",
"a b c d"
};

static String[] queries = {
"a OR b AND c",
"(a OR b) AND c",
"a OR (b AND c)",
"a AND b",
"a AND b OR c AND d",
"(a AND b) OR (c AND d)",
"a AND (b OR c) AND d",
"((a AND b) OR c) AND d",
"a AND (b OR (c AND d))",
"a AND b AND c AND d",

"a OR b AND NOT c",
"(a OR b) AND NOT c",
"a OR (b AND NOT c)",
"a AND NOT d",
"a AND NOT b OR c AND NOT d",
"(a AND NOT b) OR (c AND NOT d)",
"a AND NOT (b OR c) AND NOT d",
"((a AND NOT b) OR c) AND NOT d",
"a AND NOT (b OR (c AND NOT d))",
"a AND NOT b AND NOT c AND NOT d",

"a OR NOT b",
"a OR NOT a",

"a b",
"a b c",
"a b (c d e)",
"+a +b",
"a -b",
"a +b -c",
"+a b -c",
"+a -b c",
"a -b -c",
"-a b c",

"a OR b c AND d",
"a OR b c",
"a AND b c",
"a OR b c OR d",
"a OR b c d OR e",
"a AND b c AND d",
"a AND b c d AND e"
};

public static void main(String argv[]) throws Exception {
Directory dir = new RAMDirectory();
String[] stop = {};
Analyzer analyzer = new StandardAnalyzer(stop);

IndexWriter writer = new IndexWriter(dir, analyzer, true);

for ( int i=0; i < docs.length; i++ ) {
Document doc = new Document();
}
writer.close();

Searcher searcher = new IndexSearcher(dir);
for ( int i=0; i < queries.length; i++ ) {
QueryParser parser = new QueryParser("text", analyzer);
parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);

Query [] query = new Query[4];

query[0] = QueryParser.parse(queries[i], "text", analyzer);
query[1] = QueryParser.parse(query[0].toString("text"), "text", analyzer);
query[2] = parser.parse(queries[i]);
query[3] = QueryParser.parse(query[2].toString("text"), "text", analyzer);

System.out.println(i + ": " + queries[i] + " ==> " + query[0].toString("text") + " -> " + query[1].toString("text") + " / " + query[2].toString("text") + " -> " + query[3].toString("text"));
if ( argv.length > 0 && argv[0].equals("-q") ) {
for ( int k=0; k<4; k++ ) {
Hits hits = searcher.search(query[k]);
System.out.println(k + " " + query[k].toString("text") + "\t" + hits.length() + " documents found");
for ( int j=0; j < hits.length(); j++ ) {
Document doc = hits.doc(j);
System.out.println("\t"+doc.get("text"));
}
}
}
}
}
}
============================================================

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Dec 29, 2003 at 12:11 am ⇧
Morus,

I haven't had time to think through all of the issues and the patch you
submitted, but I suggest that you go ahead and attach this to a
Bugzilla issue so that it can be addressed more formally and avoid
being lost in the mounds of e-mail we all get.

Thanks,
Erik

On Dec 28, 2003, at 11:46 AM, Morus Walter wrote:

Morus Walter writes:
I attached the patch (made against 1.3rc3 but working for 1.3final as
well)
and a test program.
Seems the attachments got stripped...

So once again:

The patch:

===File lucene/QueryParser.jj.patch===============
*** QueryParser.jj.org Mon Dec 22 11:47:30 2003
--- QueryParser.jj Mon Dec 22 13:20:57 2003
***************
*** 233,255 ****

protected void addClause(Vector clauses, int conj, int mods, Query
q) {
boolean required, prohibited;
!
! // If this term is introduced by AND, make the preceding term
required,
if (conj == CONJ_AND) {
! BooleanClause c = (BooleanClause)
clauses.elementAt(clauses.size()-1);
! if (!c.prohibited)
! c.required = true;
! }
!
! if (operator == DEFAULT_OPERATOR_AND && conj == CONJ_OR) {
! // If this term is introduced by OR, make the preceding term
optional,
! // unless it's prohibited (that means we leave -a OR b but +a
OR b-->a OR b)
! // notice if the input is a OR b, first term is parsed as
required; without
! // this modification a OR b would parsed as +a OR b
! BooleanClause c = (BooleanClause)
clauses.elementAt(clauses.size()-1);
! if (!c.prohibited)
! c.required = false;
}

// We might have been passed a null query; the term might have
been
--- 233,249 ----

protected void addClause(Vector clauses, int conj, int mods, Query
q) {
boolean required, prohibited;
! // System.out.println(conj+ " " + mods + " " +
q.toString("text"));
! // If this term is introduced by AND, check if the previous term
is the
! // first term in this or-group and make that term required,
if (conj == CONJ_AND) {
! Vector clauses2 = (Vector)clauses.elementAt(clauses.size()-1);
! //if ( clauses2.size() == 1 ) {
! BooleanClause c = (BooleanClause)
clauses2.elementAt(clauses2.size()-1);
! if (!c.prohibited)
! c.required = true;
! //}
}

// We might have been passed a null query; the term might have
been
***************
*** 257,277 ****
if (q == null)
return;

if (operator == DEFAULT_OPERATOR_OR) {
- // We set REQUIRED if we're introduced by AND or +; PROHIBITED
if
// introduced by NOT or -; make sure not to set both.
prohibited = (mods == MOD_NOT);
! required = (mods == MOD_REQ);
! if (conj == CONJ_AND && !prohibited) {
! required = true;
! }
! } else {
! // We set PROHIBITED if we're introduced by NOT or -; We set
REQUIRED
! // if not PROHIBITED and not introduced by OR
prohibited = (mods == MOD_NOT);
! required = (!prohibited && conj != CONJ_OR);
}
}

/**
--- 251,279 ----
if (q == null)
return;

+ // start new or-group if there's an explit or
+ if ( conj == CONJ_OR ) {
+ }
+
if (operator == DEFAULT_OPERATOR_OR) {
// introduced by NOT or -; make sure not to set both.
prohibited = (mods == MOD_NOT);
! // for explizit conjunctions: set required to true
! if ( conj == CONJ_AND ) {
! required = true;
! }
! else {
! // default OR -> required only when requested
! required = (mods == MOD_REQ);
! }
! } else { // operator == DEFAULT_OPERATOR_AND
! // We set PROHIBITED if we're introduced by NOT or -
prohibited = (mods == MOD_NOT);
! // always REQUIRED unless PROHIBITED
! required = (!prohibited);
}
BooleanClause(q, required, prohibited));
}

/**
***************
*** 359,369 ****
*/
protected Query getBooleanQuery(Vector clauses) throws
ParseException
{
! BooleanQuery query = new BooleanQuery();
! for (int i = 0; i < clauses.size(); i++) {
! }
! return query;
}

/**
--- 361,389 ----
*/
protected Query getBooleanQuery(Vector clauses) throws
ParseException
{
! BooleanQuery query = new BooleanQuery();
! if ( clauses.size() == 1 ) {
! clauses = (Vector)clauses.elementAt(0);
! for (int i = 0; i < clauses.size(); i++) {
! }
! }
! else {
! for ( int i = 0; i < clauses.size(); i++ ) {
! Vector clauses2 = (Vector)clauses.elementAt(i);
! if ( clauses2.size() == 1 &&
((BooleanClause)clauses2.elementAt(0)).prohibited == false ) {
BooleanClause(((BooleanClause)clauses2.elementAt(0)).query, false,
false));
! }
! else if ( clauses2.size() >= 1 ) {
! BooleanQuery query2 = new BooleanQuery();
! for ( int j = 0; j < clauses2.size(); j++ ) {
! }
! }
! }
! }
! return query;
}

/**
***************
*** 551,556 ****
--- 571,577 ----
Query Query(String field) :
{
Vector clauses = new Vector();
Query q, firstQuery=null;
int conj, mods;
}
***************
*** 566,572 ****
{ addClause(clauses, conj, mods, q); }
)*
{
! if (clauses.size() == 1 && firstQuery != null)
return firstQuery;
else {
return getBooleanQuery(clauses);
--- 587,593 ----
{ addClause(clauses, conj, mods, q); }
)*
{
! if (clauses.size() == 1 &&
((Vector)clauses.elementAt(0)).size() == 1 && firstQuery != null)
return firstQuery;
else {
return getBooleanQuery(clauses);
============================================================

and the test program:

===File lucene/LuceneTest.java===============
import org.apache.lucene.document.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.search.*;
import org.apache.lucene.queryParser.QueryParser;

class LuceneTest
{
static String[] docs = {
"a", "b", "c", "d",
"a b", "a c", "a d", "b c", "b d", "c d",
"a b c", "a b d", "a c d", "b c d",
"a b c d"
};

static String[] queries = {
"a OR b AND c",
"(a OR b) AND c",
"a OR (b AND c)",
"a AND b",
"a AND b OR c AND d",
"(a AND b) OR (c AND d)",
"a AND (b OR c) AND d",
"((a AND b) OR c) AND d",
"a AND (b OR (c AND d))",
"a AND b AND c AND d",

"a OR b AND NOT c",
"(a OR b) AND NOT c",
"a OR (b AND NOT c)",
"a AND NOT d",
"a AND NOT b OR c AND NOT d",
"(a AND NOT b) OR (c AND NOT d)",
"a AND NOT (b OR c) AND NOT d",
"((a AND NOT b) OR c) AND NOT d",
"a AND NOT (b OR (c AND NOT d))",
"a AND NOT b AND NOT c AND NOT d",

"a OR NOT b",
"a OR NOT a",

"a b",
"a b c",
"a b (c d e)",
"+a +b",
"a -b",
"a +b -c",
"+a b -c",
"+a -b c",
"a -b -c",
"-a b c",

"a OR b c AND d",
"a OR b c",
"a AND b c",
"a OR b c OR d",
"a OR b c d OR e",
"a AND b c AND d",
"a AND b c d AND e"
};

public static void main(String argv[]) throws Exception {
Directory dir = new RAMDirectory();
String[] stop = {};
Analyzer analyzer = new StandardAnalyzer(stop);

IndexWriter writer = new IndexWriter(dir, analyzer, true);

for ( int i=0; i < docs.length; i++ ) {
Document doc = new Document();
}
writer.close();

Searcher searcher = new IndexSearcher(dir);
for ( int i=0; i < queries.length; i++ ) {
QueryParser parser = new QueryParser("text", analyzer);
parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);

Query [] query = new Query[4];

query[0] = QueryParser.parse(queries[i], "text", analyzer);
query[1] = QueryParser.parse(query[0].toString("text"), "text",
analyzer);
query[2] = parser.parse(queries[i]);
query[3] = QueryParser.parse(query[2].toString("text"), "text",
analyzer);

System.out.println(i + ": " + queries[i] + " ==> " +
query[0].toString("text") + " -> " + query[1].toString("text") + " / "
+ query[2].toString("text") + " -> " + query[3].toString("text"));
if ( argv.length > 0 && argv[0].equals("-q") ) {
for ( int k=0; k<4; k++ ) {
Hits hits = searcher.search(query[k]);
System.out.println(k + " " + query[k].toString("text") + "\t" +
hits.length() + " documents found");
for ( int j=0; j < hits.length(); j++ ) {
Document doc = hits.doc(j);
System.out.println("\t"+doc.get("text"));
}
}
}
}
}
}
============================================================

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Dec 29, 2003 at 8:07 pm ⇧
my \$.02.

Before having patches, I think it's a good idea to agree on what the
"right" solution is. Most of it is obvious using boolean logic, but we
have some additional requirements like not having a query that only has
a NOT clause. Is this the only exception?

As far as the actual patch, I would suspect that a better approach than
using java would be to use precedence operations in the actual parser.
I've never used javacc, and it's been years since I've used yacc/bison,
but one of the basic capbilities in parsers is to define precedence. It
should be quite easy to fix it this way, and it should be more "bullet
proof." I looked a bit at the javacc code, but I don't really have the
time right now to analyze it. It certainly seems like the strategy of
having all the operators together is problematic:

<DEFAULT> TOKEN : {
<AND: ("AND" | "&&") >
<OR: ("OR" | "||") >
<NOT: ("NOT" | "!") >
<PLUS: "+" >
<MINUS: "-" >
<LPAREN: "(" >
<RPAREN: ")" >
<COLON: ":" >
<CARAT: "^" > : Boost
<QUOTED: "\"" (~["\""])+ "\"">
<TERM: <_TERM_START_CHAR> (<_TERM_CHAR>)* >
<FUZZY: "~" >
<SLOP: "~" (<_NUM_CHAR>)+ >
<PREFIXTERM: <_TERM_START_CHAR> (<_TERM_CHAR>)* "*" >
<WILDTERM: <_TERM_START_CHAR>
(<_TERM_CHAR> | ( [ "*", "?" ] ))* >
<RANGEIN_START: "[" > : RangeIn
<RANGEEX_START: "{" > : RangeEx
}

Something like http://www.lysator.liu.se/c/ANSI-C-grammar-y.html where
different operators are grouped differently according to precedence
would work better.

As is often the case, trying to *correctly* parse a string is not
trivial.

Regards,

Dror

On Sun, Dec 28, 2003 at 07:11:22PM -0500, Erik Hatcher wrote:
Morus,

I haven't had time to think through all of the issues and the patch you
submitted, but I suggest that you go ahead and attach this to a
Bugzilla issue so that it can be addressed more formally and avoid
being lost in the mounds of e-mail we all get.

Thanks,
Erik

On Dec 28, 2003, at 11:46 AM, Morus Walter wrote:

Morus Walter writes:
I attached the patch (made against 1.3rc3 but working for 1.3final as
well)
and a test program.
Seems the attachments got stripped...

So once again:

The patch:

===File lucene/QueryParser.jj.patch===============
*** QueryParser.jj.org Mon Dec 22 11:47:30 2003
--- QueryParser.jj Mon Dec 22 13:20:57 2003
***************
*** 233,255 ****

protected void addClause(Vector clauses, int conj, int mods, Query
q) {
boolean required, prohibited;
!
! // If this term is introduced by AND, make the preceding term
required,
if (conj == CONJ_AND) {
! BooleanClause c = (BooleanClause)
clauses.elementAt(clauses.size()-1);
! if (!c.prohibited)
! c.required = true;
! }
!
! if (operator == DEFAULT_OPERATOR_AND && conj == CONJ_OR) {
! // If this term is introduced by OR, make the preceding term
optional,
! // unless it's prohibited (that means we leave -a OR b but +a
OR b-->a OR b)
! // notice if the input is a OR b, first term is parsed as
required; without
! // this modification a OR b would parsed as +a OR b
! BooleanClause c = (BooleanClause)
clauses.elementAt(clauses.size()-1);
! if (!c.prohibited)
! c.required = false;
}

// We might have been passed a null query; the term might have
been
--- 233,249 ----

protected void addClause(Vector clauses, int conj, int mods, Query
q) {
boolean required, prohibited;
! // System.out.println(conj+ " " + mods + " " +
q.toString("text"));
! // If this term is introduced by AND, check if the previous term
is the
! // first term in this or-group and make that term required,
if (conj == CONJ_AND) {
! Vector clauses2 = (Vector)clauses.elementAt(clauses.size()-1);
! //if ( clauses2.size() == 1 ) {
! BooleanClause c = (BooleanClause)
clauses2.elementAt(clauses2.size()-1);
! if (!c.prohibited)
! c.required = true;
! //}
}

// We might have been passed a null query; the term might have
been
***************
*** 257,277 ****
if (q == null)
return;

if (operator == DEFAULT_OPERATOR_OR) {
- // We set REQUIRED if we're introduced by AND or +; PROHIBITED
if
// introduced by NOT or -; make sure not to set both.
prohibited = (mods == MOD_NOT);
! required = (mods == MOD_REQ);
! if (conj == CONJ_AND && !prohibited) {
! required = true;
! }
! } else {
! // We set PROHIBITED if we're introduced by NOT or -; We set
REQUIRED
! // if not PROHIBITED and not introduced by OR
prohibited = (mods == MOD_NOT);
! required = (!prohibited && conj != CONJ_OR);
}
}

/**
--- 251,279 ----
if (q == null)
return;

+ // start new or-group if there's an explit or
+ if ( conj == CONJ_OR ) {
+ }
+
if (operator == DEFAULT_OPERATOR_OR) {
// introduced by NOT or -; make sure not to set both.
prohibited = (mods == MOD_NOT);
! // for explizit conjunctions: set required to true
! if ( conj == CONJ_AND ) {
! required = true;
! }
! else {
! // default OR -> required only when requested
! required = (mods == MOD_REQ);
! }
! } else { // operator == DEFAULT_OPERATOR_AND
! // We set PROHIBITED if we're introduced by NOT or -
prohibited = (mods == MOD_NOT);
! // always REQUIRED unless PROHIBITED
! required = (!prohibited);
}
BooleanClause(q, required, prohibited));
}

/**
***************
*** 359,369 ****
*/
protected Query getBooleanQuery(Vector clauses) throws
ParseException
{
! BooleanQuery query = new BooleanQuery();
! for (int i = 0; i < clauses.size(); i++) {
! }
! return query;
}

/**
--- 361,389 ----
*/
protected Query getBooleanQuery(Vector clauses) throws
ParseException
{
! BooleanQuery query = new BooleanQuery();
! if ( clauses.size() == 1 ) {
! clauses = (Vector)clauses.elementAt(0);
! for (int i = 0; i < clauses.size(); i++) {
! }
! }
! else {
! for ( int i = 0; i < clauses.size(); i++ ) {
! Vector clauses2 = (Vector)clauses.elementAt(i);
! if ( clauses2.size() == 1 &&
((BooleanClause)clauses2.elementAt(0)).prohibited == false ) {
BooleanClause(((BooleanClause)clauses2.elementAt(0)).query, false,
false));
! }
! else if ( clauses2.size() >= 1 ) {
! BooleanQuery query2 = new BooleanQuery();
! for ( int j = 0; j < clauses2.size(); j++ ) {
! }
! }
! }
! }
! return query;
}

/**
***************
*** 551,556 ****
--- 571,577 ----
Query Query(String field) :
{
Vector clauses = new Vector();
Query q, firstQuery=null;
int conj, mods;
}
***************
*** 566,572 ****
{ addClause(clauses, conj, mods, q); }
)*
{
! if (clauses.size() == 1 && firstQuery != null)
return firstQuery;
else {
return getBooleanQuery(clauses);
--- 587,593 ----
{ addClause(clauses, conj, mods, q); }
)*
{
! if (clauses.size() == 1 &&
((Vector)clauses.elementAt(0)).size() == 1 && firstQuery != null)
return firstQuery;
else {
return getBooleanQuery(clauses);
============================================================

and the test program:

===File lucene/LuceneTest.java===============
import org.apache.lucene.document.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.search.*;
import org.apache.lucene.queryParser.QueryParser;

class LuceneTest
{
static String[] docs = {
"a", "b", "c", "d",
"a b", "a c", "a d", "b c", "b d", "c d",
"a b c", "a b d", "a c d", "b c d",
"a b c d"
};

static String[] queries = {
"a OR b AND c",
"(a OR b) AND c",
"a OR (b AND c)",
"a AND b",
"a AND b OR c AND d",
"(a AND b) OR (c AND d)",
"a AND (b OR c) AND d",
"((a AND b) OR c) AND d",
"a AND (b OR (c AND d))",
"a AND b AND c AND d",

"a OR b AND NOT c",
"(a OR b) AND NOT c",
"a OR (b AND NOT c)",
"a AND NOT d",
"a AND NOT b OR c AND NOT d",
"(a AND NOT b) OR (c AND NOT d)",
"a AND NOT (b OR c) AND NOT d",
"((a AND NOT b) OR c) AND NOT d",
"a AND NOT (b OR (c AND NOT d))",
"a AND NOT b AND NOT c AND NOT d",

"a OR NOT b",
"a OR NOT a",

"a b",
"a b c",
"a b (c d e)",
"+a +b",
"a -b",
"a +b -c",
"+a b -c",
"+a -b c",
"a -b -c",
"-a b c",

"a OR b c AND d",
"a OR b c",
"a AND b c",
"a OR b c OR d",
"a OR b c d OR e",
"a AND b c AND d",
"a AND b c d AND e"
};

public static void main(String argv[]) throws Exception {
Directory dir = new RAMDirectory();
String[] stop = {};
Analyzer analyzer = new StandardAnalyzer(stop);

IndexWriter writer = new IndexWriter(dir, analyzer, true);

for ( int i=0; i < docs.length; i++ ) {
Document doc = new Document();
}
writer.close();

Searcher searcher = new IndexSearcher(dir);
for ( int i=0; i < queries.length; i++ ) {
QueryParser parser = new QueryParser("text", analyzer);
parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);

Query [] query = new Query[4];

query[0] = QueryParser.parse(queries[i], "text", analyzer);
query[1] = QueryParser.parse(query[0].toString("text"), "text",
analyzer);
query[2] = parser.parse(queries[i]);
query[3] = QueryParser.parse(query[2].toString("text"), "text",
analyzer);

System.out.println(i + ": " + queries[i] + " ==> " +
query[0].toString("text") + " -> " + query[1].toString("text") + " / "
+ query[2].toString("text") + " -> " + query[3].toString("text"));
if ( argv.length > 0 && argv[0].equals("-q") ) {
for ( int k=0; k<4; k++ ) {
Hits hits = searcher.search(query[k]);
System.out.println(k + " " + query[k].toString("text") +
"\t" + hits.length() + " documents found");
for ( int j=0; j < hits.length(); j++ ) {
Document doc = hits.doc(j);
System.out.println("\t"+doc.get("text"));
}
}
}
}
}
}
============================================================

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
--
Dror Matalon
Zapatec Inc
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Dec 30, 2003 at 10:40 am ⇧

Dror Matalon writes:
my \$.02.

Before having patches, I think it's a good idea to agree on what the
"right" solution is.
I tried to raise that question in the first place. But there wasn't much
responce.
So I decided to make a concrete suggestion, how to change things.
Most of it is obvious using boolean logic, but we
have some additional requirements like not having a query that only has
a NOT clause. Is this the only exception?
To me the problem is, that there are two forms of queries
- boolean queries (a OR b AND c...)
- list of terms where some are flagged required and some are flagged forbidden
(a +b -c ...) (in two forms: with default or and default and)

For each of these it seems pretty clear, what they mean, but if you start
to combine the two in one query, I don't know what that should mean.

What's the meaning of a OR b c +d ?
(Acutally there must be two meanings, one for default or, one for default and).
Maybe it's obvious, but I fail to see it.
As far as the actual patch, I would suspect that a better approach than
using java would be to use precedence operations in the actual parser.
Then you decide to do a complete rewrite of the query parser.
That's something I wanted to avoid.

I don't think that it matters how you implement a grammer though.
The problem here is, that you have to define the grammer first.

But I agree that doing it by JavaCC means is less error prone.
Something like http://www.lysator.liu.se/c/ANSI-C-grammar-y.html where
different operators are grouped differently according to precedence
would work better.

As is often the case, trying to *correctly* parse a string is not
trivial.
Right. Especially if there's no definition of correct...

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Dec 30, 2003 at 6:36 pm ⇧
Hi,

First, let me say something that wasn't obvious from my first mail.
While I had opinions about the implementation, I have a lot of respect
for your finding a problem, and going ahead and coding a solution.
On Tue, Dec 30, 2003 at 11:40:17AM +0100, Morus Walter wrote:
Dror Matalon writes:
my \$.02.

Before having patches, I think it's a good idea to agree on what the
"right" solution is.
I tried to raise that question in the first place. But there wasn't much
responce.
Might be the time of the year when many people are busy with other
stuff.
So I decided to make a concrete suggestion, how to change things.
Most of it is obvious using boolean logic, but we
have some additional requirements like not having a query that only has
a NOT clause. Is this the only exception?
To me the problem is, that there are two forms of queries
- boolean queries (a OR b AND c...)
- list of terms where some are flagged required and some are flagged forbidden
(a +b -c ...) (in two forms: with default or and default and)

For each of these it seems pretty clear, what they mean, but if you start
to combine the two in one query, I don't know what that should mean.

What's the meaning of a OR b c +d ?
(Acutally there must be two meanings, one for default or, one for default and).
Maybe it's obvious, but I fail to see it.
You're right, it is confusing. Assuming default OR I would gess that the
above means
b c +d
and assuming default AND it would mean
+b +c +d
Is there another interpretation?
As far as the actual patch, I would suspect that a better approach than
using java would be to use precedence operations in the actual parser.
Then you decide to do a complete rewrite of the query parser.
That's something I wanted to avoid.
Ouch. I think you might be right. It might be a good idea to move this
discussion to lucene-dev where we'd get more attention from the
developers. This seems more like a developer issue than a user issue.
I don't think that it matters how you implement a grammer though.
The problem here is, that you have to define the grammer first.

But I agree that doing it by JavaCC means is less error prone.
Something like http://www.lysator.liu.se/c/ANSI-C-grammar-y.html where
different operators are grouped differently according to precedence
would work better.

As is often the case, trying to *correctly* parse a string is not
trivial.
Right. Especially if there's no definition of correct...
Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
--
Dror Matalon
Zapatec Inc
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Dec 30, 2003 at 8:13 pm ⇧
Hi Dror,

Before having patches, I think it's a good idea to agree on what the
"right" solution is.
I tried to raise that question in the first place. But there wasn't much
responce.
Might be the time of the year when many people are busy with other
stuff.
Probably.
My impression was that many people don't have a problem with this issue.
Otherwise I'd expecpt that the issue was raised earlier.
So I decided to make a concrete suggestion, how to change things.
Most of it is obvious using boolean logic, but we
have some additional requirements like not having a query that only has
a NOT clause. Is this the only exception?
To me the problem is, that there are two forms of queries
- boolean queries (a OR b AND c...)
- list of terms where some are flagged required and some are flagged forbidden
(a +b -c ...) (in two forms: with default or and default and)

For each of these it seems pretty clear, what they mean, but if you start
to combine the two in one query, I don't know what that should mean.

What's the meaning of a OR b c +d ?
(Acutally there must be two meanings, one for default or, one for default and).
Maybe it's obvious, but I fail to see it.
You're right, it is confusing. Assuming default OR I would gess that the
above means
b c +d
and assuming default AND it would mean
+b +c +d
Is there another interpretation?
You left out the 'a' which I intended to be part of the query (sorry if this
was unclear).

define this type of queries formally, is to give the default operator it's own
precedence relativly to the precedence of 'OR' and 'AND'.
So there are two possibilities:
either the default operator has higher precedence than 'AND' or lower than
'OR'.
For default OR in the first case
`a OR b c +d' would be equal to `(a OR b) c +d' == (a b) c +d
in the second to `a OR (b c +d)' == a (b c +d)
For default AND one has `+(a b) +c +d' and `a (+b +c +d)'

(a b) c +d searches all documents containing d, occurences of a, b and c
influence scoring
a (b c +d) searches documents containing `a' joined with documents
containing `d' (where b and c influcence scoring)
Now, what's closer to what one might have meant by `a OR b c +d'?

+(a b) +c +d searches documents containing c, d and either a or b.
a (+b +c +d) searches documents containing a or each of b, c and d.

The other alternative would be to forbid queries mixing default operators and
explicit and/or. This is what I'd probably vote for at the moment.

The patch doesn't implement any of these, as it handles the default operator
on the same level as AND.
As far as the actual patch, I would suspect that a better approach than
using java would be to use precedence operations in the actual parser.
Then you decide to do a complete rewrite of the query parser.
That's something I wanted to avoid.
Ouch. I think you might be right. It might be a good idea to move this
discussion to lucene-dev where we'd get more attention from the
developers. This seems more like a developer issue than a user issue.
Hmm. That's be up to the developers.
Don't know how many of them are reading lucene-user.

I'd prefer to keep this on the user list since the query parser is only
loosely coupled to lucenes core, while it is strongly coupled to the users
needs. So I think the users should be included in the discussion and I think
the user list is the best place for that.

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Dec 30, 2003 at 8:25 pm ⇧

On Dec 30, 2003, at 3:13 PM, Morus Walter wrote:
Hmm. That's be up to the developers.
Don't know how many of them are reading lucene-user.
I suspect we're all here!

QueryParser is Lucene's red-headed step-child. It works "well enough",
but it has more than its share of issues. It is almost a shame it is
part of Lucene's core because of its loose coupling, but it does make
Lucene quite approachable for simple applications at least.

A complete rewrite of QueryParser would certainly be welcomed by most,
I think.

Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Dec 30, 2003 at 8:42 pm ⇧

On Tue, Dec 30, 2003 at 03:25:08PM -0500, Erik Hatcher wrote:
On Dec 30, 2003, at 3:13 PM, Morus Walter wrote:
Hmm. That's be up to the developers.
Don't know how many of them are reading lucene-user.
I suspect we're all here! Great.
QueryParser is Lucene's red-headed step-child. It works "well enough",
but it has more than its share of issues. It is almost a shame it is
part of Lucene's core because of its loose coupling, but it does make
Lucene quite approachable for simple applications at least.
And to make things worse, I suspect that it works well enough for most
users so that there's not enough motivation to fix it.

I'll confess that I seldom use anyting but the defaults not only with
A complete rewrite of QueryParser would certainly be welcomed by most,
I think.

Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
--
Dror Matalon
Zapatec Inc
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Dec 30, 2003 at 9:10 pm ⇧
On Tue, Dec 30, 2003 at 09:13:30PM +0100, Morus Walter wrote:
...
What's the meaning of a OR b c +d ?
(Acutally there must be two meanings, one for default or, one for default and).
Maybe it's obvious, but I fail to see it.
You're right, it is confusing. Assuming default OR I would gess that the
above means
b c +d
and assuming default AND it would mean
+b +c +d
Is there another interpretation?
You left out the 'a' which I intended to be part of the query (sorry if this
was unclear).
Oops, my mistake.
define this type of queries formally, is to give the default operator it's own
precedence relativly to the precedence of 'OR' and 'AND'.
So there are two possibilities:
either the default operator has higher precedence than 'AND' or lower than
'OR'.
For default OR in the first case
`a OR b c +d' would be equal to `(a OR b) c +d' == (a b) c +d
in the second to `a OR (b c +d)' == a (b c +d)
For default AND one has `+(a b) +c +d' and `a (+b +c +d)'

(a b) c +d searches all documents containing d, occurences of a, b and c
influence scoring
a (b c +d) searches documents containing `a' joined with documents
containing `d' (where b and c influcence scoring)
Now, what's closer to what one might have meant by `a OR b c +d'?

+(a b) +c +d searches documents containing c, d and either a or b.
a (+b +c +d) searches documents containing a or each of b, c and d.
I don't think this is a good idea. Mostly because it would be hard to
explain/document, and you don't want end users to have to think and read
a lot of documentation when doing a search.

For one thing, I would advocate for using the '+' notation as the
underlying syntax and migrating to boolean operators since that's many
more people are used to that syntax, and I believe it's better
understood.
The other alternative would be to forbid queries mixing default operators and
explicit and/or. This is what I'd probably vote for at the moment.
At first I was inclined to agree but as a rule I think we should adopt
the WWGD (What Would Google Do) philosophy, since that's the syntax and
behavior that most people are used to.

It looks like it basically adds an "AND" between any two terms that
don't have operator between them. We could do the same for both the
default AND and the default OR. Once you've done that, you just use the
standard boolean logic precedence rule.

Now the good news on all of this is that it seems (I did a small test),
that if you use parenthesis the parser does the right thing. In my mind,
it's a good idea to use parenthesis whenever you're creating complex
expressions.
The patch doesn't implement any of these, as it handles the default operator
on the same level as AND.
As far as the actual patch, I would suspect that a better approach than
using java would be to use precedence operations in the actual parser.
Then you decide to do a complete rewrite of the query parser.
That's something I wanted to avoid.
Ouch. I think you might be right. It might be a good idea to move this
discussion to lucene-dev where we'd get more attention from the
developers. This seems more like a developer issue than a user issue.
Hmm. That's be up to the developers.
Don't know how many of them are reading lucene-user.

I'd prefer to keep this on the user list since the query parser is only
loosely coupled to lucenes core, while it is strongly coupled to the users
needs. So I think the users should be included in the discussion and I think
the user list is the best place for that.
And Erik indicated that they're here anyway, so it's fine.

Regards,

Dror
Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
--
Dror Matalon
Zapatec Inc
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Dec 30, 2003 at 10:19 pm ⇧
Hi Dror,
define this type of queries formally, is to give the default operator it's own
precedence relativly to the precedence of 'OR' and 'AND'.
So there are two possibilities:
either the default operator has higher precedence than 'AND' or lower than
'OR'.
For default OR in the first case
`a OR b c +d' would be equal to `(a OR b) c +d' == (a b) c +d
in the second to `a OR (b c +d)' == a (b c +d)
For default AND one has `+(a b) +c +d' and `a (+b +c +d)'

(a b) c +d searches all documents containing d, occurences of a, b and c
influence scoring
a (b c +d) searches documents containing `a' joined with documents
containing `d' (where b and c influcence scoring)
Now, what's closer to what one might have meant by `a OR b c +d'?

+(a b) +c +d searches documents containing c, d and either a or b.
a (+b +c +d) searches documents containing a or each of b, c and d.
I don't think this is a good idea. Mostly because it would be hard to
explain/document, and you don't want end users to have to think and read
a lot of documentation when doing a search.

For one thing, I would advocate for using the '+' notation as the
underlying syntax and migrating to boolean operators since that's many
more people are used to that syntax, and I believe it's better
understood.
I'm not sure if I understand what you mean here.
The other alternative would be to forbid queries mixing default operators and
explicit and/or. This is what I'd probably vote for at the moment.
At first I was inclined to agree but as a rule I think we should adopt
the WWGD (What Would Google Do) philosophy, since that's the syntax and
behavior that most people are used to.

It looks like it basically adds an "AND" between any two terms that
don't have operator between them. We could do the same for both the
default AND and the default OR. Once you've done that, you just use the
standard boolean logic precedence rule.
Hmm. Then you loose the possibility to create BooleanQuery-objects where
some of the terms are required some forbidden and some have neither flag.
To have this possibility is the reason why I say that implicit AND/OR and
explicit AND/OR need to be different things.
If an implicit OR equals an explicit OR, you would have '+a +b' = '+a OR +b'
= '(+a) OR (+b)' = 'a OR b' which is probably not, what was intended.
So either the '+' operator is removed or it is used as an alternative to AND
in which case it could not be a prefix. So instead of '+a +b' one would use
'a + b'.

A consequence of pure boolean operators is, that there won't be a way of
serializing an arbitray query to a parsable string in standard query parser
syntax.

So for completeness and compatibility with the current query parser, I would
keep the current behaviour of queries without explicit boolean operators.

The problem for users isn't that big IMHO.
Unless a user decides to make use of the '+' operator things are pretty clear:
a b c searches for documents containing one or all of these terms (depending
on the default operator). Using terms with the '-' operator also does what
one expects. Only if the user starts to use the '+' operator explicitly,
things are getting more complicated. So he just shouldn't do that unless
he knows what he does.
The same thing applies to queries using AND/OR as long as you don't mix it
with implicit operators. IMO whoever does the latter get's what he deserves,
if he has to deal with the difficulties of such queries. One just should
not do that, and it should be pretty clear, that the meaning of such a query
is unclear (unless parenthesis are used, in which case there is no mixing
any longer).
That is, why I think my patch is good enough, even if it leaves the evaluation
of such queries without clear definition.
Now the good news on all of this is that it seems (I did a small test),
that if you use parenthesis the parser does the right thing. In my mind,
it's a good idea to use parenthesis whenever you're creating complex
expressions.
Sure. All we are talking about is what happens if there are no explicit
parenthesis. If you use parentheses you break the query into simple parts
(e.g. (a AND b) OR (c AND d) are two queries of type 'x AND y' and one
query of typ 'x OR y' (where x and y are queries, not just terms)), which
are handled correctly even by the current query parser.
That's one of the reasons, why this hasn't been a big problem in the past.
If you use (a AND b) OR (c AND d) you will get what you expect.
It's just that I think the query parser should also create a reasonable
query if the parenthesis are removed.

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Dec 30, 2003 at 11:37 pm ⇧

On Tue, Dec 30, 2003 at 11:19:38PM +0100, Morus Walter wrote:
Hi Dror,
For one thing, I would advocate for using the '+' notation as the
underlying syntax and migrating to boolean operators since that's many
more people are used to that syntax, and I believe it's better
understood.
I'm not sure if I understand what you mean here.
I meant that the queryparse would accept AND and OR which get translated
into '+' and '-' but does not accept the '+' and '-' directly.
The other alternative would be to forbid queries mixing default operators and
explicit and/or. This is what I'd probably vote for at the moment.
At first I was inclined to agree but as a rule I think we should adopt
the WWGD (What Would Google Do) philosophy, since that's the syntax and
behavior that most people are used to.

It looks like it basically adds an "AND" between any two terms that
don't have operator between them. We could do the same for both the
default AND and the default OR. Once you've done that, you just use the
standard boolean logic precedence rule.
Hmm. Then you loose the possibility to create BooleanQuery-objects where
some of the terms are required some forbidden and some have neither flag.
To have this possibility is the reason why I say that implicit AND/OR and
explicit AND/OR need to be different things.
If an implicit OR equals an explicit OR, you would have '+a +b' = '+a OR +b'
= '(+a) OR (+b)' = 'a OR b' which is probably not, what was intended.
So either the '+' operator is removed or it is used as an alternative to AND
in which case it could not be a prefix. So instead of '+a +b' one would use
'a + b'.
Which is my point above. It's too confusing to have:
1. '+' and '-'
2. Explict AND and OR
3. Implict AND or OR

There's some redundancy between all three, and it's quite easy to get
confused.
A consequence of pure boolean operators is, that there won't be a way of
serializing an arbitray query to a parsable string in standard query parser
syntax.

So for completeness and compatibility with the current query parser, I would
keep the current behaviour of queries without explicit boolean operators.

The problem for users isn't that big IMHO.
Unless a user decides to make use of the '+' operator things are pretty clear:
a b c searches for documents containing one or all of these terms (depending
on the default operator). Using terms with the '-' operator also does what
one expects. Only if the user starts to use the '+' operator explicitly,
things are getting more complicated. So he just shouldn't do that unless
he knows what he does.
Fair enough.
The same thing applies to queries using AND/OR as long as you don't mix it
with implicit operators. IMO whoever does the latter get's what he deserves,
if he has to deal with the difficulties of such queries. One just should
not do that, and it should be pretty clear, that the meaning of such a query
is unclear (unless parenthesis are used, in which case there is no mixing
any longer).
That is, why I think my patch is good enough, even if it leaves the evaluation
of such queries without clear definition.
I guess I can be convinced. Clearly things are broken, and clearly if
your patch works as advertised, it should make things better rather than
worse. And a partial solution is better than no solution. So, if the
developers bless the patch, run it through the test suite and it comes
out looking good, I'm for it.

Again, thanks for spending the time on this.

Regards,

Dror
Now the good news on all of this is that it seems (I did a small test),
that if you use parenthesis the parser does the right thing. In my mind,
it's a good idea to use parenthesis whenever you're creating complex
expressions.
Sure. All we are talking about is what happens if there are no explicit
parenthesis. If you use parentheses you break the query into simple parts
(e.g. (a AND b) OR (c AND d) are two queries of type 'x AND y' and one
query of typ 'x OR y' (where x and y are queries, not just terms)), which
are handled correctly even by the current query parser.
That's one of the reasons, why this hasn't been a big problem in the past.
If you use (a AND b) OR (c AND d) you will get what you expect.
It's just that I think the query parser should also create a reasonable
query if the parenthesis are removed.

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
--
Dror Matalon
Zapatec Inc
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Jan 2, 2004 at 4:39 pm ⇧
Hello all,

I am new to Lucene and working through the Lucene examples on the Jakarta
site.
In the IndexHTML example,
when I type in (from my Tomcat webapps directory)
java org.apache.lucene.demo.IndexHTML -create -index{index}..

It creates an index, but when I search using
http://localhost:8000/luceneweb/
The page works but I do not get any replies.

1. How do you specify which directory is to be searched
( I assumed it was the current directory ie tomcat\webapps but when I put in
more searchable content nothing comes up in the search
I have also tried typing java
org.apache.lucene.demo.IndexHTML -create -index{content}.. where content is
the directory with the content but this still doesnt work)

2. What is the easiest way to specify fields (such as title, etc) to be
searched?
(i.e. what file needs changed to allow me to search for specific fields)

3. Is there a very simple step by step guide for someone new on how to use
lucene.
(I have looked at Jakartas site but still do not the answers to the above)

Thanking you in anticipation,

Colin.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Jan 2, 2004 at 6:45 pm ⇧

On Jan 2, 2004, at 11:49 AM, Colin McGuigan wrote:
1. How do you specify which directory is to be searched
( I assumed it was the current directory ie tomcat\webapps but when I
put in
more searchable content nothing comes up in the search
I have also tried typing java
org.apache.lucene.demo.IndexHTML -create -index{content}.. where
content is
the directory with the content but this still doesnt work)
for a nice sales pitch or starter demo to lure folks in. It is my plan
(eventually - more later than sooner at this point, but you can
definitely count on it from me) to enhance the demo application to be
quite nice and easy to use.
2. What is the easiest way to specify fields (such as title, etc) to be
searched?
(i.e. what file needs changed to allow me to search for specific
fields)
The source code to HTMLDocument shows what fields are indexed. To
search on a specific field, use the syntax you see here:
<http://jakarta.apache.org/lucene/docs/queryparsersyntax.html>
3. Is there a very simple step by step guide for someone new on how to
use
lucene.
(I have looked at Jakartas site but still do not the answers to the
above)
There are articles available on the resources page:
<http://jakarta.apache.org/lucene/docs/resources.html>, and a new one
of mine that isn't listed there (yet) at
<http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html>

My recommendation is for you to do your own experimenting and not try
to tinker with the demo application. What you need to know to use
Lucene effectively is actually quite simple and you can glean all of
that from the articles in a cleaner way than the demo app.

Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Jan 3, 2004 at 5:18 pm ⇧
Erik, Leo, Daniel,

just a short note to thank you for your help in the above.
I realise I have alot of work ahead of myself but am keen to continue with
Lucene as I have been impressed with what I have got working.

best regards,

Colin.
----- Original Message -----
From: "Erik Hatcher" <erik@ehatchersolutions.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Friday, January 02, 2004 6:44 PM
Subject: Re: IndexHTML example on Jakarta Site

On Jan 2, 2004, at 11:49 AM, Colin McGuigan wrote:
1. How do you specify which directory is to be searched
( I assumed it was the current directory ie tomcat\webapps but when I
put in
more searchable content nothing comes up in the search
I have also tried typing java
org.apache.lucene.demo.IndexHTML -create -index{content}.. where
content is
the directory with the content but this still doesnt work)
for a nice sales pitch or starter demo to lure folks in. It is my plan
(eventually - more later than sooner at this point, but you can
definitely count on it from me) to enhance the demo application to be
quite nice and easy to use.
2. What is the easiest way to specify fields (such as title, etc) to be
searched?
(i.e. what file needs changed to allow me to search for specific
fields)
The source code to HTMLDocument shows what fields are indexed. To
search on a specific field, use the syntax you see here:
<http://jakarta.apache.org/lucene/docs/queryparsersyntax.html>
3. Is there a very simple step by step guide for someone new on how to
use
lucene.
(I have looked at Jakartas site but still do not the answers to the
above)
There are articles available on the resources page:
<http://jakarta.apache.org/lucene/docs/resources.html>, and a new one
of mine that isn't listed there (yet) at
<http://today.java.net/pub/a/today/2003/11/07/QueryParserRules.html>

My recommendation is for you to do your own experimenting and not try
to tinker with the demo application. What you need to know to use
Lucene effectively is actually quite simple and you can glean all of
that from the articles in a cleaner way than the demo app.

Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Jan 2, 2004 at 7:51 pm ⇧

Colin McGuigan wrote:
It creates an index, but when I search using
http://localhost:8000/luceneweb/
The page works but I do not get any replies.

1. How do you specify which directory is to be searched
<snip>
I agree with Erik, that you would rather use an application which is
ready for use in a minute. IMHO Lucene is library/API and unless you are
a JAVA developer, it does not fit your needs. Some applications are
listed here:
http://dmoz.org/Computers/Programming/Languages/Java/Server-Side/Search_Engines/
Omit the Lucene link, else you will be in an endless loop... ;-)

If you must use Lucene, try to find something for you here:
http://jakarta.apache.org/lucene/docs/powered.html
You may be interested in i2a, but their demo (@24.9.177.111) is dead
right now.

Cheers,
Leo

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Jan 2, 2004 at 8:52 pm ⇧

On Friday 02 January 2004 20:50, Leo Galambos wrote:

IMHO Lucene is library/API and unless you are
a JAVA developer, it does not fit your needs.
One reason for the confusion might be that the homepage states that Lucene
is a "full-featured text search engine". IMHO this should be replaced by
"a powerful Java library for full-text indexing" or something like that.

Regards
Daniel

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
•  at Dec 30, 2003 at 10:39 am ⇧
Hi Erik,
I haven't had time to think through all of the issues and the patch you
submitted, but I suggest that you go ahead and attach this to a
Bugzilla issue so that it can be addressed more formally and avoid
being lost in the mounds of e-mail we all get.
Well, I'd have taken care that it doesn't get lost.
But if you think, that it's better to have the issue as a bug report, no
problem.
See:
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=25820

Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org

## Related Discussions

Discussion Overview
 group java-user categories lucene posted Dec 9, '03 at 9:58a active Jan 3, '04 at 5:18p posts 22 users 8 website lucene.apache.org

### 8 users in discussion

Content

People

Support

Translate

site design / logo © 2022 Grokbase