FAQ
Hi,

I followed the following procedure to escape special characteres.

String escapedKeywords = QueryParser.escape(keywords);
Query query = new QueryParser("content", new
StandardAnalyzer()).parse(escapedKeywords);

this works with most of the special characters like * and ~ except \ . I
can't do a search for a keyword like "ho\w" and get results.
am I doing anything wrong here.


Thanks,
Kalani
--
Kalani Ruwanpathirana
Department of Computer Science & Engineering
University of Moratuwa

Search Discussions

  • Chris Hostetter at Aug 6, 2008 at 11:06 pm
    : String escapedKeywords = QueryParser.escape(keywords);
    : Query query = new QueryParser("content", new
    : StandardAnalyzer()).parse(escapedKeywords);
    :
    : this works with most of the special characters like * and ~ except \ . I
    : can't do a search for a keyword like "ho\w" and get results.
    : am I doing anything wrong here.

    QueryParser.escape will in fact escape a backslash, but keep in mind
    StandardAnalyzer splits on backslash so that may be what's confusing you.


    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Aravind Yarram at Aug 7, 2008 at 4:36 am
    can i escape built in lucene keywords like OR, AND aswell?

    Regards,
    Aravind R Yarram






    Chris Hostetter <hossman_lucene@fucit.org>
    08/06/2008 07:05 PM
    Please respond to
    java-user@lucene.apache.org


    To
    java-user@lucene.apache.org
    cc

    Subject
    Re: escaping special characters







    : String escapedKeywords = QueryParser.escape(keywords);
    : Query query = new QueryParser("content", new
    : StandardAnalyzer()).parse(escapedKeywords);
    :
    : this works with most of the special characters like * and ~ except \ . I
    : can't do a search for a keyword like "ho\w" and get results.
    : am I doing anything wrong here.

    QueryParser.escape will in fact escape a backslash, but keep in mind
    StandardAnalyzer splits on backslash so that may be what's confusing you.


    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    This message contains information from Equifax Inc. which may be confidential and privileged. If you are not an intended recipient, please refrain from any disclosure, copying, distribution or use of this information and note that such actions are prohibited. If you have received this transmission in error, please notify by e-mail postmaster@equifax.com.
  • Chris Hostetter at Aug 11, 2008 at 6:15 pm
    : can i escape built in lucene keywords like OR, AND aswell?

    as of the last time i checked: no, they're baked into the grammer.

    (that may have changed when it switchedfrom a javac to a flex grammer
    though, so i'm not 100% positive)


    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Steven A Rowe at Aug 11, 2008 at 6:33 pm

    On 08/11/2008 at 2:14 PM, Chris Hostetter wrote:
    Aravind R Yarram wrote:
    can i escape built in lucene keywords like OR, AND aswell?
    as of the last time i checked: no, they're baked into the grammer.
    I have not tested this, but I've read somewhere on this list that enclosing OR and AND in double quotes effectively escapes them.
    (that may have changed when it switchedfrom a javac to a flex grammer
    though, so i'm not 100% positive)
    Although the StandardTokenizer was switched about a year ago from a JavaCC to a JFlex grammar, QueryParser's grammar remains in the JavaCC camp.

    Steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Matthew Hall at Aug 11, 2008 at 6:44 pm
    You can simply change your input string to lowercase before passing it
    to the analyzers, which will give you the effect of escaping the boolean
    operators. (I.E you will now search on and or and not) Remember
    however that these are extremely common words, and chances are high that
    you are removing them via your stop words list in your analyzer. This
    is also assuming you are using an analyzer that does lowercasing as part
    of its normal processing, which many do.

    Matt

    Steven A Rowe wrote:
    On 08/11/2008 at 2:14 PM, Chris Hostetter wrote:

    Aravind R Yarram wrote:
    can i escape built in lucene keywords like OR, AND aswell?
    as of the last time i checked: no, they're baked into the grammer.
    I have not tested this, but I've read somewhere on this list that enclosing OR and AND in double quotes effectively escapes them.

    (that may have changed when it switchedfrom a javac to a flex grammer
    though, so i'm not 100% positive)
    Although the StandardTokenizer was switched about a year ago from a JavaCC to a JFlex grammar, QueryParser's grammar remains in the JavaCC camp.

    Steve

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Matthew Hall
    Software Engineer
    Mouse Genome Informatics
    mhall@informatics.jax.org
    (207) 288-6012



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Aug 11, 2008 at 6:49 pm

    Steven A Rowe wrote:
    On 08/11/2008 at 2:14 PM, Chris Hostetter wrote:

    Aravind R Yarram wrote:
    can i escape built in lucene keywords like OR, AND aswell?
    as of the last time i checked: no, they're baked into the grammer.
    I have not tested this, but I've read somewhere on this list that enclosing OR and AND in double quotes effectively escapes them.
    Yeah, this works - it short circuits the token as an operator by
    triggering a quoted match instead - which eventually just pops out the
    single term in the quotes.

    But also, have you just tried escaping with a simple backslash? Seems to
    work for me with a simple test.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Aravind Yarram at Aug 11, 2008 at 7:14 pm
    The documentation for delete operation seems to be confusing (i am going
    thru the book and also posted in the books forums...), so appreciate if
    someone can let me know if my below understanding is correct.

    When i delete a document from the index

    1) It is marked for deletion in the BUFFER until I commit/close the
    writer. Does that mean the document is still visible for the Searcher?

    2) Once i commit/close the writer then IT IS JUST MARKED for delete in the
    Index. At this time the document is NOT visible for the Searcher, but the
    document is still taking up the space in the index.

    3) Once the index is merged (optimized), it is removed from the index

    Regards,
    Aravind R Yarram
    This message contains information from Equifax Inc. which may be confidential and privileged. If you are not an intended recipient, please refrain from any disclosure, copying, distribution or use of this information and note that such actions are prohibited. If you have received this transmission in error, please notify by e-mail postmaster@equifax.com.
  • Chris Hostetter at Aug 12, 2008 at 4:01 am
    : When i delete a document from the index
    ...

    The answer to all of your questions is yes, however documents marked for
    deletion are also "removed" from segments whenever they are merged, which
    can happen on any add.

    PS...

    : In-Reply-To: <48A08996.6070709@gmail.com>
    : Subject: Clarification on deletion process...

    http://people.apache.org/~hossman/#threadhijack
    Thread Hijacking on Mailing Lists

    When starting a new discussion on a mailing list, please do not reply to
    an existing message, instead start a fresh email. Even if you change the
    subject line of your email, other mail headers still track which thread
    you replied to and your question is "hidden" in that thread and gets less
    attention. It makes following discussions in the mailing list archives
    particularly difficult.
    See Also: http://en.wikipedia.org/wiki/Thread_hijacking




    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Aug 12, 2008 at 10:08 am
    Some more details below...

    wrote:
    The documentation for delete operation seems to be confusing (i am going
    thru the book and also posted in the books forums...), so appreciate if
    someone can let me know if my below understanding is correct.

    When i delete a document from the index

    1) It is marked for deletion in the BUFFER until I commit/close the
    writer. Does that mean the document is still visible for the Searcher?
    Right, IndexWriter simply records the fact that you want to delete all
    docs matching query X or term Y, in RAM.
    2) Once i commit/close the writer then IT IS JUST MARKED for delete in the
    Index. At this time the document is NOT visible for the Searcher, but the
    document is still taking up the space in the index.
    Yes, every so often (or, when you explicitly commit or close)
    IndexWriter will translate the buffered delete requests into _X_N.del
    files, which record exactly which docIDs are now deleted. If you
    reopen a searcher after this point the documents won't be seen.
    3) Once the index is merged (optimized), it is removed from the index
    As Hoss said, ordinary merges also reclaim the space consumed by
    deleted docs. You can also call expungeDeletes, which forces any
    segments containing deletions to be merged.

    Note that with ConcurrentMergeScheduler, ordinary merges are kicked
    off and complete in background threads.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Aravind Yarram at Aug 11, 2008 at 9:06 pm
    Hi all -

    I know in advance that each of the fileds i index doesnt go more than
    1000, Can i gain any performance improvement while writing the index by
    limiting the maxFieldLength to 200?

    tx
    Regards,
    Aravind R Yarram
    This message contains information from Equifax Inc. which may be confidential and privileged. If you are not an intended recipient, please refrain from any disclosure, copying, distribution or use of this information and note that such actions are prohibited. If you have received this transmission in error, please notify by e-mail postmaster@equifax.com.
  • Mark Miller at Aug 11, 2008 at 9:14 pm

    Aravind.Yarram@equifax.com wrote:
    Hi all -

    I know in advance that each of the fileds i index doesnt go more than
    1000, Can i gain any performance improvement while writing the index by
    limiting the maxFieldLength to 200?

    tx
    Regards,
    Aravind R Yarram
    This message contains information from Equifax Inc. which may be confidential and privileged. If you are not an intended recipient, please refrain from any disclosure, copying, distribution or use of this information and note that such actions are prohibited. If you have received this transmission in error, please notify by e-mail postmaster@equifax.com.
    Its 10000. Sure, if you have a lot of docs between 200 and 10000,
    indexing less will be faster. But you will only be able to search on
    those first 200 tokens for any doc longer.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Aravind Yarram at Aug 11, 2008 at 9:21 pm
    tx for the response but i think i didnt make my question clear...

    If i am indexing a filed that can at the max contain 1000 fileds, does it
    help in improving performance if i let Lucene know IN ADVANCE about 1000?






    Mark Miller <markrmiller@gmail.com>
    08/11/2008 05:13 PM
    Please respond to
    java-user@lucene.apache.org


    To
    java-user@lucene.apache.org
    cc

    Subject
    Re: Field sizes: maxFieldLength






    Aravind.Yarram@equifax.com wrote:
    Hi all -

    I know in advance that each of the fileds i index doesnt go more than
    1000, Can i gain any performance improvement while writing the index by
    limiting the maxFieldLength to 200?

    tx
    Regards,
    Aravind R Yarram
    This message contains information from Equifax Inc. which may be
    confidential and privileged. If you are not an intended recipient, please
    refrain from any disclosure, copying, distribution or use of this
    information and note that such actions are prohibited. If you have
    received this transmission in error, please notify by e-mail
    postmaster@equifax.com.
    Its 10000. Sure, if you have a lot of docs between 200 and 10000,
    indexing less will be faster. But you will only be able to search on
    those first 200 tokens for any doc longer.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    This message contains information from Equifax Inc. which may be confidential and privileged. If you are not an intended recipient, please refrain from any disclosure, copying, distribution or use of this information and note that such actions are prohibited. If you have received this transmission in error, please notify by e-mail postmaster@equifax.com.
  • Mark Miller at Aug 11, 2008 at 9:28 pm
    The gist is: it doesn't help. That simply cuts long documents off at the
    knees on the assumption that its long enough already, that more won't
    add much value (and may add noise?). Its not used for any sort of
    optimizations...its a straight, just use the first n tokens from a document.

    Aravind.Yarram@equifax.com wrote:
    tx for the response but i think i didnt make my question clear...

    If i am indexing a filed that can at the max contain 1000 fileds, does it
    help in improving performance if i let Lucene know IN ADVANCE about 1000?






    Mark Miller <markrmiller@gmail.com>
    08/11/2008 05:13 PM
    Please respond to
    java-user@lucene.apache.org


    To
    java-user@lucene.apache.org
    cc

    Subject
    Re: Field sizes: maxFieldLength






    Aravind.Yarram@equifax.com wrote:
    Hi all -

    I know in advance that each of the fileds i index doesnt go more than
    1000, Can i gain any performance improvement while writing the index by
    limiting the maxFieldLength to 200?

    tx
    Regards,
    Aravind R Yarram
    This message contains information from Equifax Inc. which may be
    confidential and privileged. If you are not an intended recipient, please
    refrain from any disclosure, copying, distribution or use of this
    information and note that such actions are prohibited. If you have
    received this transmission in error, please notify by e-mail
    postmaster@equifax.com.
    Its 10000. Sure, if you have a lot of docs between 200 and 10000,
    indexing less will be faster. But you will only be able to search on
    those first 200 tokens for any doc longer.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    This message contains information from Equifax Inc. which may be confidential and privileged. If you are not an intended recipient, please refrain from any disclosure, copying, distribution or use of this information and note that such actions are prohibited. If you have received this transmission in error, please notify by e-mail postmaster@equifax.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Aug 11, 2008 at 9:32 pm
    No. Its just a simple number of terms taken from the document limiter used
    on the assumption that text that long stops adding value, or starts adding
    noise, or starts accelerating into diminishing returns at that point. No
    optimizations are used based on it.
    On Mon, Aug 11, 2008 at 5:20 PM, wrote:

    tx for the response but i think i didnt make my question clear...

    If i am indexing a filed that can at the max contain 1000 fileds, does it
    help in improving performance if i let Lucene know IN ADVANCE about 1000?






    Mark Miller <markrmiller@gmail.com>
    08/11/2008 05:13 PM
    Please respond to
    java-user@lucene.apache.org


    To
    java-user@lucene.apache.org
    cc

    Subject
    Re: Field sizes: maxFieldLength






    Aravind.Yarram@equifax.com wrote:
    Hi all -

    I know in advance that each of the fileds i index doesnt go more than
    1000, Can i gain any performance improvement while writing the index by
    limiting the maxFieldLength to 200?

    tx
    Regards,
    Aravind R Yarram
    This message contains information from Equifax Inc. which may be
    confidential and privileged. If you are not an intended recipient, please
    refrain from any disclosure, copying, distribution or use of this
    information and note that such actions are prohibited. If you have
    received this transmission in error, please notify by e-mail
    postmaster@equifax.com.
    Its 10000. Sure, if you have a lot of docs between 200 and 10000,
    indexing less will be faster. But you will only be able to search on
    those first 200 tokens for any doc longer.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    This message contains information from Equifax Inc. which may be
    confidential and privileged. If you are not an intended recipient, please
    refrain from any disclosure, copying, distribution or use of this
    information and note that such actions are prohibited. If you have received
    this transmission in error, please notify by e-mail postmaster@equifax.com
    .
  • Chris Hostetter at Aug 12, 2008 at 3:54 am
    : In-Reply-To: <48A08996.6070709@gmail.com>
    : Subject: Field sizes: maxFieldLength

    http://people.apache.org/~hossman/#threadhijack
    Thread Hijacking on Mailing Lists

    When starting a new discussion on a mailing list, please do not reply to
    an existing message, instead start a fresh email. Even if you change the
    subject line of your email, other mail headers still track which thread
    you replied to and your question is "hidden" in that thread and gets less
    attention. It makes following discussions in the mailing list archives
    particularly difficult.
    See Also: http://en.wikipedia.org/wiki/Thread_hijacking




    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 4, '08 at 10:06a
activeAug 12, '08 at 10:08a
posts16
users7
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase