FAQ
Hi,

I am working on a program to index/search chemical element/compound. Say I write an analyzer to filter out chemical terms, such as H2O. I noticed that I can specify a tocken's type. Can I construct a token as
new Token ("H2", start, end, "chem");

My questions is
How do I search all the tokens with "chem" type token, such as H2O, O2, etc? Any sample like this?

If this approach doesn't work, what's the best approach?

Thanks,
Ethan

Search Discussions

  • Pierrick Brihaye at Apr 16, 2005 at 6:31 am

    ethandev@comcast.net a écrit :

    I am working on a program to index/search chemical element/compound. Say I write an analyzer to filter out chemical terms, such as H2O. I noticed that I can specify a tocken's type. Can I construct a token as
    new Token ("H2", start, end, "chem");

    My questions is
    How do I search all the tokens with "chem" type token, such as H2O, O2, etc? Any sample like this?

    If this approach doesn't work, what's the best approach?
    You may assign a type to the tokens, and then you may filter them
    according to their type *but* the index forgets this info since it
    stores *terms* (field/value pairs).

    Compare :
    http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Token.html
    and
    http://lucene.apache.org/java/docs/api/org/apache/lucene/index/Term.html

    Notice however that the terms have also their relative position (the
    Token's positionIncrement, default = 1) stored in the index ; this
    allows proximity searches.

    So... how to do ?

    1) use a dedicated field "chem" where only chemical content is allowed
    (filter out every token whose type is different from "chem")
    2) manipulate your termText : "chem_H2" ; the same for your queries
    3) play with the query rather than with the index content : filter out
    what is not chemical

    There may be other solutions...

    Cheers,

    p.b.



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Libbrecht at Apr 16, 2005 at 9:16 pm

    Le 16 avr. 05, à 08:31, Pierrick Brihaye a écrit :
    How do I search all the tokens with "chem" type token, such as H2O,
    O2, etc? Any sample like this? If this approach doesn't work, what's
    the best approach?
    Nifty question... I'm working on indexing text with math formulae...
    there may be similarities !
    You may assign a type to the tokens, and then you may filter them
    according to their type *but* the index forgets this info since it
    stores *terms* (field/value pairs). [...]
    1) use a dedicated field "chem" where only chemical content is allowed
    (filter out every token whose type is different from "chem")
    2) manipulate your termText : "chem_H2" ; the same for your queries
    3) play with the query rather than with the index content : filter out
    what is not chemical
    So it really seems chem_H2 is the only choice, or ?

    What's your requirements or expectations ?
    - match a formula in the middle of a sentence ?
    - or simply match documents that contain both the sentence's words and
    the formula (in the latter case, I think solution 1 is valid)
    - how would you do wildcards with formulae ?

    A related question, at least for me, is how to match a+(b+1) when the
    query is X+Y, ie. subtree cut.
    Does this occur in chemical formulae as well?

    paul

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ethandev at Apr 22, 2005 at 2:49 am
    Thanks Pierrick.

    Are you say that I should construct Token in analyzer like
    new Token ("chem_H2O", 100, 103, "chem");

    note that chem_ is added prefix to H2O, and 100 to 103 is length of H2O rather than chem_H2O?

    I also have some further problem and not sure if can be solved by this approch.

    I want to index H2O in a compound, say H2O-CH2. say I want a query to find out H2O in a compound. How can I do that?

    Thanks,
    Ethan

    -------------- Original message --------------
    ethandev@comcast.net a écrit :
    I am working on a program to index/search chemical element/compound. Say I
    write an analyzer to filter out chemical terms, such as H2O. I noticed that I
    can specify a tocken's type. Can I construct a token as
    new Token ("H2", start, end, "chem");

    My questions is
    How do I search all the tokens with "chem" type token, such as H2O, O2, etc?
    Any sample like this?
    If this approach doesn't work, what's the best approach?
    You may assign a type to the tokens, and then you may filter them
    according to their type *but* the index forgets this info since it
    stores *terms* (field/value pairs).

    Compare :
    http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/Token.html
    and
    http://lucene.apache.org/java/docs/api/org/apache/lucene/index/Term.html

    Notice however that the terms have also their relative position (the
    Token's positionIncrement, default = 1) stored in the index ; this
    allows proximity searches.

    So... how to do ?

    1) use a dedicated field "chem" where only chemical content is allowed
    (filter out every token whose type is different from "chem")
    2) manipulate your termText : "chem_H2" ; the same for your queries
    3) play with the query rather than with the index content : filter out
    what is not chemical

    There may be other solutions...

    Cheers,

    p.b.



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Pierrick Brihaye at Apr 22, 2005 at 7:34 am
    Hi,

    ethandev@comcast.net a écrit :
    Thanks Pierrick.

    Are you say that I should construct Token in analyzer like
    new Token ("chem_H2O", 100, 103, "chem");

    note that chem_ is added prefix to H2O, and 100 to 103 is length of H2O rather than chem_H2O?
    Well... 100 to 103 are offsets provided by the reader (an are thus
    usually offsets in the source file). These offsets may help you to make
    some computations but they are lost when the token is indexed.
    I want to index H2O in a compound, say H2O-CH2. say I want a query to find out H2O in a compound. How can I do that?
    Dirty solution : use wildcard queries (chem_H20-*).

    Smart solution :

    consider that you have "words", i.e. chem_H2O followed by chem_CH2
    (strange enough ;-)... and make some phrase queries or
    MultiPhraseQueries. See :

    http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/test/org/apache/lucene/search/TestPhraseQuery.java?rev=150740&view=markup
    http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/test/org/apache/lucene/search/TestMultiPhraseQuery.java?rev=150733&view=markup

    Setting the phrase slop may help you.

    You may also want to play with the position of the tokens to
    allow/prevent hits from your PhraseQuery. Give your Tokens a relevant
    positionIncrement. See :
    http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/test/org/apache/lucene/search/TestPositionIncrement.java?rev=150585&view=markup

    Cheers,

    --
    Pierrick Brihaye, informaticien
    Service régional de l'Inventaire
    DRAC Bretagne
    mailto:pierrick.brihaye@culture.gouv.fr
    +33 (0)2 99 29 67 78

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Paul Libbrecht at Apr 22, 2005 at 10:19 am

    Le 22 avr. 05, à 09:36, Pierrick Brihaye a écrit :

    Are you say that I should construct Token in analyzer like
    new Token ("chem_H2O", 100, 103, "chem");
    note that chem_ is added prefix to H2O, and 100 to 103 is length of
    H2O rather than chem_H2O?
    Well... 100 to 103 are offsets provided by the reader (an are thus
    usually offsets in the source file). These offsets may help you to
    make some computations but they are lost when the token is indexed.
    ?? not in all situations, or ?? You couldn't render it back otherwise
    as is done search-result highlighting.

    One thing that tackles me is how much this parameter could, again, be
    something different...

    In particular, I'd much prefer to have it a tree-path instead of a
    plain number. I don't have reader plain numbers and they are, often,
    lost in an XML content-base.

    hopeable ?


    thanks

    paul

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedApr 16, '05 at 2:21a
activeApr 22, '05 at 10:19a
posts6
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase