FAQ
HI All,
I am indexing a set file type (html, js,jsp,xml etc). All the file type have a common field called as text. This field contains all the file data. Can I have different analyzer for depending upon file type.

Note: I am indexing all file type with same indexer.

Regards,
Allahbaksh

Allahbaksh Mohammedali Asadullah

http://allahbaksh.blogspot.com<http://allahbaksh.blogspot.com/>




**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are not
to copy, disclose, or distribute this e-mail or its contents to any other person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has taken
every reasonable precaution to minimize this risk, but is not liable for any damage
you may sustain as a result of any virus in this e-mail. You should carry out your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

Search Discussions

  • Ian Lea at Nov 25, 2008 at 3:59 pm
    Yes, you can. But it is generally best to use the same analyzer for
    indexing and for searching so, assuming that you want searches to find
    matches in files of whatever type, I'd recommend pre-processing the
    files to a common text format before indexing and then using the same
    analyzer for all of them.


    --
    Ian.


    On Tue, Nov 25, 2008 at 3:40 PM, Allahbaksh Mohammedali Asadullah
    wrote:
    HI All,
    I am indexing a set file type (html, js,jsp,xml etc). All the file type have a common field called as text. This field contains all the file data. Can I have different analyzer for depending upon file type.

    Note: I am indexing all file type with same indexer.

    Regards,
    Allahbaksh

    Allahbaksh Mohammedali Asadullah

    http://allahbaksh.blogspot.com<http://allahbaksh.blogspot.com/>
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Nov 25, 2008 at 4:10 pm
    Hmmmm, how would you do this without open/closing your
    IndexWriter around different types of documents? And as far
    as querying is concerned, I doubt the input would be a file, so
    one of the canned analyzers should do. Although "care should
    be taken <G>"....

    Best
    Erick
    On Tue, Nov 25, 2008 at 10:57 AM, Ian Lea wrote:

    Yes, you can. But it is generally best to use the same analyzer for
    indexing and for searching so, assuming that you want searches to find
    matches in files of whatever type, I'd recommend pre-processing the
    files to a common text format before indexing and then using the same
    analyzer for all of them.


    --
    Ian.


    On Tue, Nov 25, 2008 at 3:40 PM, Allahbaksh Mohammedali Asadullah
    wrote:
    HI All,
    I am indexing a set file type (html, js,jsp,xml etc). All the file type
    have a common field called as text. This field contains all the file data.
    Can I have different analyzer for depending upon file type.
    Note: I am indexing all file type with same indexer.

    Regards,
    Allahbaksh

    Allahbaksh Mohammedali Asadullah

    http://allahbaksh.blogspot.com<http://allahbaksh.blogspot.com/>
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Nov 25, 2008 at 4:09 pm
    I'm assuming that you want a different analyzer to handle extracting
    the relevant information to put into a "text" field of the Lucene document.
    I know of no way you can attach different analyzers to a single field.
    You can certainly attach different analyzers to *different* fields...

    The first thing I'd try would be to write your own analyzer that keeps
    track of what kind of file it's currently analyzing and knows how to
    "do the right thing" to extract the next token for the text field.

    A cruder way would be to detect the type of document yourself,
    extract the text into a string (or some such) then feed that into
    your document.

    Best
    Erick

    On Tue, Nov 25, 2008 at 10:40 AM, Allahbaksh Mohammedali Asadullah wrote:

    HI All,
    I am indexing a set file type (html, js,jsp,xml etc). All the file type
    have a common field called as text. This field contains all the file data.
    Can I have different analyzer for depending upon file type.

    Note: I am indexing all file type with same indexer.

    Regards,
    Allahbaksh

    Allahbaksh Mohammedali Asadullah

    http://allahbaksh.blogspot.com<http://allahbaksh.blogspot.com/>




    **************** CAUTION - Disclaimer *****************
    This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
    solely
    for the use of the addressee(s). If you are not the intended recipient,
    please
    notify the sender by e-mail and delete the original message. Further, you
    are not
    to copy, disclose, or distribute this e-mail or its contents to any other
    person and
    any such actions are unlawful. This e-mail may contain viruses. Infosys has
    taken
    every reasonable precaution to minimize this risk, but is not liable for
    any damage
    you may sustain as a result of any virus in this e-mail. You should carry
    out your
    own virus checks before opening the e-mail or attachment. Infosys reserves
    the
    right to monitor and review the content of all messages sent to or from
    this e-mail
    address. Messages sent to or from this e-mail address may be stored on the
    Infosys e-mail system.
    ***INFOSYS******** End of Disclaimer ********INFOSYS***
  • Allahbaksh Mohammedali Asadullah at Nov 26, 2008 at 12:45 pm
    Hi Erik,
    Thanks a lot for your reply. You are right I want different analyzer on same field depending upon the other fields in the document.

    For Example
    Doc1
    Text:"Some text here"
    type:html

    Doc2
    Text: "Some jsp text here"
    type:jsp

    Now depending upon the type I wanted to use different analyzer for extracting and searching. Is there any way to do this?

    I am using the way which you suggested.

    Warm Regards,
    Allahbaksh

    ________________________________________
    From: Erick Erickson [erickerickson@gmail.com]
    Sent: Tuesday, November 25, 2008 9:38 PM
    To: java-user@lucene.apache.org
    Subject: Re: Analyzer

    I'm assuming that you want a different analyzer to handle extracting
    the relevant information to put into a "text" field of the Lucene document.
    I know of no way you can attach different analyzers to a single field.
    You can certainly attach different analyzers to *different* fields...

    The first thing I'd try would be to write your own analyzer that keeps
    track of what kind of file it's currently analyzing and knows how to
    "do the right thing" to extract the next token for the text field.

    A cruder way would be to detect the type of document yourself,
    extract the text into a string (or some such) then feed that into
    your document.

    Best
    Erick

    On Tue, Nov 25, 2008 at 10:40 AM, Allahbaksh Mohammedali Asadullah wrote:

    HI All,
    I am indexing a set file type (html, js,jsp,xml etc). All the file type
    have a common field called as text. This field contains all the file data.
    Can I have different analyzer for depending upon file type.

    Note: I am indexing all file type with same indexer.

    Regards,
    Allahbaksh

    Allahbaksh Mohammedali Asadullah

    http://allahbaksh.blogspot.com<http://allahbaksh.blogspot.com/>




    **************** CAUTION - Disclaimer *****************
    This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
    solely
    for the use of the addressee(s). If you are not the intended recipient,
    please
    notify the sender by e-mail and delete the original message. Further, you
    are not
    to copy, disclose, or distribute this e-mail or its contents to any other
    person and
    any such actions are unlawful. This e-mail may contain viruses. Infosys has
    taken
    every reasonable precaution to minimize this risk, but is not liable for
    any damage
    you may sustain as a result of any virus in this e-mail. You should carry
    out your
    own virus checks before opening the e-mail or attachment. Infosys reserves
    the
    right to monitor and review the content of all messages sent to or from
    this e-mail
    address. Messages sent to or from this e-mail address may be stored on the
    Infosys e-mail system.
    ***INFOSYS******** End of Disclaimer ********INFOSYS***
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Nov 26, 2008 at 1:55 pm
    Why not index into different fields based upon the file type? That would be
    MUCH easier and PerFieldAnalyzerWrapper would be your friend. For
    example.....

    Doc1
    text_html: "some text here"
    type: html

    Doc2
    text_jsp: "some jsp text here'
    type: html

    etc...

    Now your PerFieldAnalyzerWrapper has your html extractor for the text_html
    field and your jsp extractor for the text_jsp field.

    Searching would require you to search across all your different types, but
    that's do-able. Or you'd have to restrict your searching to one of the
    text_* fields if you really only wanted to search that type.

    But I still think you're not clear on the difference when searching. Your
    search query analyzer will probably have no relation to your indexing
    analyzer since the type of input is irrelevant to your query (I guess
    anyway). There you're probably using one of the analyzers provided out of
    the box.

    Best
    Erick

    On Wed, Nov 26, 2008 at 7:30 AM, Allahbaksh Mohammedali Asadullah wrote:

    Hi Erik,
    Thanks a lot for your reply. You are right I want different analyzer on
    same field depending upon the other fields in the document.

    For Example
    Doc1
    Text:"Some text here"
    type:html

    Doc2
    Text: "Some jsp text here"
    type:jsp

    Now depending upon the type I wanted to use different analyzer for
    extracting and searching. Is there any way to do this?

    I am using the way which you suggested.

    Warm Regards,
    Allahbaksh

    ________________________________________
    From: Erick Erickson [erickerickson@gmail.com]
    Sent: Tuesday, November 25, 2008 9:38 PM
    To: java-user@lucene.apache.org
    Subject: Re: Analyzer

    I'm assuming that you want a different analyzer to handle extracting
    the relevant information to put into a "text" field of the Lucene document.
    I know of no way you can attach different analyzers to a single field.
    You can certainly attach different analyzers to *different* fields...

    The first thing I'd try would be to write your own analyzer that keeps
    track of what kind of file it's currently analyzing and knows how to
    "do the right thing" to extract the next token for the text field.

    A cruder way would be to detect the type of document yourself,
    extract the text into a string (or some such) then feed that into
    your document.

    Best
    Erick


    On Tue, Nov 25, 2008 at 10:40 AM, Allahbaksh Mohammedali Asadullah <
    Allahbaksh_Asadullah@infosys.com> wrote:
    HI All,
    I am indexing a set file type (html, js,jsp,xml etc). All the file type
    have a common field called as text. This field contains all the file data.
    Can I have different analyzer for depending upon file type.

    Note: I am indexing all file type with same indexer.

    Regards,
    Allahbaksh

    Allahbaksh Mohammedali Asadullah

    http://allahbaksh.blogspot.com<http://allahbaksh.blogspot.com/>




    **************** CAUTION - Disclaimer *****************
    This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
    solely
    for the use of the addressee(s). If you are not the intended recipient,
    please
    notify the sender by e-mail and delete the original message. Further, you
    are not
    to copy, disclose, or distribute this e-mail or its contents to any other
    person and
    any such actions are unlawful. This e-mail may contain viruses. Infosys has
    taken
    every reasonable precaution to minimize this risk, but is not liable for
    any damage
    you may sustain as a result of any virus in this e-mail. You should carry
    out your
    own virus checks before opening the e-mail or attachment. Infosys reserves
    the
    right to monitor and review the content of all messages sent to or from
    this e-mail
    address. Messages sent to or from this e-mail address may be stored on the
    Infosys e-mail system.
    ***INFOSYS******** End of Disclaimer ********INFOSYS***
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedNov 25, '08 at 3:40p
activeNov 26, '08 at 1:55p
posts6
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase