FAQ
Hi
I am currently indexing documents (pdf, ms word, etc) that are uploaded,
these documents can be searched and what the search returns to the user are
summaries of the documents. Currently the summaries are extracted when
indexing the file (summary constructed by taking the first 10 lines of the
document and stored in the index as field). This is not ideal (static
summary), and I was wondering if it would be possible to create a dynamic
summary when a hit is found and highlight the terms found. The content of
the document is not stored in the index.

So basically what I'm looking to do is:

1) PDF indexed
2) PDF body contains the word "search"
3) Do a search and return the hit
4) Construct a summary with the term "search" included.

I'm not sure how to go about doing this (I presume it is possible). I would
be grateful for any advice.


Cheers
Amin

Search Discussions

  • Michael McCandless at Mar 7, 2009 at 10:41 am
    You should look at contrib/highlighter, which does exactly this.

    Mike

    Amin Mohammed-Coleman wrote:
    Hi
    I am currently indexing documents (pdf, ms word, etc) that are
    uploaded,
    these documents can be searched and what the search returns to the
    user are
    summaries of the documents. Currently the summaries are extracted
    when
    indexing the file (summary constructed by taking the first 10 lines
    of the
    document and stored in the index as field). This is not ideal (static
    summary), and I was wondering if it would be possible to create a
    dynamic
    summary when a hit is found and highlight the terms found. The
    content of
    the document is not stored in the index.

    So basically what I'm looking to do is:

    1) PDF indexed
    2) PDF body contains the word "search"
    3) Do a search and return the hit
    4) Construct a summary with the term "search" included.

    I'm not sure how to go about doing this (I presume it is possible).
    I would
    be grateful for any advice.


    Cheers
    Amin

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erik Hatcher at Mar 7, 2009 at 10:51 am
    With the caveat that if you're not storing the text you want
    highlighted, you'll have to retrieve it somehow and send it into the
    Highlighter yourself.

    Erik
    On Mar 7, 2009, at 5:40 AM, Michael McCandless wrote:


    You should look at contrib/highlighter, which does exactly this.

    Mike

    Amin Mohammed-Coleman wrote:
    Hi
    I am currently indexing documents (pdf, ms word, etc) that are
    uploaded,
    these documents can be searched and what the search returns to the
    user are
    summaries of the documents. Currently the summaries are extracted
    when
    indexing the file (summary constructed by taking the first 10 lines
    of the
    document and stored in the index as field). This is not ideal
    (static
    summary), and I was wondering if it would be possible to create a
    dynamic
    summary when a hit is found and highlight the terms found. The
    content of
    the document is not stored in the index.

    So basically what I'm looking to do is:

    1) PDF indexed
    2) PDF body contains the word "search"
    3) Do a search and return the hit
    4) Construct a summary with the term "search" included.

    I'm not sure how to go about doing this (I presume it is
    possible). I would
    be grateful for any advice.


    Cheers
    Amin

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Amin Mohammed-Coleman at Mar 7, 2009 at 11:37 am
    hi
    that's what i was thinking about. i would need to get the file and extract
    the text again and then pass through the highlighter. The other option is
    storing the content in the index the downside being index is going to be
    large. Which would be the recommended approach?

    Cheers

    Amin
    On Sat, Mar 7, 2009 at 10:50 AM, Erik Hatcher wrote:

    With the caveat that if you're not storing the text you want highlighted,
    you'll have to retrieve it somehow and send it into the Highlighter
    yourself.

    Erik


    On Mar 7, 2009, at 5:40 AM, Michael McCandless wrote:

    You should look at contrib/highlighter, which does exactly this.

    Mike

    Amin Mohammed-Coleman wrote:

    Hi
    I am currently indexing documents (pdf, ms word, etc) that are uploaded,
    these documents can be searched and what the search returns to the user
    are
    summaries of the documents. Currently the summaries are extracted when
    indexing the file (summary constructed by taking the first 10 lines of
    the
    document and stored in the index as field). This is not ideal (static
    summary), and I was wondering if it would be possible to create a dynamic
    summary when a hit is found and highlight the terms found. The content
    of
    the document is not stored in the index.

    So basically what I'm looking to do is:

    1) PDF indexed
    2) PDF body contains the word "search"
    3) Do a search and return the hit
    4) Construct a summary with the term "search" included.

    I'm not sure how to go about doing this (I presume it is possible). I
    would
    be grateful for any advice.


    Cheers
    Amin

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erik Hatcher at Mar 7, 2009 at 11:46 am
    It depends :)

    It's a trade-off. If storing is not prohibitive, I recommend that as
    it makes life easier for highlighting.

    Erik
    On Mar 7, 2009, at 6:37 AM, Amin Mohammed-Coleman wrote:

    hi
    that's what i was thinking about. i would need to get the file and
    extract
    the text again and then pass through the highlighter. The other
    option is
    storing the content in the index the downside being index is going
    to be
    large. Which would be the recommended approach?

    Cheers

    Amin

    On Sat, Mar 7, 2009 at 10:50 AM, Erik Hatcher <erik@ehatchersolutions.com
    wrote:
    With the caveat that if you're not storing the text you want
    highlighted,
    you'll have to retrieve it somehow and send it into the Highlighter
    yourself.

    Erik


    On Mar 7, 2009, at 5:40 AM, Michael McCandless wrote:

    You should look at contrib/highlighter, which does exactly this.

    Mike

    Amin Mohammed-Coleman wrote:

    Hi
    I am currently indexing documents (pdf, ms word, etc) that are
    uploaded,
    these documents can be searched and what the search returns to
    the user
    are
    summaries of the documents. Currently the summaries are
    extracted when
    indexing the file (summary constructed by taking the first 10
    lines of
    the
    document and stored in the index as field). This is not ideal
    (static
    summary), and I was wondering if it would be possible to create a
    dynamic
    summary when a hit is found and highlight the terms found. The
    content
    of
    the document is not stored in the index.

    So basically what I'm looking to do is:

    1) PDF indexed
    2) PDF body contains the word "search"
    3) Do a search and return the hit
    4) Construct a summary with the term "search" included.

    I'm not sure how to go about doing this (I presume it is
    possible). I
    would
    be grateful for any advice.


    Cheers
    Amin

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Mar 7, 2009 at 11:51 am
    You could store the text contents compressed; I think extracting text from
    PDF files is much more time-intensive than decompressing a stored field. And
    text-only contents often compress very good. In my opinion, if the
    (uncompressed) contents of the docs are not very large (so I mean several
    megabytes each), I would prefer storing it in index.

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de
    -----Original Message-----
    From: Erik Hatcher
    Sent: Saturday, March 07, 2009 12:46 PM
    To: java-user@lucene.apache.org
    Subject: Re: Lucene Highlighting and Dynamic Summaries

    It depends :)

    It's a trade-off. If storing is not prohibitive, I recommend that as
    it makes life easier for highlighting.

    Erik
    On Mar 7, 2009, at 6:37 AM, Amin Mohammed-Coleman wrote:

    hi
    that's what i was thinking about. i would need to get the file and
    extract
    the text again and then pass through the highlighter. The other
    option is
    storing the content in the index the downside being index is going
    to be
    large. Which would be the recommended approach?

    Cheers

    Amin

    On Sat, Mar 7, 2009 at 10:50 AM, Erik Hatcher
    <erik@ehatchersolutions.com
    wrote:
    With the caveat that if you're not storing the text you want
    highlighted,
    you'll have to retrieve it somehow and send it into the Highlighter
    yourself.

    Erik


    On Mar 7, 2009, at 5:40 AM, Michael McCandless wrote:

    You should look at contrib/highlighter, which does exactly this.

    Mike

    Amin Mohammed-Coleman wrote:

    Hi
    I am currently indexing documents (pdf, ms word, etc) that are
    uploaded,
    these documents can be searched and what the search returns to
    the user
    are
    summaries of the documents. Currently the summaries are
    extracted when
    indexing the file (summary constructed by taking the first 10
    lines of
    the
    document and stored in the index as field). This is not ideal
    (static
    summary), and I was wondering if it would be possible to create a
    dynamic
    summary when a hit is found and highlight the terms found. The
    content
    of
    the document is not stored in the index.

    So basically what I'm looking to do is:

    1) PDF indexed
    2) PDF body contains the word "search"
    3) Do a search and return the hit
    4) Construct a summary with the term "search" included.

    I'm not sure how to go about doing this (I presume it is
    possible). I
    would
    be grateful for any advice.


    Cheers
    Amin

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Amin Mohammed-Coleman at Mar 7, 2009 at 12:11 pm
    cool. i will use compression and store in index. is there anything special
    i need to for decompressing the text? i presume i can just do
    doc.get("content")?
    thanks for your advice all!
    On Sat, Mar 7, 2009 at 11:50 AM, Uwe Schindler wrote:

    You could store the text contents compressed; I think extracting text from
    PDF files is much more time-intensive than decompressing a stored field.
    And
    text-only contents often compress very good. In my opinion, if the
    (uncompressed) contents of the docs are not very large (so I mean several
    megabytes each), I would prefer storing it in index.

    -----
    Uwe Schindler
    H.-H.-Meier-Allee 63, D-28213 Bremen
    http://www.thetaphi.de
    eMail: uwe@thetaphi.de
    -----Original Message-----
    From: Erik Hatcher
    Sent: Saturday, March 07, 2009 12:46 PM
    To: java-user@lucene.apache.org
    Subject: Re: Lucene Highlighting and Dynamic Summaries

    It depends :)

    It's a trade-off. If storing is not prohibitive, I recommend that as
    it makes life easier for highlighting.

    Erik
    On Mar 7, 2009, at 6:37 AM, Amin Mohammed-Coleman wrote:

    hi
    that's what i was thinking about. i would need to get the file and
    extract
    the text again and then pass through the highlighter. The other
    option is
    storing the content in the index the downside being index is going
    to be
    large. Which would be the recommended approach?

    Cheers

    Amin

    On Sat, Mar 7, 2009 at 10:50 AM, Erik Hatcher
    <erik@ehatchersolutions.com
    wrote:
    With the caveat that if you're not storing the text you want
    highlighted,
    you'll have to retrieve it somehow and send it into the Highlighter
    yourself.

    Erik


    On Mar 7, 2009, at 5:40 AM, Michael McCandless wrote:

    You should look at contrib/highlighter, which does exactly this.

    Mike

    Amin Mohammed-Coleman wrote:

    Hi
    I am currently indexing documents (pdf, ms word, etc) that are
    uploaded,
    these documents can be searched and what the search returns to
    the user
    are
    summaries of the documents. Currently the summaries are
    extracted when
    indexing the file (summary constructed by taking the first 10
    lines of
    the
    document and stored in the index as field). This is not ideal
    (static
    summary), and I was wondering if it would be possible to create a
    dynamic
    summary when a hit is found and highlight the terms found. The
    content
    of
    the document is not stored in the index.

    So basically what I'm looking to do is:

    1) PDF indexed
    2) PDF body contains the word "search"
    3) Do a search and return the hit
    4) Construct a summary with the term "search" included.

    I'm not sure how to go about doing this (I presume it is
    possible). I
    would
    be grateful for any advice.


    Cheers
    Amin

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Uwe Schindler at Mar 7, 2009 at 12:22 pm

    cool. i will use compression and store in index. is there anything
    special
    i need to for decompressing the text? i presume i can just do
    doc.get("content")?
    thanks for your advice all!
    No just use Field.Store.COMPRESS when adding to index and Document.get()
    when fetching. The decompression is automatically done.

    You may think, why not enable compression for all fields? The case is, that
    this is an overhead for very small and short fields. So you should only use
    it for large contents (it's the same like compressing very small files as
    ZIP/GZIP: These files mostly get larger than without compression).

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Amin Mohammed-Coleman at Mar 7, 2009 at 12:26 pm
    Thanks! The final piece that I needed to do for the project!
    Cheers

    Amin
    On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler wrote:

    cool. i will use compression and store in index. is there anything
    special
    i need to for decompressing the text? i presume i can just do
    doc.get("content")?
    thanks for your advice all!
    No just use Field.Store.COMPRESS when adding to index and Document.get()
    when fetching. The decompression is automatically done.

    You may think, why not enable compression for all fields? The case is, that
    this is an overhead for very small and short fields. So you should only use
    it for large contents (it's the same like compressing very small files as
    ZIP/GZIP: These files mostly get larger than without compression).

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Amin Mohammed-Coleman at Mar 7, 2009 at 4:33 pm
    Hi
    Got it working! Thanks again for your help!


    Amin
    On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman wrote:

    Thanks! The final piece that I needed to do for the project!
    Cheers

    Amin
    On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler wrote:

    cool. i will use compression and store in index. is there anything
    special
    i need to for decompressing the text? i presume i can just do
    doc.get("content")?
    thanks for your advice all!
    No just use Field.Store.COMPRESS when adding to index and Document.get()
    when fetching. The decompression is automatically done.

    You may think, why not enable compression for all fields? The case is,
    that
    this is an overhead for very small and short fields. So you should only
    use
    it for large contents (it's the same like compressing very small files as
    ZIP/GZIP: These files mostly get larger than without compression).

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Amin Mohammed-Coleman at Mar 9, 2009 at 7:51 am
    Hi
    I am seeing some strange behaviour with the highlighter and I'm wondering if
    anyone else is experiencing this. In certain instances I don't get a
    summary being generated. I perform the search and the search returns the
    correct document. I can see that the lucene document contains the text in
    the field. However after doing:

    SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<span
    class=\"highlight\"><b>", "</b></span>");

    //required for highlighting

    Query query2 = multiSearcher.rewrite(query);

    Highlighter highlighter = new Highlighter(simpleHTMLFormatter,
    newQueryScorer(query2));

    ...

    String text= doc.get(FieldNameEnum.BODY.getDescription());

    TokenStream tokenStream = analyzer
    .tokenStream(FieldNameEnum.BODY.getDescription(), new StringReader(text));

    String result = highlighter.getBestFragments(tokenStream,
    text, 3, "...");




    the string result is empty. This is very strange, if i try a different term
    that exists in the document then I get a summary. For example I have a word
    document that contains the term "document" and "aspectj". If I search for
    "document" I get the correct document but no highlighted summary. However
    if I search using "aspectj" I get the same doucment with highlighted
    summary.


    Just to mentioned I do rewrite the original query before performing the
    highlighting.


    I'm not sure what i'm missing here. Any help would be appreciated.


    Cheers

    Amin
    On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman wrote:

    Hi
    Got it working! Thanks again for your help!


    Amin

    On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman wrote:

    Thanks! The final piece that I needed to do for the project!
    Cheers

    Amin
    On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler wrote:

    cool. i will use compression and store in index. is there anything
    special
    i need to for decompressing the text? i presume i can just do
    doc.get("content")?
    thanks for your advice all!
    No just use Field.Store.COMPRESS when adding to index and Document.get()
    when fetching. The decompression is automatically done.

    You may think, why not enable compression for all fields? The case is,
    that
    this is an overhead for very small and short fields. So you should only
    use
    it for large contents (it's the same like compressing very small files as
    ZIP/GZIP: These files mostly get larger than without compression).

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Amin Mohammed-Coleman at Mar 11, 2009 at 6:00 pm
    Hi

    Apologies for re sending this mail. Just wondering if anyone has
    experienced the below. I'm not sure if this could happen due nature of
    document. It does seem strange one term search returns summary while
    another does not even though same document is being returned.

    I'm asking this so I can code around this if is normal.


    Apologies again for re sending this mail

    Cheers

    Amin

    Sent from my iPhone
    On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman wrote:

    Hi

    I am seeing some strange behaviour with the highlighter and I'm
    wondering if anyone else is experiencing this. In certain instances
    I don't get a summary being generated. I perform the search and the
    search returns the correct document. I can see that the lucene
    document contains the text in the field. However after doing:

    SimpleHTMLFormatter simpleHTMLFormatter = new
    SimpleHTMLFormatter("<span class=\"highlight\"><b>", "</b></span>");
    //required for highlighting
    Query query2 = multiSearcher.rewrite(query);
    Highlighter highlighter = new Highlighter(simpleHTMLFormatter,
    new QueryScorer(query2));
    ...

    String text= doc.get(FieldNameEnum.BODY.getDescription());
    TokenStream tokenStream =
    analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new
    StringReader(text));
    String result =
    highlighter.getBestFragments(tokenStream, text, 3, "...");


    the string result is empty. This is very strange, if i try a
    different term that exists in the document then I get a summary.
    For example I have a word document that contains the term "document"
    and "aspectj". If I search for "document" I get the correct
    document but no highlighted summary. However if I search using
    "aspectj" I get the same doucment with highlighted summary.

    Just to mentioned I do rewrite the original query before performing
    the highlighting.

    I'm not sure what i'm missing here. Any help would be appreciated.

    Cheers
    Amin

    On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <aminmc@gmail.com
    wrote:
    Hi

    Got it working! Thanks again for your help!


    Amin


    On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <aminmc@gmail.com
    wrote:
    Thanks! The final piece that I needed to do for the project!

    Cheers

    Amin
    On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler wrote:
    cool. i will use compression and store in index. is there anything
    special
    i need to for decompressing the text? i presume i can just do
    doc.get("content")?
    thanks for your advice all!
    No just use Field.Store.COMPRESS when adding to index and
    Document.get()
    when fetching. The decompression is automatically done.

    You may think, why not enable compression for all fields? The case
    is, that
    this is an overhead for very small and short fields. So you should
    only use
    it for large contents (it's the same like compressing very small
    files as
    ZIP/GZIP: These files mostly get larger than without compression).

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


  • Markharw00d at Mar 11, 2009 at 6:12 pm
    If you can supply a Junit test that recreates the problem I think we can
    start to make progress on this.



    Amin Mohammed-Coleman wrote:
    Hi

    Apologies for re sending this mail. Just wondering if anyone has
    experienced the below. I'm not sure if this could happen due nature of
    document. It does seem strange one term search returns summary while
    another does not even though same document is being returned.

    I'm asking this so I can code around this if is normal.


    Apologies again for re sending this mail

    Cheers

    Amin

    Sent from my iPhone
    On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman wrote:

    Hi

    I am seeing some strange behaviour with the highlighter and I'm
    wondering if anyone else is experiencing this. In certain instances
    I don't get a summary being generated. I perform the search and the
    search returns the correct document. I can see that the lucene
    document contains the text in the field. However after doing:

    SimpleHTMLFormatter simpleHTMLFormatter = new
    SimpleHTMLFormatter("<span class=\"highlight\"><b>", "</b></span>");
    //required for highlighting
    Query query2 = multiSearcher.rewrite(query);
    Highlighter highlighter = new
    Highlighter(simpleHTMLFormatter, new QueryScorer(query2));
    ...

    String text= doc.get(FieldNameEnum.BODY.getDescription());
    TokenStream tokenStream =
    analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new
    StringReader(text));
    String result =
    highlighter.getBestFragments(tokenStream, text, 3, "...");


    the string result is empty. This is very strange, if i try a
    different term that exists in the document then I get a summary. For
    example I have a word document that contains the term "document" and
    "aspectj". If I search for "document" I get the correct document but
    no highlighted summary. However if I search using "aspectj" I get
    the same doucment with highlighted summary.

    Just to mentioned I do rewrite the original query before performing
    the highlighting.

    I'm not sure what i'm missing here. Any help would be appreciated.

    Cheers
    Amin

    On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman
    wrote:
    Hi

    Got it working! Thanks again for your help!


    Amin


    On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman
    wrote:
    Thanks! The final piece that I needed to do for the project!

    Cheers

    Amin
    On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler wrote:
    cool. i will use compression and store in index. is there anything
    special
    i need to for decompressing the text? i presume i can just do
    doc.get("content")?
    thanks for your advice all!
    No just use Field.Store.COMPRESS when adding to index and Document.get()
    when fetching. The decompression is automatically done.

    You may think, why not enable compression for all fields? The case
    is, that
    this is an overhead for very small and short fields. So you should
    only use
    it for large contents (it's the same like compressing very small
    files as
    ZIP/GZIP: These files mostly get larger than without compression).

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ------------------------------------------------------------------------


    No virus found in this incoming message.
    Checked by AVG - www.avg.com
    Version: 8.0.237 / Virus Database: 270.11.10/1995 - Release Date: 03/11/09 08:28:00


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Amin Mohammed-Coleman at Mar 12, 2009 at 7:48 am
    Hi
    Please find attadched a test case plus a document. Just to mention this
    occurs sometimes for other files.


    Cheers
    Amin
    On Wed, Mar 11, 2009 at 6:11 PM, markharw00d wrote:

    If you can supply a Junit test that recreates the problem I think we can
    start to make progress on this.



    Amin Mohammed-Coleman wrote:
    Hi

    Apologies for re sending this mail. Just wondering if anyone has
    experienced the below. I'm not sure if this could happen due nature of
    document. It does seem strange one term search returns summary while another
    does not even though same document is being returned.

    I'm asking this so I can code around this if is normal.


    Apologies again for re sending this mail

    Cheers

    Amin

    Sent from my iPhone

    On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman wrote:

    Hi
    I am seeing some strange behaviour with the highlighter and I'm wondering
    if anyone else is experiencing this. In certain instances I don't get a
    summary being generated. I perform the search and the search returns the
    correct document. I can see that the lucene document contains the text in
    the field. However after doing:

    SimpleHTMLFormatter simpleHTMLFormatter = new
    SimpleHTMLFormatter("<span class=\"highlight\"><b>", "</b></span>");
    //required for highlighting
    Query query2 = multiSearcher.rewrite(query);
    Highlighter highlighter = new Highlighter(simpleHTMLFormatter,
    new QueryScorer(query2));
    ...

    String text= doc.get(FieldNameEnum.BODY.getDescription());
    TokenStream tokenStream =
    analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new
    StringReader(text));
    String result = highlighter.getBestFragments(tokenStream,
    text, 3, "...");


    the string result is empty. This is very strange, if i try a different
    term that exists in the document then I get a summary. For example I have a
    word document that contains the term "document" and "aspectj". If I search
    for "document" I get the correct document but no highlighted summary.
    However if I search using "aspectj" I get the same doucment with
    highlighted summary.

    Just to mentioned I do rewrite the original query before performing the
    highlighting.

    I'm not sure what i'm missing here. Any help would be appreciated.

    Cheers
    Amin

    On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <aminmc@gmail.com>
    wrote:
    Hi

    Got it working! Thanks again for your help!


    Amin


    On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <aminmc@gmail.com>
    wrote:
    Thanks! The final piece that I needed to do for the project!

    Cheers

    Amin
    On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler wrote:
    cool. i will use compression and store in index. is there anything
    special
    i need to for decompressing the text? i presume i can just do
    doc.get("content")?
    thanks for your advice all!
    No just use Field.Store.COMPRESS when adding to index and Document.get()
    when fetching. The decompression is automatically done.

    You may think, why not enable compression for all fields? The case is,
    that
    this is an overhead for very small and short fields. So you should only
    use
    it for large contents (it's the same like compressing very small files as
    ZIP/GZIP: These files mostly get larger than without compression).

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ------------------------------------------------------------------------


    No virus found in this incoming message.
    Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database:
    270.11.10/1995 - Release Date: 03/11/09 08:28:00



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark harwood at Mar 12, 2009 at 9:52 am
    The attachment didn't make it through here. Can you add it as an attachment to a new JIRA issue?

    Thanks,
    Mark





    ________________________________
    From: Amin Mohammed-Coleman <aminmc@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Thursday, 12 March, 2009 7:47:20
    Subject: Re: Lucene Highlighting and Dynamic Summaries

    Hi

    Please find attadched a test case plus a document. Just to mention this occurs sometimes for other files.


    Cheers
    Amin


    On Wed, Mar 11, 2009 at 6:11 PM, markharw00d wrote:

    If you can supply a Junit test that recreates the problem I think we can start to make progress on this.



    Amin Mohammed-Coleman wrote:

    Hi

    Apologies for re sending this mail. Just wondering if anyone has experienced the below.. I'm not sure if this could happen due nature of document. It does seem strange one term search returns summary while another does not even though same document is being returned.

    I'm asking this so I can code around this if is normal.


    Apologies again for re sending this mail

    Cheers

    Amin

    Sent from my iPhone

    On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman wrote:


    Hi

    I am seeing some strange behaviour with the highlighter and I'm wondering if anyone else is experiencing this. In certain instances I don't get a summary being generated. I perform the search and the search returns the correct document. I can see that the lucene document contains the text in the field. However after doing:

    SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<span class=\"highlight\"><b>", "</b></span>");
    //required for highlighting
    Query query2 = multiSearcher.rewrite(query);
    Highlighter highlighter = new Highlighter(simpleHTMLFormatter, new QueryScorer(query2));
    ...

    String text= doc.get(FieldNameEnum.BODY.getDescription());
    TokenStream tokenStream = analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new StringReader(text));
    String result = highlighter.getBestFragments(tokenStream, text, 3, "...");


    the string result is empty. This is very strange, if i try a different term that exists in the document then I get a summary. For example I have a word document that contains the term "document" and "aspectj". If I search for "document" I get the correct document but no highlighted summary. However if I search using "aspectj" I get the same doucment with highlighted summary.

    Just to mentioned I do rewrite the original query before performing the highlighting.

    I'm not sure what i'm missing here. Any help would be appreciated.

    Cheers
    Amin

    On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman wrote:
    Hi

    Got it working! Thanks again for your help!


    Amin


    On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman wrote:
    Thanks! The final piece that I needed to do for the project!

    Cheers

    Amin
    On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler wrote:
    cool. i will use compression and store in index. is there anything
    special
    i need to for decompressing the text? i presume i can just do
    doc.get("content")?
    thanks for your advice all!
    No just use Field.Store.COMPRESS when adding to index and Document.get()
    when fetching. The decompression is automatically done.

    You may think, why not enable compression for all fields? The case is, that
    this is an overhead for very small and short fields. So you should only use
    it for large contents (it's the same like compressing very small files as
    ZIP/GZIP: These files mostly get larger than without compression).

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org






    ------------------------------------------------------------------------


    No virus found in this incoming message.
    Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database: 270.11.10/1995 - Release Date: 03/11/09 08:28:00






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Amin Mohammed-Coleman at Mar 12, 2009 at 11:29 am
    Hi

    Did both attachments not come through?

    Cheers
    Amin
    On Thu, Mar 12, 2009 at 9:52 AM, mark harwood wrote:

    The attachment didn't make it through here. Can you add it as an attachment
    to a new JIRA issue?

    Thanks,
    Mark





    ________________________________
    From: Amin Mohammed-Coleman <aminmc@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Thursday, 12 March, 2009 7:47:20
    Subject: Re: Lucene Highlighting and Dynamic Summaries

    Hi

    Please find attadched a test case plus a document. Just to mention this
    occurs sometimes for other files.


    Cheers
    Amin


    On Wed, Mar 11, 2009 at 6:11 PM, markharw00d wrote:

    If you can supply a Junit test that recreates the problem I think we can
    start to make progress on this.



    Amin Mohammed-Coleman wrote:

    Hi

    Apologies for re sending this mail. Just wondering if anyone has
    experienced the below.. I'm not sure if this could happen due nature of
    document. It does seem strange one term search returns summary while another
    does not even though same document is being returned.

    I'm asking this so I can code around this if is normal.


    Apologies again for re sending this mail

    Cheers

    Amin

    Sent from my iPhone

    On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman wrote:


    Hi

    I am seeing some strange behaviour with the highlighter and I'm wondering
    if anyone else is experiencing this. In certain instances I don't get a
    summary being generated. I perform the search and the search returns the
    correct document. I can see that the lucene document contains the text in
    the field. However after doing:

    SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<span
    class=\"highlight\"><b>", "</b></span>");
    //required for highlighting
    Query query2 = multiSearcher.rewrite(query);
    Highlighter highlighter = new Highlighter(simpleHTMLFormatter,
    new QueryScorer(query2));
    ...

    String text= doc.get(FieldNameEnum.BODY.getDescription());
    TokenStream tokenStream =
    analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new
    StringReader(text));
    String result = highlighter.getBestFragments(tokenStream,
    text, 3, "...");


    the string result is empty. This is very strange, if i try a different
    term that exists in the document then I get a summary. For example I have a
    word document that contains the term "document" and "aspectj". If I search
    for "document" I get the correct document but no highlighted summary.
    However if I search using "aspectj" I get the same doucment with
    highlighted summary.

    Just to mentioned I do rewrite the original query before performing the
    highlighting.

    I'm not sure what i'm missing here. Any help would be appreciated.

    Cheers
    Amin

    On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman wrote:
    Hi

    Got it working! Thanks again for your help!


    Amin


    On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman wrote:
    Thanks! The final piece that I needed to do for the project!

    Cheers

    Amin
    On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler wrote:
    cool. i will use compression and store in index. is there anything
    special
    i need to for decompressing the text? i presume i can just do
    doc.get("content")?
    thanks for your advice all!
    No just use Field.Store.COMPRESS when adding to index and Document.get()
    when fetching. The decompression is automatically done.

    You may think, why not enable compression for all fields? The case is, that
    this is an overhead for very small and short fields. So you should only use
    it for large contents (it's the same like compressing very small files as
    ZIP/GZIP: These files mostly get larger than without compression).

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org






    ------------------------------------------------------------------------


    No virus found in this incoming message.
    Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database:
    270.11.10/1995 - Release Date: 03/11/09 08:28:00






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


  • Amin Mohammed-Coleman at Mar 12, 2009 at 11:49 am
    JIRA raised:

    https://issues.apache.org/jira/browse/LUCENE-1559

    Thanks
    On Thu, Mar 12, 2009 at 11:29 AM, Amin Mohammed-Coleman wrote:

    Hi

    Did both attachments not come through?

    Cheers
    Amin

    On Thu, Mar 12, 2009 at 9:52 AM, mark harwood wrote:

    The attachment didn't make it through here. Can you add it as an
    attachment to a new JIRA issue?

    Thanks,
    Mark





    ________________________________
    From: Amin Mohammed-Coleman <aminmc@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Thursday, 12 March, 2009 7:47:20
    Subject: Re: Lucene Highlighting and Dynamic Summaries

    Hi

    Please find attadched a test case plus a document. Just to mention this
    occurs sometimes for other files.


    Cheers
    Amin


    On Wed, Mar 11, 2009 at 6:11 PM, markharw00d <markharw00d@yahoo.co.uk>
    wrote:

    If you can supply a Junit test that recreates the problem I think we can
    start to make progress on this.



    Amin Mohammed-Coleman wrote:

    Hi

    Apologies for re sending this mail. Just wondering if anyone has
    experienced the below.. I'm not sure if this could happen due nature of
    document. It does seem strange one term search returns summary while another
    does not even though same document is being returned.

    I'm asking this so I can code around this if is normal.


    Apologies again for re sending this mail

    Cheers

    Amin

    Sent from my iPhone

    On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman wrote:


    Hi

    I am seeing some strange behaviour with the highlighter and I'm wondering
    if anyone else is experiencing this. In certain instances I don't get a
    summary being generated. I perform the search and the search returns the
    correct document. I can see that the lucene document contains the text in
    the field. However after doing:

    SimpleHTMLFormatter simpleHTMLFormatter = new SimpleHTMLFormatter("<span
    class=\"highlight\"><b>", "</b></span>");
    //required for highlighting
    Query query2 = multiSearcher.rewrite(query);
    Highlighter highlighter = new Highlighter(simpleHTMLFormatter,
    new QueryScorer(query2));
    ...

    String text= doc.get(FieldNameEnum.BODY.getDescription());
    TokenStream tokenStream =
    analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new
    StringReader(text));
    String result = highlighter.getBestFragments(tokenStream,
    text, 3, "...");


    the string result is empty. This is very strange, if i try a different
    term that exists in the document then I get a summary. For example I have a
    word document that contains the term "document" and "aspectj". If I search
    for "document" I get the correct document but no highlighted summary.
    However if I search using "aspectj" I get the same doucment with
    highlighted summary.

    Just to mentioned I do rewrite the original query before performing the
    highlighting.

    I'm not sure what i'm missing here. Any help would be appreciated.

    Cheers
    Amin

    On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <aminmc@gmail.com>
    wrote:
    Hi

    Got it working! Thanks again for your help!


    Amin


    On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <aminmc@gmail.com>
    wrote:
    Thanks! The final piece that I needed to do for the project!

    Cheers

    Amin
    On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler wrote:
    cool. i will use compression and store in index. is there anything
    special
    i need to for decompressing the text? i presume i can just do
    doc.get("content")?
    thanks for your advice all!
    No just use Field.Store.COMPRESS when adding to index and Document.get()
    when fetching. The decompression is automatically done.

    You may think, why not enable compression for all fields? The case is,
    that
    this is an overhead for very small and short fields. So you should only
    use
    it for large contents (it's the same like compressing very small files as
    ZIP/GZIP: These files mostly get larger than without compression).

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org






    ------------------------------------------------------------------------


    No virus found in this incoming message.
    Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database:
    270.11.10/1995 - Release Date: 03/11/09 08:28:00






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


  • Amin Mohammed-Coleman at Mar 12, 2009 at 5:57 pm
    Hi

    I have found that it is not issue with POI. I extracted text using PoI
    but differenlty and the term is extracted properly. When I store the
    text and retrieve it the term exists. However running the text through
    highlighter doesn't work

    I will post test case with plain text file on JIRA. Currently on a
    cramped train!

    Cheers

    On 11 Mar 2009, at 18:11, markharw00d wrote:

    If you can supply a Junit test that recreates the problem I think we
    can start to make progress on this.



    Amin Mohammed-Coleman wrote:
    Hi

    Apologies for re sending this mail. Just wondering if anyone has
    experienced the below. I'm not sure if this could happen due nature
    of document. It does seem strange one term search returns summary
    while another does not even though same document is being returned.

    I'm asking this so I can code around this if is normal.


    Apologies again for re sending this mail

    Cheers

    Amin

    Sent from my iPhone

    On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman <aminmc@gmail.com>
    wrote:
    Hi

    I am seeing some strange behaviour with the highlighter and I'm
    wondering if anyone else is experiencing this. In certain
    instances I don't get a summary being generated. I perform the
    search and the search returns the correct document. I can see
    that the lucene document contains the text in the field. However
    after doing:

    SimpleHTMLFormatter simpleHTMLFormatter = new
    SimpleHTMLFormatter("<span class=\"highlight\"><b>", "</b></span>");
    //required for highlighting
    Query query2 = multiSearcher.rewrite(query);
    Highlighter highlighter = new
    Highlighter(simpleHTMLFormatter, new QueryScorer(query2));
    ...

    String text= doc.get(FieldNameEnum.BODY.getDescription());
    TokenStream tokenStream =
    analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new
    StringReader(text));
    String result =
    highlighter.getBestFragments(tokenStream, text, 3, "...");


    the string result is empty. This is very strange, if i try a
    different term that exists in the document then I get a summary.
    For example I have a word document that contains the term
    "document" and "aspectj". If I search for "document" I get the
    correct document but no highlighted summary. However if I search
    using "aspectj" I get the same doucment with highlighted summary.

    Just to mentioned I do rewrite the original query before
    performing the highlighting.

    I'm not sure what i'm missing here. Any help would be appreciated.

    Cheers
    Amin

    On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <aminmc@gmail.com
    wrote:
    Hi

    Got it working! Thanks again for your help!


    Amin


    On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <aminmc@gmail.com
    wrote:
    Thanks! The final piece that I needed to do for the project!

    Cheers

    Amin

    On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler <uwe@thetaphi.de>
    wrote:
    cool. i will use compression and store in index. is there anything
    special
    i need to for decompressing the text? i presume i can just do
    doc.get("content")?
    thanks for your advice all!
    No just use Field.Store.COMPRESS when adding to index and
    Document.get()
    when fetching. The decompression is automatically done.

    You may think, why not enable compression for all fields? The case
    is, that
    this is an overhead for very small and short fields. So you should
    only use
    it for large contents (it's the same like compressing very small
    files as
    ZIP/GZIP: These files mostly get larger than without compression).

    Uwe


    ---
    ------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---
    ---------------------------------------------------------------------


    No virus found in this incoming message.
    Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database: 270.11.10/1995
    - Release Date: 03/11/09 08:28:00


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Amin Mohammed-Coleman at Mar 12, 2009 at 6:42 pm
    JIRA updated. Includes new testcase which shows highlighter not working as
    expected.
    On Thu, Mar 12, 2009 at 5:56 PM, Amin Mohammed-Coleman wrote:

    Hi

    I have found that it is not issue with POI. I extracted text using PoI but
    differenlty and the term is extracted properly. When I store the text and
    retrieve it the term exists. However running the text through highlighter
    doesn't work

    I will post test case with plain text file on JIRA. Currently on a cramped
    train!

    Cheers



    On 11 Mar 2009, at 18:11, markharw00d wrote:

    If you can supply a Junit test that recreates the problem I think we can
    start to make progress on this.



    Amin Mohammed-Coleman wrote:
    Hi

    Apologies for re sending this mail. Just wondering if anyone has
    experienced the below. I'm not sure if this could happen due nature of
    document. It does seem strange one term search returns summary while another
    does not even though same document is being returned.

    I'm asking this so I can code around this if is normal.


    Apologies again for re sending this mail

    Cheers

    Amin

    Sent from my iPhone

    On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman wrote:

    Hi
    I am seeing some strange behaviour with the highlighter and I'm
    wondering if anyone else is experiencing this. In certain instances I don't
    get a summary being generated. I perform the search and the search returns
    the correct document. I can see that the lucene document contains the text
    in the field. However after doing:

    SimpleHTMLFormatter simpleHTMLFormatter = new
    SimpleHTMLFormatter("<span class=\"highlight\"><b>", "</b></span>");
    //required for highlighting
    Query query2 = multiSearcher.rewrite(query);
    Highlighter highlighter = new Highlighter(simpleHTMLFormatter,
    new QueryScorer(query2));
    ...

    String text= doc.get(FieldNameEnum.BODY.getDescription());
    TokenStream tokenStream =
    analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new
    StringReader(text));
    String result = highlighter.getBestFragments(tokenStream,
    text, 3, "...");


    the string result is empty. This is very strange, if i try a different
    term that exists in the document then I get a summary. For example I have a
    word document that contains the term "document" and "aspectj". If I search
    for "document" I get the correct document but no highlighted summary.
    However if I search using "aspectj" I get the same doucment with
    highlighted summary.

    Just to mentioned I do rewrite the original query before performing the
    highlighting.

    I'm not sure what i'm missing here. Any help would be appreciated.

    Cheers
    Amin

    On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <aminmc@gmail.com>
    wrote:
    Hi

    Got it working! Thanks again for your help!


    Amin


    On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <
    aminmc@gmail.com> wrote:
    Thanks! The final piece that I needed to do for the project!

    Cheers

    Amin
    On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler wrote:
    cool. i will use compression and store in index. is there anything
    special
    i need to for decompressing the text? i presume i can just do
    doc.get("content")?
    thanks for your advice all!
    No just use Field.Store.COMPRESS when adding to index and Document.get()
    when fetching. The decompression is automatically done.

    You may think, why not enable compression for all fields? The case is,
    that
    this is an overhead for very small and short fields. So you should only
    use
    it for large contents (it's the same like compressing very small files
    as
    ZIP/GZIP: These files mostly get larger than without compression).

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ------------------------------------------------------------------------


    No virus found in this incoming message.
    Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database:
    270.11.10/1995 - Release Date: 03/11/09 08:28:00


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Amin Mohammed-Coleman at Mar 12, 2009 at 9:58 pm
    I did the following:

    highlighter.setMaxDocCharsToAnalyze(Integer.MAX_VALUE);


    which works.
    On Thu, Mar 12, 2009 at 6:41 PM, Amin Mohammed-Coleman wrote:

    JIRA updated. Includes new testcase which shows highlighter not working as
    expected.

    On Thu, Mar 12, 2009 at 5:56 PM, Amin Mohammed-Coleman wrote:

    Hi

    I have found that it is not issue with POI. I extracted text using PoI but
    differenlty and the term is extracted properly. When I store the text and
    retrieve it the term exists. However running the text through highlighter
    doesn't work

    I will post test case with plain text file on JIRA. Currently on a cramped
    train!

    Cheers



    On 11 Mar 2009, at 18:11, markharw00d wrote:

    If you can supply a Junit test that recreates the problem I think we can
    start to make progress on this.



    Amin Mohammed-Coleman wrote:
    Hi

    Apologies for re sending this mail. Just wondering if anyone has
    experienced the below. I'm not sure if this could happen due nature of
    document. It does seem strange one term search returns summary while another
    does not even though same document is being returned.

    I'm asking this so I can code around this if is normal.


    Apologies again for re sending this mail

    Cheers

    Amin

    Sent from my iPhone

    On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman <aminmc@gmail.com>
    wrote:

    Hi
    I am seeing some strange behaviour with the highlighter and I'm
    wondering if anyone else is experiencing this. In certain instances I don't
    get a summary being generated. I perform the search and the search returns
    the correct document. I can see that the lucene document contains the text
    in the field. However after doing:

    SimpleHTMLFormatter simpleHTMLFormatter = new
    SimpleHTMLFormatter("<span class=\"highlight\"><b>", "</b></span>");
    //required for highlighting
    Query query2 = multiSearcher.rewrite(query);
    Highlighter highlighter = new
    Highlighter(simpleHTMLFormatter, new QueryScorer(query2));
    ...

    String text= doc.get(FieldNameEnum.BODY.getDescription());
    TokenStream tokenStream =
    analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new
    StringReader(text));
    String result = highlighter.getBestFragments(tokenStream,
    text, 3, "...");


    the string result is empty. This is very strange, if i try a different
    term that exists in the document then I get a summary. For example I have a
    word document that contains the term "document" and "aspectj". If I search
    for "document" I get the correct document but no highlighted summary.
    However if I search using "aspectj" I get the same doucment with
    highlighted summary.

    Just to mentioned I do rewrite the original query before performing the
    highlighting.

    I'm not sure what i'm missing here. Any help would be appreciated.

    Cheers
    Amin

    On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <
    aminmc@gmail.com> wrote:
    Hi

    Got it working! Thanks again for your help!


    Amin


    On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <
    aminmc@gmail.com> wrote:
    Thanks! The final piece that I needed to do for the project!

    Cheers

    Amin

    On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler <uwe@thetaphi.de>
    wrote:
    cool. i will use compression and store in index. is there anything
    special
    i need to for decompressing the text? i presume i can just do
    doc.get("content")?
    thanks for your advice all!
    No just use Field.Store.COMPRESS when adding to index and
    Document.get()
    when fetching. The decompression is automatically done.

    You may think, why not enable compression for all fields? The case is,
    that
    this is an overhead for very small and short fields. So you should only
    use
    it for large contents (it's the same like compressing very small files
    as
    ZIP/GZIP: These files mostly get larger than without compression).

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ------------------------------------------------------------------------


    No virus found in this incoming message.
    Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database:
    270.11.10/1995 - Release Date: 03/11/09 08:28:00


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Mar 12, 2009 at 10:03 pm
    IndexWriter has such behavior too, and because it was such a common trap
    (developers could not understand why their content was being
    truncated), we
    made that setting explicit, up front so you were aware of it.

    I think this in general is a reasonable approach for settings that
    "lose" stuff (content,
    highlighted terms, etc.).

    Maybe we should do the same for highlighter?

    Mike

    Amin Mohammed-Coleman wrote:
    I did the following:

    highlighter.setMaxDocCharsToAnalyze(Integer.MAX_VALUE);


    which works.

    On Thu, Mar 12, 2009 at 6:41 PM, Amin Mohammed-Coleman <aminmc@gmail.com
    wrote:
    JIRA updated. Includes new testcase which shows highlighter not
    working as
    expected.


    On Thu, Mar 12, 2009 at 5:56 PM, Amin Mohammed-Coleman <aminmc@gmail.com
    wrote:
    Hi

    I have found that it is not issue with POI. I extracted text using
    PoI but
    differenlty and the term is extracted properly. When I store the
    text and
    retrieve it the term exists. However running the text through
    highlighter
    doesn't work

    I will post test case with plain text file on JIRA. Currently on a
    cramped
    train!

    Cheers



    On 11 Mar 2009, at 18:11, markharw00d <markharw00d@yahoo.co.uk>
    wrote:

    If you can supply a Junit test that recreates the problem I think
    we can
    start to make progress on this.



    Amin Mohammed-Coleman wrote:
    Hi

    Apologies for re sending this mail. Just wondering if anyone has
    experienced the below. I'm not sure if this could happen due
    nature of
    document. It does seem strange one term search returns summary
    while another
    does not even though same document is being returned.

    I'm asking this so I can code around this if is normal.


    Apologies again for re sending this mail

    Cheers

    Amin

    Sent from my iPhone

    On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman <aminmc@gmail.com>
    wrote:

    Hi
    I am seeing some strange behaviour with the highlighter and I'm
    wondering if anyone else is experiencing this. In certain
    instances I don't
    get a summary being generated. I perform the search and the
    search returns
    the correct document. I can see that the lucene document
    contains the text
    in the field. However after doing:

    SimpleHTMLFormatter simpleHTMLFormatter = new
    SimpleHTMLFormatter("<span class=\"highlight\"><b>", "</b></
    span>");
    //required for highlighting
    Query query2 = multiSearcher.rewrite(query);
    Highlighter highlighter = new
    Highlighter(simpleHTMLFormatter, new QueryScorer(query2));
    ...

    String text= doc.get(FieldNameEnum.BODY.getDescription());
    TokenStream tokenStream =
    analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new
    StringReader(text));
    String result =
    highlighter.getBestFragments(tokenStream,
    text, 3, "...");


    the string result is empty. This is very strange, if i try a
    different
    term that exists in the document then I get a summary. For
    example I have a
    word document that contains the term "document" and "aspectj".
    If I search
    for "document" I get the correct document but no highlighted
    summary.
    However if I search using "aspectj" I get the same doucment with
    highlighted summary.

    Just to mentioned I do rewrite the original query before
    performing the
    highlighting.

    I'm not sure what i'm missing here. Any help would be
    appreciated.

    Cheers
    Amin

    On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <
    aminmc@gmail.com> wrote:
    Hi

    Got it working! Thanks again for your help!


    Amin


    On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <
    aminmc@gmail.com> wrote:
    Thanks! The final piece that I needed to do for the project!

    Cheers

    Amin

    On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler <uwe@thetaphi.de>
    wrote:
    cool. i will use compression and store in index. is there
    anything
    special
    i need to for decompressing the text? i presume i can just do
    doc.get("content")?
    thanks for your advice all!
    No just use Field.Store.COMPRESS when adding to index and
    Document.get()
    when fetching. The decompression is automatically done.

    You may think, why not enable compression for all fields? The
    case is,
    that
    this is an overhead for very small and short fields. So you
    should only
    use
    it for large contents (it's the same like compressing very
    small files
    as
    ZIP/GZIP: These files mostly get larger than without
    compression).

    Uwe


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ------------------------------------------------------------------------


    No virus found in this incoming message.
    Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database:
    270.11.10/1995 - Release Date: 03/11/09 08:28:00


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Amin Mohammed-Coleman at Mar 13, 2009 at 5:50 am
    Hi

    I think that would be good. Probably a silly thing to ask but I guess
    there is a performance implication by setting it to max value.

    Is there a general setting that other developers use?

    Cheers

    Amin



    On 12 Mar 2009, at 22:03, Michael McCandless
    wrote:
    IndexWriter has such behavior too, and because it was such a common
    trap
    (developers could not understand why their content was being
    truncated), we
    made that setting explicit, up front so you were aware of it.

    I think this in general is a reasonable approach for settings that
    "lose" stuff (content,
    highlighted terms, etc.).

    Maybe we should do the same for highlighter?

    Mike

    Amin Mohammed-Coleman wrote:
    I did the following:

    highlighter.setMaxDocCharsToAnalyze(Integer.MAX_VALUE);


    which works.

    On Thu, Mar 12, 2009 at 6:41 PM, Amin Mohammed-Coleman <aminmc@gmail.com
    wrote:
    JIRA updated. Includes new testcase which shows highlighter not
    working as
    expected.


    On Thu, Mar 12, 2009 at 5:56 PM, Amin Mohammed-Coleman <aminmc@gmail.com
    wrote:
    Hi

    I have found that it is not issue with POI. I extracted text
    using PoI but
    differenlty and the term is extracted properly. When I store the
    text and
    retrieve it the term exists. However running the text through
    highlighter
    doesn't work

    I will post test case with plain text file on JIRA. Currently on
    a cramped
    train!

    Cheers



    On 11 Mar 2009, at 18:11, markharw00d <markharw00d@yahoo.co.uk>
    wrote:

    If you can supply a Junit test that recreates the problem I think
    we can
    start to make progress on this.



    Amin Mohammed-Coleman wrote:
    Hi

    Apologies for re sending this mail. Just wondering if anyone has
    experienced the below. I'm not sure if this could happen due
    nature of
    document. It does seem strange one term search returns summary
    while another
    does not even though same document is being returned.

    I'm asking this so I can code around this if is normal.


    Apologies again for re sending this mail

    Cheers

    Amin

    Sent from my iPhone

    On 9 Mar 2009, at 07:50, Amin Mohammed-Coleman <aminmc@gmail.com>
    wrote:

    Hi
    I am seeing some strange behaviour with the highlighter and I'm
    wondering if anyone else is experiencing this. In certain
    instances I don't
    get a summary being generated. I perform the search and the
    search returns
    the correct document. I can see that the lucene document
    contains the text
    in the field. However after doing:

    SimpleHTMLFormatter simpleHTMLFormatter = new
    SimpleHTMLFormatter("<span class=\"highlight\"><b>", "</b></
    span>");
    //required for highlighting
    Query query2 = multiSearcher.rewrite(query);
    Highlighter highlighter = new
    Highlighter(simpleHTMLFormatter, new QueryScorer(query2));
    ...

    String text= doc.get(FieldNameEnum.BODY.getDescription());
    TokenStream tokenStream =
    analyzer.tokenStream(FieldNameEnum.BODY.getDescription(), new
    StringReader(text));
    String result =
    highlighter.getBestFragments(tokenStream,
    text, 3, "...");


    the string result is empty. This is very strange, if i try a
    different
    term that exists in the document then I get a summary. For
    example I have a
    word document that contains the term "document" and
    "aspectj". If I search
    for "document" I get the correct document but no highlighted
    summary.
    However if I search using "aspectj" I get the same doucment with
    highlighted summary.

    Just to mentioned I do rewrite the original query before
    performing the
    highlighting.

    I'm not sure what i'm missing here. Any help would be
    appreciated.

    Cheers
    Amin

    On Sat, Mar 7, 2009 at 4:32 PM, Amin Mohammed-Coleman <
    aminmc@gmail.com> wrote:
    Hi

    Got it working! Thanks again for your help!


    Amin


    On Sat, Mar 7, 2009 at 12:25 PM, Amin Mohammed-Coleman <
    aminmc@gmail.com> wrote:
    Thanks! The final piece that I needed to do for the project!

    Cheers

    Amin

    On Sat, Mar 7, 2009 at 12:21 PM, Uwe Schindler <uwe@thetaphi.de>
    wrote:
    cool. i will use compression and store in index. is there
    anything
    special
    i need to for decompressing the text? i presume i can just do
    doc.get("content")?
    thanks for your advice all!
    No just use Field.Store.COMPRESS when adding to index and
    Document.get()
    when fetching. The decompression is automatically done.

    You may think, why not enable compression for all fields? The
    case is,
    that
    this is an overhead for very small and short fields. So you
    should only
    use
    it for large contents (it's the same like compressing very
    small files
    as
    ZIP/GZIP: These files mostly get larger than without
    compression).

    Uwe


    ---
    ---
    ---------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-
    help@lucene.apache.org



    ---
    ---
    ---
    ---------------------------------------------------------------


    No virus found in this incoming message.
    Checked by AVG - www.avg.com Version: 8.0.237 / Virus Database:
    270.11.10/1995 - Release Date: 03/11/09 08:28:00


    ---
    ------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Mar 13, 2009 at 10:11 am

    Amin Mohammed-Coleman wrote:

    I think that would be good.
    I'll open an issue.
    Probably a silly thing to ask but I guess there is a performance
    implication by setting it to max value.
    Right. And it's tough choosing a default in situations like this --
    performance vs losing stuff.

    However, there's a new highlighter:

    https://issues.apache.org/jira/browse/LUCENE-1522

    which looks like it may have promising performance and no default
    "loses highlighted terms" limit, I think.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Amin Mohammed-Coleman at Mar 13, 2009 at 10:36 am
    Sweet! When will this highlighter be available? Can I use this now?

    Cheers!

    On Fri, Mar 13, 2009 at 10:10 AM, Michael McCandless wrote:


    Amin Mohammed-Coleman wrote:

    I think that would be good.
    I'll open an issue.

    Probably a silly thing to ask but I guess there is a performance
    implication by setting it to max value.
    Right. And it's tough choosing a default in situations like this --
    performance vs losing stuff.

    However, there's a new highlighter:

    https://issues.apache.org/jira/browse/LUCENE-1522

    which looks like it may have promising performance and no default "loses
    highlighted terms" limit, I think.

    Mike


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Mar 13, 2009 at 10:42 am
    Well, it's not yet committed.

    You can use it now by pulling the patch attached to the issue &
    testing it yourself. If you do so, please report back! This is how
    Lucene improves.

    I'm hoping we can include it in 2.9...

    Mike
    On Mar 13, 2009, at 6:35 AM, Amin Mohammed-Coleman wrote:

    Sweet! When will this highlighter be available? Can I use this now?

    Cheers!


    On Fri, Mar 13, 2009 at 10:10 AM, Michael McCandless <
    lucene@mikemccandless.com> wrote:
    Amin Mohammed-Coleman wrote:

    I think that would be good.
    I'll open an issue.

    Probably a silly thing to ask but I guess there is a performance
    implication by setting it to max value.
    Right. And it's tough choosing a default in situations like this --
    performance vs losing stuff.

    However, there's a new highlighter:

    https://issues.apache.org/jira/browse/LUCENE-1522

    which looks like it may have promising performance and no default
    "loses
    highlighted terms" limit, I think.

    Mike


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Amin Mohammed-Coleman at Mar 13, 2009 at 12:09 pm
    Absolutely! I have received considerable help from the community and there
    are so many more stuff I want to ask!

    Cheers!

    Amin
    On Fri, Mar 13, 2009 at 10:41 AM, Michael McCandless wrote:


    Well, it's not yet committed.

    You can use it now by pulling the patch attached to the issue & testing it
    yourself. If you do so, please report back! This is how Lucene improves.

    I'm hoping we can include it in 2.9...

    Mike


    On Mar 13, 2009, at 6:35 AM, Amin Mohammed-Coleman wrote:

    Sweet! When will this highlighter be available? Can I use this now?
    Cheers!


    On Fri, Mar 13, 2009 at 10:10 AM, Michael McCandless <
    lucene@mikemccandless.com> wrote:

    Amin Mohammed-Coleman wrote:

    I think that would be good.
    I'll open an issue.

    Probably a silly thing to ask but I guess there is a performance
    implication by setting it to max value.
    Right. And it's tough choosing a default in situations like this --
    performance vs losing stuff.

    However, there's a new highlighter:

    https://issues.apache.org/jira/browse/LUCENE-1522

    which looks like it may have promising performance and no default "loses
    highlighted terms" limit, I think.

    Mike


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Amin Mohammed-Coleman at Mar 13, 2009 at 2:20 pm
    Ok. I tried to apply the patch(s) and completely messed it up (user
    error). Is there a full example of the highlighter that is available that I
    can apply and test?

    Cheers
    Amin

    On Fri, Mar 13, 2009 at 12:09 PM, Amin Mohammed-Coleman wrote:

    Absolutely! I have received considerable help from the community and there
    are so many more stuff I want to ask!

    Cheers!

    Amin


    On Fri, Mar 13, 2009 at 10:41 AM, Michael McCandless <
    lucene@mikemccandless.com> wrote:
    Well, it's not yet committed.

    You can use it now by pulling the patch attached to the issue & testing it
    yourself. If you do so, please report back! This is how Lucene improves.

    I'm hoping we can include it in 2.9...

    Mike


    On Mar 13, 2009, at 6:35 AM, Amin Mohammed-Coleman wrote:

    Sweet! When will this highlighter be available? Can I use this now?
    Cheers!


    On Fri, Mar 13, 2009 at 10:10 AM, Michael McCandless <
    lucene@mikemccandless.com> wrote:

    Amin Mohammed-Coleman wrote:

    I think that would be good.
    I'll open an issue.

    Probably a silly thing to ask but I guess there is a performance
    implication by setting it to max value.
    Right. And it's tough choosing a default in situations like this --
    performance vs losing stuff.

    However, there's a new highlighter:

    https://issues.apache.org/jira/browse/LUCENE-1522

    which looks like it may have promising performance and no default "loses
    highlighted terms" limit, I think.

    Mike


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMar 7, '09 at 9:39a
activeMar 13, '09 at 2:20p
posts27
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase