FAQ
Hi,

I need to parse the Java log files with Lucene 3.0.3. The StandardAnalyzer is
OK, except it's handling of dots.

E.g. it handles "java.lang.NullPointerException" as one word and searching for
"NullPointerException" will bring nothing.

I need an Analyzer that will work as StandardAnalyzer,but will handle dots as
word separators (e.g. as it handles commas).

Please advice.

Thanks.

Regards,
Benzion.

Search Discussions

  • Erick Erickson at Jan 1, 2011 at 5:37 pm
    Have you looked at:
    http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

    Best
    Erick
    On Fri, Dec 31, 2010 at 6:12 AM, Benzion G wrote:

    Hi,

    I need to parse the Java log files with Lucene 3.0.3. The StandardAnalyzer
    is
    OK, except it's handling of dots.

    E.g. it handles "java.lang.NullPointerException" as one word and searching
    for
    "NullPointerException" will bring nothing.

    I need an Analyzer that will work as StandardAnalyzer,but will handle dots
    as
    word separators (e.g. as it handles commas).

    Please advice.

    Thanks.

    Regards,
    Benzion.

  • Hasan Diwan at Jan 1, 2011 at 6:21 pm

    On 31 December 2010 11:12, Benzion G wrote:
    I need to parse the Java log files with Lucene 3.0.3. The StandardAnalyzer is
    OK, except it's handling of dots.

    E.g. it handles "java.lang.NullPointerException" as one word and searching for
    "NullPointerException" will bring nothing.

    I need an Analyzer that will work as StandardAnalyzer,but will handle dots as
    word separators (e.g. as it handles commas).
    Before you hand it to the Analyzer, why not run a line.replace(".",",")?
    --
    Sent from my mobile device
    Envoyait de mon telephone mobil

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Benzion G at Jan 1, 2011 at 9:47 pm
    Hi,

    Of course I thought about replacing dots by commas or blanks. But I add this
    field as Filed.Store.YES.
    If I'll replace dot with commas it will appear with commas in search
    results.

    I also considered adding it as 2 fields:
    1. With dots replaced by commas for index and Filed.Store.NO
    2. The original message with Filed.Store.YES and not indexed.

    But I'm afraid it will make my index files much bigger. Since I'm indexing
    log files the index will be anyway too big so I can't make it even bigger.
    --
    View this message in context: http://lucene.472066.n3.nabble.com/parsing-Java-log-file-with-Lucene-3-0-3-tp2173046p2177453.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Hasan Diwan at Jan 1, 2011 at 9:49 pm

    On 1 January 2011 21:47, Benzion G wrote:
    But I'm afraid it will make my index files much bigger. Since I'm indexing
    log files the index will be anyway too big so I can't make it even bigger.
    Have you tried it out? How large are your log files and how large do
    you expect them to get?
    --
    Sent from my mobile device
    Envoyait de mon telephone mobil

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Benzion G at Jan 1, 2011 at 9:50 pm
    I tried to understand where the StandardAnalyzer and other Standard* classes
    are handling these dots and commas and how can I change its behaviour. I
    debugged it as well, but I failed to understand it.
    --
    View this message in context: http://lucene.472066.n3.nabble.com/parsing-Java-log-file-with-Lucene-3-0-3-tp2173046p2177458.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Benzion G at Jan 1, 2011 at 9:56 pm
    I'm testing it with ~50M log files. But in production env the log files will
    be ~10G.
    --
    View this message in context: http://lucene.472066.n3.nabble.com/parsing-Java-log-file-with-Lucene-3-0-3-tp2173046p2177477.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Jan 2, 2011 at 2:24 am
    <<<If I'll replace dot with commas it will appear with commas in search
    results.>>>

    No, that is not the case. Storing a field stores an exact copy of the
    input, without any analysis. The intent of storing a field is to return
    something to display in the results list that reflects the original
    document. What use would it be to store something that had gone
    through the analysis chain? Would you really want to show the user
    say, the stemmed version of the input text?

    Best
    Erick
    On Sat, Jan 1, 2011 at 4:47 PM, Benzion G wrote:


    Hi,

    Of course I thought about replacing dots by commas or blanks. But I add
    this
    field as Filed.Store.YES.
    If I'll replace dot with commas it will appear with commas in search
    results.

    I also considered adding it as 2 fields:
    1. With dots replaced by commas for index and Filed.Store.NO
    2. The original message with Filed.Store.YES and not indexed.

    But I'm afraid it will make my index files much bigger. Since I'm indexing
    log files the index will be anyway too big so I can't make it even bigger.
    --
    View this message in context:
    http://lucene.472066.n3.nabble.com/parsing-Java-log-file-with-Lucene-3-0-3-tp2173046p2177453.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Benzion G at Jan 2, 2011 at 4:45 am
    Of course I want to store and then show to user the original message. That's
    why I can't change it and the place to handle the dots is the Analyzer area.
    So how can I make the StandardAnalyzer to handle dots as commas?
    --
    View this message in context: http://lucene.472066.n3.nabble.com/parsing-Java-log-file-with-Lucene-3-0-3-tp2173046p2178710.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Jan 2, 2011 at 3:59 pm
    Some days I just can't read...

    First question: Why do you require standard analyzer?Are you really making
    use of
    the special processing? Take a look at other analyzer options.
    PatternAnalyzer,
    SimpleAnalyzer, etc.

    If you really require StandardAnalyzer, consider using two fields.
    field_original
    and field_processed. Store (but don't index) the original string in
    field_original.
    pre-process and analyze (but don't store) in field_processed. Search
    against field_processed and display from field_original.

    This won't bloat your index, since the operations are orthogonal anyway.

    Best
    Erick
    On Sat, Jan 1, 2011 at 11:45 PM, Benzion G wrote:


    Of course I want to store and then show to user the original message.
    That's
    why I can't change it and the place to handle the dots is the Analyzer
    area.
    So how can I make the StandardAnalyzer to handle dots as commas?
    --
    View this message in context:
    http://lucene.472066.n3.nabble.com/parsing-Java-log-file-with-Lucene-3-0-3-tp2173046p2178710.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Benzion G at Jan 4, 2011 at 7:33 am
    Thank you guys! Looks like SimpleAnalyzer is OK for my application. I'm still
    testing but meanwhile it looks good.
    --
    View this message in context: http://lucene.472066.n3.nabble.com/parsing-Java-log-file-with-Lucene-3-0-3-tp2173046p2190354.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Benzion G at Jan 4, 2011 at 11:23 am
    Problem with SimpleAnalyzer! It ignores digits.

    For text "customer 123 found" it will take only "customer" and "found", but
    will ignore "123". StandardAnalyzer handles OK the digits but has the dots
    problem, I mentioned before.

    Is there an understandable guide how to write my own Analyzer - a hybrid of
    StandardAnalyzer and SimpleAnalyzer?
    --
    View this message in context: http://lucene.472066.n3.nabble.com/parsing-Java-log-file-with-Lucene-3-0-3-tp2173046p2190856.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Jan 4, 2011 at 12:49 pm
    Lucene In Action has an example of creating a synonymanalyzer that
    you can adapt. The general idea is to subclass from Analyzer and
    implement the required functions, perhaps wrapping a Tokenizer
    in a bunch of Filters.

    You might be able to crib some ideas from
    solr.analysis.WordDelimiterFilter
    Best
    Erick


    On Tue, Jan 4, 2011 at 6:23 AM, Benzion G wrote:


    Problem with SimpleAnalyzer! It ignores digits.

    For text "customer 123 found" it will take only "customer" and "found", but
    will ignore "123". StandardAnalyzer handles OK the digits but has the dots
    problem, I mentioned before.

    Is there an understandable guide how to write my own Analyzer - a hybrid of
    StandardAnalyzer and SimpleAnalyzer?
    --
    View this message in context:
    http://lucene.472066.n3.nabble.com/parsing-Java-log-file-with-Lucene-3-0-3-tp2173046p2190856.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Benzion G at Jan 4, 2011 at 5:48 pm
    OK, I succeeded to write an Analyzer I need. I can't say that I understood
    all Lucene Analyzer-Tokenizer-Filter logic, but here's attached MyAnalyzer.
    Hope it will help somebody else.


    import java.io.Reader;

    import org.apache.lucene.analysis.Analyzer;
    import org.apache.lucene.analysis.CharTokenizer;
    import org.apache.lucene.analysis.LowerCaseFilter;
    import org.apache.lucene.analysis.StopAnalyzer;
    import org.apache.lucene.analysis.StopFilter;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.standard.StandardFilter;

    public class MyAnalyzer extends Analyzer
    {
    public TokenStream tokenStream(String field, final Reader reader)
    {
    TokenStream result = new MyCharTokenizer(reader);
    result = new StandardFilter(result);
    result = new LowerCaseFilter(result);
    result = new StopFilter(true, result,
    StopAnalyzer.ENGLISH_STOP_WORDS_SET);

    return result;
    }

    static class MyCharTokenizer extends CharTokenizer
    {
    public static final char[] BAD_CHARACTERS =
    { '.', ',', ':', '(', ')', ' ', '[', ']', ';', '\'', '"', '|', '-', '_',
    '*', '<', '>', '=', '+', '%', '#', '~', '`', '^'};


    public MyCharTokenizer(Reader input)
    {
    super(input);
    }


    @Override
    protected boolean isTokenChar(char paramChar)
    {
    if (Character.isLetterOrDigit(paramChar))
    {
    return true;
    }
    else
    {
    return false;
    }

    //if you need to filter out specific characters and not just
    non-digits-or-letters as above
    //for (int i = 0; i < BAD_CHARACTERS.length; i++)
    //{
    // if (BAD_CHARACTERS[i] == paramChar)
    // {
    // return false;
    // }
    //}

    //return true;
    }
    }
    }

    --
    View this message in context: http://lucene.472066.n3.nabble.com/parsing-Java-log-file-with-Lucene-3-0-3-tp2173046p2193022.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJan 1, '11 at 5:36p
activeJan 4, '11 at 5:48p
posts14
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase