FAQ
Hi,

I'm about to write an application that does very simple text analysis,
namely dictionary based entity entraction. The alternative is to do in
memory matching with substring:

String text; // could be any size, but normally "news paper length"
List matches;
for( String wordOrPhrase : dictionary) {
if ( text.substring( wordOrPhrase ) >= 0 ) {
matches.add( wordOrPhrase );
}
}

I am concerned the above code will be quite cpu intensitive, it will also be
case sensitive and lot leave any room for fuzzy matching.

I thought this task could also be solved by indexing every bit of text that
is to be analyzed, and then executing a query per dicionary entry:

(pseudo)

lucene.index(text)
List matches
for( String wordOrPhrase : dictionary {
if( lucene.search( wordOrPharse, text_id) gives hit ) {
matches.add(wordOrPhrase)
}
}

I have not used lucene very much, so I don't know if it is a good idea or
not to use lucene for this task at all. Could anyone please share their
thoughs on this?

Thanks,
Geir

Search Discussions

  • Ian Lea at Jul 23, 2010 at 1:05 pm
    So, if I've understood this correctly, you've got some text and wan't
    to loop through a list of words and/or phrases, and see which of those
    match the text.

    e.g.

    text "some random article about something or other of some random length"

    words

    some - matches
    many - no match
    article - matches
    word - no match

    You can certainly do that with lucene. Load the text into a document
    and loop round the words or phrases searching for each. You are
    likely to need to look into analyzers depending on your requirements
    around stop words, punctuation, case, etc. And phrase/span queries
    for phrases.
    There are also probably some lucene techniques for speeding this up,
    but as ever, start simple - lucene is usually plenty fast enough.


    --
    Ian.


    On Thu, Jul 22, 2010 at 11:30 PM, Geir Gullestad Pettersen
    wrote:
    Hi,

    I'm about to write an application that does very simple text analysis,
    namely dictionary based entity entraction. The alternative is to do in
    memory matching with substring:

    String text; // could be any size, but normally "news paper length"
    List matches;
    for( String wordOrPhrase : dictionary) {
    if ( text.substring( wordOrPhrase ) >= 0 ) {
    matches.add( wordOrPhrase );
    }
    }

    I am concerned the above code will be quite cpu intensitive, it will also be
    case sensitive and lot leave any room for fuzzy matching.

    I thought this task could also be solved by indexing every bit of text that
    is to be analyzed, and then executing a query per dicionary entry:

    (pseudo)

    lucene.index(text)
    List matches
    for( String wordOrPhrase : dictionary {
    if( lucene.search( wordOrPharse, text_id) gives hit ) {
    matches.add(wordOrPhrase)
    }
    }

    I have not used lucene very much, so I don't know if it is a good idea or
    not to use lucene for this task at all. Could anyone please share their
    thoughs on this?

    Thanks,
    Geir
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Geir Gullestad Pettersen at Jul 27, 2010 at 8:35 pm
    Thanks for your feedback, Ian.

    I have written a first implementation of this service that works well. You
    mentioned something about technologies for speeding up lucene, something I
    am interested in knowing more about. Would you, or anyone, please mind
    elaborating a bit, or giving me some pointers?

    For the record I am using the in memory RAMDirectory instead of file based
    index. I don't know if is relevant in terms of speeding things up, but
    thought I'd mention it just to be safe.

    Thank you,

    Geir

    2010/7/23 Ian Lea <ian.lea@gmail.com>
    So, if I've understood this correctly, you've got some text and wan't
    to loop through a list of words and/or phrases, and see which of those
    match the text.

    e.g.

    text "some random article about something or other of some random length"

    words

    some - matches
    many - no match
    article - matches
    word - no match

    You can certainly do that with lucene. Load the text into a document
    and loop round the words or phrases searching for each. You are
    likely to need to look into analyzers depending on your requirements
    around stop words, punctuation, case, etc. And phrase/span queries
    for phrases.
    There are also probably some lucene techniques for speeding this up,
    but as ever, start simple - lucene is usually plenty fast enough.


    --
    Ian.


    On Thu, Jul 22, 2010 at 11:30 PM, Geir Gullestad Pettersen
    wrote:
    Hi,

    I'm about to write an application that does very simple text analysis,
    namely dictionary based entity entraction. The alternative is to do in
    memory matching with substring:

    String text; // could be any size, but normally "news paper length"
    List matches;
    for( String wordOrPhrase : dictionary) {
    if ( text.substring( wordOrPhrase ) >= 0 ) {
    matches.add( wordOrPhrase );
    }
    }

    I am concerned the above code will be quite cpu intensitive, it will also be
    case sensitive and lot leave any room for fuzzy matching.

    I thought this task could also be solved by indexing every bit of text that
    is to be analyzed, and then executing a query per dicionary entry:

    (pseudo)

    lucene.index(text)
    List matches
    for( String wordOrPhrase : dictionary {
    if( lucene.search( wordOrPharse, text_id) gives hit ) {
    matches.add(wordOrPhrase)
    }
    }

    I have not used lucene very much, so I don't know if it is a good idea or
    not to use lucene for this task at all. Could anyone please share their
    thoughs on this?

    Thanks,
    Geir
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • William Newport at Jul 27, 2010 at 10:01 pm
    Ramdirectorys seem useful but as the index gets larger, java heap
    sizes can become a problem in terms of garbage collection pauses. Some
    customers are looking to use data grid products such as IBM websphere
    extreme scale or oracle coherence to act as the directory for the
    index. This stores the index in memory in a set of remote jvms. This
    avoids expensive disk io and considerably reduces the memory needed
    for each jvm running the lucene engine. It's not as fast as the ram
    directory and in our tests is similar to a fast disk type setup.

    Once the index is copied from disk to the grid directory when lucene
    jvms can be cycled and reconnected to that remote grid thus avoiding
    the need to copy to ram every jvm start.

    I work for IBM and am the chief architect for ibms data grid product,
    websphere extreme scale.

    Sent from my iPad
    On Jul 27, 2010, at 3:34 PM, Geir Gullestad Pettersen wrote:

    Thanks for your feedback, Ian.

    I have written a first implementation of this service that works well. You
    mentioned something about technologies for speeding up lucene, something I
    am interested in knowing more about. Would you, or anyone, please mind
    elaborating a bit, or giving me some pointers?

    For the record I am using the in memory RAMDirectory instead of file based
    index. I don't know if is relevant in terms of speeding things up, but
    thought I'd mention it just to be safe.

    Thank you,

    Geir

    2010/7/23 Ian Lea <ian.lea@gmail.com>
    So, if I've understood this correctly, you've got some text and wan't
    to loop through a list of words and/or phrases, and see which of those
    match the text.

    e.g.

    text "some random article about something or other of some random length"

    words

    some - matches
    many - no match
    article - matches
    word - no match

    You can certainly do that with lucene. Load the text into a document
    and loop round the words or phrases searching for each. You are
    likely to need to look into analyzers depending on your requirements
    around stop words, punctuation, case, etc. And phrase/span queries
    for phrases.
    There are also probably some lucene techniques for speeding this up,
    but as ever, start simple - lucene is usually plenty fast enough.


    --
    Ian.


    On Thu, Jul 22, 2010 at 11:30 PM, Geir Gullestad Pettersen
    wrote:
    Hi,

    I'm about to write an application that does very simple text analysis,
    namely dictionary based entity entraction. The alternative is to do in
    memory matching with substring:

    String text; // could be any size, but normally "news paper length"
    List matches;
    for( String wordOrPhrase : dictionary) {
    if ( text.substring( wordOrPhrase ) >= 0 ) {
    matches.add( wordOrPhrase );
    }
    }

    I am concerned the above code will be quite cpu intensitive, it will also be
    case sensitive and lot leave any room for fuzzy matching.

    I thought this task could also be solved by indexing every bit of text that
    is to be analyzed, and then executing a query per dicionary entry:

    (pseudo)

    lucene.index(text)
    List matches
    for( String wordOrPhrase : dictionary {
    if( lucene.search( wordOrPharse, text_id) gives hit ) {
    matches.add(wordOrPhrase)
    }
    }

    I have not used lucene very much, so I don't know if it is a good idea or
    not to use lucene for this task at all. Could anyone please share their
    thoughs on this?

    Thanks,
    Geir
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ian Lea at Jul 28, 2010 at 8:54 am
    You could also look at MemoryIndex or InstantiatedIndex, both in
    lucene's contrib area. I think that I was also wondering if you might
    gain from using TermDocs or TermVectors or something directly.


    --
    Ian.



    On Tue, Jul 27, 2010 at 9:34 PM, Geir Gullestad Pettersen
    wrote:
    Thanks for your feedback, Ian.

    I have written a first implementation of this service that works well. You
    mentioned something about technologies for speeding up lucene, something I
    am interested in knowing more about. Would you, or anyone, please mind
    elaborating a bit, or giving me some pointers?

    For the record I am using the in memory RAMDirectory instead of file based
    index. I don't know if is relevant in terms of speeding things up, but
    thought I'd mention it just to be safe.

    Thank you,

    Geir

    2010/7/23 Ian Lea <ian.lea@gmail.com>
    So, if I've understood this correctly, you've got some text and wan't
    to loop through a list of words and/or phrases, and see which of those
    match the text.

    e.g.

    text "some random article about something or other of some random length"

    words

    some - matches
    many - no match
    article - matches
    word - no match

    You can certainly do that with lucene.  Load the text into a document
    and loop round the words or phrases searching for each.  You are
    likely to need to look into analyzers depending on your requirements
    around stop words, punctuation, case, etc.  And phrase/span queries
    for phrases.
    There are also probably some lucene techniques for speeding this up,
    but as ever, start simple - lucene is usually plenty fast enough.


    --
    Ian.


    On Thu, Jul 22, 2010 at 11:30 PM, Geir Gullestad Pettersen
    wrote:
    Hi,

    I'm about to write an application that does very simple text analysis,
    namely dictionary based entity entraction. The alternative is to do in
    memory matching with substring:

    String text; // could be any size, but normally "news paper length"
    List matches;
    for( String wordOrPhrase : dictionary) {
    if ( text.substring( wordOrPhrase ) >= 0 ) {
    matches.add( wordOrPhrase );
    }
    }

    I am concerned the above code will be quite cpu intensitive, it will also be
    case sensitive and lot leave any room for fuzzy matching.

    I thought this task could also be solved by indexing every bit of text that
    is to be analyzed, and then executing a query per dicionary entry:

    (pseudo)

    lucene.index(text)
    List matches
    for( String wordOrPhrase : dictionary {
    if( lucene.search( wordOrPharse, text_id) gives hit ) {
    matches.add(wordOrPhrase)
    }
    }

    I have not used lucene very much, so I don't know if it is a good idea or
    not to use lucene for this task at all. Could anyone please share their
    thoughs on this?

    Thanks,
    Geir
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 22, '10 at 10:31p
activeJul 28, '10 at 8:54a
posts5
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase