FAQ
Hello everybody

1. Have tons of digitalized text with the logical errors in OCR process
2. Have indexed with Solr and is working OK.
3. Have added spellchecker index-based for words and phrases with the
hope to offer suggestions with "suspicious" possible new query
expressions, or related query expressions to the actual one with the
intention to find documents that have the original expression but
contains OCR errors (the user originally have search for "state and
democracy" and the interface will offer "stete and demcraci" as an
alternate query expression)

My first problem appears because I need suggestions inclusive when the
expression has returned results. It's seems that only appear
suggestions when there are no results. Is there a way to do so?

The second question is: For the purposes that I've mentioned, is the
best way to use spellchecker or mlt component? Or some other (as a
fuzzy query)?

Thanks a lot
German

Search Discussions

  • Chris Hostetter at Dec 15, 2009 at 8:58 pm
    : My first problem appears because I need suggestions inclusive when the
    : expression has returned results. It's seems that only appear
    : suggestions when there are no results. Is there a way to do so?

    can you give us an example of what your queries look like? with the
    example configs, i can get matches, as well as suggestions...


    http://localhost:8983/solr/spell?q=ide&spellcheck=true

    : The second question is: For the purposes that I've mentioned, is the
    : best way to use spellchecker or mlt component? Or some other (as a
    : fuzzy query)?

    there's no clear cut answer to that -- i don't remember anyone else ever
    asking about anything particularly similar to what you're doing, so i
    don't know that there is any precident for a "best" way to go about it.



    -Hoss
  • Lance Norskog at Dec 17, 2009 at 9:00 pm
    Character-based NGrams are a good tool for this problem. MLT is a
    document-wide numerical analysis.

    If the common types of OCR mistakes are different than what NGrams
    create, you might tune the ngram generator. For example, swapping
    letters might not happen very often. SIngle- and multi-word errors
    must happen a lot.

    If you do a facet query on your indexed terms, you will get a lot of
    facets with only one appearance in the index. These are often
    misspellings. It is possible to automate pulling these and creating a
    matching set of synonyms for words that appear in the spelling index.

    On Tue, Dec 15, 2009 at 12:57 PM, Chris Hostetter
    wrote:
    : My first problem appears because I need suggestions inclusive when the
    : expression has returned results. It's seems that only appear
    : suggestions when there are no results. Is there a way to do so?

    can you give us an example of what your queries look like?  with the
    example configs, i can get matches, as well as suggestions...


    http://localhost:8983/solr/spell?q=ide&spellcheck=true

    : The second question is: For the purposes that I've mentioned, is the
    : best way to use spellchecker or mlt component? Or some other (as a
    : fuzzy query)?

    there's no clear cut answer to that -- i don't remember anyone else ever
    asking about anything particularly similar to what you're doing, so i
    don't know that there is any precident for a "best" way to go about it.



    -Hoss


    --
    Lance Norskog
    [email protected]
  • Lance Norskog at Dec 17, 2009 at 9:10 pm
    Another thing you might check into is stemming. The Porter stemmer
    included in Solr is "aggressive", meaning that it will tend to do
    weird things with misspellings. There is a different stemmer called
    KStem which is available from www.lucidimagination.com/Downloads is
    less aggressive. Porter turns "changes" and "changing" into "chang",
    while KStem does not go this far.

    http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem
    On Thu, Dec 17, 2009 at 12:59 PM, Lance Norskog wrote:
    Character-based NGrams are a good tool for this problem. MLT is a
    document-wide numerical analysis.

    If the common types of OCR mistakes are different than what NGrams
    create, you might tune the ngram generator. For example, swapping
    letters might not happen very often. SIngle- and multi-word errors
    must happen a lot.

    If you do a facet query on your indexed terms, you will get a lot of
    facets with only one appearance in the index. These are often
    misspellings. It is possible to automate pulling these and creating a
    matching set of synonyms for words that appear in the spelling index.

    On Tue, Dec 15, 2009 at 12:57 PM, Chris Hostetter
    wrote:
    : My first problem appears because I need suggestions inclusive when the
    : expression has returned results. It's seems that only appear
    : suggestions when there are no results. Is there a way to do so?

    can you give us an example of what your queries look like?  with the
    example configs, i can get matches, as well as suggestions...


    http://localhost:8983/solr/spell?q=ide&spellcheck=true

    : The second question is: For the purposes that I've mentioned, is the
    : best way to use spellchecker or mlt component? Or some other (as a
    : fuzzy query)?

    there's no clear cut answer to that -- i don't remember anyone else ever
    asking about anything particularly similar to what you're doing, so i
    don't know that there is any precident for a "best" way to go about it.



    -Hoss


    --
    Lance Norskog
    [email protected]


    --
    Lance Norskog
    [email protected]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupsolr-user @
categorieslucene
postedDec 7, '09 at 5:03a
activeDec 17, '09 at 9:10p
posts4
users3
websitelucene.apache.org...

People

Translate

site design / logo © 2023 Grokbase