FAQ
[ https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899812#action_12899812 ]

Robert Muir commented on SOLR-1860:
-----------------------------------

committed this as rev 986612 (and 3x rev 986615).
improve stopwords list handling
-------------------------------

Key: SOLR-1860
URL: https://issues.apache.org/jira/browse/SOLR-1860
Project: Solr
Issue Type: Improvement
Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
Attachments: SOLR-1860.patch


Currently Solr makes it easy to use english stopwords for StopFilter or CommonGramsFilter.
Recently in lucene, we added stopwords lists (mostly, but not all from snowball) to all the language analyzers.
So it would be nice if a user can easily specify that they want to use a french stopword list, and use it for StopFilter or CommonGrams.
The ones from snowball, are however formatted in a different manner than the others (although in Lucene we have parsers to deal with this).
Additionally, we abstract this from Lucene users by adding a static getDefaultStopSet to all analyzers.
There are two approaches, the first one I think I prefer the most, but I'm not sure it matters as long as we have good examples (maybe a foreign language example schema?)
1. The user would specify something like:
<filter class="solr.StopFilterFactory" fromAnalyzer="org.apache.lucene.analysis.FrenchAnalyzer" .../>
This would just grab the CharArraySet from the FrenchAnalyzer's getDefaultStopSet method, who cares where it comes from or how its loaded.
2. We add support for snowball-formatted stopwords lists, and the user could something like:
<filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/snowball/french_stop.txt" format="snowball" ... />
The disadvantage to this is they have to know where the list is, what format its in, etc. For example: snowball doesn't provide Romanian or Turkish
stopword lists to go along with their stemmers, so we had to add our own.
Let me know what you guys think, and I will create a patch.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Search Discussions

  • Lance Norskog (JIRA) at Aug 21, 2010 at 3:47 am
    [ https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900969#action_12900969 ]

    Lance Norskog commented on SOLR-1860:
    -------------------------------------

    This is a nice piece of work. One thing I've learned is that configurations should be as flat and transparent as possible. Pushing all of these word lists out of the classes and into files is a great improvement. The Greek Analyzer, for example, is (was) nothing but a default list of stopwords.

    But, having the stopwords as text files runs smack into character encoding wackiness (why, yes, I do use windows). Can the file format or importer at least support the XML or URL notations for Unicode characters? Maybe a list of words that include prot&#x0274; ge for protege?

    improve stopwords list handling
    -------------------------------

    Key: SOLR-1860
    URL: https://issues.apache.org/jira/browse/SOLR-1860
    Project: Solr
    Issue Type: Improvement
    Components: Schema and Analysis
    Affects Versions: 3.1
    Reporter: Robert Muir
    Assignee: Robert Muir
    Priority: Minor
    Attachments: SOLR-1860.patch


    Currently Solr makes it easy to use english stopwords for StopFilter or CommonGramsFilter.
    Recently in lucene, we added stopwords lists (mostly, but not all from snowball) to all the language analyzers.
    So it would be nice if a user can easily specify that they want to use a french stopword list, and use it for StopFilter or CommonGrams.
    The ones from snowball, are however formatted in a different manner than the others (although in Lucene we have parsers to deal with this).
    Additionally, we abstract this from Lucene users by adding a static getDefaultStopSet to all analyzers.
    There are two approaches, the first one I think I prefer the most, but I'm not sure it matters as long as we have good examples (maybe a foreign language example schema?)
    1. The user would specify something like:
    <filter class="solr.StopFilterFactory" fromAnalyzer="org.apache.lucene.analysis.FrenchAnalyzer" .../>
    This would just grab the CharArraySet from the FrenchAnalyzer's getDefaultStopSet method, who cares where it comes from or how its loaded.
    2. We add support for snowball-formatted stopwords lists, and the user could something like:
    <filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/snowball/french_stop.txt" format="snowball" ... />
    The disadvantage to this is they have to know where the list is, what format its in, etc. For example: snowball doesn't provide Romanian or Turkish
    stopword lists to go along with their stemmers, so we had to add our own.
    Let me know what you guys think, and I will create a patch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Aug 21, 2010 at 4:48 am
    [ https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900975#action_12900975 ]

    Robert Muir commented on SOLR-1860:
    -----------------------------------

    {quote}
    The Greek Analyzer, for example, is (was) nothing but a default list of stopwords.
    {quote}

    This is no longer true. there is a stemmer, too.

    {quote}
    But, having the stopwords as text files runs smack into character encoding wackiness (why, yes, I do use windows).
    {quote}

    What wackiness? The files are all unicode UTF-8, which windows too supports.

    {quote}
    Can the file format or importer at least support the XML or URL notations for Unicode characters?
    {quote}

    Only if we escape with *ALL* english strings in all files too. But I prefer things to be readable.

    improve stopwords list handling
    -------------------------------

    Key: SOLR-1860
    URL: https://issues.apache.org/jira/browse/SOLR-1860
    Project: Solr
    Issue Type: Improvement
    Components: Schema and Analysis
    Affects Versions: 3.1
    Reporter: Robert Muir
    Assignee: Robert Muir
    Priority: Minor
    Attachments: SOLR-1860.patch


    Currently Solr makes it easy to use english stopwords for StopFilter or CommonGramsFilter.
    Recently in lucene, we added stopwords lists (mostly, but not all from snowball) to all the language analyzers.
    So it would be nice if a user can easily specify that they want to use a french stopword list, and use it for StopFilter or CommonGrams.
    The ones from snowball, are however formatted in a different manner than the others (although in Lucene we have parsers to deal with this).
    Additionally, we abstract this from Lucene users by adding a static getDefaultStopSet to all analyzers.
    There are two approaches, the first one I think I prefer the most, but I'm not sure it matters as long as we have good examples (maybe a foreign language example schema?)
    1. The user would specify something like:
    <filter class="solr.StopFilterFactory" fromAnalyzer="org.apache.lucene.analysis.FrenchAnalyzer" .../>
    This would just grab the CharArraySet from the FrenchAnalyzer's getDefaultStopSet method, who cares where it comes from or how its loaded.
    2. We add support for snowball-formatted stopwords lists, and the user could something like:
    <filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/snowball/french_stop.txt" format="snowball" ... />
    The disadvantage to this is they have to know where the list is, what format its in, etc. For example: snowball doesn't provide Romanian or Turkish
    stopword lists to go along with their stemmers, so we had to add our own.
    Let me know what you guys think, and I will create a patch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Lance Norskog (JIRA) at Aug 21, 2010 at 9:53 pm
    [ https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901103#action_12901103 ]

    Lance Norskog commented on SOLR-1860:
    -------------------------------------

    bq. What wackiness? The files are all unicode UTF-8, which windows too supports.

    'Supports' does not mean 'you can get it done without a pounding headache'. UTF-8 is not the default and you cannot make it the default. I'm guessing some linux editors don't understand the funky binary starting bytes that mark a UTF-8 file. Having UTF-8 characters in the Java source blows up also. An XML file format would go a long way to useability.

    .

    improve stopwords list handling
    -------------------------------

    Key: SOLR-1860
    URL: https://issues.apache.org/jira/browse/SOLR-1860
    Project: Solr
    Issue Type: Improvement
    Components: Schema and Analysis
    Affects Versions: 3.1
    Reporter: Robert Muir
    Assignee: Robert Muir
    Priority: Minor
    Attachments: SOLR-1860.patch


    Currently Solr makes it easy to use english stopwords for StopFilter or CommonGramsFilter.
    Recently in lucene, we added stopwords lists (mostly, but not all from snowball) to all the language analyzers.
    So it would be nice if a user can easily specify that they want to use a french stopword list, and use it for StopFilter or CommonGrams.
    The ones from snowball, are however formatted in a different manner than the others (although in Lucene we have parsers to deal with this).
    Additionally, we abstract this from Lucene users by adding a static getDefaultStopSet to all analyzers.
    There are two approaches, the first one I think I prefer the most, but I'm not sure it matters as long as we have good examples (maybe a foreign language example schema?)
    1. The user would specify something like:
    <filter class="solr.StopFilterFactory" fromAnalyzer="org.apache.lucene.analysis.FrenchAnalyzer" .../>
    This would just grab the CharArraySet from the FrenchAnalyzer's getDefaultStopSet method, who cares where it comes from or how its loaded.
    2. We add support for snowball-formatted stopwords lists, and the user could something like:
    <filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/snowball/french_stop.txt" format="snowball" ... />
    The disadvantage to this is they have to know where the list is, what format its in, etc. For example: snowball doesn't provide Romanian or Turkish
    stopword lists to go along with their stemmers, so we had to add our own.
    Let me know what you guys think, and I will create a patch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Uwe Schindler (JIRA) at Aug 21, 2010 at 10:03 pm
    [ https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901104#action_12901104 ]

    Uwe Schindler commented on SOLR-1860:
    -------------------------------------

    If it's documented to be UTF-8, its clear what you have to provide (in Solr). If you use Lucene directly, the stopword file parser does not care about encodings at all, it simply takes a java.io..Reader.
    improve stopwords list handling
    -------------------------------

    Key: SOLR-1860
    URL: https://issues.apache.org/jira/browse/SOLR-1860
    Project: Solr
    Issue Type: Improvement
    Components: Schema and Analysis
    Affects Versions: 3.1
    Reporter: Robert Muir
    Assignee: Robert Muir
    Priority: Minor
    Attachments: SOLR-1860.patch


    Currently Solr makes it easy to use english stopwords for StopFilter or CommonGramsFilter.
    Recently in lucene, we added stopwords lists (mostly, but not all from snowball) to all the language analyzers.
    So it would be nice if a user can easily specify that they want to use a french stopword list, and use it for StopFilter or CommonGrams.
    The ones from snowball, are however formatted in a different manner than the others (although in Lucene we have parsers to deal with this).
    Additionally, we abstract this from Lucene users by adding a static getDefaultStopSet to all analyzers.
    There are two approaches, the first one I think I prefer the most, but I'm not sure it matters as long as we have good examples (maybe a foreign language example schema?)
    1. The user would specify something like:
    <filter class="solr.StopFilterFactory" fromAnalyzer="org.apache.lucene.analysis.FrenchAnalyzer" .../>
    This would just grab the CharArraySet from the FrenchAnalyzer's getDefaultStopSet method, who cares where it comes from or how its loaded.
    2. We add support for snowball-formatted stopwords lists, and the user could something like:
    <filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/snowball/french_stop.txt" format="snowball" ... />
    The disadvantage to this is they have to know where the list is, what format its in, etc. For example: snowball doesn't provide Romanian or Turkish
    stopword lists to go along with their stemmers, so we had to add our own.
    Let me know what you guys think, and I will create a patch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org
  • Robert Muir (JIRA) at Aug 21, 2010 at 10:09 pm
    [ https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901105#action_12901105 ]

    Robert Muir commented on SOLR-1860:
    -----------------------------------

    Lance, I don't know what your OS problems are, but the whole reason it exists is so things like these files can be viewable/editable in their own languages and not encoded in hex.

    So, I don't plan on making life cryptic for people that use languages other than english because you are scared of UTF-8 or don't know how to configure your computer.

    improve stopwords list handling
    -------------------------------

    Key: SOLR-1860
    URL: https://issues.apache.org/jira/browse/SOLR-1860
    Project: Solr
    Issue Type: Improvement
    Components: Schema and Analysis
    Affects Versions: 3.1
    Reporter: Robert Muir
    Assignee: Robert Muir
    Priority: Minor
    Attachments: SOLR-1860.patch


    Currently Solr makes it easy to use english stopwords for StopFilter or CommonGramsFilter.
    Recently in lucene, we added stopwords lists (mostly, but not all from snowball) to all the language analyzers.
    So it would be nice if a user can easily specify that they want to use a french stopword list, and use it for StopFilter or CommonGrams.
    The ones from snowball, are however formatted in a different manner than the others (although in Lucene we have parsers to deal with this).
    Additionally, we abstract this from Lucene users by adding a static getDefaultStopSet to all analyzers.
    There are two approaches, the first one I think I prefer the most, but I'm not sure it matters as long as we have good examples (maybe a foreign language example schema?)
    1. The user would specify something like:
    <filter class="solr.StopFilterFactory" fromAnalyzer="org.apache.lucene.analysis.FrenchAnalyzer" .../>
    This would just grab the CharArraySet from the FrenchAnalyzer's getDefaultStopSet method, who cares where it comes from or how its loaded.
    2. We add support for snowball-formatted stopwords lists, and the user could something like:
    <filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/snowball/french_stop.txt" format="snowball" ... />
    The disadvantage to this is they have to know where the list is, what format its in, etc. For example: snowball doesn't provide Romanian or Turkish
    stopword lists to go along with their stemmers, so we had to add our own.
    Let me know what you guys think, and I will create a patch.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedAug 18, '10 at 10:13a
activeAug 21, '10 at 10:09p
posts6
users1
websitelucene.apache.org

1 user in discussion

Robert Muir (JIRA): 6 posts

People

Translate

site design / logo © 2021 Grokbase