Grokbase Groups Lucene dev July 2010
FAQ
[ https://issues.apache.org/jira/browse/SOLR-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-1860:
------------------------------

Attachment: SOLR-1860.patch

here is a first step, 2 of the analyzers (Brazilian, Czech) use embedded stopword sets.
I think this was an oversight, this moves these to .txt files like the rest

improve stopwords list handling
-------------------------------

Key: SOLR-1860
URL: https://issues.apache.org/jira/browse/SOLR-1860
Project: Solr
Issue Type: Improvement
Components: Schema and Analysis
Affects Versions: 3.1
Reporter: Robert Muir
Assignee: Robert Muir
Priority: Minor
Attachments: SOLR-1860.patch


Currently Solr makes it easy to use english stopwords for StopFilter or CommonGramsFilter.
Recently in lucene, we added stopwords lists (mostly, but not all from snowball) to all the language analyzers.
So it would be nice if a user can easily specify that they want to use a french stopword list, and use it for StopFilter or CommonGrams.
The ones from snowball, are however formatted in a different manner than the others (although in Lucene we have parsers to deal with this).
Additionally, we abstract this from Lucene users by adding a static getDefaultStopSet to all analyzers.
There are two approaches, the first one I think I prefer the most, but I'm not sure it matters as long as we have good examples (maybe a foreign language example schema?)
1. The user would specify something like:
<filter class="solr.StopFilterFactory" fromAnalyzer="org.apache.lucene.analysis.FrenchAnalyzer" .../>
This would just grab the CharArraySet from the FrenchAnalyzer's getDefaultStopSet method, who cares where it comes from or how its loaded.
2. We add support for snowball-formatted stopwords lists, and the user could something like:
<filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/snowball/french_stop.txt" format="snowball" ... />
The disadvantage to this is they have to know where the list is, what format its in, etc. For example: snowball doesn't provide Romanian or Turkish
stopword lists to go along with their stemmers, so we had to add our own.
Let me know what you guys think, and I will create a patch.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedJul 26, '10 at 1:04p
activeJul 26, '10 at 1:04p
posts1
users1
websitelucene.apache.org

1 user in discussion

Robert Muir (JIRA): 1 post

People

Translate

site design / logo © 2021 Grokbase