Grokbase Groups Lucene dev July 2009
FAQ
I'd like to enable ShingleFilter to only create shingles for a set of
(stop) words (rather than for all N tokens).

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Search Discussions

  • Steven A Rowe at Jul 27, 2009 at 8:22 pm
    Hi Jason,
    On 7/27/2009 at 3:15 PM, Jason Rutherglen wrote:
    I'd like to enable ShingleFilter to only create shingles for a set of
    (stop) words (rather than for all N tokens).
    For purposes of discussion, here's some example input (first sentence from <http://en.wikipedia.org/wiki/Manufacturing>):

    Manufacturing is the use of machines, tools and labor
    to make things for use or sale.

    For n=2 and stoplist = { is, the, of, and, to, for, or }, and assuming WhitespaceAnalyzer, I think what you want is for ShingleFilter to *exclude* from output the following shingles (no unigrams output); since all other bigrams contain at least one stopword, they would be output:

    /machines, tools/
    /make things/

    Is this what you want?

    It might make sense, rather than modifying ShingleFilter, to create a new TokenFilter that can exclude terms you don't like.

    Solr has KeepWordFilter, which is close to what you want (the inverse of StopFilter), with the exception that you want to keep shingles that *contain* words from a list you supply.

    Perhaps a new TokenFilter subclass that can take in a regular expression would work? (Maybe called KeepRegexFilter.) Stopword lists are generally small enough to make building a regex to match them fairly simple, e.g. for the above list:

    (?:^|\s)(?:is|the|of|and|to|for|or)(?:\s|$)

    Alternatively/additionally, maybe a Keep{Term,Phrase,Keyword}Filter that takes in a list of words, then builds a regex like above?

    Having this functionality separate from ShingleFilter would be nice, I think, because it would be useful in other contexts.

    Steve


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Jason Rutherglen at Jul 27, 2009 at 9:06 pm
    Is this what you want?
    Yes.

    ShingleFilter outputs the actual tokens and the shingled tokens
    so doing things like KeepWordFilter before ShingleFilter may not
    will solve the issue.

    We can probably extend ShingleFilter and create
    ShingleStopWordFilter as this type of problem/solution seems
    fairly common?

    On Mon, Jul 27, 2009 at 1:22 PM, Steven A Rowewrote:
    Hi Jason,
    On 7/27/2009 at 3:15 PM, Jason Rutherglen wrote:
    I'd like to enable ShingleFilter to only create shingles for a set of
    (stop) words (rather than for all N tokens).
    For purposes of discussion, here's some example input (first sentence from <http://en.wikipedia.org/wiki/Manufacturing>):

    Manufacturing is the use of machines, tools and labor
    to make things for use or sale.

    For n=2 and stoplist = { is, the, of, and, to, for, or }, and assuming WhitespaceAnalyzer, I think what you want is for ShingleFilter to *exclude* from output the following shingles (no unigrams output); since all other bigrams contain at least one stopword, they would be output:

    /machines, tools/
    /make things/

    Is this what you want?

    It might make sense, rather than modifying ShingleFilter, to create a new TokenFilter that can exclude terms you don't like.

    Solr has KeepWordFilter, which is close to what you want (the inverse of StopFilter), with the exception that you want to keep shingles that *contain* words from a list you supply.

    Perhaps a new TokenFilter subclass that can take in a regular expression would work?  (Maybe called KeepRegexFilter.)  Stopword lists are generally small enough to make building a regex to match them fairly simple, e.g. for the above list:

    (?:^|\s)(?:is|the|of|and|to|for|or)(?:\s|$)

    Alternatively/additionally, maybe a Keep{Term,Phrase,Keyword}Filter that takes in a list of words, then builds a regex like above?

    Having this functionality separate from ShingleFilter would be nice, I think, because it would be useful in other contexts.

    Steve


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Steven A Rowe at Jul 27, 2009 at 9:19 pm

    On 7/27/2009 at 5:07 PM, Jason Rutherglen wrote:
    ShingleFilter outputs the actual tokens and the shingled tokens
    so doing things like KeepWordFilter before ShingleFilter may not
    will solve the issue.
    Actually, it's *after* ShingleFilter that the technique I was describing would be applied.

    Steve


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org
  • Tom Burton-West at Jul 28, 2009 at 4:09 pm

    I'd like to enable ShingleFilter to only create shingles for a set of
    (stop) words (rather than for all N tokens).
    Hi Jason,

    You might want to take a look at our port of the Nutch CommonGrams filter to
    Solr. https://issues.apache.org/jira/browse/SOLR-908. It does what you
    want and should work with lucene without too much problem if you deal with
    the dependency on org.apache.solr.analysis.BufferedTokenStream.

    Tom Burton-West

    --
    View this message in context: http://www.nabble.com/ShingleFilter-%2B-StopWords--tp24686636p24702699.html
    Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-dev-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieslucene
postedJul 27, '09 at 7:13p
activeJul 28, '09 at 4:09p
posts5
users3
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase