FAQ
Hi Folks,

I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus contains
English word n-grams and their observed frequency counts. The length of the
n-grams ranges from unigrams(single words) to five-grams)

I´m loading each ngram (each row is a ngram) as an individual Document.
This way I´ll be able to search for each ngram separated, but I´m ending
with huge indexes witch makes them very hard to load and read the index.

Is there a better way to load and read ngrams to a Lucene index? Maybe
using lower level api?


More Info about Google Web 1T 5 Gram corpus at:
<http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>

Thanks,

Rafael

Search Discussions

  • Julien Nioche at Apr 23, 2008 at 11:39 am
    Hi Raphael,

    We initially tried to do the same but ended up developing our own API for
    querying the Web 1T. You can find more details on
    http://digitalpebble.com/resources.html
    There could be a way to reuse elements from Lucene e.g. the Term index only
    but I could not find an obvious way to achieve that.

    Best,

    Julien

    --
    DigitalPebble Ltd
    http://www.digitalpebble.com

    On 23/04/2008, Rafael Turk wrote:

    Hi Folks,

    I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus
    contains
    English word n-grams and their observed frequency counts. The length of
    the
    n-grams ranges from unigrams(single words) to five-grams)

    I´m loading each ngram (each row is a ngram) as an individual Document.
    This way I´ll be able to search for each ngram separated, but I´m ending
    with huge indexes witch makes them very hard to load and read the index.

    Is there a better way to load and read ngrams to a Lucene index? Maybe
    using lower level api?


    More Info about Google Web 1T 5 Gram corpus at:
    <http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>

    Thanks,


    Rafael
  • Rafael Turk at Apr 24, 2008 at 1:13 am
    Thanks Julien,

    I´ll definitely give it a try!!!

    []s

    Rafael
    On Wed, Apr 23, 2008 at 8:38 AM, Julien Nioche wrote:

    Hi Raphael,

    We initially tried to do the same but ended up developing our own API for
    querying the Web 1T. You can find more details on
    http://digitalpebble.com/resources.html
    There could be a way to reuse elements from Lucene e.g. the Term index
    only
    but I could not find an obvious way to achieve that.

    Best,

    Julien

    --
    DigitalPebble Ltd
    http://www.digitalpebble.com

    On 23/04/2008, Rafael Turk wrote:

    Hi Folks,

    I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus
    contains
    English word n-grams and their observed frequency counts. The length of
    the
    n-grams ranges from unigrams(single words) to five-grams)

    I´m loading each ngram (each row is a ngram) as an individual Document.
    This way I´ll be able to search for each ngram separated, but I´m ending
    with huge indexes witch makes them very hard to load and read the index.

    Is there a better way to load and read ngrams to a Lucene index? Maybe
    using lower level api?


    More Info about Google Web 1T 5 Gram corpus at:
    <http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>

    Thanks,


    Rafael
  • Mathieu Lecarme at Apr 23, 2008 at 1:17 pm

    Rafael Turk a écrit :
    Hi Folks,

    I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus contains
    English word n-grams and their observed frequency counts. The length of the
    n-grams ranges from unigrams(single words) to five-grams)

    I´m loading each ngram (each row is a ngram) as an individual Document.
    This way I´ll be able to search for each ngram separated, but I´m ending
    with huge indexes witch makes them very hard to load and read the index.

    Is there a better way to load and read ngrams to a Lucene index? Maybe
    using lower level api?


    More Info about Google Web 1T 5 Gram corpus at:
    <http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>

    Thanks,

    Rafael
    What do you wont to do?
    If you wont an ngram => popularity map, just use a berkley DB, and use
    this information in your Lucene application. Lucene is a reversed index,
    Berkeley DB an index.

    M.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Rafael Turk at Apr 24, 2008 at 1:21 am
    Hi Mathieu,

    *What do you wont to do?*

    An spell checker and related keyword suggestion

    If you wont an ngram => popularity map, just use a berkley DB, and use this
    information in your Lucene application. Lucene is a reversed index, Berkeley
    DB an index.

    *Great ideia! Berkeley DB is definitely a try, simple and effective, but
    I'll have to work the data previously. I was hopping to take advantage of
    Lucene's built in features*
    **
    *[]s*
    **
    On Wed, Apr 23, 2008 at 10:16 AM, Mathieu Lecarme wrote:

    Rafael Turk a écrit :

    Hi Folks,
    I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus
    contains
    English word n-grams and their observed frequency counts. The length of
    the
    n-grams ranges from unigrams(single words) to five-grams)

    I´m loading each ngram (each row is a ngram) as an individual
    Document.
    This way I´ll be able to search for each ngram separated, but I´m ending
    with huge indexes witch makes them very hard to load and read the index.

    Is there a better way to load and read ngrams to a Lucene index? Maybe
    using lower level api?


    More Info about Google Web 1T 5 Gram corpus at:
    <http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>

    Thanks,

    Rafael

    What do you wont to do?
    If you wont an ngram => popularity map, just use a berkley DB, and use
    this information in your Lucene application. Lucene is a reversed index,
    Berkeley DB an index.

    M.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Mathieu Lecarme at Apr 24, 2008 at 7:59 am

    Rafael Turk a écrit :
    Hi Mathieu,

    *What do you wont to do?*

    An spell checker and related keyword suggestion
    Here is a spell checker wich I try to finalize :
    https://admin.garambrogne.net/projets/revuedepresse/browser/trunk/src/java
    If you wont an ngram => popularity map, just use a berkley DB, and use this
    information in your Lucene application. Lucene is a reversed index, Berkeley
    DB an index.

    *Great ideia! Berkeley DB is definitely a try, simple and effective, but
    I'll have to work the data previously. I was hopping to take advantage of
    Lucene's built in features*
    Lucene provides nice tools without the need to index. Analyzer and
    TokenFilter can help you, i guess.

    M.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Karl Wettin at Apr 24, 2008 at 11:02 am

    Rafael Turk skrev:

    *Great ideia! Berkeley DB is definitely a try, simple and effective, but
    I'll have to work the data previously.
    JDBM has a more appealing license if you ask ASF.


    karl

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedApr 23, '08 at 11:26a
activeApr 24, '08 at 11:02a
posts7
users4
websitelucene.apache.org

People

Translate

site design / logo © 2023 Grokbase