|
Rafael Turk |
at Apr 24, 2008 at 1:21 am
|
⇧ |
| |
Hi Mathieu,
*What do you wont to do?*
An spell checker and related keyword suggestion
If you wont an ngram => popularity map, just use a berkley DB, and use this
information in your Lucene application. Lucene is a reversed index, Berkeley
DB an index.
*Great ideia! Berkeley DB is definitely a try, simple and effective, but
I'll have to work the data previously. I was hopping to take advantage of
Lucene's built in features*
**
*[]s*
**
On Wed, Apr 23, 2008 at 10:16 AM, Mathieu Lecarme wrote:Rafael Turk a écrit :
Hi Folks,
I´m trying to load Google Web 1T 5 Gram to Lucene. (This corpus
contains
English word n-grams and their observed frequency counts. The length of
the
n-grams ranges from unigrams(single words) to five-grams)
I´m loading each ngram (each row is a ngram) as an individual
Document.
This way I´ll be able to search for each ngram separated, but I´m ending
with huge indexes witch makes them very hard to load and read the index.
Is there a better way to load and read ngrams to a Lucene index? Maybe
using lower level api?
More Info about Google Web 1T 5 Gram corpus at:
<
http://www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt>
Thanks,
Rafael
What do you wont to do?
If you wont an ngram => popularity map, just use a berkley DB, and use
this information in your Lucene application. Lucene is a reversed index,
Berkeley DB an index.
M.
---------------------------------------------------------------------
To unsubscribe, e-mail:
[email protected]For additional commands, e-mail:
[email protected]