FAQ
Hi,
I'm using Lucene in an open source java project at
http://belletmen.dev.java.net .
In the project there are several dictionaries with a simple structure.
All items are composed of a "phrase", and a "definition". Both parts
might contain a single word, or have lots of words.
Since both parts might contain multiple words, I used the following:
private Document buildDocument(SozlukBirimi birim){
Document doc = new Document();
doc.add(Field.Keyword("soz", birim.getSoz()));//soz means word
in Turkish
doc.add(Field.Text("soz1", birim.getSoz()));//the same as
keyword part
doc.add(Field.Text("anlam", birim.getAnlam()));//anlam means
meaning in Turkish
return doc;
}
As you can see, I used the first part both as a keyword field, and a
text field. The reason is that the program will try to find phrases, or
single words in the first part also.
At the first stages of the application, there were a single
English-Turkish dictionary, and I had used an analyzer in which both
English and Turkish stop words are included.
And, here my questions:
1- Do you think whether the above system is a good solution for a
dictionary, or not?
2- I'm in hesitation now, about using stop words in a dictionary. What
do you think?
3- I have a quite big timing problem. For a 107155 items of an
English-English dictionary, it took 1436 seconds to complete the
indexing on a 600MHz Pentium 4 Laptop with 256 MB of memory. Is it
normal? Or, am I in a completely wrong way?
I'm waiting for your suggestions.
Thanks a lot.
Ahmet Aksoy


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Paul Elschot at Sep 3, 2005 at 6:46 pm
    Ahmet,
    On Saturday 03 September 2005 10:12, Ahmet Aksoy wrote:
    Hi,
    I'm using Lucene in an open source java project at
    http://belletmen.dev.java.net .
    In the project there are several dictionaries with a simple structure.
    All items are composed of a "phrase", and a "definition". Both parts
    might contain a single word, or have lots of words.
    Since both parts might contain multiple words, I used the following:
    private Document buildDocument(SozlukBirimi birim){
    Document doc = new Document();
    doc.add(Field.Keyword("soz", birim.getSoz()));//soz means word
    in Turkish
    doc.add(Field.Text("soz1", birim.getSoz()));//the same as
    keyword part
    doc.add(Field.Text("anlam", birim.getAnlam()));//anlam means
    meaning in Turkish
    return doc;
    }
    As you can see, I used the first part both as a keyword field, and a
    text field. The reason is that the program will try to find phrases, or
    single words in the first part also.
    At the first stages of the application, there were a single
    English-Turkish dictionary, and I had used an analyzer in which both
    English and Turkish stop words are included.
    And, here my questions:
    1- Do you think whether the above system is a good solution for a
    dictionary, or not?
    Keyword is used for searching exactly the same term, and this is normally
    reserved for things like a primary key. When the word (soz) has that role,
    it is ok. One way to determine a primary key is to consider how to
    delete a Document from the index.

    In case you anticipate searching on both the words and their meanings,
    you might also consider to index them concatenated together in another field.
    2- I'm in hesitation now, about using stop words in a dictionary. What
    do you think?
    For the word field, they are needed, stopwords are words.
    For the meaning field, you need to determine whether stopwords are
    necessary in the index.
    It is possible to use a different analyzer per field.
    3- I have a quite big timing problem. For a 107155 items of an
    English-English dictionary, it took 1436 seconds to complete the
    indexing on a 600MHz Pentium 4 Laptop with 256 MB of memory. Is it
    normal? Or, am I in a completely wrong way?
    Since each entry is probably rather small, it is possible to keep many
    of them in RAM before writing them to disk during index building.
    You could try and increase IndexWriter.minMergeFactor first
    and consider the other merge factors of IndexWriter there later.

    Regards,
    Paul Elschot


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ahmet Aksoy at Sep 3, 2005 at 9:34 pm
    Hi Paul,

    I decided to use a minimum number of stop words in my application. I
    hope, it will work better.

    According to your suggestion, I made a few trials, and found my optimum
    values.

    The following values look like best in my case:
    mergeFactor = 100;
    minMergeDocs = 500;
    maxMergeDocs = 1000;

    Now, indexing is performed 8 to 30 times faster than before.
    Thanks a lot.
    Ahmet Aksoy

    Paul Elschot wrote:
    Ahmet,

    On Saturday 03 September 2005 10:12, Ahmet Aksoy wrote:

    Hi,
    I'm using Lucene in an open source java project at
    http://belletmen.dev.java.net .
    In the project there are several dictionaries with a simple structure.
    All items are composed of a "phrase", and a "definition". Both parts
    might contain a single word, or have lots of words.
    Since both parts might contain multiple words, I used the following:
    private Document buildDocument(SozlukBirimi birim){
    Document doc = new Document();
    doc.add(Field.Keyword("soz", birim.getSoz()));//soz means word
    in Turkish
    doc.add(Field.Text("soz1", birim.getSoz()));//the same as
    keyword part
    doc.add(Field.Text("anlam", birim.getAnlam()));//anlam means
    meaning in Turkish
    return doc;
    }
    As you can see, I used the first part both as a keyword field, and a
    text field. The reason is that the program will try to find phrases, or
    single words in the first part also.
    At the first stages of the application, there were a single
    English-Turkish dictionary, and I had used an analyzer in which both
    English and Turkish stop words are included.
    And, here my questions:
    1- Do you think whether the above system is a good solution for a
    dictionary, or not?
    Keyword is used for searching exactly the same term, and this is normally
    reserved for things like a primary key. When the word (soz) has that role,
    it is ok. One way to determine a primary key is to consider how to
    delete a Document from the index.

    In case you anticipate searching on both the words and their meanings,
    you might also consider to index them concatenated together in another field.


    2- I'm in hesitation now, about using stop words in a dictionary. What
    do you think?
    For the word field, they are needed, stopwords are words.
    For the meaning field, you need to determine whether stopwords are
    necessary in the index.
    It is possible to use a different analyzer per field.


    3- I have a quite big timing problem. For a 107155 items of an
    English-English dictionary, it took 1436 seconds to complete the
    indexing on a 600MHz Pentium 4 Laptop with 256 MB of memory. Is it
    normal? Or, am I in a completely wrong way?
    Since each entry is probably rather small, it is possible to keep many
    of them in RAM before writing them to disk during index building.
    You could try and increase IndexWriter.minMergeFactor first
    and consider the other merge factors of IndexWriter there later.

    Regards,
    Paul Elschot


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedSep 3, '05 at 8:13a
activeSep 3, '05 at 9:34p
posts3
users2
websitelucene.apache.org

2 users in discussion

Ahmet Aksoy: 2 posts Paul Elschot: 1 post

People

Translate

site design / logo © 2022 Grokbase