FAQ
Hi everybody:

I have a big problem making prallel searches in big indexes.
I have indexed with lucene over 60 000 articles, I have distributed the
indexes in 10 computers nodes so each index not exceed the 60 MB of size. I
makes parallel searches in those indexes but I get the search results after
40 MINUTES !!! Then I put the indexes in memory to do the parallel searches
But still I get the search results after 3 minutes !!! that`s to mucho time
waiting !!!
How Can I reduce the time of search ???
Could you help me please ???
I need help !!!!!

Greetings

Search Discussions

  • Erick Erickson at Oct 12, 2006 at 12:37 am
    Something's extremely not right <G>....

    First of all, I'm running a 1.4G index on a single machine and getting very
    good results, under 10 seconds even for the most complex queries I'm firing.
    This is with 870,000 documents, and includes sorting by criteria other than
    relevance. And using span queries. And using wildcards that build their own
    filters.

    So, something must be very different about how you are using lucene to get
    such poor search times.

    So, please tell us significantly more about the structure of your index and
    post the shortest example you can of your search code that demonstrates the
    problem, and maybe some of the wiser heads than mine can help out too.

    There should be no need to put the index in RAM, the index is just not big
    enough.

    So, some of the things I think would help analyze your problems....

    1> hardware and op systems you're running on. Including how much memory
    you're allowing your JVM to have.
    2> network topology. If you're running the searchers locally and just
    storing the indexes on remote machines, you're possibly having network
    latency problems. Personally, I don't think your problem is properly
    addressed by splitting your index. 600MB of index is just not big enough to
    need this.
    3> This *should* work on a local machine with just a single index. How much
    trouble would it be to create it so? Can you try that and see what
    difference that makes?
    4> how did you build your index? Is it optimized? Can you give us an idea of
    how many fields you are storing and some indication of the relative sizes of
    each? Mostly, I'm asking whether you have a bunch of small fields and some
    other very large ones.
    5> Put one of the indexes on your local machine and get a copy of Luke
    (google luke lucene) and fire off a few queries via Luke and tell us what
    kind of results you get. Actually, this is probably the first thing you
    should try. If you get radically different results with Luke than your code,
    you can be pretty sure you're doing something out of the ordinary.
    6> Timings of *only* the search code. By that I mean the time it takes for
    searcher.search to complete. It's vaguely possible that the search is fine,
    but something you're doing when processing the results is taking forever. I
    have no evidence for this, of course, but it'd be a useful bit of
    information.

    I don't know if this helps much, but from your description, I think there's
    a fundamental, correctable problem because nobody would use the product if
    it gave such poor search times. And lots of people use it.

    Best
    Erick
    On 10/11/06, Ariel Isaac Romero Cartaya wrote:

    Hi everybody:

    I have a big problem making prallel searches in big indexes.
    I have indexed with lucene over 60 000 articles, I have distributed
    the
    indexes in 10 computers nodes so each index not exceed the 60 MB of size.
    I
    makes parallel searches in those indexes but I get the search results
    after
    40 MINUTES !!! Then I put the indexes in memory to do the parallel
    searches
    But still I get the search results after 3 minutes !!! that`s to mucho
    time
    waiting !!!
    How Can I reduce the time of search ???
    Could you help me please ???
    I need help !!!!!

    Greetings
  • Ariel Isaac Romero Cartaya at Oct 16, 2006 at 12:44 pm
    First af all, what is your machine architecture ??? Do you have a super pc
    ???
    I'm running this on a dual xeon hyperthreading 2,4 Ghz, 1 Gb RAM, HD SATA.
    I Can not get the times results you get. I think that the problem may be in
    the structure of my index, for example I use a special analyzer for the
    english language, filter stopwords, stemming and synonims detection with
    wordnet but I need to do this to get the results I get. Could you help me to
    fix this problem.
    I'm desesperate.
    Greetings
    On 10/11/06, Erick Erickson wrote:

    Something's extremely not right <G>....

    First of all, I'm running a 1.4G index on a single machine and getting
    very
    good results, under 10 seconds even for the most complex queries I'm
    firing.
    This is with 870,000 documents, and includes sorting by criteria other
    than
    relevance. And using span queries. And using wildcards that build their
    own
    filters.

    So, something must be very different about how you are using lucene to get
    such poor search times.

    So, please tell us significantly more about the structure of your index
    and
    post the shortest example you can of your search code that demonstrates
    the
    problem, and maybe some of the wiser heads than mine can help out too.

    There should be no need to put the index in RAM, the index is just not big
    enough.

    So, some of the things I think would help analyze your problems....

    1> hardware and op systems you're running on. Including how much memory
    you're allowing your JVM to have.
    2> network topology. If you're running the searchers locally and just
    storing the indexes on remote machines, you're possibly having network
    latency problems. Personally, I don't think your problem is properly
    addressed by splitting your index. 600MB of index is just not big enough
    to
    need this.
    3> This *should* work on a local machine with just a single index. How
    much
    trouble would it be to create it so? Can you try that and see what
    difference that makes?
    4> how did you build your index? Is it optimized? Can you give us an idea
    of
    how many fields you are storing and some indication of the relative sizes
    of
    each? Mostly, I'm asking whether you have a bunch of small fields and some
    other very large ones.
    5> Put one of the indexes on your local machine and get a copy of Luke
    (google luke lucene) and fire off a few queries via Luke and tell us what
    kind of results you get. Actually, this is probably the first thing you
    should try. If you get radically different results with Luke than your
    code,
    you can be pretty sure you're doing something out of the ordinary.
    6> Timings of *only* the search code. By that I mean the time it takes for
    searcher.search to complete. It's vaguely possible that the search is
    fine,
    but something you're doing when processing the results is taking forever.
    I
    have no evidence for this, of course, but it'd be a useful bit of
    information.

    I don't know if this helps much, but from your description, I think
    there's
    a fundamental, correctable problem because nobody would use the product if
    it gave such poor search times. And lots of people use it.

    Best
    Erick
    On 10/11/06, Ariel Isaac Romero Cartaya wrote:

    Hi everybody:

    I have a big problem making prallel searches in big indexes.
    I have indexed with lucene over 60 000 articles, I have distributed
    the
    indexes in 10 computers nodes so each index not exceed the 60 MB of size.
    I
    makes parallel searches in those indexes but I get the search results
    after
    40 MINUTES !!! Then I put the indexes in memory to do the parallel
    searches
    But still I get the search results after 3 minutes !!! that`s to mucho
    time
    waiting !!!
    How Can I reduce the time of search ???
    Could you help me please ???
    I need help !!!!!

    Greetings
  • Ariel Isaac Romero Cartaya at Oct 17, 2006 at 1:55 pm
    Here are pieces of my source code:

    First of all, I search in all the indexes given a query String with a
    parallel searcher. As you can see I make a multi field query. Then you can
    see the index format I use, I store in the index all the fields. My index is
    optimized.

    public Hits search(String query) throws IOException {

    AnalyzerHandler analizer = new AnalyzerHandler();
    Query pquery = null;

    try {
    pquery = MultiFieldQueryParser.parse(query, new String[]
    {"title", "sumary", "filename", "content", "author"}, analizer.getAnalyzer
    ());
    } catch (ParseException e1) {
    e1.printStackTrace();
    }

    Searchable[] searchables = new Searchable[IndexCount];

    for (int i = 0; i < IndexCount; i++) {
    searchables[i] = new IndexSearcher(RAMIndexsManager.getInstance
    ().getDirectoryAt(i));
    }

    Searcher parallelSearcher = new ParallelMultiSearcher(searchables);

    return parallelSearcher.search(pquery);


    }

    Then in another method I obtain the fragment where the term occur, As you
    can see I use an EnglisAnalyzer that filter stopwords, stemming, synonims
    detection ... :

    public Vector getResults(Hits h, String string) throws IOException{

    Vector ResultItems = new Vector();
    int cantHits = h.length();
    if (cantHits!=0){

    QueryParser qparser = new QueryParser("content", new
    AnalyzerHandler().getAnalyzer());
    Query query1 = null;
    try {
    query1 = qparser.parse(string);
    } catch (ParseException e1) {
    e1.printStackTrace();
    }

    QueryScorer scorer = new QueryScorer(query1);

    Highlighter highlighter = new Highlighter(scorer);

    Fragmenter fragmenter = new SimpleFragmenter(150);

    highlighter.setTextFragmenter(fragmenter);


    for (int i = 0; i < cantHits; i++) {

    org.apache.lucene.document.Document doc = h.doc(i);

    String filename = doc.get("filename");

    filename = filename.substring(filename.indexOf("/") + 1);

    String filepath = doc.get("filepath");

    Integer id = new Integer(h.id(i));

    String score = (h.score(i))+ "";

    int fileSize = Integer.parseInt(doc.get("filesize"));

    String title = doc.get("title");
    String summary = doc.get("sumary");

    //fragment
    String body = h.doc(i).get("content");

    TokenStream stream = new
    EnglishAnalyzer().tokenStream("content",new StringReader(body));

    String[] fragment = highlighter.getBestFragments(stream,
    body, 4);
    //fragment



    if (fragment.length == 0) {
    fragment = new String[1];
    fragment[0] = "";
    }



    StringBuilder buffer = new StringBuilder();

    for (int I = 0; I < fragment.length; I++){
    buffer.append(validateCad(fragment[I]) + "...\n");
    }

    String stringFragment = buffer.toString();

    ResultItem result = new ResultItem();
    result.setFilename(fileName);
    result.setFilepath(filePath);
    result.setFilesize(filesize);
    result.setScore(Double.parseDouble(score));
    result.setFragment(fragment);
    result.setId(new Integer(id));
    result.setSummary(summary);
    result.setTitle(title);
    ResultItems.add(result);


    }
    }



    return ResultItems;
    }


    So these are the principals methods that make search. Could you tell me if I
    do something wrong or inefficient ?
    As you can see I make a parallel search, I have a dual xeon machine with two
    CPU hyperthreading 2,4 Ghz 512 RAM but when I make the parallel searcher I
    can see in my command prompt on Linux that the 3 og my 4 cpu are always idle
    while only one is working, why occur that if the parallel searcher must
    saturate all the CPU of work.

    I hope you can help me.
  • Karl wettin at Oct 17, 2006 at 4:21 pm

    17 okt 2006 kl. 15.55 skrev Ariel Isaac Romero Cartaya:

    Here are pieces of my source code:
    public Hits search(String query) throws IOException {
    for (int i = 0; i < IndexCount; i++) {
    searchables[i] = new IndexSearcher
    (RAMIndexsManager.getInstance
    ().getDirectoryAt(i));
    I didn't look further than this. You can initially try to reuse your
    IndexSearchers. That should help a lot.

    http://lucene.apache.org/java/docs/api/org/apache/lucene/search/
    IndexSearcher.html
    "For performance reasons it is recommended to open only one
    IndexSearcher and use it for all of your searches."


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Doron Cohen at Oct 12, 2006 at 12:48 am
    These times really are not reasonable. But 60K do not seem much for Lucene.
    I once indexed ~1M docs of ~20K each, that's ~20GB input collection. The
    result index size was ~2.5GB and the search times for a short query 2-3
    words free text (or) query was ~300ms for a "hot" query and ~900ms for a
    "cold" query. This was a single machine.

    It may very well be that your settings are different - e.g. fields stored
    or not (I don't recall if my fields were stored), types of queries, etc.

    If you can provide more information on the scenario, people on the list
    running similar settings would be able to comment:
    - what queries?
    - what happens if you run the same query again?
    - how often are you updating the index, optimizing, opening new searchers?
    - do you reuse searchers or open a new one for each query?
    - how many results are you asking for? (more than 50?)
    - did you measure search time on a single searcher?
    (without the distribution, even for a subset of your documents)
    what times do you measure here?
    - did you try to use MultiReader instead of MultiSeracher?

    Also, I understand that you partitioned the index, so that you have 10
    indexes each "covering" 6K docs out of your 60K docs, right? If so, is this
    just a test, preparing for a bigger index? Because if 60K docs is what you
    intend to index, at least by my experience, not sure that partitioning the
    index is a must.

    The FAQ is also helpful, for instance
    http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-47995886fbb41d8e7da103f8cda2d935f99dc6c8


    Last, a word of encourage, based on my experience and what I saw in this
    list, it is certain you would be accelerate this,

    - Doron

    "Ariel Isaac Romero Cartaya" <isaacrc82@gmail.com> wrote on 11/10/2006
    14:36:31:
    Hi everybody:

    I have a big problem making prallel searches in big indexes.
    I have indexed with lucene over 60 000 articles, I have distributed the
    indexes in 10 computers nodes so each index not exceed the 60 MB of size. I
    makes parallel searches in those indexes but I get the search results after
    40 MINUTES !!! Then I put the indexes in memory to do the parallel searches
    But still I get the search results after 3 minutes !!! that`s to mucho time
    waiting !!!
    How Can I reduce the time of search ???
    Could you help me please ???
    I need help !!!!!

    Greetings

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedOct 11, '06 at 9:37p
activeOct 17, '06 at 4:21p
posts6
users4
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase