FAQ
Hello,

My name is Mihai and I'm trying to write a java (later I'll need to port it
to pylucene) search on billions of mentions like twitter statuses. Mentions
are grouped by some containing keywords.

I'm thinking of partitioning the index for faster results as follows:

common index for the past week

common index for earlier small groups | individual indexes for very large
groups

My questions are:

-if i search on last week's index and the individual index (this needs to be
opened at search request!?) will it be faster than using a single huge index
for all groups, for all weeks?
-is* IndexSearcher searcher= new
IndexSearcher(IndexReader.open(writer,false));* read only? if not how can i
build numerous near-real-time readers on same writer(index)?
-How can i give NearRealTime acces to an IndexWriter started in another
application.
-How can i store alldocuments from results. Something like AllDocs
(equivalent to TopDocs) of AllDocsCollector(
TopDocsCollector).

I understood that Tweeter submitted their code on realTime architecture to
lucene, can i get my hands on that ?

Thank you in advance,
Mihai

Search Discussions

  • Ian Lea at Jul 14, 2011 at 11:51 am
    Searching billions of anything is likely to be challenging. Mark
    Miller's document at
    http://www.lucidimagination.com/content/scaling-lucene-and-solr looks
    well worth a read.
    -if i search on last week's index and the individual index (this needs to be
    opened at search request!?) will it be faster than using a single huge index
    for all groups, for all weeks?
    Too many variables to say.

    -is* IndexSearcher searcher= new
    IndexSearcher(IndexReader.open(writer,false));* read only?
    Surely searchers are read only, by definition.

    -How can i give NearRealTime acces to an IndexWriter started in another
    application.
    Sounds impossible.

    -How can i store alldocuments from results. Something like AllDocs
    (equivalent to TopDocs) of AllDocsCollector(
    TopDocsCollector).
    Not clear what you are asking here, but you can pass whatever you like
    as the max doc count to the assorted search methods, and do whatever
    you want with the results. Storing all docs from search results on a
    massive index doesn't sound a very clever idea.

    I understood that Tweeter submitted their code on realTime architecture to
    lucene, can i get my hands on that ?
    No idea.


    --
    Ian.

    On Wed, Jul 13, 2011 at 10:09 AM, Mihai Caraman wrote:
    Hello,

    My name is Mihai and I'm trying to write a java (later I'll need to port it
    to pylucene) search on billions of mentions like twitter statuses. Mentions
    are grouped by some containing keywords.

    I'm thinking of partitioning the index for faster results as follows:

    common index for the past week

    common index for earlier small groups  |  individual indexes for very large
    groups

    My questions are:

    -if i search on last week's index and the individual index (this needs to be
    opened at search request!?) will it be faster than using a single huge index
    for all groups, for all weeks?
    -is* IndexSearcher searcher= new
    IndexSearcher(IndexReader.open(writer,false));*  read only? if not how can i
    build numerous near-real-time readers on same writer(index)?
    -How can i give NearRealTime acces to an IndexWriter started in another
    application.
    -How can i store alldocuments from results. Something like AllDocs
    (equivalent to TopDocs) of AllDocsCollector(
    TopDocsCollector).

    I understood that Tweeter submitted their code on realTime architecture to
    lucene, can i get my hands on that ?

    Thank you in advance,
    Mihai
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mihai Caraman at Jul 14, 2011 at 12:46 pm
    Thank you for the reply, if you need more info to understand the question,
    I'll try to be as prompt as possible.
    -if i search on last week's index and the individual index (this needs to be
    opened at search request!?) will it be faster than using a single huge index
    for all groups, for all weeks?
    Searches will be made on only one group at a time(disjoint groups of
    documents). So for very large groups, the old documents will have they're
    own index. I'm hopping that this shortens the work load on a search by not
    having to skip other groups while searching for large quantities of old
    documents. In other words, I won't be needing a filter for the group needed
    to be searched, this being the filter with most hits.


    Storing all docs from search results on a
    massive index doesn't sound a very clever idea.

    It's a necessary step because i need to group results by they're date, so
    I'm only storing doc-id's and retrieving they're date field.
    I understood that Tweeter submitted their code on realTime architecture to
    lucene, can i get my hands on that ?
    No idea.


    "... we’re planning on contributing all these changes back to Lucene; some
    of which have already made it into Lucene’s trunk and its new realtime
    branch." Read more<http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html>

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJul 13, '11 at 9:10a
activeJul 14, '11 at 12:46p
posts3
users2
websitelucene.apache.org

2 users in discussion

Mihai Caraman: 2 posts Ian Lea: 1 post

People

Translate

site design / logo © 2022 Grokbase