FAQ
I've been wandering for a while through this list and other Lucene
resources on the web trying to figure out the possible outlines of a
search solution which could fit my case. But as a Lucene newbie I
decided to ask for your help.

Now this is the scenario. I am building a webmail application which
allows users to aggregate their email addresses. Every user can setup
up to 6 mailboxes on his user account on the website. From day 1 the
site enables the new user to start downloading emails from his
accounts.
Email messages and attachments are permanently stored (I am not
describing that here, just presume that the storing side of the
problem is solved).

Now the task is to let users do full-text searches into their stored
emails (forget the attachments).

Say that an average user does receive and downloads 30 emails per day
per account. This makes 180 mails/day, 5400 mails/month, 32400 mails
on the first 6 months.
Say that on day 1 we do face with 100,000 registered users and that
this user base does not change on the 6 months timeframe (ok it's not
realistic, it's just to put down a sort of a scale).

We will face 100,000*32,400 = 3,240,000,000 mails, say 20KB per
message, it's 64,800,000,000KB which makes some 60TBs (!!).

The Lucene index would be some 20TBs (let's say that it would take the
30% of the space needed for the actual content, as stated elsewhere in
this list).

OK let's finally presume that my forecast is unrealistic or just fool,
let's say that we will have to handle a Lucene index of only 3 or 4
TBs.

I was thinking that the best and simplest approach would be creating
one index per user: every user will search only its own mail querying
only its own index. Multiple not-too-big indexes could be easily moved
and scaled to additional disks or machines.

But I've read that even 100 indexes could be too much (also for OS
limitations in concurrently keeping files open, of course).

Maybe my scenario is not a common one, but from where would you start
when trying to outlining a solution which in 6 months could take to a
3-4 TBs index?

Browsing the web it seems that scaling or distributing Lucene indexes
is quite a task, I have in mind too many different options for a
possible architecture but I don't know where to start.

As you may understand I'm not a search-specialist, I presume that in
case of a good start of my project I will get some help or, better,
some money for hire someone. The task is to "resist" until that time
without creating in the first 3-6 months an un-manageable, un-scalable
situation.

Thanks for sharing with me your points of view.

And I may add that the search feature is absolutely needed from the
start, building it in a second phase is out of question.

Giovanni

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Otis Gospodnetic at Aug 27, 2008 at 5:27 am
    Giovanni,

    You could try the approach you described - one index per user. When I built Simpy (see http://simpy.com ) a few years ago I chose the same approach and I never regretted it. The hardware behind Simpy is very modest, usage is high, and I never had problems with too many indices open (but you do have to keep track of them and close them when they are no longer needed).

    Another approach is to put users in buckets, so you have a single index that holds data for multiple users. This is a very common approach not just with Lucene, but also database scaling.

    With numbers you described you have to think distributed indexing, distributed search, lots of servers, horizontal scaling, index data/user partitioning, etc.


    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: Giovanni Mascia <gmascia@gmail.com>
    To: java-user@lucene.apache.org
    Sent: Tuesday, August 26, 2008 7:56:25 PM
    Subject: system design for big numbers

    I've been wandering for a while through this list and other Lucene
    resources on the web trying to figure out the possible outlines of a
    search solution which could fit my case. But as a Lucene newbie I
    decided to ask for your help.

    Now this is the scenario. I am building a webmail application which
    allows users to aggregate their email addresses. Every user can setup
    up to 6 mailboxes on his user account on the website. From day 1 the
    site enables the new user to start downloading emails from his
    accounts.
    Email messages and attachments are permanently stored (I am not
    describing that here, just presume that the storing side of the
    problem is solved).

    Now the task is to let users do full-text searches into their stored
    emails (forget the attachments).

    Say that an average user does receive and downloads 30 emails per day
    per account. This makes 180 mails/day, 5400 mails/month, 32400 mails
    on the first 6 months.
    Say that on day 1 we do face with 100,000 registered users and that
    this user base does not change on the 6 months timeframe (ok it's not
    realistic, it's just to put down a sort of a scale).

    We will face 100,000*32,400 = 3,240,000,000 mails, say 20KB per
    message, it's 64,800,000,000KB which makes some 60TBs (!!).

    The Lucene index would be some 20TBs (let's say that it would take the
    30% of the space needed for the actual content, as stated elsewhere in
    this list).

    OK let's finally presume that my forecast is unrealistic or just fool,
    let's say that we will have to handle a Lucene index of only 3 or 4
    TBs.

    I was thinking that the best and simplest approach would be creating
    one index per user: every user will search only its own mail querying
    only its own index. Multiple not-too-big indexes could be easily moved
    and scaled to additional disks or machines.

    But I've read that even 100 indexes could be too much (also for OS
    limitations in concurrently keeping files open, of course).

    Maybe my scenario is not a common one, but from where would you start
    when trying to outlining a solution which in 6 months could take to a
    3-4 TBs index?

    Browsing the web it seems that scaling or distributing Lucene indexes
    is quite a task, I have in mind too many different options for a
    possible architecture but I don't know where to start.

    As you may understand I'm not a search-specialist, I presume that in
    case of a good start of my project I will get some help or, better,
    some money for hire someone. The task is to "resist" until that time
    without creating in the first 3-6 months an un-manageable, un-scalable
    situation.

    Thanks for sharing with me your points of view.

    And I may add that the search feature is absolutely needed from the
    start, building it in a second phase is out of question.

    Giovanni

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 26, '08 at 11:57p
activeAug 27, '08 at 5:27a
posts2
users2
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase