FAQ
Hi all,

I'm building a Web service which, under the hood, accepts a continuous
stream of data from a user (~30 pushes per second from a JS client), then
for each of these 'pieces' of data:
- converts data into a format suitable for querying a large library of
descriptors (total size not known as yet, but could be into the millions)
- executes that query in parallel across the cluster (the part I'm
struggling with); each calculation against a descriptor will return some
numerical metric indicating how strong the match is to a results collector
- results collector aggregates the metrics returned and determines the best
match for the query across everything in the library
- this is passed on to some other bolts that handle returning something to
the user (don't worry about that here)

The latency between the JS client pushing up a piece of data, and the
result being returned, needs to be well under a second as this is a (very)
real-time app. Hence processing each query against the library in a serial
manner would be doomed from the start as it would take far too long.

In my case, guaranteed processing of each piece of data is not as important
as returning a timely result - I believe Storm will allow us to drop old
bits of data that are taking too long to process, whereas a batch system
like Hadoop might keep trying to process everything even if this takes
minutes, then return data that is no longer wanted.

My problem is deciding how best to split up the library in Storm so it can
process queries against it in the most parallel / fastest way it can.

I've wondered if I could assign n items from the library to each bolt
(where n = total pieces of data in library / number of bolts instantiated,
i.e. the parallelism hint), but I don't think you can do this - Storm
handles their instantiation so I think they have to be identical.

Any ideas? (Additionally, is Storm the right tool for this job?)

Thanks in advance

Search Discussions

  • Mr Chris K at Dec 26, 2012 at 6:37 pm
    Think I've partially solved this. I have switched to a Trident (DRPC)
    topology which, if my tenuous understanding of the TridentReach example in
    storm-starter is anything to go by...

    1. queries the database in the stateQuery() call
    2. returns all suitable matches for the query
    3. processes batches of those results in parallel by spinning up
    multiple instances of whatever function comes after the query

    So Storm handles the parallelism instead of us - we just tell it what we
    want done.
    On Saturday, December 15, 2012 1:26:42 PM UTC, mr.ch...@gmail.com wrote:

    Hi all,

    I'm building a Web service which, under the hood, accepts a continuous
    stream of data from a user (~30 pushes per second from a JS client), then
    for each of these 'pieces' of data:
    - converts data into a format suitable for querying a large library of
    descriptors (total size not known as yet, but could be into the millions)
    - executes that query in parallel across the cluster (the part I'm
    struggling with); each calculation against a descriptor will return some
    numerical metric indicating how strong the match is to a results collector
    - results collector aggregates the metrics returned and determines the
    best match for the query across everything in the library
    - this is passed on to some other bolts that handle returning something to
    the user (don't worry about that here)

    The latency between the JS client pushing up a piece of data, and the
    result being returned, needs to be well under a second as this is a (very)
    real-time app. Hence processing each query against the library in a serial
    manner would be doomed from the start as it would take far too long.

    In my case, guaranteed processing of each piece of data is not as
    important as returning a timely result - I believe Storm will allow us to
    drop old bits of data that are taking too long to process, whereas a batch
    system like Hadoop might keep trying to process everything even if this
    takes minutes, then return data that is no longer wanted.

    My problem is deciding how best to split up the library in Storm so it can
    process queries against it in the most parallel / fastest way it can.

    I've wondered if I could assign n items from the library to each bolt
    (where n = total pieces of data in library / number of bolts instantiated,
    i.e. the parallelism hint), but I don't think you can do this - Storm
    handles their instantiation so I think they have to be identical.

    Any ideas? (Additionally, is Storm the right tool for this job?)

    Thanks in advance

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupstorm-user @
postedDec 15, '12 at 8:54p
activeDec 26, '12 at 6:37p
posts2
users1
websitestorm-project.net
irc#storm-user

1 user in discussion

Mr Chris K: 2 posts

People

Translate

site design / logo © 2022 Grokbase