Hi all,
I'm building a Web service which, under the hood, accepts a continuous
stream of data from a user (~30 pushes per second from a JS client), then
for each of these 'pieces' of data:
- converts data into a format suitable for querying a large library of
descriptors (total size not known as yet, but could be into the millions)
- executes that query in parallel across the cluster (the part I'm
struggling with); each calculation against a descriptor will return some
numerical metric indicating how strong the match is to a results collector
- results collector aggregates the metrics returned and determines the best
match for the query across everything in the library
- this is passed on to some other bolts that handle returning something to
the user (don't worry about that here)
The latency between the JS client pushing up a piece of data, and the
result being returned, needs to be well under a second as this is a (very)
real-time app. Hence processing each query against the library in a serial
manner would be doomed from the start as it would take far too long.
In my case, guaranteed processing of each piece of data is not as important
as returning a timely result - I believe Storm will allow us to drop old
bits of data that are taking too long to process, whereas a batch system
like Hadoop might keep trying to process everything even if this takes
minutes, then return data that is no longer wanted.
My problem is deciding how best to split up the library in Storm so it can
process queries against it in the most parallel / fastest way it can.
I've wondered if I could assign n items from the library to each bolt
(where n = total pieces of data in library / number of bolts instantiated,
i.e. the parallelism hint), but I don't think you can do this - Storm
handles their instantiation so I think they have to be identical.
Any ideas? (Additionally, is Storm the right tool for this job?)
Thanks in advance