I am using Cascalong with Amazon EMR. All my data is (mostly) in plain
text (tab delimited) files across S3. I've had a lot of success with
cascalog and now that I'm more comfortable with the framework, i'm looking
to squeeze more efficiencies out of everything. There are a lot of
directions I can go, and before I start running down a few paths, I was
looking for any early advice some of you may have...
I know i have a small file problem and am looking to consolidate files. I
read up on the "Consolidator" but haven't really been able to find much
more about it. I see the github project dfs-datastore and the Consolidator
src files in there. Not sure if I'm looking in the right place or if this
utility is available elsewhere. I figure this is the quickest bang for
the buck I can get. I have existing processes that consolidate stuff by
month. I'd like to consolidate further and more optimally (i.e. some
factor of hadoop block size). not all my files are created equal too, so
some data could consolidate larger time ranges than others, etc. The
Consolidator utility sounds perfect and it seems like it would save me some
time. Any pointers to more details?
A cross-language format is ideal. I'd rather not go the sequence file
route unless it is truly the best way forward. I'd prefer a format that
played nice with S3/EMR, suitable for use with streaming APIs, and
approachable by other non-map/reduce processes. Seems like that rules out
Sequence Files, and I'm likely looking at something line oriented (if
streaming is a must), or some other block oriented format if I ditch
streaming API access but still want to retain accessibility by
non-map/reduce tasks (without a translation/data-export utility). Is my
My format currently follows a column oriented design, with each column of
data being put into it's own file (with some record/row identifier). this
provides a nice way to deal with wide and sparse rows that change shape
over time. However, it also contributes to my small file problem.
Currently, I'm leaning away from this towards something like Thrift or
Protocol buffers to return to wider records and still retain the
Given the above two things, I imagine i'm looking for either a file format
like a sequence file that has proto/thrift data entries (possibly LZO
compressed), or Base64 a binary data record for use in a line oriented
file. I stumbled across the elephantbird project. This seems to be
pretty close to what I need. I also stumbled across Nathan's
cascading-thrift support. This also seems close to what I need. Which
direction would you guys suggest? tradeoffs?
I've been following your threads on Kryo, and while that seems cool, it
doesn't seem to fit my needs (from an input/output perspective - as an
intermediate representation, awesomeness and i assume transparent ;-).
Correct me if I'm wrong, but Kryo is like thrift/protocol buf without the
schema, IDL code generation, versioning and cross laungage. So it's an
efficient serialization format, not a serialization format designed for
interoperability. Again, please correct me if I'm wrong.
Last but not least, what is the recommended way for working with RDBMS
data? Ideally I'd like to pick a storage format that is suitable to export
DB into. In other words, my data warehouse/data vault collects all data
from all operational systems and logs, and is analyzed by map/reduce and
other processes (intermediate formats/views/data-marts are created along
the way as well). Do others do something similar? Or should I be
looking to utilize the JDBC taps more regularly? Just looking for some
best practices / suggestions for working with both log/event data along
with operational/transactional db data (either current or archived).
Thanks in advance for any advice/guidance.