Hey all,

I am using Cascalong with Amazon EMR. All my data is (mostly) in plain
text (tab delimited) files across S3. I've had a lot of success with
cascalog and now that I'm more comfortable with the framework, i'm looking
to squeeze more efficiencies out of everything. There are a lot of
directions I can go, and before I start running down a few paths, I was
looking for any early advice some of you may have...


I know i have a small file problem and am looking to consolidate files. I
read up on the "Consolidator" but haven't really been able to find much
more about it. I see the github project dfs-datastore and the Consolidator
src files in there. Not sure if I'm looking in the right place or if this
utility is available elsewhere. I figure this is the quickest bang for
the buck I can get. I have existing processes that consolidate stuff by
month. I'd like to consolidate further and more optimally (i.e. some
factor of hadoop block size). not all my files are created equal too, so
some data could consolidate larger time ranges than others, etc. The
Consolidator utility sounds perfect and it seems like it would save me some
time. Any pointers to more details?

A cross-language format is ideal. I'd rather not go the sequence file
route unless it is truly the best way forward. I'd prefer a format that
played nice with S3/EMR, suitable for use with streaming APIs, and
approachable by other non-map/reduce processes. Seems like that rules out
Sequence Files, and I'm likely looking at something line oriented (if
streaming is a must), or some other block oriented format if I ditch
streaming API access but still want to retain accessibility by
non-map/reduce tasks (without a translation/data-export utility). Is my
understanding correct?

My format currently follows a column oriented design, with each column of
data being put into it's own file (with some record/row identifier). this
provides a nice way to deal with wide and sparse rows that change shape
over time. However, it also contributes to my small file problem.
Currently, I'm leaning away from this towards something like Thrift or
Protocol buffers to return to wider records and still retain the
over-time-backwards-compatibility.

Given the above two things, I imagine i'm looking for either a file format
like a sequence file that has proto/thrift data entries (possibly LZO
compressed), or Base64 a binary data record for use in a line oriented
file. I stumbled across the elephantbird project. This seems to be
pretty close to what I need. I also stumbled across Nathan's
cascading-thrift support. This also seems close to what I need. Which
direction would you guys suggest? tradeoffs?

I've been following your threads on Kryo, and while that seems cool, it
doesn't seem to fit my needs (from an input/output perspective - as an
intermediate representation, awesomeness and i assume transparent ;-).
Correct me if I'm wrong, but Kryo is like thrift/protocol buf without the
schema, IDL code generation, versioning and cross laungage. So it's an
efficient serialization format, not a serialization format designed for
interoperability. Again, please correct me if I'm wrong.


Last but not least, what is the recommended way for working with RDBMS
data? Ideally I'd like to pick a storage format that is suitable to export
DB into. In other words, my data warehouse/data vault collects all data
from all operational systems and logs, and is analyzed by map/reduce and
other processes (intermediate formats/views/data-marts are created along
the way as well). Do others do something similar? Or should I be
looking to utilize the JDBC taps more regularly? Just looking for some
best practices / suggestions for working with both log/event data along
with operational/transactional db data (either current or archived).


Thanks in advance for any advice/guidance.

Cheers,
... Mike

Search Discussions

  • Kovas boguta at Dec 13, 2011 at 2:16 am

    On Mon, Dec 12, 2011 at 4:07 PM, Mike Stanley wrote:

    A cross-language format is ideal.  I'd rather not go the sequence file route
    unless it is truly the best way forward.   I'd prefer a format that played
    nice with S3/EMR, suitable for use with streaming APIs, and approachable by
    other non-map/reduce processes.  Seems like that rules out Sequence Files,
    http://avro.apache.org/

    This is the "official" successor to sequence files, designed with
    cross-language compatibility in mind. Many major projects already
    support it or are moving to doing so.
  • Andrew Xue at Dec 13, 2011 at 6:06 am
    Hey Mike -- I have dealt with it with tab delimited text files that
    have a few "columns" that I know are always populated and necessary
    and then the more sparse ones are put into a JSON map which is in a
    column of its own.

    I played with Avro very briefly, but the main thing that sort of
    annoyed me about it was that w/ Cascading, I don't think you are able
    to read the schema that comes with the data, which is one of Avro's
    primary benefits.

    I also work with s3/EMR.
    On Dec 12, 6:15 pm, kovas boguta wrote:
    On Mon, Dec 12, 2011 at 4:07 PM, Mike Stanley wrote:
    A cross-language format is ideal.  I'd rather not go the sequence file route
    unless it is truly the best way forward.   I'd prefer a format that played
    nice with S3/EMR, suitable for use with streaming APIs, and approachable by
    other non-map/reduce processes.  Seems like that rules out Sequence Files,
    http://avro.apache.org/

    This is the "official" successor to sequence files, designed with
    cross-language compatibility in mind. Many major projects already
    support it or are moving to doing so.
  • Mike Stanley at Dec 13, 2011 at 5:15 pm
    Hi guys,

    Thanks for the replies. I have to admit, I glanced over Avro in the past
    as just another serialization format in the same vein of Thrift, Google
    Protobuf, Message Pack, etc. I didn't get it (felt like a "not invented
    here" type thing)... But now, after having some more battle wounds from
    Hadoop, I understand how it differentiates itself from the others and where
    it fits into this puzzle.

    Correct me if I'm wrong though, but if I go the Avro container file
    approach, streaming API is no longer an option (as this requires line
    oriented files). I'm ok with this, given everything we do
    is gravitating towards Cascalog. And other m/r libraries
    (pig,hive,custom) are still an option for others if they want.

    Alex - you mention not having access to the "schema". Is this any worse
    off than the tab delimited files? I imagine this just means you need to
    have a sense of the contents of the file ahead of time (which would be the
    same as protobuf, etc.). I can see how it would be useful to have the
    schema, but without that, doesn't it still have the benefit of a
    splitleable container for objects (something you would have to do yourself
    with other serialization formats)?

    Thanks again for the advice. Still trying to wrap my head around it all.


    Cheers,
    ... Mike
    On Tue, Dec 13, 2011 at 1:06 AM, Andrew Xue wrote:

    Hey Mike -- I have dealt with it with tab delimited text files that
    have a few "columns" that I know are always populated and necessary
    and then the more sparse ones are put into a JSON map which is in a
    column of its own.

    I played with Avro very briefly, but the main thing that sort of
    annoyed me about it was that w/ Cascading, I don't think you are able
    to read the schema that comes with the data, which is one of Avro's
    primary benefits.

    I also work with s3/EMR.
    On Dec 12, 6:15 pm, kovas boguta wrote:
    On Mon, Dec 12, 2011 at 4:07 PM, Mike Stanley wrote:
    A cross-language format is ideal. I'd rather not go the sequence file
    route
    unless it is truly the best way forward. I'd prefer a format that
    played
    nice with S3/EMR, suitable for use with streaming APIs, and
    approachable by
    other non-map/reduce processes. Seems like that rules out Sequence
    Files,
    http://avro.apache.org/

    This is the "official" successor to sequence files, designed with
    cross-language compatibility in mind. Many major projects already
    support it or are moving to doing so.
  • Andrew Xue at Dec 13, 2011 at 7:40 pm
    Hey Mike -- yea, it's no worse than tab delimited files. I was looking
    into Avro mostly because I didn't want to have to package my cascalog
    jar with a static schema file to read from my data sources.

    So this is what I have now: my jobs read a static file with schema
    info in JSON format (example: "COLUMNS":
    ["id","date_id","key_val_map"], "TYPE":[Integer, Integer,
    String]") ... basically the column name and types are then are used to
    create TextDelimited taps.

    The "key_val_map" section of the data houses the more sparse data. A
    read-json somewhere in the cascalog query does the trick.

    Anyway, thats just been my experience; a lot of the decision was based
    on a way to get things off the ground a minimum of hassle/learning
    curve/etc -- I am pretty new to this and trying to wrap my head around
    it too. I would love for someone to critique this, I am sure that
    there is a more optimal way. Good luck and please share any insight.

    Andy
    On Dec 13, 9:15 am, Mike Stanley wrote:
    Hi guys,

    Thanks for the replies.   I have to admit, I glanced over Avro in the past
    as just another serialization format in the same vein of Thrift, Google
    Protobuf, Message Pack, etc.    I didn't get it (felt like a "not invented
    here" type thing)...  But now, after having some more battle wounds from
    Hadoop, I understand how it differentiates itself from the others and where
    it fits into this puzzle.

    Correct me if I'm wrong though, but if I go the Avro container file
    approach,  streaming API is no longer an option (as this requires line
    oriented files).   I'm ok with this, given everything we do
    is gravitating towards Cascalog.    And other m/r libraries
    (pig,hive,custom) are still an option for others if they want.

    Alex - you mention not having access to the "schema".  Is this any worse
    off than the tab delimited files?  I imagine this just means you need to
    have a sense of the contents of the file ahead of time  (which would be the
    same as protobuf, etc.).  I can see how it would be useful to have the
    schema, but without that, doesn't it still have the benefit of a
    splitleable container for objects (something you would have to do yourself
    with other serialization formats)?

    Thanks again for the advice.  Still trying to wrap my head around it all.

    Cheers,
    ... Mike






    On Tue, Dec 13, 2011 at 1:06 AM, Andrew Xue wrote:
    Hey Mike -- I have dealt with it with tab delimited text files that
    have a few "columns" that I know are always populated and necessary
    and then the more sparse ones are put into a JSON map which is in a
    column of its own.
    I played with Avro very briefly, but the main thing that sort of
    annoyed me about it was that w/ Cascading, I don't think you are able
    to read the schema that comes with the data, which is one of Avro's
    primary benefits.
    I also work with s3/EMR.
    On Dec 12, 6:15 pm, kovas boguta wrote:
    On Mon, Dec 12, 2011 at 4:07 PM, Mike Stanley <m...@mikestanley.org>
    wrote:
    A cross-language format is ideal.  I'd rather not go the sequence file
    route
    unless it is truly the best way forward.   I'd prefer a format that
    played
    nice with S3/EMR, suitable for use with streaming APIs, and
    approachable by
    other non-map/reduce processes.  Seems like that rules out Sequence
    Files,
    This is the "official" successor to sequence files, designed with
    cross-language compatibility in mind. Many major projects already
    support it or are moving to doing so.
  • Sam Ritchie at Dec 13, 2011 at 7:22 pm
    Hey Mike, I'll send a more detailed follow-up soon, but you're absolutely
    right that Kryo is not the right choice for cross-language work. I'm
    working on an extension to dfs-datastores that will allow for long-term
    storage of Clojure data structures, but it sounds like you're looking for
    something a bit more general.

    Here at Twitter we use the dfs-datastores project with Cascading-Thrift.
    I'm just getting into the Elephant-Bird code and should have some LZO taps
    written up in the next few days.

    More later,
    Sam
    On Mon, Dec 12, 2011 at 4:07 PM, Mike Stanley wrote:

    Hey all,

    I am using Cascalong with Amazon EMR. All my data is (mostly) in plain
    text (tab delimited) files across S3. I've had a lot of success with
    cascalog and now that I'm more comfortable with the framework, i'm looking
    to squeeze more efficiencies out of everything. There are a lot of
    directions I can go, and before I start running down a few paths, I was
    looking for any early advice some of you may have...


    I know i have a small file problem and am looking to consolidate files. I
    read up on the "Consolidator" but haven't really been able to find much
    more about it. I see the github project dfs-datastore and the Consolidator
    src files in there. Not sure if I'm looking in the right place or if this
    utility is available elsewhere. I figure this is the quickest bang for
    the buck I can get. I have existing processes that consolidate stuff by
    month. I'd like to consolidate further and more optimally (i.e. some
    factor of hadoop block size). not all my files are created equal too, so
    some data could consolidate larger time ranges than others, etc. The
    Consolidator utility sounds perfect and it seems like it would save me some
    time. Any pointers to more details?

    A cross-language format is ideal. I'd rather not go the sequence file
    route unless it is truly the best way forward. I'd prefer a format that
    played nice with S3/EMR, suitable for use with streaming APIs, and
    approachable by other non-map/reduce processes. Seems like that rules out
    Sequence Files, and I'm likely looking at something line oriented (if
    streaming is a must), or some other block oriented format if I ditch
    streaming API access but still want to retain accessibility by
    non-map/reduce tasks (without a translation/data-export utility). Is my
    understanding correct?

    My format currently follows a column oriented design, with each column of
    data being put into it's own file (with some record/row identifier). this
    provides a nice way to deal with wide and sparse rows that change shape
    over time. However, it also contributes to my small file problem.
    Currently, I'm leaning away from this towards something like Thrift or
    Protocol buffers to return to wider records and still retain the
    over-time-backwards-compatibility.

    Given the above two things, I imagine i'm looking for either a file format
    like a sequence file that has proto/thrift data entries (possibly LZO
    compressed), or Base64 a binary data record for use in a line oriented
    file. I stumbled across the elephantbird project. This seems to be
    pretty close to what I need. I also stumbled across Nathan's
    cascading-thrift support. This also seems close to what I need. Which
    direction would you guys suggest? tradeoffs?

    I've been following your threads on Kryo, and while that seems cool, it
    doesn't seem to fit my needs (from an input/output perspective - as an
    intermediate representation, awesomeness and i assume transparent ;-).
    Correct me if I'm wrong, but Kryo is like thrift/protocol buf without the
    schema, IDL code generation, versioning and cross laungage. So it's an
    efficient serialization format, not a serialization format designed for
    interoperability. Again, please correct me if I'm wrong.


    Last but not least, what is the recommended way for working with RDBMS
    data? Ideally I'd like to pick a storage format that is suitable to export
    DB into. In other words, my data warehouse/data vault collects all data
    from all operational systems and logs, and is analyzed by map/reduce and
    other processes (intermediate formats/views/data-marts are created along
    the way as well). Do others do something similar? Or should I be
    looking to utilize the JDBC taps more regularly? Just looking for some
    best practices / suggestions for working with both log/event data along
    with operational/transactional db data (either current or archived).


    Thanks in advance for any advice/guidance.

    Cheers,
    ... Mike


    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)
  • Mike Stanley at Dec 13, 2011 at 7:28 pm
    Cool. Thanks Sam.

    While your coming up to speed with Elephant-Bird , an additional *new*
    question I have is now differences between the various LZO Protobuf/Thrift
    formats AND Avro out-of-the-box containers.

    I look forward to hearing more about what your doing (as it feels like i'm
    nipping at your heals a bit ;-).

    I'm also interested in the long-term storage for clojure data structures as
    well (may not be right for what I'm asking for at the moment, but on a more
    general level, sounds awesome ;-).


    Cheers,
    ... Mike
    On Tue, Dec 13, 2011 at 2:22 PM, Sam Ritchie wrote:

    Hey Mike, I'll send a more detailed follow-up soon, but you're absolutely
    right that Kryo is not the right choice for cross-language work. I'm
    working on an extension to dfs-datastores that will allow for long-term
    storage of Clojure data structures, but it sounds like you're looking for
    something a bit more general.

    Here at Twitter we use the dfs-datastores project with Cascading-Thrift.
    I'm just getting into the Elephant-Bird code and should have some LZO taps
    written up in the next few days.

    More later,
    Sam
    On Mon, Dec 12, 2011 at 4:07 PM, Mike Stanley wrote:

    Hey all,

    I am using Cascalong with Amazon EMR. All my data is (mostly) in plain
    text (tab delimited) files across S3. I've had a lot of success with
    cascalog and now that I'm more comfortable with the framework, i'm looking
    to squeeze more efficiencies out of everything. There are a lot of
    directions I can go, and before I start running down a few paths, I was
    looking for any early advice some of you may have...


    I know i have a small file problem and am looking to consolidate files.
    I read up on the "Consolidator" but haven't really been able to find much
    more about it. I see the github project dfs-datastore and the Consolidator
    src files in there. Not sure if I'm looking in the right place or if this
    utility is available elsewhere. I figure this is the quickest bang for
    the buck I can get. I have existing processes that consolidate stuff by
    month. I'd like to consolidate further and more optimally (i.e. some
    factor of hadoop block size). not all my files are created equal too, so
    some data could consolidate larger time ranges than others, etc. The
    Consolidator utility sounds perfect and it seems like it would save me some
    time. Any pointers to more details?

    A cross-language format is ideal. I'd rather not go the sequence file
    route unless it is truly the best way forward. I'd prefer a format that
    played nice with S3/EMR, suitable for use with streaming APIs, and
    approachable by other non-map/reduce processes. Seems like that rules out
    Sequence Files, and I'm likely looking at something line oriented (if
    streaming is a must), or some other block oriented format if I ditch
    streaming API access but still want to retain accessibility by
    non-map/reduce tasks (without a translation/data-export utility). Is my
    understanding correct?

    My format currently follows a column oriented design, with each column of
    data being put into it's own file (with some record/row identifier). this
    provides a nice way to deal with wide and sparse rows that change shape
    over time. However, it also contributes to my small file problem.
    Currently, I'm leaning away from this towards something like Thrift or
    Protocol buffers to return to wider records and still retain the
    over-time-backwards-compatibility.

    Given the above two things, I imagine i'm looking for either a file
    format like a sequence file that has proto/thrift data entries (possibly
    LZO compressed), or Base64 a binary data record for use in a line oriented
    file. I stumbled across the elephantbird project. This seems to be
    pretty close to what I need. I also stumbled across Nathan's
    cascading-thrift support. This also seems close to what I need. Which
    direction would you guys suggest? tradeoffs?

    I've been following your threads on Kryo, and while that seems cool, it
    doesn't seem to fit my needs (from an input/output perspective - as an
    intermediate representation, awesomeness and i assume transparent ;-).
    Correct me if I'm wrong, but Kryo is like thrift/protocol buf without the
    schema, IDL code generation, versioning and cross laungage. So it's an
    efficient serialization format, not a serialization format designed for
    interoperability. Again, please correct me if I'm wrong.


    Last but not least, what is the recommended way for working with RDBMS
    data? Ideally I'd like to pick a storage format that is suitable to export
    DB into. In other words, my data warehouse/data vault collects all data
    from all operational systems and logs, and is analyzed by map/reduce and
    other processes (intermediate formats/views/data-marts are created along
    the way as well). Do others do something similar? Or should I be
    looking to utilize the JDBC taps more regularly? Just looking for some
    best practices / suggestions for working with both log/event data along
    with operational/transactional db data (either current or archived).


    Thanks in advance for any advice/guidance.

    Cheers,
    ... Mike


    --
    Sam Ritchie, Twitter Inc
    703.662.1337
    @sritchie09

    (Too brief? Here's why! http://emailcharter.org)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcascalog-user @
categoriesclojure, hadoop
postedDec 13, '11 at 12:10a
activeDec 13, '11 at 7:40p
posts7
users4
websiteclojure.org
irc#clojure

People

Translate

site design / logo © 2021 Grokbase