Grokbase Groups Pig user March 2011
FAQ
I've got a general question surrounding the output of various Pig scripts
and generally where people are
storing that data and in what kind of format?

I read Dmitriy's article on Apache log processing and noticed that the
output of the scripts was a format more
suitable for reporting and graphing upon - that of TSV files.

At present the results from my Pig scripts end up in HDFS in Pig bag/tuple
format and I just wondered whether
that was the best practice for large amounts of data in terms of
organisation. Is anybody using Hive to store the
intermediate Pig data and reporting off that instead? Or, are people
generating graphs and analyses based off the
raw Pig data in HDFS?

Many thanks,
Jon.

Search Discussions

  • Alex McLintock at Mar 23, 2011 at 7:10 pm

    On 23 March 2011 18:12, Jonathan Holloway wrote:

    I've got a general question surrounding the output of various Pig scripts
    and generally where people are
    storing that data and in what kind of format?
    ...
    At present the results from my Pig scripts end up in HDFS in Pig bag/tuple
    format and I just wondered whether
    that was the best practice for large amounts of data in terms of
    organisation.

    I too would like to know this. My plan was to convert my hdfs data into
    read-only Project Voldemort key/value database. I've been told it can be
    done but haven't investigated fully yet.

    I am not sure when I should to use Hive or what the alternatives are.

    Alex
  • Josh Devins at Mar 23, 2011 at 7:59 pm
    Hey Jon,

    I think a common approach is to use Pig (and MR/Hadoop in general) as purely
    the heavy lifter, doing all the merge-downs, aggregations and such of the
    data. At Nokia we tend to output a lot of data from Pig/MR as TSV or CSV
    (using PigStorage) and then use Sqoop to push that into a MySQL DB (or your
    RDBMS of your choice). You (and any BI folks you have) can then do whatever
    'traditional' reporting or graphing off of the MySQL DB. This actually works
    great for us and we run regular reports and build dashboards this way. You
    can of course push back into any data store as well, some are just better
    suited at bulk loading than others (HBase, Cassandra, etc. fit quite well
    when done right). I've not played with an HDFS-backed Voldemort yet, but
    that would be fun too and LinkedIn use this setup.

    If you want to leave your Pig/MR output data on HDFS for jobs further down a
    pipeline, LZO compressed output (check out Elephant Bird from Twitter) is
    good for that, or even just regular SequenceFiles.

    You can use Hive, of course, but what you will probably find is that there
    are not (yet) a lot of off-the-shelf products and components that can
    natively read from Hive. If you it's purely a reporting/BI function you are
    looking for, definitely look at Datameer (http://datameer.com) as they
    integrate with raw files in HDFS or Hive tables.

    As you can imagine, this is potentially a really, really long topic! Feel
    free to email me directly if you want more details or ideas.

    Cheers,

    Josh

    On 23 March 2011 19:12, Jonathan Holloway wrote:

    I've got a general question surrounding the output of various Pig scripts
    and generally where people are
    storing that data and in what kind of format?

    I read Dmitriy's article on Apache log processing and noticed that the
    output of the scripts was a format more
    suitable for reporting and graphing upon - that of TSV files.

    At present the results from my Pig scripts end up in HDFS in Pig bag/tuple
    format and I just wondered whether
    that was the best practice for large amounts of data in terms of
    organisation. Is anybody using Hive to store the
    intermediate Pig data and reporting off that instead? Or, are people
    generating graphs and analyses based off the
    raw Pig data in HDFS?

    Many thanks,
    Jon.
  • Dmitriy Ryaboy at Mar 23, 2011 at 11:50 pm
    What we do in production is a combination of two approaches:

    1) TSV delimited files (well, \u001 actually, to avoid comma and tab
    escaping complexities). Some but not all of these get bulk-loaded into our
    reporting database in a post-processing step. We don't insert directly into
    the db from Pig to avoid taking it down, and to make it easy to reload data
    if needed without rerunning the pig script.

    2) Protocol Buffers or Thrift files that serve as inputs to other jobs.

    It's mostly tsvs. We do protobufs and thrift when the schema is well known
    and settled upon, there is a need to keep and read the stuff in hadoop, and
    we are likely to need to revisit the data often, or when we are creating
    data sets to be ingested into other services. We like the binary formats for
    space savings and easy schemas, but there's something to be said for easily
    human-readable files and no need to predefine the schemas.

    D
    On Wed, Mar 23, 2011 at 11:12 AM, Jonathan Holloway wrote:

    I've got a general question surrounding the output of various Pig scripts
    and generally where people are
    storing that data and in what kind of format?

    I read Dmitriy's article on Apache log processing and noticed that the
    output of the scripts was a format more
    suitable for reporting and graphing upon - that of TSV files.

    At present the results from my Pig scripts end up in HDFS in Pig bag/tuple
    format and I just wondered whether
    that was the best practice for large amounts of data in terms of
    organisation. Is anybody using Hive to store the
    intermediate Pig data and reporting off that instead? Or, are people
    generating graphs and analyses based off the
    raw Pig data in HDFS?

    Many thanks,
    Jon.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMar 23, '11 at 7:10p
activeMar 23, '11 at 11:50p
posts4
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase