I think a common approach is to use Pig (and MR/Hadoop in general) as purely
the heavy lifter, doing all the merge-downs, aggregations and such of the
data. At Nokia we tend to output a lot of data from Pig/MR as TSV or CSV
(using PigStorage) and then use Sqoop to push that into a MySQL DB (or your
RDBMS of your choice). You (and any BI folks you have) can then do whatever
'traditional' reporting or graphing off of the MySQL DB. This actually works
great for us and we run regular reports and build dashboards this way. You
can of course push back into any data store as well, some are just better
suited at bulk loading than others (HBase, Cassandra, etc. fit quite well
when done right). I've not played with an HDFS-backed Voldemort yet, but
that would be fun too and LinkedIn use this setup.
If you want to leave your Pig/MR output data on HDFS for jobs further down a
pipeline, LZO compressed output (check out Elephant Bird from Twitter) is
good for that, or even just regular SequenceFiles.
You can use Hive, of course, but what you will probably find is that there
are not (yet) a lot of off-the-shelf products and components that can
natively read from Hive. If you it's purely a reporting/BI function you are
looking for, definitely look at Datameer (http://datameer.com) as they
integrate with raw files in HDFS or Hive tables.
As you can imagine, this is potentially a really, really long topic! Feel
free to email me directly if you want more details or ideas.
On 23 March 2011 19:12, Jonathan Holloway wrote:
I've got a general question surrounding the output of various Pig scripts
and generally where people are
storing that data and in what kind of format?
I read Dmitriy's article on Apache log processing and noticed that the
output of the scripts was a format more
suitable for reporting and graphing upon - that of TSV files.
At present the results from my Pig scripts end up in HDFS in Pig bag/tuple
format and I just wondered whether
that was the best practice for large amounts of data in terms of
organisation. Is anybody using Hive to store the
intermediate Pig data and reporting off that instead? Or, are people
generating graphs and analyses based off the
raw Pig data in HDFS?