Twitter hosted this month's Pig contributor meeting.
Developers from Yahoo, Twitter, LinkedIn, RichRelevance, and Cloudera were
First, Alan Gates demoed Howl, a project whose goal is to provide table
management service for all of hadoop. The vision is that ultimately you will
be able to read/write data using regular MR, or Pig, or Hive, and read it
using any of those three, with full support of a partition-aware metadata
store that will tell you what data is available, what its schema is, etc,
reusing a single table abstraction.
Currently, tables are created using (a restricted subset of) Hive ddl
statements; a howl cli for this will be created, which will enforce the
Writing to the table using Pig or MapReduce is supported. Reading can
already be done using all three.
At the moment, a single Pig store statement can only store into a single
partition; adding ability to "spray" across partitions is on the roadmap.
This, and a good api for interacting with the metastore, are the two areas
that were identified as good opportunities for the wider developer community
to get involved with the project. The source code is on GitHub, and is at
the moment synchronized with the development trunk manually; Yahoo folks
will look into changing this.
Security is a concern, and Yahoo will be working on it. Making it possible
for Hive to write to the tables is at the moment not as high a priority as
the others listed, it would basically involve just writing a Hive SerDe (an
equivalent of Pig's StoreFunc).
2. Azkaban presentation
Russel Jurney and Richard Park from LinkedIn presented the workflow
management tool open-sourced by LinkedIn, called Azkaban. It allows you to
declare job dependencies, has a web interface for launching and monitoring
jobs, etc. It has a special exec mode for Pig that lets you set some
Pig-specific options on a per-job basis. It does not currently have
triggering or job-instance parameter substitution (it does have job-level
parameter substitution). When asked what would Pig could do to make life
easier for Azkaban, the two things Richard identified were registering jars
through the grunt command line and a way to monitor the running job -- both
of these are already in trunk, so we're in pretty good shaped for 0.8
3. Piggybank discussion
Kevin Weil led a discussion of the piggybank. There are a few problems with
it -- it's released on the Pig schedule, and has quite a few barriers to
submission that are, anecdotally at least, preventing people from
contributing. Several options were discussed, with the group finally
settling on starting a community-curated GitHub project for piggybank. It
will have a number of committers from different companies, and will aim to
make it easy for folks to contribute (all contribs will still have to have
tests, and be Apache 2.0-licensed). More details will be forthcoming as we
figure them out. Initially this project will be seeded with the current
Piggybank functions some time after 0.8 is branched. The initial list of
committers Kevin Weil (Twitter), Dmitriy Ryaboy (Twitter), Carl Steinbach
(Cloudera), and Russel Jurney (LinkedIn). Yahoo will also nominate someone.
Please send us any thoughts you might have on this subject. It was suggested
that a lot of common code might be shared with Hive UDFs, which have the
same problems as Piggybank does, and that perhaps the project can be another
collaboration point between the projects. Not clear how that would work,
Carl will talk to other Hive people.
So far the items on the list for 0.9 are: better type propagation /
resolution story and documentation, perhaps different parser (ANTLR?), some
performance tweaks, and map types with fixed-type values. Much still to be
The next contributor meeting will be hosted by LinkedIn in October.