FAQ
Briefly on parquet-mr, though that discussion should go onto parquet-dev
list, as Nong suggested:

Parquet java apis are specifically built to allow people to implement logic
for reading parquet files into their specific object representation
schemes. Currently, we've used these APIs to provide support for conversion
to/from Thrift and to/from Avro. Protocol Buffers are very similar, and can
be done equally generically -- meaning, you don't need to write a converter
for your specific Proto message, but a general one for all protocol
buffers; then you can use it for any current or future protobuf. You can
check parquet-thrift and parquet-avro packages for examples of
implementations.

On Fri, Aug 9, 2013 at 10:28 AM, Nong Li wrote:

When you run the insert in Impala, what is the format of 'raw_table'? Can
you provide any
more information about the crash?

The questions about parquet-mr are more suited for the parquet-dev mailing
list. You'll need to
build a component that can read the fields out of your protobufs and feeds
them to the parquet-mr
writer. Maybe the DynamicMessage API is what you are looking for?

https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/DynamicMessage


On Thu, Aug 8, 2013 at 9:56 PM, Gautam wrote:

Hello,
I'm trying to get around 100 TB of data from protobuf to
Parquet. One way is to use the "INSERT into TABLE parquet_table select *
from raw_table" and running it in impala but this can take a while and
cause crashes. I'm trying to implement a MR job to import the data in
Parquet.

I'v gone through the github/Parquet/parquet-mr repo. Unless I'm missing
something, It's not exactly straightforward to integrate into Parquet since
it cant work with Proto compiled schemas and I need to implement my own
writer and my own Type class. Like protobuf there's no auto-compiled
reader/writer for Parquet schemas. One needs to hand code the serialization
logic as per the writer interface. This can get cumbersome if I have a
verbose message schema. Although almost all fields are primitive types
(string, int) , one repeated field and two user defined fields.


Is there any alternative to doing this? Or some quick way to extend the
ParquetOutputFormat. The latter is more desirable since we can also use
Hive Serde to read the Parquet-imported data.

-Gautam.

Search Discussions

Discussion Posts

Previous

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 3 | next ›
Discussion Overview
groupimpala-user @
categorieshadoop
postedAug 9, '13 at 4:56a
activeAug 9, '13 at 5:47p
posts3
users3
websitecloudera.com
irc#hadoop

3 users in discussion

Gautam: 1 post Dmitriy Ryaboy: 1 post Nong Li: 1 post

People

Translate

site design / logo © 2021 Grokbase