Serialization framework use SequenceFile/TFile/Other metadata to instantiate deserializer

Key: HADOOP-4243
URL: https://issues.apache.org/jira/browse/HADOOP-4243
Project: Hadoop Core
Issue Type: Improvement
Components: contrib/serialization
Reporter: Pete Wyckoff

SequenceFile metadata is useful for storing additional information about the serialized data, for example, for RecordIO, whether the data is CSV or Binary. For thrift, the same thing - Binary, JSON, ...

For Hive, this may be especially important, because it has a Dynamic generic serializer/deserializer that takes its DDL at runtime (as opposed to RecordIO and Thrift which require pre-compilation into a specific class whose name can be stored in the sequence file key or value class). In this case, the class name is like Record.java in RecordIO - it doesn't tell you anything without the DDL.

One way to address this could be adding the sequence file metadata to the getDeserializer call in Serialization interface. The api would then be something like getDeserializer(Class<?>, Map<Text, Text> metadata) or Properties metadata.

But, I am open to proposals.

This also means that saying a class implements Writable is not enough to necessarily deserialize it since it may do specific actions based on the metadata - e.g., RecordIO might determine whether to use CSV rather than the default Binary deserialization.

There's the other issue of the getSerializer returning the metadata to be written to the Sequence/T File.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 1 | next ›
Discussion Overview
groupcommon-dev @
postedSep 22, '08 at 6:31p
activeSep 22, '08 at 6:31p

1 user in discussion

Pete Wyckoff (JIRA): 1 post



site design / logo © 2022 Grokbase