FAQ
Observing a few emails on this list, I think the following email
exchange between me and john may be of interest to a broader audience.



Runping





________________________________

From: Runping Qi
Sent: Sunday, April 13, 2008 8:58 AM
To: 'JJ'
Subject: RE: streaming + binary input/output data?







That is basically what I envisioned originally.



One issue is the data format of streaming mapper output and the format
of streaming reducer output.

Those data are parsed by the streaming framework into key/value pairs.
The framework assumes that the key and values are separated by tab
char, and the key/value pairs are separated by newline "\n".

That means the keys and values cannot have those two chars. If the
mapper and the reducer can encodet hose chars, then it will be fine.

Encoding the values with base64 will do it. Things related to keys are a
bit tricky, since the framework need will apply compare function on them
in order to do the sorting (and partition).

However, in most cases, it will be acceptable to avoid binary data for
keys.



Another issue is to read binary input data and write binary data to dfs.

This issue can be addressed by implementing customer InputFormat and
OutputFormat classes (only the users know how to parse a specific binary
data format).

For each input key/value pair, the streaming framework basically writes
the following to the stdin of the streaming mapper:

Key.toString() + "\t" + value.toString() " \n"



As long as you implement the toString methods to ensure proper base64
encoding for the value (and the key if necessary), then you will be
fine.



So, in summary, all these issues can be addressed by the user's code.

Initially, I was wondering whether the framework can be extended somehow
so that the user may only need to set some configuration variables to
handle binary data.

However, it seems that it is still unclear what extension should be for
a broad classes of applications.

Maybe it is the best approach for each user to do something like what I
outlined above to address his/her specific problem.



Hope this helps.



Runping







________________________________

From: [email protected] On Behalf
Of JJ
Sent: Sunday, April 13, 2008 8:18 AM
To: Runping Qi
Subject: Re: streaming + binary input/output data?



thx for the info,
what do you think about the idea of encoding the binary data with base64
to text before streaming it with hadoop?

John

2008/4/13, Runping Qi <[email protected]>:


No implementation/solution yet.
If there are more real use cases/user interests, then somebody may have
enough interest to provide a patch.

Runping

-----Original Message-----
From: [email protected]
Sent: Sunday, April 13, 2008 7:30 AM
To: Runping Qi
Subject: RE: streaming + binary input/output data?

i just read the jira. these are interestin suggestions, but how do they
translate into a solution for my problem/question? has all or at least
some of this been implemented or not?

thx
John

Runping Qi wrote:

Actually, there is an old jira about the same issue:
https://issues.apache.org/jira/browse/HADOOP-1722

Runping

-----Original Message-----
From: John Menzer
Sent: Saturday, April 12, 2008 2:45 PM
To: [email protected]
Subject: RE: streaming + binary input/output data?


so you mean you changed the hadoop streaming source code?
actually i am not really willing to change the source code if it's
not
necessary.

so i thought about simply encoding the input binary data to txt
(e.g.
with
base64) and then adding a '\n' after each line to make it
splittable
for
streaming.
after reading from stdin my C programm would just have to decode it
map/reduce it and then encode it back to base64 so write to stdout.

what do you think about that? worth a try?



Joydeep Sen Sarma wrote:
actually - this is possible - but changes to streaming are
required.
at one point - we had gotten rid of the '\n' and '\t' separators
between
the keys and the values in the streaming code and streamed byte
arrays
directly to scripts (and then decoded them in the script). it
worked
perfectly fine. (in fact we were streaming thrift generated byte
streams
-
encoded in java land and decoded in python land :-))

the binary data on hdfs is best stored as sequencefiles (if u
store
binary
data in (what looks to hadoop as) a text file - then bad things
will
happen). if stored this way - hadoop doesn't care about newlines
and
tabs
- those are purely artifacts of streaming.

also - the streaming code (for unknown reasons) doesn't allow a
SequencefileInputFormat. there were minor tweaks we had to make
to
the
streaming driver to allow this stuff ..


-----Original Message-----
From: Ted Dunning
Sent: Mon 4/7/2008 7:43 AM
To: [email protected]
Subject: Re: streaming + binary input/output data?


I don't think that binary input works with streaming because of
the
assumption of one record per line.

If you want to script map-reduce programs, would you be open to a
Groovy
implementation that avoids these problems?

On 4/7/08 6:42 AM, "John Menzer" wrote:


hi,

i would like to use binary input and output data in combination
with
hadoop
streaming.

the reason why i want to use binary data is, that parsing text
to
float
seems to consume a big lot of time compared to directly reading
the
binary
floats.

i am using a C-coded mapper (getting streaming data from stdin
and
writing
to stdout) and no reducer.

so my question is: how do i implement binary input output in
this
context?
as far as i understand i need to put an '\n' char at the end of
each
binary-'line'. so hadoop knows how to split/distribute the input
data
among
the nodes and how to collect it for output(??)

is this approach reasonable?

thanks,
john

--
View this message in context:
http://www.nabble.com/streaming-%2B-binary-
input-output-data--tp16537427p16656661.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.
Quoted from:
http://www.nabble.com/streaming-%2B-binary-input-output-data--
tp16537427p16658687.html

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 14, '08 at 11:00p
activeApr 14, '08 at 11:00p
posts1
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Runping Qi: 1 post

People

Translate

site design / logo © 2023 Grokbase