exchange between me and john may be of interest to a broader audience.
Runping
________________________________
From: Runping Qi
Sent: Sunday, April 13, 2008 8:58 AM
To: 'JJ'
Subject: RE: streaming + binary input/output data?
That is basically what I envisioned originally.
One issue is the data format of streaming mapper output and the format
of streaming reducer output.
Those data are parsed by the streaming framework into key/value pairs.
The framework assumes that the key and values are separated by tab
char, and the key/value pairs are separated by newline "\n".
That means the keys and values cannot have those two chars. If the
mapper and the reducer can encodet hose chars, then it will be fine.
Encoding the values with base64 will do it. Things related to keys are a
bit tricky, since the framework need will apply compare function on them
in order to do the sorting (and partition).
However, in most cases, it will be acceptable to avoid binary data for
keys.
Another issue is to read binary input data and write binary data to dfs.
This issue can be addressed by implementing customer InputFormat and
OutputFormat classes (only the users know how to parse a specific binary
data format).
For each input key/value pair, the streaming framework basically writes
the following to the stdin of the streaming mapper:
Key.toString() + "\t" + value.toString() " \n"
As long as you implement the toString methods to ensure proper base64
encoding for the value (and the key if necessary), then you will be
fine.
So, in summary, all these issues can be addressed by the user's code.
Initially, I was wondering whether the framework can be extended somehow
so that the user may only need to set some configuration variables to
handle binary data.
However, it seems that it is still unclear what extension should be for
a broad classes of applications.
Maybe it is the best approach for each user to do something like what I
outlined above to address his/her specific problem.
Hope this helps.
Runping
________________________________
From: [email protected] On Behalf
Of JJ
Sent: Sunday, April 13, 2008 8:18 AM
To: Runping Qi
Subject: Re: streaming + binary input/output data?
thx for the info,
what do you think about the idea of encoding the binary data with base64
to text before streaming it with hadoop?
John
2008/4/13, Runping Qi <[email protected]>:
No implementation/solution yet.
If there are more real use cases/user interests, then somebody may have
enough interest to provide a patch.
Runping
-----Original Message-----
From: [email protected]
Sent: Sunday, April 13, 2008 7:30 AM
To: Runping Qi
Subject: RE: streaming + binary input/output data?
i just read the jira. these are interestin suggestions, but how do they
translate into a solution for my problem/question? has all or at least
some of this been implemented or not?
thx
John
Runping Qi wrote:
Actually, there is an old jira about the same issue:
https://issues.apache.org/jira/browse/HADOOP-1722
Runping
notFrom: [email protected]
Sent: Sunday, April 13, 2008 7:30 AM
To: Runping Qi
Subject: RE: streaming + binary input/output data?
i just read the jira. these are interestin suggestions, but how do they
translate into a solution for my problem/question? has all or at least
some of this been implemented or not?
thx
John
Runping Qi wrote:
Actually, there is an old jira about the same issue:
https://issues.apache.org/jira/browse/HADOOP-1722
Runping
-----Original Message-----
From: John Menzer
Sent: Saturday, April 12, 2008 2:45 PM
To: [email protected]
Subject: RE: streaming + binary input/output data?
so you mean you changed the hadoop streaming source code?
actually i am not really willing to change the source code if it's
From: John Menzer
Sent: Saturday, April 12, 2008 2:45 PM
To: [email protected]
Subject: RE: streaming + binary input/output data?
so you mean you changed the hadoop streaming source code?
actually i am not really willing to change the source code if it's
necessary.
so i thought about simply encoding the input binary data to txt
so i thought about simply encoding the input binary data to txt
with
base64) and then adding a '\n' after each line to make it
for
streaming.
after reading from stdin my C programm would just have to decode it
map/reduce it and then encode it back to base64 so write to stdout.
what do you think about that? worth a try?
Joydeep Sen Sarma wrote:
after reading from stdin my C programm would just have to decode it
map/reduce it and then encode it back to base64 so write to stdout.
what do you think about that? worth a try?
Joydeep Sen Sarma wrote:
actually - this is possible - but changes to streaming are
at one point - we had gotten rid of the '\n' and '\t' separators
the keys and the values in the streaming code and streamed byte
directly to scripts (and then decoded them in the script). it
perfectly fine. (in fact we were streaming thrift generated byte
-
encoded in java land and decoded in python land :-))
the binary data on hdfs is best stored as sequencefiles (if u
the binary data on hdfs is best stored as sequencefiles (if u
binary
data in (what looks to hadoop as) a text file - then bad things
happen). if stored this way - hadoop doesn't care about newlines
tabs
- those are purely artifacts of streaming.
also - the streaming code (for unknown reasons) doesn't allow a
SequencefileInputFormat. there were minor tweaks we had to make
also - the streaming code (for unknown reasons) doesn't allow a
SequencefileInputFormat. there were minor tweaks we had to make
the
streaming driver to allow this stuff ..
-----Original Message-----
From: Ted Dunning
Sent: Mon 4/7/2008 7:43 AM
To: [email protected]
Subject: Re: streaming + binary input/output data?
I don't think that binary input works with streaming because of
-----Original Message-----
From: Ted Dunning
Sent: Mon 4/7/2008 7:43 AM
To: [email protected]
Subject: Re: streaming + binary input/output data?
I don't think that binary input works with streaming because of
assumption of one record per line.
If you want to script map-reduce programs, would you be open to a
If you want to script map-reduce programs, would you be open to a
implementation that avoids these problems?
On 4/7/08 6:42 AM, "John Menzer" wrote:
hi,
i would like to use binary input and output data in combination
hi,
i would like to use binary input and output data in combination
hadoop
streaming.
the reason why i want to use binary data is, that parsing text
streaming.
the reason why i want to use binary data is, that parsing text
float
seems to consume a big lot of time compared to directly reading
binary
floats.
i am using a C-coded mapper (getting streaming data from stdin
floats.
i am using a C-coded mapper (getting streaming data from stdin
writing
to stdout) and no reducer.
so my question is: how do i implement binary input output in
to stdout) and no reducer.
so my question is: how do i implement binary input output in
context?
as far as i understand i need to put an '\n' char at the end of
as far as i understand i need to put an '\n' char at the end of
binary-'line'. so hadoop knows how to split/distribute the input
among
the nodes and how to collect it for output(??)
is this approach reasonable?
thanks,
john
the nodes and how to collect it for output(??)
is this approach reasonable?
thanks,
john
View this message in context:
http://www.nabble.com/streaming-%2B-binary-
input-output-data--tp16537427p16656661.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.
http://www.nabble.com/streaming-%2B-binary-input-output-data--
tp16537427p16658687.html