I have a problem where my input data has various control characters. I
thought that I could load this data (100+GB tab delimited) files and then
run a perl streaming script to clean it up (wanted to take advantage of
parallelization of hadoop framework). However, since some of the data has
"^M" and other special chars, the input records get broken up into multiple
records in HDFS. I am trying to load this data as is (not a sequence file).

for example this is some of the data in the input file:

1232141:32432 test.com/template
Next > ^M\

I have a perl script that takes care of this issue when I run it against the
input file (not in HDFS), but unfortunately it does not work in HDFS
streaming since I think it has to do with the special characters getting
translated into line breaks by HDFS loading utility.

Is there anyway I could still use Hadoop to cleanup the data? Or should I
just clean it up first and then load it into HDFS.


Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
postedApr 14, '11 at 12:45a
activeApr 14, '11 at 12:45a

1 user in discussion

Macmarcin: 1 post



site design / logo © 2023 Grokbase