thought that I could load this data (100+GB tab delimited) files and then
run a perl streaming script to clean it up (wanted to take advantage of
parallelization of hadoop framework). However, since some of the data has
"^M" and other special chars, the input records get broken up into multiple
records in HDFS. I am trying to load this data as is (not a sequence file).
for example this is some of the data in the input file:
1232141:32432 test.com/template
^M\
^M\
Next > ^M\
I have a perl script that takes care of this issue when I run it against the
input file (not in HDFS), but unfortunately it does not work in HDFS
streaming since I think it has to do with the special characters getting
translated into line breaks by HDFS loading utility.
Is there anyway I could still use Hadoop to cleanup the data? Or should I
just clean it up first and then load it into HDFS.
cheers,
Marcin
--
View this message in context: http://hadoop-common.472056.n3.nabble.com/What-is-the-best-way-to-load-data-with-control-characters-into-HDFS-tp2818487p2818487.html
Sent from the Users mailing list archive at Nabble.com.