I looked at different file types, input and output formats, but got
quite confused, and am not sure how to connect the pipe from one
format to another.
Here is what I would like to do:
1. Pass in a string to my hadoop program, and it will write this
single key-value pair to a file on the fly.
2. The first job will read from this file, do some processing, and
write more key-value pairs to other files (the same format as the file
in step 1). Subsequent jobs will read from those files generated by
the first job. This will continue in an iterative manner until some
terminal condition has reached.
3. Both the key and value in the file should be text (i.e. Human
While this sounds simple, I have been having trouble figuring out the
correct formats to use, and here is why:
JobConf.setInputKeyClass, and setInputValueClass are both deprecated,
so I am avoiding them.
SequenceFileOutputFormat doesn't work because the key has to be
IntWriteable and a Text key causes the code to blow up. (which I still
dont quite understand why, because when I use a SequenceFile.Writer,
it can take Text for both keys and values)
KeyValueTextInputFormat looks promising, but I am not sure how to
bootstrap the first file mentioned in step 1, i.e. what formats and
writer I should use to create the file to hold the initial argument...
I have a feeling that this is actually a very simple problem, only
that I am not looking at the right direction. Your help would be