Has anyone come across this scenario and if not, does anyone have any suggestions?
What if you store different types of data within HDFS. You store XML, text, binary, sequence files, etc. You now want to run a query against ALL of the data stored within HDFS via a map/reduce job. How do you do this if the data input is different types?
For example, (simplest), you want to find all the terms/words matching a pattern and count and return where they are within each data source. Even the example of word count could be an example but given that not all data is textual line-by-line. The terms/words could be contained within XML or against a sequence file or some other format that is stored in your HDFS. What if you want to find those terms/words against ALL data sets that may not be same format stored within HDFS.
I understand that your Map/Reduce jobs specify a specific input format upfront, however, if you have different data formats within HDFS and you want to run the exact query against all formats within 1 map/reduce job, how is this even possible?
Can you even run a single query in a single map/reduce job against all the data across HDFS that is in different formats?
If not, any suggestions on how to handle this?
Thanks.
____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ