FAQ
Has anyone come across this scenario and if not, does anyone have any suggestions?

What if you store different types of data within HDFS. You store XML, text, binary, sequence files, etc. You now want to run a query against ALL of the data stored within HDFS via a map/reduce job. How do you do this if the data input is different types?
For example, (simplest), you want to find all the terms/words matching a pattern and count and return where they are within each data source. Even the example of word count could be an example but given that not all data is textual line-by-line. The terms/words could be contained within XML or against a sequence file or some other format that is stored in your HDFS. What if you want to find those terms/words against ALL data sets that may not be same format stored within HDFS.

I understand that your Map/Reduce jobs specify a specific input format upfront, however, if you have different data formats within HDFS and you want to run the exact query against all formats within 1 map/reduce job, how is this even possible?

Can you even run a single query in a single map/reduce job against all the data across HDFS that is in different formats?
If not, any suggestions on how to handle this?

Thanks.



____________________________________________________________________________________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

Search Discussions

  • Ted Dunning at May 5, 2008 at 4:02 pm
    You just have to write an adapted input format that reads multiple kinds of input.

    It can key off the contents of the file or the name. Depending on names is bad, but has a long lineage so people tend to deal with it reasonably well.

    It isn't very hard to write.

    -----Original Message-----
    From: Kayla Jay
    Sent: Mon 5/5/2008 6:18 AM
    To: [email protected]
    Subject: Query against different data types within HDFS using Map/Reduce

    Has anyone come across this scenario and if not, does anyone have any suggestions?

    What if you store different types of data within HDFS. You store XML, text, binary, sequence files, etc. You now want to run a query against ALL of the data stored within HDFS via a map/reduce job. How do you do this if the data input is different types?
    For example, (simplest), you want to find all the terms/words matching a pattern and count and return where they are within each data source. Even the example of word count could be an example but given that not all data is textual line-by-line. The terms/words could be contained within XML or against a sequence file or some other format that is stored in your HDFS. What if you want to find those terms/words against ALL data sets that may not be same format stored within HDFS.

    I understand that your Map/Reduce jobs specify a specific input format upfront, however, if you have different data formats within HDFS and you want to run the exact query against all formats within 1 map/reduce job, how is this even possible?

    Can you even run a single query in a single map/reduce job against all the data across HDFS that is in different formats?
    If not, any suggestions on how to handle this?

    Thanks.



    ____________________________________________________________________________________
    Be a better friend, newshound, and
    know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
  • Jason Venner at May 5, 2008 at 4:42 pm
    We do this all the time.
    In one case we have the mapper work out the input type by examining the
    input file name and the record data. We tend to do this for the textual
    keyTABvalue records

    In another case we have a container object that can hold any writable,
    that we pass around. We do this for data that has binary data that is to
    large to bother base64 encoding, or where we explicitly have to reduce
    multiple data types where we can't readily tell what the data type is.



    Ted Dunning wrote:
    You just have to write an adapted input format that reads multiple kinds of input.

    It can key off the contents of the file or the name. Depending on names is bad, but has a long lineage so people tend to deal with it reasonably well.

    It isn't very hard to write.

    -----Original Message-----
    From: Kayla Jay
    Sent: Mon 5/5/2008 6:18 AM
    To: [email protected]
    Subject: Query against different data types within HDFS using Map/Reduce

    Has anyone come across this scenario and if not, does anyone have any suggestions?

    What if you store different types of data within HDFS. You store XML, text, binary, sequence files, etc. You now want to run a query against ALL of the data stored within HDFS via a map/reduce job. How do you do this if the data input is different types?
    For example, (simplest), you want to find all the terms/words matching a pattern and count and return where they are within each data source. Even the example of word count could be an example but given that not all data is textual line-by-line. The terms/words could be contained within XML or against a sequence file or some other format that is stored in your HDFS. What if you want to find those terms/words against ALL data sets that may not be same format stored within HDFS.

    I understand that your Map/Reduce jobs specify a specific input format upfront, however, if you have different data formats within HDFS and you want to run the exact query against all formats within 1 map/reduce job, how is this even possible?

    Can you even run a single query in a single map/reduce job against all the data across HDFS that is in different formats?
    If not, any suggestions on how to handle this?

    Thanks.



    ____________________________________________________________________________________
    Be a better friend, newshound, and
    know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
    --
    Jason Venner
    Attributor - Program the Web <http://www.attributor.com/>
    Attributor is hiring Hadoop Wranglers and coding wizards, contact if
    interested
  • Kayla Jay at May 5, 2008 at 4:50 pm
    Awesome. Thanks for the replies. Do you mind sharing your code or providing high-level details on the implementation?


    ----- Original Message ----
    From: Jason Venner <[email protected]>
    To: [email protected]
    Sent: Monday, May 5, 2008 12:41:26 PM
    Subject: Re: Query against different data types within HDFS using Map/Reduce

    We do this all the time.
    In one case we have the mapper work out the input type by examining the
    input file name and the record data. We tend to do this for the textual
    keyTABvalue records

    In another case we have a container object that can hold any writable,
    that we pass around. We do this for data that has binary data that is to
    large to bother base64 encoding, or where we explicitly have to reduce
    multiple data types where we can't readily tell what the data type is.



    Ted Dunning wrote:
    You just have to write an adapted input format that reads multiple kinds of input.

    It can key off the contents of the file or the name. Depending on names is bad, but has a long lineage so people tend to deal with it reasonably well.

    It isn't very hard to write.

    -----Original Message-----
    From: Kayla Jay
    Sent: Mon 5/5/2008 6:18 AM
    To: [email protected]
    Subject: Query against different data types within HDFS using Map/Reduce

    Has anyone come across this scenario and if not, does anyone have any suggestions?

    What if you store different types of data within HDFS. You store XML, text, binary, sequence files, etc. You now want to run a query against ALL of the data stored within HDFS via a map/reduce job. How do you do this if the data input is different types?
    For example, (simplest), you want to find all the terms/words matching a pattern and count and return where they are within each data source. Even the example of word count could be an example but given that not all data is textual line-by-line. The terms/words could be contained within XML or against a sequence file or some other format that is stored in your HDFS. What if you want to find those terms/words against ALL data sets that may not be same format stored within HDFS.

    I understand that your Map/Reduce jobs specify a specific input format upfront, however, if you have different data formats within HDFS and you want to run the exact query against all formats within 1 map/reduce job, how is this even possible?

    Can you even run a single query in a single map/reduce job against all the data across HDFS that is in different formats?
    If not, any suggestions on how to handle this?

    Thanks.



    ____________________________________________________________________________________
    Be a better friend, newshound, and
    know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
    --
    Jason Venner
    Attributor - Program the Web <http://www.attributor.com/>
    Attributor is hiring Hadoop Wranglers and coding wizards, contact if
    interested



    ____________________________________________________________________________________
    Be a better friend, newshound, and
    know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMay 5, '08 at 1:19p
activeMay 5, '08 at 4:50p
posts4
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2023 Grokbase