FAQ
Hi,
Does every block of files in HDFS have to be the same file format when
writing map-reduce applications, a more specific question is , when
dealing with CSV files, can we have a head in the file? I have seen
Mahout applications using the UCI repository file format which is
similar as CSV without header, does it because all map reduce task
must run semantically, having a header will cause one map task be
unique to others.

Regards,

Xiaobo Gu

Search Discussions

  • Xiaobo Gu at Jul 6, 2011 at 10:24 am
    Hi Sean,

    Thanks for your reply first, so we must wirte specific code to
    handle the CSV header if we have it in the file, right?

    Xiaobu Gu


    On Wed, Jul 6, 2011 at 6:11 PM, Sean Owen wrote:
    A block is a piece of a file. It does not (necessarily) have a meaning, or a
    "file format", by itself. You would not address HDFS blocks individually
    from this level. So I suppose the first answer is, no, they do not have
    different formats, though the question is not well-formed.

    You can have whatever you like in whatever HDFS file you want. Your
    application (be it Mahout, or any MapReduce application) just needs to be
    prepared to read it. If your input is a CSV file with a header line, one
    mapper will read that first chunk with the header line. You don't know which
    mapper that will be. Only one will read it, so no you would not construct a
    MapReduce app that depends on all mappers seeing some header line, because
    they don't.

    Yes, so, you would not observe any Mahout job doing this, because it doesn't
    work.
    On Wed, Jul 6, 2011 at 11:03 AM, Xiaobo Gu wrote:

    Hi,
    Does every block of files in HDFS have to be the same file format when
    writing map-reduce applications, a more specific question is , when
    dealing with CSV files, can we have a head in the file? I have seen
    Mahout applications using the UCI repository file format which is
    similar as CSV without header, does it because all map reduce task
    must run semantically, having a header will cause one map task be
    unique to others.

    Regards,

    Xiaobo Gu

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedJul 6, '11 at 10:03a
activeJul 6, '11 at 10:24a
posts2
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Xiaobo Gu: 2 posts

People

Translate

site design / logo © 2023 Grokbase