FAQ
Hi all ,

I have been working with MapReduce and HDFS for sometime. So the procedure
what I normally follow is :

1) copy in the input file from Local File System to HDFS

2) run the map reduce module

3) copy the output file back to the Local File System from the HDFS

But I feel , step 1 and 3 is adding a lot of overhead to the entire process
!!

My queries are :

1) I am getting the files into the Local File System by establishing a port
connection with another node. So can I ensure that the data which is ported
into the hadoop node is directly written to the HDFS instead of going
through the Local File System and then performing a CopyFromLocal ???

2) Can I copy the reduce output (which creates the final output file)
directly to the Local File System instead of injecting it to the HDFS
(effectively into different nodes in HDFS), so that I can minimize the
overhead ?? I expect this procedure to take much lesser time than copying to
the HDFS and then performing a CopyToLocal.. Finally I should be able to
send this file back to another node using socket communication..

Looking forward to your suggestions !!

Thanks,

Matthew John

Search Discussions

  • Sebastian Schoenherr at Nov 15, 2010 at 9:40 am
    Hi Matthew,
    of course, you can copy it directly to HDFS and vice versa. Use the
    IOUtils (hadoop.io.IOUtils) like this:
    FileSystem fileSystem = FileSystem.get(conf);
    (org.apache.hadoop.fs.FileSystem)

    "in" and "out" are the streams (out is in this example the HDFS outputstream)
    IOUtils.copyBytes(in, out, fileSystem.getConf());

    hope this helps,
    sebastian

    Zitat von Matthew John <tmatthewjohn1988@gmail.com>:
    Hi all ,

    I have been working with MapReduce and HDFS for sometime. So the procedure
    what I normally follow is :

    1) copy in the input file from Local File System to HDFS

    2) run the map reduce module

    3) copy the output file back to the Local File System from the HDFS

    But I feel , step 1 and 3 is adding a lot of overhead to the entire process
    !!

    My queries are :

    1) I am getting the files into the Local File System by establishing a port
    connection with another node. So can I ensure that the data which is ported
    into the hadoop node is directly written to the HDFS instead of going
    through the Local File System and then performing a CopyFromLocal ???

    2) Can I copy the reduce output (which creates the final output file)
    directly to the Local File System instead of injecting it to the HDFS
    (effectively into different nodes in HDFS), so that I can minimize the
    overhead ?? I expect this procedure to take much lesser time than copying to
    the HDFS and then performing a CopyToLocal.. Finally I should be able to
    send this file back to another node using socket communication..

    Looking forward to your suggestions !!

    Thanks,

    Matthew John
  • Zooni79 at Nov 17, 2010 at 1:51 pm
    Hi,

    As an extension to the problem statement...Is it possible to fuse step 1 and 2 in to one step?
    i.e. Can we have the map task to pick the input from an external filesystem instead of HDFS.
    Can FTPfileSystem/RawLocalFileSystem can be of any help here?

    ./zahoor
    On 15-Nov-2010, at 3:10 PM, Sebastian Schoenherr wrote:

    Hi Matthew,
    of course, you can copy it directly to HDFS and vice versa. Use the IOUtils (hadoop.io.IOUtils) like this:
    FileSystem fileSystem = FileSystem.get(conf); (org.apache.hadoop.fs.FileSystem)

    "in" and "out" are the streams (out is in this example the HDFS outputstream)
    IOUtils.copyBytes(in, out, fileSystem.getConf());

    hope this helps,
    sebastian

    Zitat von Matthew John <tmatthewjohn1988@gmail.com>:
    Hi all ,

    I have been working with MapReduce and HDFS for sometime. So the procedure
    what I normally follow is :

    1) copy in the input file from Local File System to HDFS

    2) run the map reduce module

    3) copy the output file back to the Local File System from the HDFS

    But I feel , step 1 and 3 is adding a lot of overhead to the entire process
    !!

    My queries are :

    1) I am getting the files into the Local File System by establishing a port
    connection with another node. So can I ensure that the data which is ported
    into the hadoop node is directly written to the HDFS instead of going
    through the Local File System and then performing a CopyFromLocal ???

    2) Can I copy the reduce output (which creates the final output file)
    directly to the Local File System instead of injecting it to the HDFS
    (effectively into different nodes in HDFS), so that I can minimize the
    overhead ?? I expect this procedure to take much lesser time than copying to
    the HDFS and then performing a CopyToLocal.. Finally I should be able to
    send this file back to another node using socket communication..

    Looking forward to your suggestions !!

    Thanks,

    Matthew John

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedNov 15, '10 at 5:37a
activeNov 17, '10 at 1:51p
posts3
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase