FAQ
I am considering a basic task of loading data to hadoop cluster in this scenario: hadoop cluster and bulk data reside on different boxes, e.g. connected via LAN or wan.

An example to do this is to move data from amazon s3 to ec2, which is supported in latest hadoop by specifying s3(n)://authority/path in distcp.

But generally speaking, what is the best way to load data to hadoop cluster from a remote box? Clearly, in this scenario, it is unreasonable to copy data to local name node and then issue some command like "hadoop fs -copyFromLocal" to put data in the cluster (besides this, a desired data transfer tool is also a factor, scp or sftp, gridftp, ..., compression and encryption, ...).

I am not awaring of a generic support for fetching data from a remote box (like that from s3 or s3n), I am thinking about the following solution (run on remote boxes to push data to hadoop):

cat datafile | ssh hadoopbox 'hadoop fs -put - dst'

There are pros (simple and will do the job without storing a local copy of each data file and then do a command like 'hadoop fs -copyFromLocal') and cons (obviously will need many such pipelines running in parallel to speed up the job, but at the cost of creating processes on remote machines to read data and maintain ssh connections, so if data file is small, better archive small files into a tar file before calling 'cat'). Alternative to using a 'cat', a program can be written to keep reading data files and dump to stdin in parallel.

Any comments about this or thoughts about a better solution?

Thanks,
--
Michael

Search Discussions

  • Jiang licht at Mar 2, 2010 at 8:01 am
    Another thoughts, to speed up partition and distribution of data in hadoop cluster, it seems to me it is favorable if the transfer task can be run as a map/reduce job. Make any sense?

    Thanks,
    --
    Michael

    --- On Tue, 3/2/10, jiang licht wrote:


    From: jiang licht <licht_jiang@yahoo.com>
    Subject: bulk data transfer to HDFS remotely (e.g. via wan)
    To: common-user@hadoop.apache.org
    Date: Tuesday, March 2, 2010, 1:30 AM


    I am considering a basic task of loading data to hadoop cluster in this scenario: hadoop cluster and bulk data reside on different boxes, e.g. connected via LAN or wan.

    An example to do this is to move data from amazon s3 to ec2, which is supported in latest hadoop by specifying s3(n)://authority/path in distcp.

    But generally speaking, what is the best way to load data to hadoop cluster from a remote box? Clearly, in this scenario, it is unreasonable to copy data to local name node and then issue some command like "hadoop fs -copyFromLocal" to put data in the cluster (besides this, a desired data transfer tool is also a factor, scp or sftp, gridftp, ..., compression and encryption, ...).

    I am not awaring of a generic support for fetching data from a remote box (like that from s3 or s3n), I am thinking about the following solution (run on remote boxes to push data to hadoop):

    cat datafile | ssh hadoopbox 'hadoop fs -put - dst'

    There are pros (simple and will do the job without storing a local copy of each data file and then do a command like 'hadoop fs -copyFromLocal') and cons (obviously will need many such pipelines running in parallel to speed up the job, but at the cost of creating processes on remote machines to read data and maintain ssh connections, so if data file is small, better archive small files into a tar file before calling 'cat'). Alternative to using a 'cat', a program can be written to keep reading data files and dump to stdin in parallel.

    Any comments about this or thoughts about a better solution?

    Thanks,
    --
    Michael
  • Brian Bockelman at Mar 2, 2010 at 2:39 pm
    Hey Michael,

    distcp does a MapReduce job to transfer data between two clusters - but it might not be acceptable security-wise for your setup.

    Locally, we use gridftp between two clusters (not necessarily Hadoop!) and a protocol called SRM to load-balance between gridftp servers. GridFTP was selected because it is common in our field, and we already have the certificate infrastructure well setup.

    GridFTP is fast too - many Gbps is not too hard.

    YMMV

    Brian
    On Mar 2, 2010, at 1:30 AM, jiang licht wrote:

    I am considering a basic task of loading data to hadoop cluster in this scenario: hadoop cluster and bulk data reside on different boxes, e.g. connected via LAN or wan.

    An example to do this is to move data from amazon s3 to ec2, which is supported in latest hadoop by specifying s3(n)://authority/path in distcp.

    But generally speaking, what is the best way to load data to hadoop cluster from a remote box? Clearly, in this scenario, it is unreasonable to copy data to local name node and then issue some command like "hadoop fs -copyFromLocal" to put data in the cluster (besides this, a desired data transfer tool is also a factor, scp or sftp, gridftp, ..., compression and encryption, ...).

    I am not awaring of a generic support for fetching data from a remote box (like that from s3 or s3n), I am thinking about the following solution (run on remote boxes to push data to hadoop):

    cat datafile | ssh hadoopbox 'hadoop fs -put - dst'

    There are pros (simple and will do the job without storing a local copy of each data file and then do a command like 'hadoop fs -copyFromLocal') and cons (obviously will need many such pipelines running in parallel to speed up the job, but at the cost of creating processes on remote machines to read data and maintain ssh connections, so if data file is small, better archive small files into a tar file before calling 'cat'). Alternative to using a 'cat', a program can be written to keep reading data files and dump to stdin in parallel.

    Any comments about this or thoughts about a better solution?

    Thanks,
    --
    Michael
  • Jiang licht at Mar 2, 2010 at 6:11 pm
    Hi Brian,

    Thanks a lot for sharing your experience. Here I have some questions to bother you for more help :)

    So, basically means that data transfer in your case is 2-step job: 1st, use gridftp to make a local copy of data on target, 2nd load data into the target cluster by sth like "hadoop fs -put". If this is correct, I am wondering if this will consume too much disk space of your target box (since it is stored in a local file system, prior to be distributed to hadoop cluster). Also, do you do a integrity check for each file transferred (one straightforward method might be to do a 'cksum' or alike comparison, but is that doable in terms of efficiency)?

    I am not familiar with gridftp except that I know it is a better choice compared to scp, sftp, etc. in that it can tune tcp settings and create parallel transfer. So, I want to know if it keeps a log of what files have been successfully transferred and what have not, does gridftp do a file integrity check? Right now, I only have one box for data storage (not in hadoop cluster) and want to transfer that data to hadoop. Can I just install gridftp on this box and name node box to enable gridftp transfer from the 1st to the 2nd?

    Thanks,
    --

    Michael

    --- On Tue, 3/2/10, Brian Bockelman wrote:

    From: Brian Bockelman <bbockelm@cse.unl.edu>
    Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan)
    To: common-user@hadoop.apache.org
    Date: Tuesday, March 2, 2010, 8:38 AM

    Hey Michael,

    distcp does a MapReduce job to transfer data between two clusters - but it might not be acceptable security-wise for your setup.

    Locally, we use gridftp between two clusters (not necessarily Hadoop!) and a protocol called SRM to load-balance between gridftp servers.  GridFTP was selected because it is common in our field, and we already have the certificate infrastructure well setup.

    GridFTP is fast too - many Gbps is not too hard.

    YMMV

    Brian
    On Mar 2, 2010, at 1:30 AM, jiang licht wrote:

    I am considering a basic task of loading data to hadoop cluster in this scenario: hadoop cluster and bulk data reside on different boxes, e.g. connected via LAN or wan.

    An example to do this is to move data from amazon s3 to ec2, which is supported in latest hadoop by specifying s3(n)://authority/path in distcp.

    But generally speaking, what is the best way to load data to hadoop cluster from a remote box? Clearly, in this scenario, it is unreasonable to copy data to local name node and then issue some command like "hadoop fs -copyFromLocal" to put data in the cluster (besides this, a desired data transfer tool is also a factor, scp or sftp, gridftp, ..., compression and encryption, ...).

    I am not awaring of a generic support for fetching data from a remote box (like that from s3 or s3n), I am thinking about the following solution (run on remote boxes to push data to hadoop):

    cat datafile | ssh hadoopbox 'hadoop fs -put - dst'

    There are pros (simple and will do the job without storing a local copy of each data file and then do a command like 'hadoop fs -copyFromLocal') and cons (obviously will need many such pipelines running in parallel to speed up the job, but at the cost of creating processes on remote machines to read data and maintain ssh connections, so if data file is small, better archive small files into a tar file before calling 'cat'). Alternative to using a 'cat', a program can be written to keep reading data files and dump to stdin in parallel.

    Any comments about this or thoughts about a better solution?

    Thanks,
    --
    Michael
  • Brian Bockelman at Mar 2, 2010 at 9:00 pm
    Hey Michael,

    We've developed a GridFTP server plugin that writes directly into Hadoop, so there's no intermediate data staging required. You can just use your favorite GridFTP client on the source machine and transfer it directly into Hadoop. Globus GridFTP can do checksums as it goes, but I haven't tried it - it might not work with our plugin. The GridFTP server does not need to co-exist with any Hadoop processes - it just needs a network connection to the WAN and a network connection to the LAN.

    The GridFTP server is automatically installed with our yum packaging, along with our organization's CA certs. If this is a one-off transfer - or you don't already have the CA certificate/grid infrastructure already available in your organization - you might be better served approaching another solution.

    The setup works well for us because (a) the other 40 sites use GridFTP as a common protocol, (b) we have a long history with using GridFTP, and (c) we need to transfer many TB on a daily basis.

    Brian
    On Mar 2, 2010, at 12:10 PM, jiang licht wrote:

    Hi Brian,

    Thanks a lot for sharing your experience. Here I have some questions to bother you for more help :)

    So, basically means that data transfer in your case is 2-step job: 1st, use gridftp to make a local copy of data on target, 2nd load data into the target cluster by sth like "hadoop fs -put". If this is correct, I am wondering if this will consume too much disk space of your target box (since it is stored in a local file system, prior to be distributed to hadoop cluster). Also, do you do a integrity check for each file transferred (one straightforward method might be to do a 'cksum' or alike comparison, but is that doable in terms of efficiency)?

    I am not familiar with gridftp except that I know it is a better choice compared to scp, sftp, etc. in that it can tune tcp settings and create parallel transfer. So, I want to know if it keeps a log of what files have been successfully transferred and what have not, does gridftp do a file integrity check? Right now, I only have one box for data storage (not in hadoop cluster) and want to transfer that data to hadoop. Can I just install gridftp on this box and name node box to enable gridftp transfer from the 1st to the 2nd?

    Thanks,
    --

    Michael

    --- On Tue, 3/2/10, Brian Bockelman wrote:

    From: Brian Bockelman <bbockelm@cse.unl.edu>
    Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan)
    To: common-user@hadoop.apache.org
    Date: Tuesday, March 2, 2010, 8:38 AM

    Hey Michael,

    distcp does a MapReduce job to transfer data between two clusters - but it might not be acceptable security-wise for your setup.

    Locally, we use gridftp between two clusters (not necessarily Hadoop!) and a protocol called SRM to load-balance between gridftp servers. GridFTP was selected because it is common in our field, and we already have the certificate infrastructure well setup.

    GridFTP is fast too - many Gbps is not too hard.

    YMMV

    Brian
    On Mar 2, 2010, at 1:30 AM, jiang licht wrote:

    I am considering a basic task of loading data to hadoop cluster in this scenario: hadoop cluster and bulk data reside on different boxes, e.g. connected via LAN or wan.

    An example to do this is to move data from amazon s3 to ec2, which is supported in latest hadoop by specifying s3(n)://authority/path in distcp.

    But generally speaking, what is the best way to load data to hadoop cluster from a remote box? Clearly, in this scenario, it is unreasonable to copy data to local name node and then issue some command like "hadoop fs -copyFromLocal" to put data in the cluster (besides this, a desired data transfer tool is also a factor, scp or sftp, gridftp, ..., compression and encryption, ...).

    I am not awaring of a generic support for fetching data from a remote box (like that from s3 or s3n), I am thinking about the following solution (run on remote boxes to push data to hadoop):

    cat datafile | ssh hadoopbox 'hadoop fs -put - dst'

    There are pros (simple and will do the job without storing a local copy of each data file and then do a command like 'hadoop fs -copyFromLocal') and cons (obviously will need many such pipelines running in parallel to speed up the job, but at the cost of creating processes on remote machines to read data and maintain ssh connections, so if data file is small, better archive small files into a tar file before calling 'cat'). Alternative to using a 'cat', a program can be written to keep reading data files and dump to stdin in parallel.

    Any comments about this or thoughts about a better solution?

    Thanks,
    --
    Michael

  • Jiang licht at Mar 2, 2010 at 9:52 pm
    Thanks, Brian.

    There is no certificate/grid infrastructure as of now yet for us. But I guess I can still use gridftp by noticing the following from its FAQ page: GridFTP can be run in a mode using standard
    SSH security credentials. It can also be run in anonymous mode and
    with username/password authentication.

    I am wondering how gridftp can used in a generic scenario: transfer bulk data from a box (not in hadoop cluster) to a remote hadoop cluster at a regular interval (maybe hourly or couple of minutes). So, I guess I can install gridftp server on hadoop name node and install gridftp client on the remote data box. But to bypass the intermediate step of keeping a local copy on hadoop name node, I need something like the plugin you mentioned. Is that correct?

    Since I dont have the plugin you have, I found a helpful article here that might address the problem:

    http://osg-test2.unl.edu/documentation/hadoop/gridftp-hdfs

    It seems to me that it can directly write data to hadoop (although I don't know exactly how). But I am not sure how to direct gridftp client to write data to hadoop, sth like "globus-url-copy localurl hdfs://hadoopnamenode/pathinhdfs"? Otherwise, there might be some mapping on the gridftp server side to relay data to hadoop.

    I think this is interesting if it works. Basically, this is a "push" mode.

    Even better: "pull mode", I still want sth built into hadoop (so, its running in map/reduce) that acts like "hadoop distcp s3://123:456@nutch/ hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998" or "hadoop distcp -f filelistA hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998" and filelistA looks like
    s3://123:456@nutch/file1
    s3://123:456@nutch/fileN

    So, just like accessing local files, we might have sth like "hadoop distcp file://remotehost/path hdfs://namenode/path" or "hadoop distcp -f filelistB hdfs://hostname/path" and filelistB looks like

    file://remotehost/path1/file1
    file://remorehost/path2/fileN

    (file:// works for local file system, but in this case it points to remote file system, or replace it with sth like remote://), so, some middleware will sit on remote host and the namenode to exchange data, in this case, the gridftp?, if they agree on protocols (ports, etc.)

    If security is an issue, data can be gpg encrypted before doing a "distcp".

    Thanks,
    --

    Michael

    --- On Tue, 3/2/10, Brian Bockelman wrote:

    From: Brian Bockelman <bbockelm@cse.unl.edu>
    Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan)
    To: common-user@hadoop.apache.org
    Date: Tuesday, March 2, 2010, 3:00 PM

    Hey Michael,

    We've developed a GridFTP server plugin that writes directly into Hadoop, so there's no intermediate data staging required.  You can just use your favorite GridFTP client on the source machine and transfer it directly into Hadoop.  Globus GridFTP can do checksums as it goes, but I haven't tried it - it might not work with our plugin.  The GridFTP server does not need to co-exist with any Hadoop processes - it just needs a network connection to the WAN and a network connection to the LAN.

    The GridFTP server is automatically installed with our yum packaging, along with our organization's CA certs.  If this is a one-off transfer - or you don't already have the CA certificate/grid infrastructure already available in your organization - you might be better served approaching another solution.

    The setup works well for us because (a) the other 40 sites use GridFTP as a common protocol, (b) we have a long history with using GridFTP, and (c) we need to transfer many TB on a daily basis.

    Brian
    On Mar 2, 2010, at 12:10 PM, jiang licht wrote:

    Hi Brian,

    Thanks a lot for sharing your experience. Here I have some questions to bother you for more help :)

    So, basically means that data transfer in your case is 2-step job: 1st, use gridftp to make a local copy of data on target, 2nd load data into the target cluster by sth like "hadoop fs -put". If this is correct, I am wondering if this will consume too much disk space of your target box (since it is stored in a local file system, prior to be distributed to hadoop cluster). Also, do you do a integrity check for each file transferred (one straightforward method might be to do a 'cksum' or alike comparison, but is that doable in terms of efficiency)?

    I am not familiar with gridftp except that I know it is a better choice compared to scp, sftp, etc. in that it can tune tcp settings and create parallel transfer. So, I want to know if it keeps a log of what files have been successfully transferred and what have not, does gridftp do a file integrity check? Right now, I only have one box for data storage (not in hadoop cluster) and want to transfer that data to hadoop. Can I just install gridftp on this box and name node box to enable gridftp transfer from the 1st to the 2nd?

    Thanks,
    --

    Michael

    --- On Tue, 3/2/10, Brian Bockelman wrote:

    From: Brian Bockelman <bbockelm@cse.unl.edu>
    Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan)
    To: common-user@hadoop.apache.org
    Date: Tuesday, March 2, 2010, 8:38 AM

    Hey Michael,

    distcp does a MapReduce job to transfer data between two clusters - but it might not be acceptable security-wise for your setup.

    Locally, we use gridftp between two clusters (not necessarily Hadoop!) and a protocol called SRM to load-balance between gridftp servers.  GridFTP was selected because it is common in our field, and we already have the certificate infrastructure well setup.

    GridFTP is fast too - many Gbps is not too hard.

    YMMV

    Brian
    On Mar 2, 2010, at 1:30 AM, jiang licht wrote:

    I am considering a basic task of loading data to hadoop cluster in this scenario: hadoop cluster and bulk data reside on different boxes, e.g. connected via LAN or wan.

    An example to do this is to move data from amazon s3 to ec2, which is supported in latest hadoop by specifying s3(n)://authority/path in distcp.

    But generally speaking, what is the best way to load data to hadoop cluster from a remote box? Clearly, in this scenario, it is unreasonable to copy data to local name node and then issue some command like "hadoop fs -copyFromLocal" to put data in the cluster (besides this, a desired data transfer tool is also a factor, scp or sftp, gridftp, ..., compression and encryption, ...).

    I am not awaring of a generic support for fetching data from a remote box (like that from s3 or s3n), I am thinking about the following solution (run on remote boxes to push data to hadoop):

    cat datafile | ssh hadoopbox 'hadoop fs -put - dst'

    There are pros (simple and will do the job without storing a local copy of each data file and then do a command like 'hadoop fs -copyFromLocal') and cons (obviously will need many such pipelines running in parallel to speed up the job, but at the cost of creating processes on remote machines to read data and maintain ssh connections, so if data file is small, better archive small files into a tar file before calling 'cat'). Alternative to using a 'cat', a program can be written to keep reading data files and dump to stdin in parallel.

    Any comments about this or thoughts about a better solution?

    Thanks,
    --
    Michael

  • Brian Bockelman at Mar 2, 2010 at 10:01 pm

    On Mar 2, 2010, at 3:51 PM, jiang licht wrote:

    Thanks, Brian.

    There is no certificate/grid infrastructure as of now yet for us. But I guess I can still use gridftp by noticing the following from its FAQ page: GridFTP can be run in a mode using standard
    SSH security credentials. It can also be run in anonymous mode and
    with username/password authentication.
    Ah, I guess I've never gone down that road - I always use SSL certificates.
    I am wondering how gridftp can used in a generic scenario: transfer bulk data from a box (not in hadoop cluster) to a remote hadoop cluster at a regular interval (maybe hourly or couple of minutes). So, I guess I can install gridftp server on hadoop name node and install gridftp client on the remote data box. But to bypass the intermediate step of keeping a local copy on hadoop name node, I need something like the plugin you mentioned. Is that correct?
    This is correct.
    Since I dont have the plugin you have, I found a helpful article here that might address the problem:

    http://osg-test2.unl.edu/documentation/hadoop/gridftp-hdfs
    You can get the plugin here if you're using the same Hadoop version as us:

    https://twiki.grid.iu.edu/bin/view/Storage/Hadoop

    The source code is

    svn://t2.unl.edu/brian/gridftp_hdfs

    (browse it online here http://t2.unl.edu:8094/browser/gridftp_hdfs)

    Heck, I've even updated the RPM .spec file to be compatible with the Cloudera packaging of 0.20.x (we plan on moving to Cloudera's distribution for that version). I haven't tested it lately against Cloudera's packaging.
    It seems to me that it can directly write data to hadoop (although I don't know exactly how). But I am not sure how to direct gridftp client to write data to hadoop, sth like "globus-url-copy localurl hdfs://hadoopnamenode/pathinhdfs"? Otherwise, there might be some mapping on the gridftp server side to relay data to hadoop.
    No, it would be something like this:

    globus-url-copy localurl gsiftp://gridftp-server.example.com/path/in/hdfs

    I would never recommend running any server on the hadoop namenode besides the hadoop namenode. :)
    I think this is interesting if it works. Basically, this is a "push" mode.

    Even better: "pull mode", I still want sth built into hadoop (so, its running in map/reduce) that acts like "hadoop distcp s3://123:456@nutch/ hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998" or "hadoop distcp -f filelistA hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998" and filelistA looks like
    s3://123:456@nutch/file1
    s3://123:456@nutch/fileN

    So, just like accessing local files, we might have sth like "hadoop distcp file://remotehost/path hdfs://namenode/path" or "hadoop distcp -f filelistB hdfs://hostname/path" and filelistB looks like

    file://remotehost/path1/file1
    file://remorehost/path2/fileN

    (file:// works for local file system, but in this case it points to remote file system, or replace it with sth like remote://), so, some middleware will sit on remote host and the namenode to exchange data, in this case, the gridftp?, if they agree on protocols (ports, etc.)

    If security is an issue, data can be gpg encrypted before doing a "distcp".
    This might serve you better. It's probably a fairly minor modification to distcp? Doesn't sound too hard to code.

    Brian
    Thanks,
    --

    Michael

    --- On Tue, 3/2/10, Brian Bockelman wrote:

    From: Brian Bockelman <bbockelm@cse.unl.edu>
    Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan)
    To: common-user@hadoop.apache.org
    Date: Tuesday, March 2, 2010, 3:00 PM

    Hey Michael,

    We've developed a GridFTP server plugin that writes directly into Hadoop, so there's no intermediate data staging required. You can just use your favorite GridFTP client on the source machine and transfer it directly into Hadoop. Globus GridFTP can do checksums as it goes, but I haven't tried it - it might not work with our plugin. The GridFTP server does not need to co-exist with any Hadoop processes - it just needs a network connection to the WAN and a network connection to the LAN.

    The GridFTP server is automatically installed with our yum packaging, along with our organization's CA certs. If this is a one-off transfer - or you don't already have the CA certificate/grid infrastructure already available in your organization - you might be better served approaching another solution.

    The setup works well for us because (a) the other 40 sites use GridFTP as a common protocol, (b) we have a long history with using GridFTP, and (c) we need to transfer many TB on a daily basis.

    Brian
    On Mar 2, 2010, at 12:10 PM, jiang licht wrote:

    Hi Brian,

    Thanks a lot for sharing your experience. Here I have some questions to bother you for more help :)

    So, basically means that data transfer in your case is 2-step job: 1st, use gridftp to make a local copy of data on target, 2nd load data into the target cluster by sth like "hadoop fs -put". If this is correct, I am wondering if this will consume too much disk space of your target box (since it is stored in a local file system, prior to be distributed to hadoop cluster). Also, do you do a integrity check for each file transferred (one straightforward method might be to do a 'cksum' or alike comparison, but is that doable in terms of efficiency)?

    I am not familiar with gridftp except that I know it is a better choice compared to scp, sftp, etc. in that it can tune tcp settings and create parallel transfer. So, I want to know if it keeps a log of what files have been successfully transferred and what have not, does gridftp do a file integrity check? Right now, I only have one box for data storage (not in hadoop cluster) and want to transfer that data to hadoop. Can I just install gridftp on this box and name node box to enable gridftp transfer from the 1st to the 2nd?

    Thanks,
    --

    Michael

    --- On Tue, 3/2/10, Brian Bockelman wrote:

    From: Brian Bockelman <bbockelm@cse.unl.edu>
    Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan)
    To: common-user@hadoop.apache.org
    Date: Tuesday, March 2, 2010, 8:38 AM

    Hey Michael,

    distcp does a MapReduce job to transfer data between two clusters - but it might not be acceptable security-wise for your setup.

    Locally, we use gridftp between two clusters (not necessarily Hadoop!) and a protocol called SRM to load-balance between gridftp servers. GridFTP was selected because it is common in our field, and we already have the certificate infrastructure well setup.

    GridFTP is fast too - many Gbps is not too hard.

    YMMV

    Brian
    On Mar 2, 2010, at 1:30 AM, jiang licht wrote:

    I am considering a basic task of loading data to hadoop cluster in this scenario: hadoop cluster and bulk data reside on different boxes, e.g. connected via LAN or wan.

    An example to do this is to move data from amazon s3 to ec2, which is supported in latest hadoop by specifying s3(n)://authority/path in distcp.

    But generally speaking, what is the best way to load data to hadoop cluster from a remote box? Clearly, in this scenario, it is unreasonable to copy data to local name node and then issue some command like "hadoop fs -copyFromLocal" to put data in the cluster (besides this, a desired data transfer tool is also a factor, scp or sftp, gridftp, ..., compression and encryption, ...).

    I am not awaring of a generic support for fetching data from a remote box (like that from s3 or s3n), I am thinking about the following solution (run on remote boxes to push data to hadoop):

    cat datafile | ssh hadoopbox 'hadoop fs -put - dst'

    There are pros (simple and will do the job without storing a local copy of each data file and then do a command like 'hadoop fs -copyFromLocal') and cons (obviously will need many such pipelines running in parallel to speed up the job, but at the cost of creating processes on remote machines to read data and maintain ssh connections, so if data file is small, better archive small files into a tar file before calling 'cat'). Alternative to using a 'cat', a program can be written to keep reading data files and dump to stdin in parallel.

    Any comments about this or thoughts about a better solution?

    Thanks,
    --
    Michael


  • Jiang licht at Mar 2, 2010 at 10:37 pm
    Brian,

    Thanks a lot for sharing your work and experience!

    Im not sure if I will use it but absolutely hdfs-aware gridftp provides a good solution to bulk data transfer to hadoop, taking care of efficiency and security (reliability and integrity check?). In summary, gridftp client talks to gridftp server and the server talks to hadoop namenode to relay data w/o storing a local copy. All three things can sit on geographically separated boxes. Great work!

    Thanks,
    --

    Michael

    --- On Tue, 3/2/10, Brian Bockelman wrote:

    From: Brian Bockelman <bbockelm@cse.unl.edu>
    Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan)
    To: common-user@hadoop.apache.org
    Date: Tuesday, March 2, 2010, 4:00 PM

    On Mar 2, 2010, at 3:51 PM, jiang licht wrote:

    Thanks, Brian.

    There is no certificate/grid infrastructure as of now yet for us. But I guess I can still use gridftp by noticing the following from its FAQ page: GridFTP can be run in a mode using standard
    SSH security credentials. It can also be run in anonymous mode and
    with username/password authentication.
    Ah, I guess I've never gone down that road - I always use SSL certificates.
    I am wondering how gridftp can used in a generic scenario: transfer bulk data from a box (not in hadoop cluster) to a remote hadoop cluster at a regular interval (maybe hourly or couple of minutes). So, I guess I can install gridftp server on hadoop name node and install gridftp client on the remote data box. But to bypass the intermediate step of keeping a local copy on hadoop name node, I need something like the plugin you mentioned. Is that correct?
    This is correct.
    Since I dont have the plugin you have, I found a helpful article here that might address the problem:

    http://osg-test2.unl.edu/documentation/hadoop/gridftp-hdfs
    You can get the plugin here if you're using the same Hadoop version as us:

    https://twiki.grid.iu.edu/bin/view/Storage/Hadoop

    The source code is

    svn://t2.unl.edu/brian/gridftp_hdfs

    (browse it online here http://t2.unl.edu:8094/browser/gridftp_hdfs)

    Heck, I've even updated the RPM .spec file to be compatible with the Cloudera packaging of 0.20.x (we plan on moving to Cloudera's distribution for that version).  I haven't tested it lately against Cloudera's packaging.
    It seems to me that it can directly write data to hadoop (although I don't know exactly how). But I am not sure how to direct gridftp client to write data to hadoop, sth like "globus-url-copy localurl hdfs://hadoopnamenode/pathinhdfs"? Otherwise, there might be some mapping on the gridftp server side to relay data to hadoop.
    No, it would be something like this:

    globus-url-copy localurl gsiftp://gridftp-server.example.com/path/in/hdfs

    I would never recommend running any server on the hadoop namenode besides the hadoop namenode.  :)
    I think this is interesting if it works. Basically, this is a "push" mode.

    Even better: "pull mode", I still want sth built into hadoop (so, its running in map/reduce) that acts like "hadoop distcp s3://123:456@nutch/ hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998" or "hadoop distcp -f filelistA hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998" and filelistA looks like
    s3://123:456@nutch/file1
    s3://123:456@nutch/fileN

    So, just like accessing local files, we might have sth like "hadoop distcp file://remotehost/path hdfs://namenode/path" or "hadoop distcp -f filelistB hdfs://hostname/path" and filelistB looks like

    file://remotehost/path1/file1
    file://remorehost/path2/fileN

    (file:// works for local file system, but in this case it points to remote file system, or replace it with sth like remote://), so, some middleware will sit on remote host and the namenode to exchange data, in this case, the gridftp?, if they agree on protocols (ports, etc.)

    If security is an issue, data can be gpg encrypted before doing a "distcp".
    This might serve you better.  It's probably a fairly minor modification to distcp?  Doesn't sound too hard to code.

    Brian
    Thanks,
    --

    Michael

    --- On Tue, 3/2/10, Brian Bockelman wrote:

    From: Brian Bockelman <bbockelm@cse.unl.edu>
    Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan)
    To: common-user@hadoop.apache.org
    Date: Tuesday, March 2, 2010, 3:00 PM

    Hey Michael,

    We've developed a GridFTP server plugin that writes directly into Hadoop, so there's no intermediate data staging required.  You can just use your favorite GridFTP client on the source machine and transfer it directly into Hadoop.  Globus GridFTP can do checksums as it goes, but I haven't tried it - it might not work with our plugin.  The GridFTP server does not need to co-exist with any Hadoop processes - it just needs a network connection to the WAN and a network connection to the LAN.

    The GridFTP server is automatically installed with our yum packaging, along with our organization's CA certs.  If this is a one-off transfer - or you don't already have the CA certificate/grid infrastructure already available in your organization - you might be better served approaching another solution.

    The setup works well for us because (a) the other 40 sites use GridFTP as a common protocol, (b) we have a long history with using GridFTP, and (c) we need to transfer many TB on a daily basis.

    Brian
    On Mar 2, 2010, at 12:10 PM, jiang licht wrote:

    Hi Brian,

    Thanks a lot for sharing your experience. Here I have some questions to bother you for more help :)

    So, basically means that data transfer in your case is 2-step job: 1st, use gridftp to make a local copy of data on target, 2nd load data into the target cluster by sth like "hadoop fs -put". If this is correct, I am wondering if this will consume too much disk space of your target box (since it is stored in a local file system, prior to be distributed to hadoop cluster). Also, do you do a integrity check for each file transferred (one straightforward method might be to do a 'cksum' or alike comparison, but is that doable in terms of efficiency)?

    I am not familiar with gridftp except that I know it is a better choice compared to scp, sftp, etc. in that it can tune tcp settings and create parallel transfer. So, I want to know if it keeps a log of what files have been successfully transferred and what have not, does gridftp do a file integrity check? Right now, I only have one box for data storage (not in hadoop cluster) and want to transfer that data to hadoop. Can I just install gridftp on this box and name node box to enable gridftp transfer from the 1st to the 2nd?

    Thanks,
    --

    Michael

    --- On Tue, 3/2/10, Brian Bockelman wrote:

    From: Brian Bockelman <bbockelm@cse.unl.edu>
    Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan)
    To: common-user@hadoop.apache.org
    Date: Tuesday, March 2, 2010, 8:38 AM

    Hey Michael,

    distcp does a MapReduce job to transfer data between two clusters - but it might not be acceptable security-wise for your setup.

    Locally, we use gridftp between two clusters (not necessarily Hadoop!) and a protocol called SRM to load-balance between gridftp servers.  GridFTP was selected because it is common in our field, and we already have the certificate infrastructure well setup.

    GridFTP is fast too - many Gbps is not too hard.

    YMMV

    Brian
    On Mar 2, 2010, at 1:30 AM, jiang licht wrote:

    I am considering a basic task of loading data to hadoop cluster in this scenario: hadoop cluster and bulk data reside on different boxes, e.g. connected via LAN or wan.

    An example to do this is to move data from amazon s3 to ec2, which is supported in latest hadoop by specifying s3(n)://authority/path in distcp.

    But generally speaking, what is the best way to load data to hadoop cluster from a remote box? Clearly, in this scenario, it is unreasonable to copy data to local name node and then issue some command like "hadoop fs -copyFromLocal" to put data in the cluster (besides this, a desired data transfer tool is also a factor, scp or sftp, gridftp, ..., compression and encryption, ...).

    I am not awaring of a generic support for fetching data from a remote box (like that from s3 or s3n), I am thinking about the following solution (run on remote boxes to push data to hadoop):

    cat datafile | ssh hadoopbox 'hadoop fs -put - dst'

    There are pros (simple and will do the job without storing a local copy of each data file and then do a command like 'hadoop fs -copyFromLocal') and cons (obviously will need many such pipelines running in parallel to speed up the job, but at the cost of creating processes on remote machines to read data and maintain ssh connections, so if data file is small, better archive small files into a tar file before calling 'cat'). Alternative to using a 'cat', a program can be written to keep reading data files and dump to stdin in parallel.

    Any comments about this or thoughts about a better solution?

    Thanks,
    --
    Michael


  • Mattmann, Chris A (388J) at Mar 2, 2010 at 10:07 pm
    Hi All,

    I co-authored a paper about this that was published at the NASA/IEEE Mass Storage conference in 2006 [1]. Also, my Ph.D. Dissertation [2] contains information about making these types of data movement selections when needed. Thought I'd throw it out there in case it helps.

    HTH,
    Chris

    [1] http://sunset.usc.edu/~mattmann/pubs/MSST2006.pdf
    [2] http://sunset.usc.edu/~mattmann/Dissertation.pdf




    On 3/2/10 11:10 AM, "jiang licht" wrote:

    Hi Brian,

    Thanks a lot for sharing your experience. Here I have some questions to bother you for more help :)

    So, basically means that data transfer in your case is 2-step job: 1st, use gridftp to make a local copy of data on target, 2nd load data into the target cluster by sth like "hadoop fs -put". If this is correct, I am wondering if this will consume too much disk space of your target box (since it is stored in a local file system, prior to be distributed to hadoop cluster). Also, do you do a integrity check for each file transferred (one straightforward method might be to do a 'cksum' or alike comparison, but is that doable in terms of efficiency)?

    I am not familiar with gridftp except that I know it is a better choice compared to scp, sftp, etc. in that it can tune tcp settings and create parallel transfer. So, I want to know if it keeps a log of what files have been successfully transferred and what have not, does gridftp do a file integrity check? Right now, I only have one box for data storage (not in hadoop cluster) and want to transfer that data to hadoop. Can I just install gridftp on this box and name node box to enable gridftp transfer from the 1st to the 2nd?

    Thanks,
    --

    Michael

    --- On Tue, 3/2/10, Brian Bockelman wrote:

    From: Brian Bockelman <bbockelm@cse.unl.edu>
    Subject: Re: bulk data transfer to HDFS remotely (e.g. via wan)
    To: common-user@hadoop.apache.org
    Date: Tuesday, March 2, 2010, 8:38 AM

    Hey Michael,

    distcp does a MapReduce job to transfer data between two clusters - but it might not be acceptable security-wise for your setup.

    Locally, we use gridftp between two clusters (not necessarily Hadoop!) and a protocol called SRM to load-balance between gridftp servers. GridFTP was selected because it is common in our field, and we already have the certificate infrastructure well setup.

    GridFTP is fast too - many Gbps is not too hard.

    YMMV

    Brian
    On Mar 2, 2010, at 1:30 AM, jiang licht wrote:

    I am considering a basic task of loading data to hadoop cluster in this scenario: hadoop cluster and bulk data reside on different boxes, e.g. connected via LAN or wan.

    An example to do this is to move data from amazon s3 to ec2, which is supported in latest hadoop by specifying s3(n)://authority/path in distcp.

    But generally speaking, what is the best way to load data to hadoop cluster from a remote box? Clearly, in this scenario, it is unreasonable to copy data to local name node and then issue some command like "hadoop fs -copyFromLocal" to put data in the cluster (besides this, a desired data transfer tool is also a factor, scp or sftp, gridftp, ..., compression and encryption, ...).

    I am not awaring of a generic support for fetching data from a remote box (like that from s3 or s3n), I am thinking about the following solution (run on remote boxes to push data to hadoop):

    cat datafile | ssh hadoopbox 'hadoop fs -put - dst'

    There are pros (simple and will do the job without storing a local copy of each data file and then do a command like 'hadoop fs -copyFromLocal') and cons (obviously will need many such pipelines running in parallel to speed up the job, but at the cost of creating processes on remote machines to read data and maintain ssh connections, so if data file is small, better archive small files into a tar file before calling 'cat'). Alternative to using a 'cat', a program can be written to keep reading data files and dump to stdin in parallel.

    Any comments about this or thoughts about a better solution?

    Thanks,
    --
    Michael







    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Chris Mattmann, Ph.D.
    Senior Computer Scientist
    NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
    Office: 171-266B, Mailstop: 171-246
    Email: Chris.Mattmann@jpl.nasa.gov
    WWW: http://sunset.usc.edu/~mattmann/
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Adjunct Assistant Professor, Computer Science Department
    University of Southern California, Los Angeles, CA 90089 USA
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 2, '10 at 7:31a
activeMar 2, '10 at 10:37p
posts9
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase