FAQ
Hello,

Does anyone know of a modified "rsync" that gets/puts files to/from the dfs instead of the normal, mounted filesystems? I'm guessing since the dfs can't be mounted like a "normal" filesystem that rsync would need to be modified in order to access it, as with any other program. We use rsync --daemon a lot for moving files around, making backups, etc. so I think it should be a logical fit... at least I hope so.

I'm new to hadoop and just got my first standalone node configured. Apologies if this has been answered before, or if I'm missing something obvious.

Thanks
gregc

Search Discussions

  • Ted Dunning at Jan 2, 2008 at 4:32 pm
    That is a good idea. I currently use a shell script that does the rough
    equivalent of rsync -av, but it wouldn't be bad to have a one-liner that
    solves the same problem.

    One (slight) benefit to the scripted approach is that I get a list of
    directories to which files have been moved. That lets me reprocess entire
    directories for aggregates when something changes. I expect that a clean
    implementation of rsync could give me a list of file that I could sed into a
    list of directories.

    On 1/2/08 7:03 AM, "Greg Connor" wrote:

    Hello,

    Does anyone know of a modified "rsync" that gets/puts files to/from the dfs
    instead of the normal, mounted filesystems? I'm guessing since the dfs can't
    be mounted like a "normal" filesystem that rsync would need to be modified in
    order to access it, as with any other program. We use rsync --daemon a lot
    for moving files around, making backups, etc. so I think it should be a
    logical fit... at least I hope so.

    I'm new to hadoop and just got my first standalone node configured. Apologies
    if this has been answered before, or if I'm missing something obvious.

    Thanks
    gregc
  • Joydeep Sen Sarma at Jan 2, 2008 at 7:39 pm
    hdfs doesn't allow random overwrites or appends. so even if hdfs were mountable - i am guessing we couldn't just do a rsync to a dfs mount (never looked at rsync code - but assuming it does appends/random-writes). any emulation of rsync would end up having to delete and recreate changed files in hdfs.

    If your data/processing is mostly on log files - replication to hdfs can take advantage of some strong assumptions (file only changes at end, can convert one file to multiple files as long as mapping can be inferred easily).

    ________________________________

    From: Greg Connor
    Sent: Wed 1/2/2008 7:03 AM
    To: 'hadoop-user@lucene.apache.org'
    Subject: Is there an rsyncd for HDFS



    Hello,

    Does anyone know of a modified "rsync" that gets/puts files to/from the dfs instead of the normal, mounted filesystems? I'm guessing since the dfs can't be mounted like a "normal" filesystem that rsync would need to be modified in order to access it, as with any other program. We use rsync --daemon a lot for moving files around, making backups, etc. so I think it should be a logical fit... at least I hope so.

    I'm new to hadoop and just got my first standalone node configured. Apologies if this has been answered before, or if I'm missing something obvious.

    Thanks
    gregc
  • Greg Connor at Jan 2, 2008 at 9:36 pm

    From: Joydeep Sen Sarma

    hdfs doesn't allow random overwrites or appends. so even if
    hdfs were mountable - i am guessing we couldn't just do a
    rsync to a dfs mount (never looked at rsync code - but
    assuming it does appends/random-writes). any emulation of
    rsync would end up having to delete and recreate changed
    files in hdfs.

    Thanks for the reply. Most of the functions I've used rsync for are probably compatible with this... I believe the default is to create a .hidden file, write and close it, then rename it to the final name. So, if someone were to take certain filesystem calls and replace them with hdfs api, it would probably work seamlessly for most users.

    I know rsync has a partial-checksum type of feature where the file already exists on the destination, so instead of transferring the whole thing, it somewhat intelligently determines what blocks have changed and only sends those. I admit that I don't actually know whether it writes a second file in this case or not. For my purposes I would be fine with just disabling certain features that modify files... I probably wouldn't even go to the trouble of writing my way around it.
    If your data/processing is mostly on log files - replication
    to hdfs can take advantage of some strong assumptions (file
    only changes at end, can convert one file to multiple files
    as long as mapping can be inferred easily).
    Excellent... that confirms what I was thinking. Our app is mostly small files (like 1M) but they are almost all write-once read-many... it's extremely rare to replace a file after it's written and even then it's almost always a complete replacement.

    Thanks again!

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJan 2, '08 at 3:03p
activeJan 2, '08 at 9:36p
posts4
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase