[
https://issues.apache.org/jira/browse/HADOOP-3939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626739#action_12626739 ]
Chris Douglas commented on HADOOP-3939:
---------------------------------------
* Would it make sense to require either \-update or \-overwrite if \-delete is specified? Without either of these options, the semantics are a little confusing. For example:
** In this case, the destination doesn't exist. Everything that isn't the source is deleted, which seems reasonable.
{noformat}
$ bin/hadoop fs -ls a b
Found 2 items
-rw-r--r-- 1 someuser somegroup 92934 2008-08-11 21:42 /user/someuser/a/part-00000
Found 4 items
-rw-r--r-- 1 someuser somegroup 105177784 2008-08-28 11:46 /user/someuser/b/part-00000
-rw-r--r-- 1 someuser somegroup 105177884 2008-08-28 11:46 /user/someuser/b/part-00001
-rw-r--r-- 1 someuser somegroup 105177754 2008-08-28 11:46 /user/someuser/b/part-00002
$ bin/hadoop distcp -delete hdfs://host:8020/user/someuser/a hdfs://host:8020/user/someuser/b
08/08/28 11:51:18 INFO tools.DistCp: srcPaths=[hdfs://host:8020/user/someuser/a]
08/08/28 11:51:18 INFO tools.DistCp: destPath=hdfs://host:8020/user/someuser/b
Deleted hdfs://host/user/someuser/b/part-00000
Deleted hdfs://host/user/someuser/b/part-00001
Deleted hdfs://host/user/someuser/b/part-00002
[snip]
$ bin/hadoop fs -ls a b
Found 2 items
-rw-r--r-- 1 someuser somegroup 92934 2008-08-11 21:42 /user/someuser/a/part-00000
Found 2 items
drwxr-xr-x - someuser somegroup 0 2008-08-28 11:51 /user/someuser/b/a
{noformat}
** Here, the destination does exist, but it is deleted anyway, as though \-overwrite were specified.
{noformat}
$ bin/hadoop fs -lsr a b
-rw-r--r-- 1 someuser somegroup 92934 2008-08-11 21:42 /user/someuser/a/part-00000
-rw-r--r-- 1 someuser somegroup 105177784 2008-08-28 11:51 /user/someuser/b/part-00000
-rw-r--r-- 1 someuser somegroup 105177884 2008-08-28 11:51 /user/someuser/b/part-00001
-rw-r--r-- 1 someuser somegroup 105177754 2008-08-28 11:51 /user/someuser/b/part-00002
drwxr-xr-x - someuser somegroup 0 2008-08-28 13:34 /user/someuser/b/a
-rw-r--r-- 1 someuser somegroup 105177784 2008-08-28 13:34 /user/someuser/b/a/part-00000
$ bin/hadoop distcp -delete hdfs://host:8020/user/someuser/a hdfs://host:8020/user/someuser/b
08/08/28 13:35:14 INFO tools.DistCp: srcPaths=[hdfs://host:8020/user/someuser/a]
08/08/28 13:35:14 INFO tools.DistCp: destPath=hdfs://host:8020/user/someuser/b
Deleted hdfs://host:8020/user/someuser/b/part-00000
Deleted hdfs://host:8020/user/someuser/b/part-00001
Deleted hdfs://host:8020/user/someuser/b/part-00002
Deleted hdfs://host:8020/user/someuser/b/a
[snip]
$ bin/hadoop fs -lsr a b
-rw-r--r-- 1 someuser somegroup 92934 2008-08-11 21:42 /user/someuser/a/part-00000
drwxr-xr-x - someuser somegroup 0 2008-08-28 13:35 /user/someuser/b/a
-rw-r--r-- 1 someuser somegroup 92934 2008-08-28 13:35 /user/someuser/b/a/part-00000
{noformat}
Adding this dependency would also help prevent casual errors and potentially serious mistakes if the Trash is disabled.
* It might help to always add a message about FsShell failing, and set the cause rather than:
{noformat}
+ } catch(Exception e) {
+ throw e instanceof IOException? (IOException)e: new IOException(e);
+ }
{noformat}
* When \-delete is specified, the client is doing a lot of work to recursively list the destination, then to delete individual files there. In the future it might make sense to leave it to the maps to delete entries, since the source list is sorted. The client (or a reduce) would have to do some work on the boundaries, but it should scale well. The current patch is clearer given distcp's current organization, though.
* The fix to FileStatus makes sense, but when is the Path null?
DistCp should support an option for deleting non-existing files.
----------------------------------------------------------------
Key: HADOOP-3939
URL:
https://issues.apache.org/jira/browse/HADOOP-3939Project: Hadoop Core
Issue Type: New Feature
Components: tools/distcp
Reporter: Tsz Wo (Nicholas), SZE
Assignee: Tsz Wo (Nicholas), SZE
Attachments: 3939_20080825.patch, 3939_20080825b.patch, 3939_20080826.patch
One use case of DistCp is to sync two directories. Currently, DistCp has an -update option for overwriting dst files if src is different from dst. However, it is not enough for sync. If there are some files in dst but not exist in src, there is no easy way to delete them. We should add a new option, say -delete, so that DistCp will delete the non-existing in dst.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.