FAQ
distcp does not skip copying file if we are updating single file
----------------------------------------------------------------

Key: HADOOP-6051
URL: https://issues.apache.org/jira/browse/HADOOP-6051
Project: Hadoop Core
Issue Type: Bug
Components: tools/distcp
Affects Versions: 0.21.0
Reporter: Ravi Gummadi
Fix For: 0.21.0


distcp doesn't skip copying file when we do -update on single file if the destfile already exists.

When we do

hadoop distcp -update srcfilename destfilename

it seems to be comparing checksums of srcfilename and destfilename/srcfilename and so skip is not done. It should compare checksums of srcfilename and destfilename.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Tsz Wo (Nicholas), SZE (JIRA) at Jun 16, 2009 at 11:28 pm
    [ https://issues.apache.org/jira/browse/HADOOP-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720416#action_12720416 ]

    Tsz Wo (Nicholas), SZE commented on HADOOP-6051:
    ------------------------------------------------

    No one should use distcp to copy a single file anymore! Just kidding ... we have to fix this. :)
    distcp does not skip copying file if we are updating single file
    ----------------------------------------------------------------

    Key: HADOOP-6051
    URL: https://issues.apache.org/jira/browse/HADOOP-6051
    Project: Hadoop Core
    Issue Type: Bug
    Components: tools/distcp
    Affects Versions: 0.21.0
    Reporter: Ravi Gummadi
    Fix For: 0.21.0


    distcp doesn't skip copying file when we do -update on single file if the destfile already exists.
    When we do
    hadoop distcp -update srcfilename destfilename
    it seems to be comparing checksums of srcfilename and destfilename/srcfilename and so skip is not done. It should compare checksums of srcfilename and destfilename.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tsz Wo (Nicholas), SZE (JIRA) at Jun 17, 2009 at 9:20 pm
    [ https://issues.apache.org/jira/browse/HADOOP-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720902#action_12720902 ]

    Tsz Wo (Nicholas), SZE commented on HADOOP-6051:
    ------------------------------------------------
    hadoop distcp -update srcfilename destfilename

    it seems to be comparing checksums of srcfilename and destfilename/srcfilename and so skip is not done. It should compare checksums of srcfilename and destfilename.
    Actually, this is the correct behavior according to the [doc|http://hadoop.apache.org/core/docs/r0.20.0/distcp.html]. Quoted -update description below:
    {quote}
    As noted in the preceding, this is not a "sync" operation. The only criterion examined is the source and destination file sizes; if they differ, the source file replaces the destination file. As discussed in the [following|http://hadoop.apache.org/core/docs/r0.20.0/distcp.html#uo], it also changes the semantics for generating destination paths, so users should use this carefully.
    {quote}
    distcp does not skip copying file if we are updating single file
    ----------------------------------------------------------------

    Key: HADOOP-6051
    URL: https://issues.apache.org/jira/browse/HADOOP-6051
    Project: Hadoop Core
    Issue Type: Bug
    Components: tools/distcp
    Affects Versions: 0.21.0
    Reporter: Ravi Gummadi
    Fix For: 0.21.0


    distcp doesn't skip copying file when we do -update on single file if the destfile already exists.
    When we do
    hadoop distcp -update srcfilename destfilename
    it seems to be comparing checksums of srcfilename and destfilename/srcfilename and so skip is not done. It should compare checksums of srcfilename and destfilename.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Jun 18, 2009 at 4:45 am
    [ https://issues.apache.org/jira/browse/HADOOP-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721060#action_12721060 ]

    Ravi Gummadi commented on HADOOP-6051:
    --------------------------------------

    Only file sizes were checked earlier. But now in trunk, checksums are also checked after checking filesizes.
    In any case, if I run the following command multiple times

    hadoop distcp -update srcfile destfile

    and if destfile doesn't exist, -update should allow the file to be copied only once and from 2nd run onwards it should not copy as the filesizes(and
    checksums are same).
    But the problem here seems to be it is not comparing the filesizes and checksums of srcfile and destfile. distcp seems to be comparing srcfile with
    the path destfile/srcfile(i.e. srcfile in destfile directory), which is wrong.
    distcp does not skip copying file if we are updating single file
    ----------------------------------------------------------------

    Key: HADOOP-6051
    URL: https://issues.apache.org/jira/browse/HADOOP-6051
    Project: Hadoop Core
    Issue Type: Bug
    Components: tools/distcp
    Affects Versions: 0.21.0
    Reporter: Ravi Gummadi
    Fix For: 0.21.0


    distcp doesn't skip copying file when we do -update on single file if the destfile already exists.
    When we do
    hadoop distcp -update srcfilename destfilename
    it seems to be comparing checksums of srcfilename and destfilename/srcfilename and so skip is not done. It should compare checksums of srcfilename and destfilename.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tsz Wo (Nicholas), SZE (JIRA) at Jun 18, 2009 at 4:52 pm
    [ https://issues.apache.org/jira/browse/HADOOP-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721340#action_12721340 ]

    Tsz Wo (Nicholas), SZE commented on HADOOP-6051:
    ------------------------------------------------
    But the problem here seems to be it is not comparing the filesizes and checksums of srcfile and destfile. distcp seems to be comparing srcfile with the path destfile/srcfile(i.e. srcfile in destfile directory), which is wrong.
    According to to doc, "hadoop distcp -update foo bar" means copying foo to bar/foo, but not copying foo to bar. Could you check it?
    distcp does not skip copying file if we are updating single file
    ----------------------------------------------------------------

    Key: HADOOP-6051
    URL: https://issues.apache.org/jira/browse/HADOOP-6051
    Project: Hadoop Core
    Issue Type: Bug
    Components: tools/distcp
    Affects Versions: 0.21.0
    Reporter: Ravi Gummadi
    Fix For: 0.21.0


    distcp doesn't skip copying file when we do -update on single file if the destfile already exists.
    When we do
    hadoop distcp -update srcfilename destfilename
    it seems to be comparing checksums of srcfilename and destfilename/srcfilename and so skip is not done. It should compare checksums of srcfilename and destfilename.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Jun 18, 2009 at 5:08 pm
    [ https://issues.apache.org/jira/browse/HADOOP-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721344#action_12721344 ]

    Ravi Gummadi commented on HADOOP-6051:
    --------------------------------------

    Currently -update writes to bar only and I think that is correct.
    It copies to bar/foo only if bar is a dir and existing(similar to what happens without -update). If "bar" doesn't exist at destination, then foo is copied to bar. If "bar" exists at destination and is a file, it is overwritten if different from the source(this is the case overwriting is happening again and again, though it should not).
    I don't see any path difference with -update when compared to without -update in any case(whether the destination exists or not). Am I missing any case where -update writes to a different path when compared to without -update option ?
    distcp does not skip copying file if we are updating single file
    ----------------------------------------------------------------

    Key: HADOOP-6051
    URL: https://issues.apache.org/jira/browse/HADOOP-6051
    Project: Hadoop Core
    Issue Type: Bug
    Components: tools/distcp
    Affects Versions: 0.21.0
    Reporter: Ravi Gummadi
    Fix For: 0.21.0


    distcp doesn't skip copying file when we do -update on single file if the destfile already exists.
    When we do
    hadoop distcp -update srcfilename destfilename
    it seems to be comparing checksums of srcfilename and destfilename/srcfilename and so skip is not done. It should compare checksums of srcfilename and destfilename.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Jun 18, 2009 at 5:28 pm
    [ https://issues.apache.org/jira/browse/HADOOP-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721356#action_12721356 ]

    Ravi Gummadi commented on HADOOP-6051:
    --------------------------------------

    I just saw that the doc says something different from what I understood. But distcp -update works as I mentioned only. I rechecked now after looking at the doc.
    For the case of distcp -update foo/a foo/b bar mentioned in the doc, the effect of the command is creating bar/a/aa bar/a/ab bar/b/ba abd bar/b/bb.

    I guess we need to change the doc. OR Do we really need to change the behaviour of distcp ?
    distcp does not skip copying file if we are updating single file
    ----------------------------------------------------------------

    Key: HADOOP-6051
    URL: https://issues.apache.org/jira/browse/HADOOP-6051
    Project: Hadoop Core
    Issue Type: Bug
    Components: tools/distcp
    Affects Versions: 0.21.0
    Reporter: Ravi Gummadi
    Fix For: 0.21.0


    distcp doesn't skip copying file when we do -update on single file if the destfile already exists.
    When we do
    hadoop distcp -update srcfilename destfilename
    it seems to be comparing checksums of srcfilename and destfilename/srcfilename and so skip is not done. It should compare checksums of srcfilename and destfilename.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tsz Wo (Nicholas), SZE (JIRA) at Jun 18, 2009 at 6:04 pm
    [ https://issues.apache.org/jira/browse/HADOOP-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721384#action_12721384 ]

    Tsz Wo (Nicholas), SZE commented on HADOOP-6051:
    ------------------------------------------------
    I guess we need to change the doc. OR Do we really need to change the behaviour of distcp ?
    Thanks for checking this Ravi. The doc actually is not clear about copying a single file, which, as always, is a not-so-useful special case. We have freedom to do either way. It is probably better to minimize the code complexity since distcp is already complicated enough.
    distcp does not skip copying file if we are updating single file
    ----------------------------------------------------------------

    Key: HADOOP-6051
    URL: https://issues.apache.org/jira/browse/HADOOP-6051
    Project: Hadoop Core
    Issue Type: Bug
    Components: tools/distcp
    Affects Versions: 0.21.0
    Reporter: Ravi Gummadi
    Fix For: 0.21.0


    distcp doesn't skip copying file when we do -update on single file if the destfile already exists.
    When we do
    hadoop distcp -update srcfilename destfilename
    it seems to be comparing checksums of srcfilename and destfilename/srcfilename and so skip is not done. It should compare checksums of srcfilename and destfilename.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Jun 18, 2009 at 6:12 pm
    [ https://issues.apache.org/jira/browse/HADOOP-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721391#action_12721391 ]

    Ravi Gummadi commented on HADOOP-6051:
    --------------------------------------

    With -update, the path to which it copies a single file is correct. The issue is just that it copies again and again even with -update, which it should not be doing. Will upload a patch soon. Also will try to update the doc to reflect the correct behaviour.
    distcp does not skip copying file if we are updating single file
    ----------------------------------------------------------------

    Key: HADOOP-6051
    URL: https://issues.apache.org/jira/browse/HADOOP-6051
    Project: Hadoop Core
    Issue Type: Bug
    Components: tools/distcp
    Affects Versions: 0.21.0
    Reporter: Ravi Gummadi
    Fix For: 0.21.0


    distcp doesn't skip copying file when we do -update on single file if the destfile already exists.
    When we do
    hadoop distcp -update srcfilename destfilename
    it seems to be comparing checksums of srcfilename and destfilename/srcfilename and so skip is not done. It should compare checksums of srcfilename and destfilename.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Jun 19, 2009 at 8:53 am
    [ https://issues.apache.org/jira/browse/HADOOP-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Ravi Gummadi reassigned HADOOP-6051:
    ------------------------------------

    Assignee: Ravi Gummadi
    distcp does not skip copying file if we are updating single file
    ----------------------------------------------------------------

    Key: HADOOP-6051
    URL: https://issues.apache.org/jira/browse/HADOOP-6051
    Project: Hadoop Core
    Issue Type: Bug
    Components: tools/distcp
    Affects Versions: 0.21.0
    Reporter: Ravi Gummadi
    Assignee: Ravi Gummadi
    Fix For: 0.21.0


    distcp doesn't skip copying file when we do -update on single file if the destfile already exists.
    When we do
    hadoop distcp -update srcfilename destfilename
    it seems to be comparing checksums of srcfilename and destfilename/srcfilename and so skip is not done. It should compare checksums of srcfilename and destfilename.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Ravi Gummadi (JIRA) at Jun 19, 2009 at 8:55 am
    [ https://issues.apache.org/jira/browse/HADOOP-6051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Ravi Gummadi updated HADOOP-6051:
    ---------------------------------

    Attachment: d_singlefile_update.patch

    Attaching patch that fixes the path (used for checking checksums for -update) issue for single file update.

    Please review and provide your comments.
    distcp does not skip copying file if we are updating single file
    ----------------------------------------------------------------

    Key: HADOOP-6051
    URL: https://issues.apache.org/jira/browse/HADOOP-6051
    Project: Hadoop Core
    Issue Type: Bug
    Components: tools/distcp
    Affects Versions: 0.21.0
    Reporter: Ravi Gummadi
    Assignee: Ravi Gummadi
    Fix For: 0.21.0

    Attachments: d_singlefile_update.patch


    distcp doesn't skip copying file when we do -update on single file if the destfile already exists.
    When we do
    hadoop distcp -update srcfilename destfilename
    it seems to be comparing checksums of srcfilename and destfilename/srcfilename and so skip is not done. It should compare checksums of srcfilename and destfilename.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedJun 16, '09 at 8:34a
activeJun 19, '09 at 8:55a
posts11
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Ravi Gummadi (JIRA): 11 posts

People

Translate

site design / logo © 2022 Grokbase