FAQ
Distcp truncates some files when copying
----------------------------------------

Key: HADOOP-2725
URL: https://issues.apache.org/jira/browse/HADOOP-2725
Project: Hadoop Core
Issue Type: Bug
Components: util
Affects Versions: 0.16.0
Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
With patches for HADOOP-2095 and HADOOP-2119.
Reporter: Murtaza A. Basrai
Priority: Critical
Fix For: 0.16.0


We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.

Command used (it was run on the src cluster):
hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir

Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).

src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
src-file-3 692172075 bytes -> tgt-file-3 0 bytes

All target files are truncated at block boundaries (some have 0 size).


I looked at the log files, and noticed a few things:

1. There are 31059 log files (same as the number of Maps the job had).

2. 246 of the log files are non-empty.

3. All non-empty log files are of the form:

SKIP: hdfs://src-namenode/src-dir-a/src-file-x
SKIP: hdfs://src-namenode/src-dir-b/src-file-y
SKIP: hdfs://src-namenode/src-dir-c/src-file-z

4. All 9 files which were truncated were included in the log files as skipped files.

5. All 9 files were the last entry in their respective log files.

e.g.
Non-empty logfile 1:

SKIP: hdfs://src-namenode/src-dir-a/src-file-x
SKIP: hdfs://src-namenode/src-dir-b/src-file-y
SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file

Non_empty logfile 2:
SKIP: hdfs://src-namenode/src-dir-p/src-file-m
SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • dhruba borthakur (JIRA) at Jan 29, 2008 at 1:22 am
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563361#action_12563361 ]

    dhruba borthakur commented on HADOOP-2725:
    ------------------------------------------

    It is possible that a map task copied a few blocks on the file before it encountered an error. The map task failed and got restarted. Now, the destination file already exists in DFS. The re-run of the map task sees that the file already exists and skips it.

    Maybe distcp should be copy a file into a temporary filename into the destination folder and then when the entire copy is successful, it should rename it to the real filename. The rename is atomic in DFS,
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Priority: Critical
    Fix For: 0.16.0


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Koji Noguchi (JIRA) at Jan 29, 2008 at 1:23 am
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563362#action_12563362 ]

    Koji Noguchi commented on HADOOP-2725:
    --------------------------------------

    We have been using '-update' to avoid this problem for now. (0.15)

    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Priority: Critical
    Fix For: 0.16.0


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Jan 29, 2008 at 2:29 am
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12563366#action_12563366 ]

    Chris Douglas commented on HADOOP-2725:
    ---------------------------------------

    bq. Maybe distcp should be copy a file into a temporary filename into the destination folder and then when the entire copy is successful, it should rename it to the real filename.

    This is a good idea. However, since distcp accepts multiple sources, it is possible for multiple sources to map to the same destination. In the default case, skipping present files prevents both accidental deletion of data at the destination and- now that files appear when created- map tasks overwriting a file copying/copied from another map. If one doesn't expect files to be skipped, searching the logs for skipped files is necessary.

    Copying to a temporary dir and renaming can distinguish part of the latter case, since collisions at creation time are unambiguously part of the copy. The problem changes, however, because now we must distinguish between files copying from another map and files that were part of a failed attempt (in the temp dir). The distcp user still needs to review the log to determine what- if any- of the cruft left in the temp dir is relevant.

    Running the task a second time with '-upgrade' seems easier, if less efficient.
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Priority: Critical
    Fix For: 0.16.0


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Sameer Paranjpye (JIRA) at Jan 30, 2008 at 9:19 am
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Sameer Paranjpye updated HADOOP-2725:
    -------------------------------------

    Fix Version/s: (was: 0.16.0)
    0.16.1
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Priority: Critical
    Fix For: 0.16.1


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Sameer Paranjpye (JIRA) at Jan 31, 2008 at 4:41 am
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Sameer Paranjpye updated HADOOP-2725:
    -------------------------------------

    Assignee: Tsz Wo (Nicholas), SZE
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Robert Chansler (JIRA) at Feb 1, 2008 at 12:31 am
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Robert Chansler updated HADOOP-2725:
    ------------------------------------

    Component/s: dfs
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tsz Wo (Nicholas), SZE (JIRA) at Feb 4, 2008 at 10:15 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565566#action_12565566 ]

    Tsz Wo (Nicholas), SZE commented on HADOOP-2725:
    ------------------------------------------------

    Shell we make "-update" a default option? It is like cp in unix, i.e. cp overwrite files by default.
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Sameer Paranjpye (JIRA) at Feb 6, 2008 at 7:21 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566284#action_12566284 ]

    Sameer Paranjpye commented on HADOOP-2725:
    ------------------------------------------
    Shell we make "-update" a default option? It is like cp in unix, i.e. cp overwrite files by default.
    I'd rather not make it the default. It's too easy to clobber vast amounts of data inadvertently.
    From the discussion above, it seems like the problem is that partial copies aren't clearly distinguishable from successfully copied inputs. One has to compare the source and destination lists by name and size to determine the set of unsuccessful copies. The use to temporary filenames should make it easier find partial copies.
    Another enhancement that would help is distcp deleting partially copied files at the destination.


    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tsz Wo (Nicholas), SZE (JIRA) at Feb 6, 2008 at 7:57 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566305#action_12566305 ]

    Tsz Wo (Nicholas), SZE commented on HADOOP-2725:
    ------------------------------------------------

    Chris and I have a better solution for this issue:
    # verify source list to make sure there is no duplication since duplicated source files dose not make sense in copying.
    # Then, use dhruba's idea to first copy a file into a temporary file and do rename.

    Sounds good?
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tsz Wo (Nicholas), SZE (JIRA) at Feb 7, 2008 at 2:01 am
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tsz Wo (Nicholas), SZE updated HADOOP-2725:
    -------------------------------------------

    Attachment: 2725_20080206.patch

    2725_20080206.patch: check source duplication and then do atomic copy (i.e. copy to tmp and rename).
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1

    Attachments: 2725_20080206.patch


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Feb 8, 2008 at 9:59 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567220#action_12567220 ]

    Chris Douglas commented on HADOOP-2725:
    ---------------------------------------

    Only a few minor nits:
    * Path::makeRelative should use String::split rather than StringTokenizer
    * Path::makeRelative probably needs a new test case in o.a.h.fs.TestPath
    * This uses SequenceFile.Sorter::sort, which has been unused in core for awhile. It might be worthwhile to add some noise into the duplication test case (i.e. more than 2 files) to ensure that any regressions introduced through changes to the sort code are detected.

    In addition to its enhancements, this patch makes distcp much cleaner. +1
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1

    Attachments: 2725_20080206.patch


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tsz Wo (Nicholas), SZE (JIRA) at Feb 8, 2008 at 11:51 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tsz Wo (Nicholas), SZE updated HADOOP-2725:
    -------------------------------------------

    Attachment: 2725_20080208.patch

    Thanks, Chris! Here is a update

    2725_20080208.patch:
    - the changes for makeRelative are reverted. We properly should work on it in a separated issue.
    - TestCopyFiles.testCopyDuplication() indeed initializes two 20-file sets and do sorting.
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1

    Attachments: 2725_20080206.patch, 2725_20080208.patch


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tsz Wo (Nicholas), SZE (JIRA) at Feb 9, 2008 at 12:05 am
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tsz Wo (Nicholas), SZE updated HADOOP-2725:
    -------------------------------------------

    Status: Patch Available (was: Open)
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1

    Attachments: 2725_20080206.patch, 2725_20080208.patch


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Feb 9, 2008 at 1:12 am
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567260#action_12567260 ]

    Hadoop QA commented on HADOOP-2725:
    -----------------------------------

    +1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12375120/2725_20080208.patch
    against trunk revision 619744.

    @author +1. The patch does not contain any @author tags.

    tests included +1. The patch appears to include 3 new or modified tests.

    javadoc +1. The javadoc tool did not generate any warning messages.

    javac +1. The applied patch does not generate any new javac compiler warnings.

    release audit +1. The applied patch does not generate any new release audit warnings.

    findbugs +1. The patch does not introduce any new Findbugs warnings.

    core tests +1. The patch passed core unit tests.

    contrib tests +1. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1764/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1764/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1764/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1764/console

    This message is automatically generated.
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1

    Attachments: 2725_20080206.patch, 2725_20080208.patch


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Feb 9, 2008 at 1:27 am
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-2725:
    ----------------------------------

    Resolution: Fixed
    Status: Resolved (was: Patch Available)

    I just committed this. Thanks, Nicholas!
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1

    Attachments: 2725_20080206.patch, 2725_20080208.patch


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Feb 9, 2008 at 3:08 am
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas reopened HADOOP-2725:
    -----------------------------------


    I reverted this patch, because its test case (TestCopyFiles) took nearly 400s (from 26s) on my machine, due to a silently failing local-to-local test case. All 20 files copy successfully, but fail in the rename:

    {noformat}
    2008-02-08 18:36:14,246 INFO util.CopyFiles (CopyFiles.java:map(390)) - FAIL 2522487525519213817 : java.io.IOException: Fail to rename tmp file (=file:/path/build/test/data/destdat/_distcp_tmp_cq5yoa/25224875255192138
    17) to destination file (=file:/path/build/test/data/destdat/2522487525519213817)
    at org.apache.hadoop.util.CopyFiles$FSCopyFilesMapper.rename(CopyFiles.java:336)
    at org.apache.hadoop.util.CopyFiles$FSCopyFilesMapper.copy(CopyFiles.java:317)
    at org.apache.hadoop.util.CopyFiles$FSCopyFilesMapper.map(CopyFiles.java:382)
    at org.apache.hadoop.util.CopyFiles$FSCopyFilesMapper.map(CopyFiles.java:202)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:208)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:132)
    Caused by: java.io.IOException: Target file:/path/build/test/data/destdat/.2522487525519213817.crc already exists
    at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:269)
    at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
    at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:133)
    at org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:211)
    at org.apache.hadoop.fs.ChecksumFileSystem.rename(ChecksumFileSystem.java:403)
    at org.apache.hadoop.util.CopyFiles$FSCopyFilesMapper.rename(CopyFiles.java:333)
    ... 6 more
    {noformat}

    At a glance, this looks like a problem in LocalFileSystem, but I'm reverting this patch for now.
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1

    Attachments: 2725_20080206.patch, 2725_20080208.patch


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Feb 9, 2008 at 3:58 am
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567276#action_12567276 ]

    Chris Douglas commented on HADOOP-2725:
    ---------------------------------------

    This looks related to HADOOP-730. Since RawLocalFileSystem uses copy to rename, the .crc is already regenerated when ChecksumFileSystem::rename attempts to move it.
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1

    Attachments: 2725_20080206.patch, 2725_20080208.patch


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hudson (JIRA) at Feb 9, 2008 at 12:05 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567300#action_12567300 ]

    Hudson commented on HADOOP-2725:
    --------------------------------

    Integrated in Hadoop-trunk #395 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/395/])
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1

    Attachments: 2725_20080206.patch, 2725_20080208.patch


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Feb 12, 2008 at 1:41 am
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567895#action_12567895 ]

    Chris Douglas commented on HADOOP-2725:
    ---------------------------------------

    As it turns out, the observed failure is *not* related to HADOOP-730. Rather, the .crc files are included in the source file list, which causes the collision during rename. HADOOP-2754 has been merged into 0.16.1, so the source list should contain only "real" targets.
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1

    Attachments: 2725_20080206.patch, 2725_20080208.patch


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tsz Wo (Nicholas), SZE (JIRA) at Feb 12, 2008 at 6:57 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tsz Wo (Nicholas), SZE updated HADOOP-2725:
    -------------------------------------------

    Attachment: 2725_20080212.patch

    2725_20080212.patch: also fixed HADOOP-2807
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1

    Attachments: 2725_20080206.patch, 2725_20080208.patch, 2725_20080212.patch


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Feb 12, 2008 at 7:13 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-2725:
    ----------------------------------

    Status: Patch Available (was: Reopened)
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1

    Attachments: 2725_20080206.patch, 2725_20080208.patch, 2725_20080212.patch


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Feb 12, 2008 at 9:07 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568313#action_12568313 ]

    Hadoop QA commented on HADOOP-2725:
    -----------------------------------

    +1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12375405/2725_20080212.patch
    against trunk revision 619744.

    @author +1. The patch does not contain any @author tags.

    tests included +1. The patch appears to include 3 new or modified tests.

    javadoc +1. The javadoc tool did not generate any warning messages.

    javac +1. The applied patch does not generate any new javac compiler warnings.

    release audit +1. The applied patch does not generate any new release audit warnings.

    findbugs +1. The patch does not introduce any new Findbugs warnings.

    core tests +1. The patch passed core unit tests.

    contrib tests +1. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1785/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1785/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1785/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/1785/console

    This message is automatically generated.
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1

    Attachments: 2725_20080206.patch, 2725_20080208.patch, 2725_20080212.patch


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Feb 12, 2008 at 10:01 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-2725:
    ----------------------------------

    Resolution: Fixed
    Status: Resolved (was: Patch Available)

    I just committed this. Thanks, Nicholas!
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1

    Attachments: 2725_20080206.patch, 2725_20080208.patch, 2725_20080212.patch


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hudson (JIRA) at Feb 13, 2008 at 12:12 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568525#action_12568525 ]

    Hudson commented on HADOOP-2725:
    --------------------------------

    Integrated in Hadoop-trunk #399 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/399/])
    Distcp truncates some files when copying
    ----------------------------------------

    Key: HADOOP-2725
    URL: https://issues.apache.org/jira/browse/HADOOP-2725
    Project: Hadoop Core
    Issue Type: Bug
    Components: dfs, util
    Affects Versions: 0.16.0
    Environment: Nightly build: http://hadoopqa.yst.corp.yahoo.com:8080/hudson/job/Hadoop-LinuxTest/770/
    With patches for HADOOP-2095 and HADOOP-2119.
    Reporter: Murtaza A. Basrai
    Assignee: Tsz Wo (Nicholas), SZE
    Priority: Critical
    Fix For: 0.16.1

    Attachments: 2725_20080206.patch, 2725_20080208.patch, 2725_20080212.patch


    We used distcp to copy ~100 TB of data across two clusters ~1400 nodes each.
    Command used (it was run on the src cluster):
    hadoop distcp -log /logdir/logfile hdfs://src-namenode:8600//src-dir-1 hdfs://src-namenode:8600//src-dir-2 ... hdfs://src-namenode:8600//src-dir-n hdfs://tgt-namenode:8600//dst-dir
    Distcp completed without errors, but when we checked the file sizes on the src and tgt clusters, we noticed differences in file sizes for 9 files (~6 GB).
    src-file-1 666762714 bytes -> tgt-file-1 134217728 bytes
    src-file-2 673791814 bytes -> tgt-file-2 536870912 bytes
    src-file-3 692172075 bytes -> tgt-file-3 0 bytes
    All target files are truncated at block boundaries (some have 0 size).
    I looked at the log files, and noticed a few things:
    1. There are 31059 log files (same as the number of Maps the job had).
    2. 246 of the log files are non-empty.
    3. All non-empty log files are of the form:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z
    4. All 9 files which were truncated were included in the log files as skipped files.
    5. All 9 files were the last entry in their respective log files.
    e.g.
    Non-empty logfile 1:
    SKIP: hdfs://src-namenode/src-dir-a/src-file-x
    SKIP: hdfs://src-namenode/src-dir-b/src-file-y
    SKIP: hdfs://src-namenode/src-dir-c/src-file-z <-- Truncated file
    Non_empty logfile 2:
    SKIP: hdfs://src-namenode/src-dir-p/src-file-m
    SKIP: hdfs://src-namenode/src-dir-q/src-file-n <-- Truncated file
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedJan 29, '08 at 12:51a
activeFeb 13, '08 at 12:12p
posts25
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Hudson (JIRA): 25 posts

People

Translate

site design / logo © 2022 Grokbase