FAQ
Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
--------------------------------------------------------------------------------------------------------------------------

Key: HADOOP-4927
URL: https://issues.apache.org/jira/browse/HADOOP-4927
Project: Hadoop Core
Issue Type: Bug
Reporter: Devaraj Das
Fix For: 0.20.0


When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Devaraj Das (JIRA) at Dec 22, 2008 at 6:29 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658455#action_12658455 ]

    Devaraj Das commented on HADOOP-4927:
    -------------------------------------

    Okay, so i figured that I was referring to the old MapReduce API *smile*
    There seems to be two approaches anyways. For the old API:
    Today, the getRecordWriter calls relevant to the tasks are made in two places - in DirectMapOutputCollector (in the constructor) and in ReduceTask.java (just before starting to call the user's reduce method). We can probably move the calls to the respective OutputCollect.collect implementations:
    {code}
    if (out == null) {
    out = job.getOutputFormat().getRecordWriter(fs, job, finalName, reporter);
    }
    {code}

    For the new API, I am not yet sure what the good approach is. Maybe we could delay creating the recordwriter until TaskInputOutputContext.write is invoked.

    The other approach is to delay the creation of the files on the output filesystem, until it is necessary, in the respective RecordWriter implementations. But this requires users (who have implemented recordwriters or are implementing them in the future) to be aware of such a change and thus is vulnerable to problems..

    Thoughts?
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Devaraj Das
    Fix For: 0.20.0


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Dec 22, 2008 at 5:53 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658602#action_12658602 ]

    Doug Cutting commented on HADOOP-4927:
    --------------------------------------

    I'm not convinced this is a bug. If you specify N output partitions then you should generate N output files, even if some of them are empty, no? One could write an OutputFormat that lazily creates its output files, but that's not the contract of FileOutputFormat.
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Devaraj Das
    Fix For: 0.20.0


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Koji Noguchi (JIRA) at Dec 23, 2008 at 5:59 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12658909#action_12658909 ]

    Koji Noguchi commented on HADOOP-4927:
    --------------------------------------

    On one of our clusters, counted number of empty "part-" files.

    Out of 30 million files/dirs, 4.5 million part- files were empty. 40 users having more than 10,000 empty files.

    bq. If you specify N output partitions then you should generate N output files,
    I believe some users did mention that the feature of having exactly N output files is useful.

    If we could somehow make the no-empty-part-files feature configurable, it'll ease up our support work a lot.
    (Instead of asking our users to implement a custom outputformat, I can just ask them to set the jobconf.)


    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: Bug
    Reporter: Devaraj Das
    Fix For: 0.20.0


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Dec 23, 2008 at 6:17 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Doug Cutting updated HADOOP-4927:
    ---------------------------------

    Issue Type: New Feature (was: Bug)

    Changing this from a bug to a feature request. It seems reasonable for FileOutputFormat to support a mode where files are created lazily when the first record is written.
    Out of 30 million files/dirs, 4.5 million part- files were empty. 40 users having more than 10,000 empty files.
    It sounds like there's also perhaps another problem here. Are these folks perhaps specifying way too many reduces? For jobs with lots of empty files, how many non-empty files are there, and how big are they?

    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Fix For: 0.20.0


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tsz Wo (Nicholas), SZE (JIRA) at Dec 23, 2008 at 6:17 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tsz Wo (Nicholas), SZE updated HADOOP-4927:
    -------------------------------------------

    Component/s: mapred
    Issue Type: Bug (was: New Feature)
    I believe some users did mention that the feature of having exactly N output files is useful.
    I also believe it is useful in some cases, especially when all output files are empty (ah, you may argue that the entire job is not useful in this case :) ). However, it is costly to maintain empty files in HDFS and I believe it is USELESS in many cases. Could we have an option for not creating them or cleaning them?
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: Bug
    Components: mapred
    Reporter: Devaraj Das
    Fix For: 0.20.0


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tsz Wo (Nicholas), SZE (JIRA) at Dec 23, 2008 at 6:35 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Tsz Wo (Nicholas), SZE updated HADOOP-4927:
    -------------------------------------------

    Issue Type: New Feature (was: Bug)

    I was setting the "component/s" to mapred but not intended to change "issue type". Changing this back to "New Feature".
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Fix For: 0.20.0


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Dec 24, 2008 at 2:47 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Devaraj Das updated HADOOP-4927:
    --------------------------------

    Fix Version/s: (was: 0.20.0)
    0.21.0

    Resetting the Fix Version to 0.21
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Fix For: 0.21.0


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Jan 6, 2009 at 6:14 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan reassigned HADOOP-4927:
    -----------------------------------------

    Assignee: Jothi Padmanabhan
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Jan 22, 2009 at 9:12 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan updated HADOOP-4927:
    --------------------------------------

    Status: Patch Available (was: Open)
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Jan 22, 2009 at 9:12 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan updated HADOOP-4927:
    --------------------------------------

    Attachment: hadoop-4927.patch

    This patch implements the lazy file creation by wrapping the RecordWriter funcitonality into a wrapper class and instantiating it only on a call to output.collect
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Jan 22, 2009 at 6:02 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666231#action_12666231 ]

    Doug Cutting commented on HADOOP-4927:
    --------------------------------------

    The class might better be called LazyRecordWriter, I'd prefer that this we used no wrapper when lazy output creation is disabled, and we should add a static method to access the config property. So, putting these together, we might have a method that looks like:

    {code}
    static RecordWriter ReduceTask#createRecordWriter(JobConf job, ...) {
    if (ReduceTask.useLazyOutputCreation(job)) {
    return new LazyRecordWriter(...);
    } else {
    return ...;
    }
    {code}
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Jan 23, 2009 at 5:20 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan updated HADOOP-4927:
    --------------------------------------

    Status: Open (was: Patch Available)

    Canceling patch to incorporate review comments
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Jan 27, 2009 at 9:33 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan updated HADOOP-4927:
    --------------------------------------

    Attachment: hadoop-4927-v1.patch

    Patch incorporating review comments
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Jan 27, 2009 at 9:35 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan updated HADOOP-4927:
    --------------------------------------

    Status: Patch Available (was: Open)
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Jan 27, 2009 at 9:53 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667829#action_12667829 ]

    Hadoop QA commented on HADOOP-4927:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12398803/hadoop-4927-v1.patch
    against trunk revision 737944.

    +1 @author. The patch does not contain any @author tags.

    +1 tests included. The patch appears to include 3 new or modified tests.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 core tests. The patch passed core unit tests.

    -1 contrib tests. The patch failed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3761/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3761/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3761/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3761/console

    This message is automatically generated.
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Jan 28, 2009 at 4:09 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667927#action_12667927 ]

    Jothi Padmanabhan commented on HADOOP-4927:
    -------------------------------------------

    The test failure, org.apache.hadoop.chukwa.datacollection.adaptor.filetailer.TestStartAtOffset.testStartAfterOffset, is not related to this patch
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Jan 29, 2009 at 2:59 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668491#action_12668491 ]

    Jothi Padmanabhan commented on HADOOP-4927:
    -------------------------------------------

    Doug, could you please review this patch. Thank you.
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Jan 29, 2009 at 8:46 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Doug Cutting updated HADOOP-4927:
    ---------------------------------

    Status: Open (was: Patch Available)

    A few nits:
    - we need a public setter method for lazy file creation, setLazyOutput(boolean). This should probably go on mapred.JobConf and mapreduce.Job.
    - this does not need to be in hadoop-default.xml. That file is for options that are configured site-wide in hadoop-site.xml, not per-job parameters.


    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Jan 30, 2009 at 12:24 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan updated HADOOP-4927:
    --------------------------------------

    Attachment: hadoop-4927-v2.patch

    Patch incorporating review comments
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Jan 30, 2009 at 12:24 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan updated HADOOP-4927:
    --------------------------------------

    Status: Patch Available (was: Open)
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Jan 31, 2009 at 12:13 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Chris Douglas updated HADOOP-4927:
    ----------------------------------

    Status: Open (was: Patch Available)

    This feature doesn't require framework changes, does it? Configuring FileOutputFormat or writing an OutputFormat that lazily creates its files should be sufficient.
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Feb 1, 2009 at 3:31 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669328#action_12669328 ]

    Jothi Padmanabhan commented on HADOOP-4927:
    -------------------------------------------

    True, this feature can be achieved by modifying RecordWriter implementations. But that would mean people who have their own implementations would need to add it themselves, see Devaraj's [comment|https://issues.apache.org/jira/browse/HADOOP-4927?focusedCommentId=12658455#action_12658455] Also, this feature is a sort of generic across the different output formats and having the framework support this would be useful. No?
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Feb 2, 2009 at 1:40 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669596#action_12669596 ]

    Devaraj Das commented on HADOOP-4927:
    -------------------------------------

    I am okay with the current patch in terms of the behavior and the framework support it adds for lazy creation of the destination output "something" (where "something" is easy to explain when interpreted as a file).
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Chris Douglas (JIRA) at Feb 3, 2009 at 12:40 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669812#action_12669812 ]

    Chris Douglas commented on HADOOP-4927:
    ---------------------------------------

    bq. this feature is a sort of generic across the different output formats and having the framework support this would be useful.

    True. Still, while it is generic functionality, it's neither difficult nor inefficient in user-space. Absent either of the latter criteria, putting it into the framework seems premature, at least. If this should be abstracted, wouldn't it make sense as an OutputFormat in lib? That seems no less brittle than using a framework configuration variable and I've difficulty seeing this setting scoped to the whole cluster...

    bq. I am okay with the current patch in terms of the behavior and the framework support it adds for lazy creation of the destination output "something" (where "something" is easy to explain when interpreted as a file).

    Unless there's a non-FileOutputFormat use case that's also easy to explain, it remains unclear that this is the correct abstraction. Creating a setting on FileOutputFormat seems like a good idea. Whether that's implemented in FileOutputFormat only or via an OuputFormat wrapper class depends on how generally applicable the abstraction is. Since its motivation is expensive clutter in HDFS, it's not obvious to me that the latter is justified, let alone tight integration with the framework.
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Feb 3, 2009 at 5:42 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670028#action_12670028 ]

    Doug Cutting commented on HADOOP-4927:
    --------------------------------------
    Unless there's a non-FileOutputFormat use case [ ... ]
    I see Chris's point and agree. Unless there's a strong reason to put features in the kernel we should prefer to put them in library code, keeping the kernel minimal. Are there non-FileInputFormats that need this feature?

    A wrapper implementation is a bit harder to use, since folks would need to both set the job's outputformat to the wrapper, and set the wrapper's parameter to the real output format: two changes instead of just setting a single parameter, although it is more generic. We could perhaps implement both: a flag for FileOutputFormat and a wrapper OutputFormat for folks who've not subclassed FileOutputFormat?

    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Feb 4, 2009 at 12:18 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670314#action_12670314 ]

    Jothi Padmanabhan commented on HADOOP-4927:
    -------------------------------------------

    So, if I understand this correctly, we have new WrapperOutputFormat class

    {code}

    LazyOutputFormat {

    LazyOutputFormat (OutputFormat rawOF) {
    this.rawOutputFormat = rawOF;
    }

    getRecordWriter(...) {
    return new LazyRecordWriter(...)
    }
    }

    {code}

    Users will then do

    {code}
    job.setOutputFormat(new LazyOutputFormat(actualOutputFormat));
    {code}


    The other option would be

    {code}
    FileOutputFormat.setLazy(true);

    JobConf::getOutputFormat() {
    OutputFormat o = getOutputFormat();
    if (o.instanceof FileOutputFormat && lazyFlag) {
    return new LazyOutputFormat(0);
    }
    else {
    return o;
    }
    }

    {code}
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Feb 4, 2009 at 12:46 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670323#action_12670323 ]

    Jothi Padmanabhan commented on HADOOP-4927:
    -------------------------------------------

    sorry, we set the class objects in setOutputFormat, so, it would be

    LazyOutputFormat.set(actualoutputformat.class) and
    job.setOutputFormat(LazyOutputFormat.class)
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Feb 4, 2009 at 6:06 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670392#action_12670392 ]

    Doug Cutting commented on HADOOP-4927:
    --------------------------------------
    LazyOutputFormat.set(actualoutputformat.class) and
    job.setOutputFormat(LazyOutputFormat.class)
    Right. That's the two-line penalty of a wrapper. If we built it into FileInputFormat then it would only take one line:

    FileOutputFormat.setLazyOutput(true);

    but it would then also only work for subclasses of FileOutputFormat, rather than any OutputFormat implementation. This is a tough call, since most, but not all, OutputFormats do subclass FileOutputFormat. I'm leaning towards the wrapper, since, while a bit more complex for users, it is a cleaner layering, making FileOutputFormat less of a kitchen-sink of features.

    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Feb 5, 2009 at 6:54 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670642#action_12670642 ]

    Devaraj Das commented on HADOOP-4927:
    -------------------------------------

    It might make sense to add a new API job.setOutputFormat(OutputFormat, boolean lazy). That will be a shorthand to having the two lines explictly. Also, we need to handle the case of streaming/pipes (where probably the "lazyOutputCreation" could be taken as a command line argument).
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Feb 5, 2009 at 12:12 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan updated HADOOP-4927:
    --------------------------------------

    Attachment: hadoop-4927-v3.patch

    Attaching patch to elicit comments. This patch implements a wrapper for OutputFormat. It also handles the streaming case by supporting a flag "-lazyOutput". Pipes still needs to be handled.
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927-v3.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Feb 5, 2009 at 6:44 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670855#action_12670855 ]

    Doug Cutting commented on HADOOP-4927:
    --------------------------------------

    - LazyOutputFormat should keep a field for the nested output format, not create it again for each call, no?
    - We might implement generic FilterOutputFormat and FilterRecordReader that LazyOutputFormat and LazyRecordReader extend. This is probably not the last time someone will need to wrap an OutputFormat or a RecordReader.
    - JobConf#setOutputFormatClass(Class, boolean) should instead be static LazyOutputFormat#setClass(Job, Class, boolean). This localizes the change, and it's still a one-line change for applications.
    - Similarly, JobContext#getLazyOutputFormatClass() should instead be static LazyOutputFormat#getClass(JobContext). This feature can be entirely contained in LazyOutputFormat and should not require changes to the kernel.
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927-v3.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Feb 12, 2009 at 4:04 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan updated HADOOP-4927:
    --------------------------------------

    Status: Patch Available (was: Open)
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927-v3.patch, hadoop-4927-v4.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Feb 12, 2009 at 4:04 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan updated HADOOP-4927:
    --------------------------------------

    Attachment: hadoop-4927-v4.patch

    * Patch that implements FilterOutputFormat and FilterRecordWriter that LazyOutputFormat and LazyRecordWriter extend.
    * Pipes and Streaming support the -lazyOutput flag
    * Added test cases for both mapred and mapreduce packages
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927-v3.patch, hadoop-4927-v4.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Feb 13, 2009 at 9:46 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673194#action_12673194 ]

    Hadoop QA commented on HADOOP-4927:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12400110/hadoop-4927-v4.patch
    against trunk revision 744000.

    +1 @author. The patch does not contain any @author tags.

    +1 tests included. The patch appears to include 14 new or modified tests.

    -1 patch. The patch command could not apply the patch.

    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3844/console

    This message is automatically generated.
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927-v3.patch, hadoop-4927-v4.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Feb 13, 2009 at 11:36 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan updated HADOOP-4927:
    --------------------------------------

    Status: Open (was: Patch Available)
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927-v3.patch, hadoop-4927-v4.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Feb 13, 2009 at 12:44 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan updated HADOOP-4927:
    --------------------------------------

    Status: Patch Available (was: Open)
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927-v3.patch, hadoop-4927-v4.patch, hadoop-4927-v5.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Feb 13, 2009 at 12:44 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan updated HADOOP-4927:
    --------------------------------------

    Attachment: hadoop-4927-v5.patch

    Synched the patch with trunk
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927-v3.patch, hadoop-4927-v4.patch, hadoop-4927-v5.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Feb 14, 2009 at 7:42 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673477#action_12673477 ]

    Jothi Padmanabhan commented on HADOOP-4927:
    -------------------------------------------

    Testpatch and ant test passed on my local box with this patch
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927-v3.patch, hadoop-4927-v4.patch, hadoop-4927-v5.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Feb 15, 2009 at 1:32 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673559#action_12673559 ]

    Hadoop QA commented on HADOOP-4927:
    -----------------------------------

    -1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12400180/hadoop-4927-v5.patch
    against trunk revision 744406.

    +1 @author. The patch does not contain any @author tags.

    +1 tests included. The patch appears to include 14 new or modified tests.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit. The applied patch does not increase the total number of release audit warnings.

    -1 core tests. The patch failed core unit tests.

    +1 contrib tests. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3859/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3859/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3859/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3859/console

    This message is automatically generated.
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927-v3.patch, hadoop-4927-v4.patch, hadoop-4927-v5.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Feb 15, 2009 at 4:27 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673678#action_12673678 ]

    Jothi Padmanabhan commented on HADOOP-4927:
    -------------------------------------------

    The test that timed out org.apache.hadoop.mapred.TestTaskLimits.testTaskLimits passed on my local machine and this test is unrelated to the patch.
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927-v3.patch, hadoop-4927-v4.patch, hadoop-4927-v5.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Doug Cutting (JIRA) at Feb 18, 2009 at 10:20 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12674809#action_12674809 ]

    Doug Cutting commented on HADOOP-4927:
    --------------------------------------

    A minor nit: in FilterOutputFormat, the checks for a null baseOut and null rawWriter might be better factored into private getBaseOut() and getRawWriter methods, so that the body of most filter methods is just one line, e.g., 'return getBaseOut().getRecordWriter(...);'. Other than that, +1.

    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927-v3.patch, hadoop-4927-v4.patch, hadoop-4927-v5.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Feb 19, 2009 at 8:23 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan updated HADOOP-4927:
    --------------------------------------

    Attachment: hadoop-4927-v6.patch

    Incorporated the review comment to check for nulls in a private method
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927-v3.patch, hadoop-4927-v4.patch, hadoop-4927-v5.patch, hadoop-4927-v6.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Feb 19, 2009 at 8:25 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan updated HADOOP-4927:
    --------------------------------------

    Status: Open (was: Patch Available)
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927-v3.patch, hadoop-4927-v4.patch, hadoop-4927-v5.patch, hadoop-4927-v6.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jothi Padmanabhan (JIRA) at Feb 19, 2009 at 8:25 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Jothi Padmanabhan updated HADOOP-4927:
    --------------------------------------

    Status: Patch Available (was: Open)
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927-v3.patch, hadoop-4927-v4.patch, hadoop-4927-v5.patch, hadoop-4927-v6.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hadoop QA (JIRA) at Feb 19, 2009 at 11:40 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675178#action_12675178 ]

    Hadoop QA commented on HADOOP-4927:
    -----------------------------------

    +1 overall. Here are the results of testing the latest attachment
    http://issues.apache.org/jira/secure/attachment/12400482/hadoop-4927-v6.patch
    against trunk revision 745934.

    +1 @author. The patch does not contain any @author tags.

    +1 tests included. The patch appears to include 14 new or modified tests.

    +1 javadoc. The javadoc tool did not generate any warning messages.

    +1 javac. The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs. The patch does not introduce any new Findbugs warnings.

    +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

    +1 release audit. The applied patch does not increase the total number of release audit warnings.

    +1 core tests. The patch passed core unit tests.

    +1 contrib tests. The patch passed contrib unit tests.

    Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3884/testReport/
    Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3884/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
    Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3884/artifact/trunk/build/test/checkstyle-errors.html
    Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3884/console

    This message is automatically generated.
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927-v3.patch, hadoop-4927-v4.patch, hadoop-4927-v5.patch, hadoop-4927-v6.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Devaraj Das (JIRA) at Feb 23, 2009 at 8:23 am
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Devaraj Das updated HADOOP-4927:
    --------------------------------

    Resolution: Fixed
    Hadoop Flags: [Reviewed]
    Status: Resolved (was: Patch Available)

    I just committed this. Thanks, Jothi!
    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927-v3.patch, hadoop-4927-v4.patch, hadoop-4927-v5.patch, hadoop-4927-v6.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hudson (JIRA) at Feb 23, 2009 at 3:19 pm
    [ https://issues.apache.org/jira/browse/HADOOP-4927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12675926#action_12675926 ]

    Hudson commented on HADOOP-4927:
    --------------------------------

    Integrated in Hadoop-trunk #763 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/763/])
    . Adds a generic wrapper around outputformat to allow creation of output on demand. Contributed by Jothi Padmanabhan.

    Part files on the output filesystem are created irrespective of whether the corresponding task has anything to write there
    --------------------------------------------------------------------------------------------------------------------------

    Key: HADOOP-4927
    URL: https://issues.apache.org/jira/browse/HADOOP-4927
    Project: Hadoop Core
    Issue Type: New Feature
    Components: mapred
    Reporter: Devaraj Das
    Assignee: Jothi Padmanabhan
    Fix For: 0.21.0

    Attachments: hadoop-4927-v1.patch, hadoop-4927-v2.patch, hadoop-4927-v3.patch, hadoop-4927-v4.patch, hadoop-4927-v5.patch, hadoop-4927-v6.patch, hadoop-4927.patch


    When OutputFormat.getRecordWriter is invoked, a part file is created on the output filesystem. But the created RecordWriter is not used until the OutputCollector.collect call is made by the task (user's code). This results in empty part files even if the OutputCollector.collect is never invoked by the corresponding tasks.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedDec 22, '08 at 6:01a
activeFeb 23, '09 at 3:19p
posts48
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Hudson (JIRA): 48 posts

People

Translate

site design / logo © 2022 Grokbase