FAQ
[ https://issues.apache.org/jira/browse/HADOOP-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574192#action_12574192 ]

Joydeep Sen Sarma commented on HADOOP-2429:
-------------------------------------------

+1

it seems that we should be able to implement a default byte ordered sort right away and offer that as an option (perhaps via a conf option or something). We could write an optimized C/JNI implementation (if we could parse the sequencefiles that is!). Would try to give the Java one a spin (the C one sounds hairy with compression).
The lowest level map-reduce APIs should be byte oriented
--------------------------------------------------------

Key: HADOOP-2429
URL: https://issues.apache.org/jira/browse/HADOOP-2429
Project: Hadoop Core
Issue Type: Improvement
Components: mapred
Reporter: eric baldeschwieler

As discussed here:
https://issues.apache.org/jira/browse/HADOOP-1986#action_12551237
The templates, serializers and other complexities that allow map-reduce to use arbitrary types complicate the design and lead to lots of object creates and other overhead that a byte oriented design would not suffer. I believe the lowest level implementation of hadoop map-reduce should have byte string oriented APIs (for keys and values). This API would be more performant, simpler and more easily cross language.
The existing API could be maintained as a thin layer on top of the leaner API.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Owen O'Malley (JIRA) at Mar 2, 2008 at 8:17 am
    [ https://issues.apache.org/jira/browse/HADOOP-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574196#action_12574196 ]

    Owen O'Malley commented on HADOOP-2429:
    ---------------------------------------

    -1

    The object-based interfaces are far easier for Java programmers to use than byte-oriented ones would be, therefore you aren't simplifying the design, but significantly complicating it by adding another layer.

    Other than trivial cases like IdentityMapper and IdentityReducer, I can't think of any Mapper or Reducer that doesn't need to construct the object anyways. You can see in Google's map/reduce paper that even their C++ word count example is serializing and deserializing the values. Granted, they do it in the mapper and reducer, but I don't see that as more efficient. Looking more deeply at the word count example, because they are doing the serialization in mapper and reducer, they do a *bad* job of it. Their published example is converting the numbers to ascii strings, while our example is uses raw bytes and is therefore more efficient.

    It should also be noted that the object-based API can be treated as a byte-oriented as long as the InputFormat and OutputFormat read and write a chunk of bytes, they will flow as bytes through the rest of the system. Such an InputFormat was added in HADOOP-2603 and it would be easy to write the corresponding OutputFormat, if there is a need.

    Also note that when I wrote the C++ API, I did choose to use bytes rather than objects, partially because dealing with objects in C++ is a pain. But for Java the pain points are different...
    The lowest level map-reduce APIs should be byte oriented
    --------------------------------------------------------

    Key: HADOOP-2429
    URL: https://issues.apache.org/jira/browse/HADOOP-2429
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Reporter: eric baldeschwieler

    As discussed here:
    https://issues.apache.org/jira/browse/HADOOP-1986#action_12551237
    The templates, serializers and other complexities that allow map-reduce to use arbitrary types complicate the design and lead to lots of object creates and other overhead that a byte oriented design would not suffer. I believe the lowest level implementation of hadoop map-reduce should have byte string oriented APIs (for keys and values). This API would be more performant, simpler and more easily cross language.
    The existing API could be maintained as a thin layer on top of the leaner API.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Runping Qi (JIRA) at Mar 2, 2008 at 3:23 pm
    [ https://issues.apache.org/jira/browse/HADOOP-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574221#action_12574221 ]

    Runping Qi commented on HADOOP-2429:
    ------------------------------------


    Even in the current framework, if you use Buffer as the class for key/value throughout a job, that should be equivalent to a bytes oriented
    interface, modulo to the cost of serializing/deserializing Buffer objects, which should be minimum.

    The lowest level map-reduce APIs should be byte oriented
    --------------------------------------------------------

    Key: HADOOP-2429
    URL: https://issues.apache.org/jira/browse/HADOOP-2429
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Reporter: eric baldeschwieler

    As discussed here:
    https://issues.apache.org/jira/browse/HADOOP-1986#action_12551237
    The templates, serializers and other complexities that allow map-reduce to use arbitrary types complicate the design and lead to lots of object creates and other overhead that a byte oriented design would not suffer. I believe the lowest level implementation of hadoop map-reduce should have byte string oriented APIs (for keys and values). This API would be more performant, simpler and more easily cross language.
    The existing API could be maintained as a thin layer on top of the leaner API.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • eric baldeschwieler (JIRA) at Mar 14, 2008 at 5:43 am
    [ https://issues.apache.org/jira/browse/HADOOP-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578622#action_12578622 ]

    eric baldeschwieler commented on HADOOP-2429:
    ---------------------------------------------

    owen, runping, I disagree completely with the assumptions from which you reach your conclusions.

    I think you are not looking at the total system costs. This design ends up spewing java types everywhere. The fact of the matter is our average user is not using java to code. Even in Java we end up doing gymnastics to support standard types and it is making multi-language interoperability difficult and non-standard. Types end up polluting all of our containers too. This makes them non-interoperable.

    And after paying all this cost, we still end up compromising the integrity of the type based design, because one can not efficiently merge/sort if one must instantiate objects to do it, so we require the writing of byte based comparators. Yuck!

    I'm interested in suggestions for a compromise design that raises other languages to first order citizens in Hadoop and keeps us from having to do gymnastics to use even native java types.
    The lowest level map-reduce APIs should be byte oriented
    --------------------------------------------------------

    Key: HADOOP-2429
    URL: https://issues.apache.org/jira/browse/HADOOP-2429
    Project: Hadoop Core
    Issue Type: Improvement
    Components: mapred
    Reporter: eric baldeschwieler

    As discussed here:
    https://issues.apache.org/jira/browse/HADOOP-1986#action_12551237
    The templates, serializers and other complexities that allow map-reduce to use arbitrary types complicate the design and lead to lots of object creates and other overhead that a byte oriented design would not suffer. I believe the lowest level implementation of hadoop map-reduce should have byte string oriented APIs (for keys and values). This API would be more performant, simpler and more easily cross language.
    The existing API could be maintained as a thin layer on top of the leaner API.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedMar 2, '08 at 6:25a
activeMar 14, '08 at 5:43a
posts4
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

eric baldeschwieler (JIRA): 4 posts

People

Translate

site design / logo © 2022 Grokbase