FAQ
Design a pluggable interface to place replicas of blocks in HDFS
----------------------------------------------------------------

Key: HADOOP-3799
URL: https://issues.apache.org/jira/browse/HADOOP-3799
Project: Hadoop Core
Issue Type: Improvement
Components: dfs
Reporter: dhruba borthakur


The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Lohit Vijayarenu (JIRA) at Jul 22, 2008 at 8:08 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615767#action_12615767 ]

    Lohit Vijayarenu commented on HADOOP-3799:
    ------------------------------------------

    It would be good if this could also be used while re-balancing. Even better if there is an option to change policy of a file.
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur

    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at Nov 7, 2008 at 6:11 am
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645695#action_12645695 ]

    dhruba borthakur commented on HADOOP-3799:
    ------------------------------------------

    Another alternative would be to allow an application to specify that a set of files have the same "block affinity". HDFS will try to allocate blocks of these files in a few set of datanode(s).
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur

    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at May 2, 2009 at 9:40 am
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    dhruba borthakur reassigned HADOOP-3799:
    ----------------------------------------

    Assignee: dhruba borthakur
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur

    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at May 2, 2009 at 9:43 am
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    dhruba borthakur updated HADOOP-3799:
    -------------------------------------

    Attachment: BlockPlacementPluggable.txt

    This patch makes the HDFS block placement algorithm pluggable. The Namenode uses the name of the configuration parameter called dfs.block.replicator.classname to find the name of the class that chooses block locations. The default policy remains the same as before and can be found in ReplicationTargetChooser.java.

    A new block placement policy has to implement the interface specified in BlockPlacementInterface.java

    I would have liked the interface to specify the name of the file for which the block belongs to. But, the name of the file cannot be materialized cheaply when the block manager is doing block replication.
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hong Tang (JIRA) at May 2, 2009 at 11:20 am
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705258#action_12705258 ]

    Hong Tang commented on HADOOP-3799:
    -----------------------------------

    @dhruba

    any specific block replication policies you might be interested in experimenting with?

    additionally, Ben Reed pointed me to this paper in NSDI 09 where the authors described a scheme on how to allow applications to express storage cues (some of them can be replication policies):

    Flexible, Wide-Area Storage for Distributed Systems with WheelFS
    Jeremy Stribling, MIT CSAIL; Yair Sovran, New York University; Irene Zhang and Xavid Pretzer, MIT CSAIL; Jinyang Li, New York University; M. Frans Kaashoek and Robert Morris, MIT CSAIL http://www.usenix.org/events/nsdi09/tech/full_papers/stribling/stribling.pdf

    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at May 2, 2009 at 5:33 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705313#action_12705313 ]

    dhruba borthakur commented on HADOOP-3799:
    ------------------------------------------

    @Hong: thanks for the paper. I will look at it shortly. Thanks.
    any specific block replication policies you might be interested in experimenting with?
    I am interested on co-locating blocks from two specified data-sets in the same set of datanode(s).
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Stefan Will (JIRA) at May 2, 2009 at 7:24 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705320#action_12705320 ]

    Stefan Will commented on HADOOP-3799:
    -------------------------------------

    Here's something I'd love to have: A way to lock a complete replica of a set of files to one or more nodes. Together with HADOOP-4801 this would allow Lucene indexes to be efficiently served right out of hdfs, rather than having to create yet another copy on the local disk.
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at May 3, 2009 at 12:55 am
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705361#action_12705361 ]

    dhruba borthakur edited comment on HADOOP-3799 at 5/2/09 5:55 PM:
    ------------------------------------------------------------------

    @Stefan: I completely agree with you. This patch should enable researchers to experiment with various modes of HDFS block placement without changing code-hdfs code. I plan on using this to co-locate blocks from hdfs datasets that are frequently scanner together in a small number of datanodes so that such a "join" operation gets better node/rack locality.

    was (Author: dhruba):
    @Stefan: I complete agree with you. This patch should enable researchers to experiment with various modes of HDFS block placement without changing code-hdfs code. I plan on using this to co-locate blocks from hdfs datasets that are frequently scanner together in a small number of datanodes so that such a "join" operation gets better node/rack locality.
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at May 3, 2009 at 12:55 am
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12705361#action_12705361 ]

    dhruba borthakur commented on HADOOP-3799:
    ------------------------------------------

    @Stefan: I complete agree with you. This patch should enable researchers to experiment with various modes of HDFS block placement without changing code-hdfs code. I plan on using this to co-locate blocks from hdfs datasets that are frequently scanner together in a small number of datanodes so that such a "join" operation gets better node/rack locality.
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tom White (JIRA) at May 8, 2009 at 1:26 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707337#action_12707337 ]

    Tom White commented on HADOOP-3799:
    -----------------------------------

    Dhruba, These look like good changes - glad to see this moving forward. More comments below:

    * Can BlockPlacementInterface be an abstract class? I would also change its name to not have the "Interface" suffix, something like ReplicationPolicy, or BlockPlacementPolicy. ReplicationTargetChooser could be renamed something like DoubleRackReplicationPolicy or DoubleRackBlockPlacementPolicy or similar, to better describe its role.
    * Why doesn't ReplicationPolicy simply pass through verifyBlockPlacement()? It seems odd that it's doing extra work here.
    * BlockPlacementInterface#chooseTarget(). Make excludedNodes a List<DatanodeDescriptor>. Implementations may choose to turn it into a map if they need to, but for the interface, it should just be a list, shouldn't it?
    * For future evolution, can we pass a Configuration to the initialize() method, rather than the considerLoad boolean?
    * Rather than passing the full FSNamesystem to the initialize method, it would be preferable to create an interface for the part that the block placement strategy needs. Something like FSNamespaceStats, which only needs getTotalLoad() for the moment. I think this is an acceptable use of an interface, since it only used by developers writing a new block placement strategy. There's a similar situtation for job scheduling in MapReduce: JobTracker implements the package-private TaskTrackerManager interface so that TaskScheduler doesn't have to pull in the whole JobTracker. This helps a lot with testing.
    * These changes should make it possible to unit test ReplicationTargetChooser directly. This could be another Jira.
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hairong Kuang (JIRA) at May 8, 2009 at 5:19 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707426#action_12707426 ]

    Hairong Kuang commented on HADOOP-3799:
    ---------------------------------------

    Dhruba, what's the implication of a pluggable block placement policy to balancer, under-replicated, and over-replicated blocks? Currently they all assumes there is only one replication policy. Is it possible that a block is created with one replication policy and then later in its life use a different policy to handle its over/under-replicaton and balancing? Or we need to persist their initial placement policy on disk.
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at May 10, 2009 at 11:14 am
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707779#action_12707779 ]

    dhruba borthakur commented on HADOOP-3799:
    ------------------------------------------

    @Hairong: fsck uses verifyReplication() to ensure whether replicaas are placed according to the configured policy, and this method is part of the proposed BlockPlacement policy interface. Balancer does not use the configured policy but ensures that the number of unique racks for the block is not reduced. Thus, both these components should work ok with a externally configured replication policy, isn't it?



    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jingkei Ly (JIRA) at May 11, 2009 at 1:46 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708030#action_12708030 ]

    Jingkei Ly commented on HADOOP-3799:
    ------------------------------------

    Hi Dhruba. I'm really glad you raised this JIRA because I need something like this in-order to be able to co-locate blocks based on the filename of the file the blocks belong to - I think similar to what you mentioned you would like to do. However, I'm not sure how you would do this with the interface you are proposing, as BlockPlacementInterface#chooseTarget() isn't passed anything to identify the block or filename that is being written - am I missing something? How would this be done?
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at May 11, 2009 at 1:58 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708035#action_12708035 ]

    dhruba borthakur commented on HADOOP-3799:
    ------------------------------------------

    I would like to figure out how to expose the filename in the API. It can be easily done when this API is invoked at the time when a new block is being allocated to a file. However, when the replicator takes a under-replicated block and tries to get a new replica location for that block, it is costly to the find the filename.

    The filename can be deduced by traversing the INode-tree that is maintained by the namenode. But it is a costly operation beccause one has to traverse the entire branch starting from the specified INOde to the root. One option is to pass in the INode into the block-placement algorithm. f the algorithm needs the complete path name of the file in question, then it has to do the costly operation of generating the full path name of the file. However, this makes the interface kind-of less elegant. I am still debating how to do it right.
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hairong Kuang (JIRA) at May 11, 2009 at 5:36 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708122#action_12708122 ]

    Hairong Kuang commented on HADOOP-3799:
    ---------------------------------------
    Balancer does not use the configured policy but ensures that the number of unique racks for the block is not reduced. Thus, both these components should work ok with a externally configured replication policy, isn't it?
    Replication policies may have other requirements other than number of racks, for example co-location. I don't think balancer handles this right now.
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at May 12, 2009 at 2:24 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708447#action_12708447 ]

    dhruba borthakur commented on HADOOP-3799:
    ------------------------------------------
    Replication policies may have other requirements other than number of racks,
    Of course, this is true. But this depends on what type of replication policy one can come up with and I would like to leave this for a future time. "co-location" of blocks would typically be based on their pathnames, and the balancer could be extended to invoke the same BlockPlacement policy interface to adhere to the policy.

    However, in this patch, I would like to expose the name of the file via the BlockPlacement policy interface. Any ideas here?
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jingkei Ly (JIRA) at May 12, 2009 at 3:18 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708460#action_12708460 ]

    Jingkei Ly commented on HADOOP-3799:
    ------------------------------------
    However, in this patch, I would like to expose the name of the file via the BlockPlacement policy interface. Any ideas here?
    I think having two versions of BlockPlacementInterface#chooseTarget() would be the most efficient - one that accepts the filename (to be called from FSNamesystem#getAdditionalBlock()) and another that accepts the INode (to be called from FSNamesystem#computeReplicationWorkForBlock()). As you said, it does make the interface rather inelegant, though.

    An alternative is to pass the Block object to chooseTarget() and let the plugin-code look up the INode itself in the FSNamesystem map - not particularly efficient, but perhaps plugin-code could cache INodes to filenames to mitigate it a bit.
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Hairong Kuang (JIRA) at May 12, 2009 at 5:51 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708525#action_12708525 ]

    Hairong Kuang commented on HADOOP-3799:
    ---------------------------------------
    I would like to leave this for a future time
    I do not think ignoring balancer in this design is a good idea. Yahoo runs balancer almost 24/7 on all clusters. We do not want balancer to break any of the favorite block placement policy configured on the cluster.

    Dhruba, I think this jira makes a big change to the core of DFS. It's non-trival at all. If we ever want to make this change, we should have a design document first, explaining the semantic of pluggable placement policy and how all block placement related components should change to support the feature.
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Owen O'Malley (JIRA) at May 12, 2009 at 11:17 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708683#action_12708683 ]

    Owen O'Malley commented on HADOOP-3799:
    ---------------------------------------

    I agree with Hairong. I think any answer that ignores the rebalancer is broken. Especially since in order to be stable under the rebalancer, hdfs probably needs to store metadata with the files or blocks that is placement specific.

    -1.
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at May 13, 2009 at 5:47 am
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708766#action_12708766 ]

    dhruba borthakur commented on HADOOP-3799:
    ------------------------------------------
    Especially since in order to be stable under the rebalancer
    Oh guys, you are going too far! I am talking of faster cycle of innovation and iteration. A pluggable interface allows the hadoop community to try experiments with newer methods of block placement. Once such a placement algorithm proves beneficial and helpful, does the related questions of "how to make the balancer work with the new placement policy" come into my mind. If experiments prove that there isn't any viable alternative pluggable policy, then the question of "does the balancer work with a pluggable policy" is moot.
    hdfs probably needs to store metadata with the files or blocks
    I do not like this approach. It makes hdfs heavy, clunky and difficult to maintain. Have you seen what happened to other file system that tried to do everything inside it, e.g. DCE-DFS? It is possible that HDFS might allow generic blobs to be stored stored with files (aka extended file attributes) where application specific data can be stored. But it should be disassociated from a "requirement" that archival-policy has to be stored with file meta-data.

    Again folks, I agree completely with you that a "finished product" needs to encompass the "balancer". But to start experimenting to figure out whether a different placement policy is beneificial at all, I need the pluggability feature, otherwise I have to keep changing my hadoop source code every time I want to experiment. My experiments will probably take 3 to six months, especially because I want to benchmark results at large scale.

    For installations that go with the default policy, there is no impact at all.














    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jingkei Ly (JIRA) at May 13, 2009 at 9:47 am
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708862#action_12708862 ]

    Jingkei Ly commented on HADOOP-3799:
    ------------------------------------
    hdfs probably needs to store metadata with the files or blocks
    Instead of storing replication policy metadata with the blocks, could you keep the responsibility for this back with the replica placement plugin-code?

    So assuming the balancer is also updated to use the block placement interface; if a cluster has to support multiple replication policies, it could be the plugin-code's responsiblity to decide which policy to use based on the file owner/permissions/filename for the block. One advantage is that all the replication code for the cluster is encompassed in one place.
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at May 13, 2009 at 5:24 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709016#action_12709016 ]

    dhruba borthakur commented on HADOOP-3799:
    ------------------------------------------

    After sleeping over it, I think it is necessary to ensure that the balancer does at least the bare minimum to work elegantly with an external block placement policy.
    if a cluster has to support multiple replication policies, it could be the plugin-code's responsiblity to decide which policy to use based on the file owner/permissions/filename for the block
    That's my plan. One of my ideas is to change th block placement policy for a file directory based on access patterns. The plugin wil analyze a set of past access patterns (stored in an external db) to decide what type of placement is "currently" best for a dataset.
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Konstantin Shvachko (JIRA) at May 13, 2009 at 5:40 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709021#action_12709021 ]

    Konstantin Shvachko commented on HADOOP-3799:
    ---------------------------------------------
    The plugin wil analyze a set of past access patterns (stored in an external db)
    What external db? Dhruba, could you please elaborate what you are trying to do. May be in a form of some document.
    Mixing different placement policies in the same (instance of) file system is not a good idea imo. The policies may contradict each other.
    I would rather allow to format a file system with a specific policy and then keep it constant for the lifespan of the system.
    This gives enough space for experimenting with different policies.
    But I agree that the balancer and the fsck should be policy aware.
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at May 13, 2009 at 5:54 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709027#action_12709027 ]

    dhruba borthakur commented on HADOOP-3799:
    ------------------------------------------
    What external db? Dhruba, could you please elaborate what you
    We have deployed the scripts from HADOOP-3708 to collect (online) job logs into a mysql DB. This DB also contains the hive query that the job makes. It is easy to see what datasets are being used together in most queries. My idea is to dynamically co-locate blocks based on these access patterns. I will present those ideas in a separate JIRA (once this JIRA gets through)
    I would rather allow to format a file system with a specific policy and then keep it constant for the lifespan of the system.
    That would be a fine goal for your cluster. However, I would like the API to be more flexible than that. Does it sound reasonable?
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • dhruba borthakur (JIRA) at May 31, 2009 at 9:52 am
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    dhruba borthakur updated HADOOP-3799:
    -------------------------------------

    Attachment: BlockPlacementPluggable2.txt

    I incorporated all of Tom's comments. Thanks Tom.

    This patch ensures that the following code paths conform to the configured block placement policy:
    1. Allocation of a new block for a file
    2. deletion of excess replicas of a block
    3. creation of new replicas of a block
    4. movement of blocks triggered by the balancer
    5. verification of block placement by fsck


    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt, BlockPlacementPluggable2.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Jingkei Ly (JIRA) at Jun 1, 2009 at 9:34 am
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715036#action_12715036 ]

    Jingkei Ly commented on HADOOP-3799:
    ------------------------------------

    I have a use case for controlling the replication targets of a block based on filename at file creation time - is it possible to accommodate this in your proposed API?
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt, BlockPlacementPluggable2.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Tom White (JIRA) at Jun 19, 2009 at 3:07 pm
    [ https://issues.apache.org/jira/browse/HADOOP-3799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721805#action_12721805 ]

    Tom White commented on HADOOP-3799:
    -----------------------------------

    Hi Dhruba,

    A couple more comments:
    BlockPlacementInterface#chooseTarget(). Make excludedNodes a List<DatanodeDescriptor>. Implementations may choose to turn it into a map if they need to, but for the interface, it should just be a list, shouldn't it?
    I think you missed this change.

    I'm not convinced that ReplicationPolicyChooser is needed. Couldn't we add a static method (e.g. getInstance()) to BlockPlacementPolicy to construct a BlockPlacementPolicy from the dfs.block.replicator.classname property? We can add an overloaded chooseTarget() method to BlockPlacementPolicy which doesn't take a chosenNodes argument (BTW this is misspelt as "choosenNodes" in BlockPlacementPolicy).
    Design a pluggable interface to place replicas of blocks in HDFS
    ----------------------------------------------------------------

    Key: HADOOP-3799
    URL: https://issues.apache.org/jira/browse/HADOOP-3799
    Project: Hadoop Core
    Issue Type: Improvement
    Components: dfs
    Reporter: dhruba borthakur
    Assignee: dhruba borthakur
    Attachments: BlockPlacementPluggable.txt, BlockPlacementPluggable2.txt


    The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-dev @
categorieshadoop
postedJul 21, '08 at 8:19a
activeJun 19, '09 at 3:07p
posts28
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

Tom White (JIRA): 28 posts

People

Translate

site design / logo © 2023 Grokbase