FAQ
Hi,

Has anyone looked into the Microsoft Dryad project?

Their basic idea is using DAG(connect computational "vertices" with
communication "edges") to model distributed computing flows. And they have
something called DryadLINQ which seems to be the Hive equivilent.

Since the DAG model doesn't distingish inter-job(workflow) and
intra-job(map/reduce..etc) layer, their approach of doing Query
translation,Workflow/Job Scheduling,Execution in one box may score better
optimization and fine tuning opportunties compared to the Hadoop/Hive
combo.

Also giving majority of the hard work will be encapsulated and performed
by the translation/optimizing layer, the simplicity
beauty of Map/Reduce becomes irrelevant or even hindrance because it doesn't
permit more generic and flexible
operations like Dryad does.


Seems M$ got it right this time, at least on paper :-P ...thought?



Qing

Search Discussions

  • Jeff Hammerbacher at Oct 15, 2009 at 5:45 pm
    Hey Qing,

    You can download Dryad and see for yourself:
    http://connect.microsoft.com/site/sitehome.aspx?SiteID=891. There's no
    accompanying distributed file system, unfortunately, and I've never seen a
    benchmark of Dryad scaling to more than 300 nodes, so it's not clear that
    it's the "right" model for all workloads. There's certainly room for a
    richer set of physical operators in the Hadoop project, but the nice thing
    about Hadoop and Hive is that it's a full suite of storage, data flow
    execution, and a higher-level syntax that works today at scale. If you'd
    like to try your hand at an implementation of the Dryad model of query
    execution over HDFS and underneath HiveQL, that would certainly be an
    interesting project.

    Regards,
    Jeff
    On Thu, Oct 15, 2009 at 12:31 AM, Qing Yan wrote:

    Hi,

    Has anyone looked into the Microsoft Dryad project?

    Their basic idea is using DAG(connect computational "vertices" with
    communication "edges") to model distributed computing flows. And they have
    something called DryadLINQ which seems to be the Hive equivilent.

    Since the DAG model doesn't distingish inter-job(workflow) and
    intra-job(map/reduce..etc) layer, their approach of doing Query
    translation,Workflow/Job Scheduling,Execution in one box may score better
    optimization and fine tuning opportunties compared to the Hadoop/Hive
    combo.

    Also giving majority of the hard work will be encapsulated and performed
    by the translation/optimizing layer, the simplicity
    beauty of Map/Reduce becomes irrelevant or even hindrance because
    it doesn't permit more generic and flexible
    operations like Dryad does.


    Seems M$ got it right this time, at least on paper :-P ...thought?



    Qing



  • Qing Yan at Oct 16, 2009 at 3:43 am
    Hi Jeff,

    Actually I care less about Dryad's implementation - few people will adopt it
    today due to its immature and/or proprietary nature. But strictly from the
    design and architecture perspective, reading through their literature
    makes one feel Dryad has certain edges over Hadoop/Hive.

    E.g. Hive treats Hadoop as an execution blackbox, say the hadoop job
    involves a large dataset, if partial data error caused the job failure,
    there is no easy way for Hive to know the situation and the whole job need
    to be re-runned later, vs. in Dryad you get more control and fine tuning
    opportunties.

    About the implementation of the Dryad model of query execution over HDFS and
    underneath HiveQL, the question is
    how much dependency Hive has upon Map/Reduce.. It is probably difficult to
    share the same translator/optimizer for Hadoop & Dryad without sacrafing
    Dryad's capabilities.We can make Dryad operated only in M/R mode but why
    bother:-P



    Regards

    Qing
    On Fri, Oct 16, 2009 at 1:44 AM, Jeff Hammerbacher wrote:

    Hey Qing,

    You can download Dryad and see for yourself:
    http://connect.microsoft.com/site/sitehome.aspx?SiteID=891. There's no
    accompanying distributed file system, unfortunately, and I've never seen a
    benchmark of Dryad scaling to more than 300 nodes, so it's not clear that
    it's the "right" model for all workloads. There's certainly room for a
    richer set of physical operators in the Hadoop project, but the nice thing
    about Hadoop and Hive is that it's a full suite of storage, data flow
    execution, and a higher-level syntax that works today at scale. If you'd
    like to try your hand at an implementation of the Dryad model of query
    execution over HDFS and underneath HiveQL, that would certainly be an
    interesting project.

    Regards,
    Jeff

    On Thu, Oct 15, 2009 at 12:31 AM, Qing Yan wrote:

    Hi,

    Has anyone looked into the Microsoft Dryad project?

    Their basic idea is using DAG(connect computational "vertices" with
    communication "edges") to model distributed computing flows. And they have
    something called DryadLINQ which seems to be the Hive equivilent.

    Since the DAG model doesn't distingish inter-job(workflow) and
    intra-job(map/reduce..etc) layer, their approach of doing Query
    translation,Workflow/Job Scheduling,Execution in one box may score better
    optimization and fine tuning opportunties compared to the Hadoop/Hive
    combo.

    Also giving majority of the hard work will be encapsulated and
    performed by the translation/optimizing layer, the simplicity
    beauty of Map/Reduce becomes irrelevant or even hindrance because
    it doesn't permit more generic and flexible
    operations like Dryad does.


    Seems M$ got it right this time, at least on paper :-P ...thought?



    Qing



  • Zheng Shao at Oct 16, 2009 at 4:05 am
    Hi Qing,

    Talking about high-level design and architecture, I think the ideas proposed
    in Hive will help SQL -> DryadLINQ translation as well.

    Hive internally translates the SQL query into a DAG plan which should fit
    Dryad - but with the limitation of Hadoop, we have to cut the DAG plan into
    separate map-reduce jobs.
    Also, as a side note, this paper from SOSP 2009:
    http://www.sigops.org/sosp/sosp09/papers/yu-sosp09.pdf (also from Microsoft)
    has the same idea as the hash-based aggregation in Hive.

    Nothing is blocking people from implementing the architecture of Hive on top
    of Dryad, and it should be as effective (just remove the last step that
    chops the plan into separate map-reduce jobs). But I do agree we won't be
    able to (and it does not make sense to) share the same code.

    So, we can either take the architecture of Hive and implement it on Dryad,
    or take the architecture of Dryad and implement it on Hadoop (NOTE: hadoop
    hdfs and map-reduce are broken apart now which makes it easier than ever to
    do that) and put Hive on top of that, just as Jeff mentioned. I do prefer
    the latter because Hadoop is a much widely accessible platform by both
    academia and industry.

    What do you think? Let us know if you want to start a project on this. It
    looks very interesting to me.

    Zheng
    On Thu, Oct 15, 2009 at 8:43 PM, Qing Yan wrote:

    Hi Jeff,

    Actually I care less about Dryad's implementation - few people will adopt
    it today due to its immature and/or proprietary nature. But strictly from
    the design and architecture perspective, reading through their literature
    makes one feel Dryad has certain edges over Hadoop/Hive.

    E.g. Hive treats Hadoop as an execution blackbox, say the hadoop job
    involves a large dataset, if partial data error caused the job failure,
    there is no easy way for Hive to know the situation and the whole job need
    to be re-runned later, vs. in Dryad you get more control and fine tuning
    opportunties.

    About the implementation of the Dryad model of query execution over HDFS
    and underneath HiveQL, the question is
    how much dependency Hive has upon Map/Reduce.. It is probably difficult
    to share the same translator/optimizer for Hadoop & Dryad without sacrafing
    Dryad's capabilities.We can make Dryad operated only in M/R mode but why
    bother:-P



    Regards

    Qing
    On Fri, Oct 16, 2009 at 1:44 AM, Jeff Hammerbacher wrote:

    Hey Qing,

    You can download Dryad and see for yourself:
    http://connect.microsoft.com/site/sitehome.aspx?SiteID=891. There's no
    accompanying distributed file system, unfortunately, and I've never seen a
    benchmark of Dryad scaling to more than 300 nodes, so it's not clear that
    it's the "right" model for all workloads. There's certainly room for a
    richer set of physical operators in the Hadoop project, but the nice thing
    about Hadoop and Hive is that it's a full suite of storage, data flow
    execution, and a higher-level syntax that works today at scale. If you'd
    like to try your hand at an implementation of the Dryad model of query
    execution over HDFS and underneath HiveQL, that would certainly be an
    interesting project.

    Regards,
    Jeff

    On Thu, Oct 15, 2009 at 12:31 AM, Qing Yan wrote:

    Hi,

    Has anyone looked into the Microsoft Dryad project?

    Their basic idea is using DAG(connect computational "vertices" with
    communication "edges") to model distributed computing flows. And they have
    something called DryadLINQ which seems to be the Hive equivilent.

    Since the DAG model doesn't distingish inter-job(workflow) and
    intra-job(map/reduce..etc) layer, their approach of doing Query
    translation,Workflow/Job Scheduling,Execution in one box may score better
    optimization and fine tuning opportunties compared to the Hadoop/Hive
    combo.

    Also giving majority of the hard work will be encapsulated and
    performed by the translation/optimizing layer, the simplicity
    beauty of Map/Reduce becomes irrelevant or even hindrance because
    it doesn't permit more generic and flexible
    operations like Dryad does.


    Seems M$ got it right this time, at least on paper :-P ...thought?



    Qing




    --
    Yours,
    Zheng
  • Qing Yan at Oct 17, 2009 at 7:38 am
    Hi Zheng,
    I second the idea of taking Dryad's architecture and apply it to
    Hadoop.It will get the best of both worlds.The top part of Hive and the
    bottom part
    of Hadoop can be reused while refactoring Hadoop M/R layer to support
    arbitrary operation, expanding Hive's DAG to cover
    node level execution and finally integrating Hive and Hadoop together. I
    think this is the right direction! Does this match your vision and how to
    proceed?
    Just start a new project or should there be more ppl in the community
    involved to master planning/discuss this thing?


    Regards,

    Qing
    On Fri, Oct 16, 2009 at 12:04 PM, Zheng Shao wrote:

    Hi Qing,

    Talking about high-level design and architecture, I think the ideas
    proposed in Hive will help SQL -> DryadLINQ translation as well.

    Hive internally translates the SQL query into a DAG plan which should fit
    Dryad - but with the limitation of Hadoop, we have to cut the DAG plan into
    separate map-reduce jobs.
    Also, as a side note, this paper from SOSP 2009:
    http://www.sigops.org/sosp/sosp09/papers/yu-sosp09.pdf (also from
    Microsoft) has the same idea as the hash-based aggregation in Hive.

    Nothing is blocking people from implementing the architecture of Hive on
    top of Dryad, and it should be as effective (just remove the last step that
    chops the plan into separate map-reduce jobs). But I do agree we won't be
    able to (and it does not make sense to) share the same code.

    So, we can either take the architecture of Hive and implement it on Dryad,
    or take the architecture of Dryad and implement it on Hadoop (NOTE: hadoop
    hdfs and map-reduce are broken apart now which makes it easier than ever to
    do that) and put Hive on top of that, just as Jeff mentioned. I do prefer
    the latter because Hadoop is a much widely accessible platform by both
    academia and industry.

    What do you think? Let us know if you want to start a project on this. It
    looks very interesting to me.

    Zheng

    On Thu, Oct 15, 2009 at 8:43 PM, Qing Yan wrote:

    Hi Jeff,

    Actually I care less about Dryad's implementation - few people will adopt
    it today due to its immature and/or proprietary nature. But strictly from
    the design and architecture perspective, reading through their literature
    makes one feel Dryad has certain edges over Hadoop/Hive.

    E.g. Hive treats Hadoop as an execution blackbox, say the hadoop job
    involves a large dataset, if partial data error caused the job failure,
    there is no easy way for Hive to know the situation and the whole job need
    to be re-runned later, vs. in Dryad you get more control and fine tuning
    opportunties.

    About the implementation of the Dryad model of query execution over HDFS
    and underneath HiveQL, the question is
    how much dependency Hive has upon Map/Reduce.. It is probably difficult
    to share the same translator/optimizer for Hadoop & Dryad without sacrafing
    Dryad's capabilities.We can make Dryad operated only in M/R mode but why
    bother:-P



    Regards

    Qing

    On Fri, Oct 16, 2009 at 1:44 AM, Jeff Hammerbacher <hammer@cloudera.com
    wrote:
    Hey Qing,

    You can download Dryad and see for yourself:
    http://connect.microsoft.com/site/sitehome.aspx?SiteID=891. There's no
    accompanying distributed file system, unfortunately, and I've never seen a
    benchmark of Dryad scaling to more than 300 nodes, so it's not clear that
    it's the "right" model for all workloads. There's certainly room for a
    richer set of physical operators in the Hadoop project, but the nice thing
    about Hadoop and Hive is that it's a full suite of storage, data flow
    execution, and a higher-level syntax that works today at scale. If you'd
    like to try your hand at an implementation of the Dryad model of query
    execution over HDFS and underneath HiveQL, that would certainly be an
    interesting project.

    Regards,
    Jeff

    On Thu, Oct 15, 2009 at 12:31 AM, Qing Yan wrote:

    Hi,

    Has anyone looked into the Microsoft Dryad project?

    Their basic idea is using DAG(connect computational "vertices" with
    communication "edges") to model distributed computing flows. And they have
    something called DryadLINQ which seems to be the Hive equivilent.

    Since the DAG model doesn't distingish inter-job(workflow) and
    intra-job(map/reduce..etc) layer, their approach of doing Query
    translation,Workflow/Job Scheduling,Execution in one box may score better
    optimization and fine tuning opportunties compared to the Hadoop/Hive
    combo.

    Also giving majority of the hard work will be encapsulated and
    performed by the translation/optimizing layer, the simplicity
    beauty of Map/Reduce becomes irrelevant or even hindrance because
    it doesn't permit more generic and flexible
    operations like Dryad does.


    Seems M$ got it right this time, at least on paper :-P ...thought?



    Qing




    --
    Yours,
    Zheng
  • Zheng Shao at Oct 17, 2009 at 10:42 am
    Hi Qing,

    The usual way of open-source development is that you need to get at least a
    prototype running, before others will spend time and join your efforts. It
    will be great to have a prototype so that others can try it out, a design
    spec so that others can understand how it works , and a roadmap so others
    know how to help and contribute.

    Are you aiming for a research prototype or a real system? What is your
    eventual goal?

    Zheng
    On Sat, Oct 17, 2009 at 12:38 AM, Qing Yan wrote:

    Hi Zheng,
    I second the idea of taking Dryad's architecture and apply it to
    Hadoop.It will get the best of both worlds.The top part of Hive and the
    bottom part
    of Hadoop can be reused while refactoring Hadoop M/R layer to support
    arbitrary operation, expanding Hive's DAG to cover
    node level execution and finally integrating Hive and Hadoop together. I
    think this is the right direction! Does this match your vision and how to
    proceed?
    Just start a new project or should there be more ppl in the community
    involved to master planning/discuss this thing?


    Regards,

    Qing
    On Fri, Oct 16, 2009 at 12:04 PM, Zheng Shao wrote:

    Hi Qing,

    Talking about high-level design and architecture, I think the ideas
    proposed in Hive will help SQL -> DryadLINQ translation as well.

    Hive internally translates the SQL query into a DAG plan which should fit
    Dryad - but with the limitation of Hadoop, we have to cut the DAG plan into
    separate map-reduce jobs.
    Also, as a side note, this paper from SOSP 2009:
    http://www.sigops.org/sosp/sosp09/papers/yu-sosp09.pdf (also from
    Microsoft) has the same idea as the hash-based aggregation in Hive.

    Nothing is blocking people from implementing the architecture of Hive on
    top of Dryad, and it should be as effective (just remove the last step that
    chops the plan into separate map-reduce jobs). But I do agree we won't be
    able to (and it does not make sense to) share the same code.

    So, we can either take the architecture of Hive and implement it on Dryad,
    or take the architecture of Dryad and implement it on Hadoop (NOTE: hadoop
    hdfs and map-reduce are broken apart now which makes it easier than ever to
    do that) and put Hive on top of that, just as Jeff mentioned. I do prefer
    the latter because Hadoop is a much widely accessible platform by both
    academia and industry.

    What do you think? Let us know if you want to start a project on this. It
    looks very interesting to me.

    Zheng

    On Thu, Oct 15, 2009 at 8:43 PM, Qing Yan wrote:

    Hi Jeff,

    Actually I care less about Dryad's implementation - few people will adopt
    it today due to its immature and/or proprietary nature. But strictly from
    the design and architecture perspective, reading through their literature
    makes one feel Dryad has certain edges over Hadoop/Hive.

    E.g. Hive treats Hadoop as an execution blackbox, say the hadoop job
    involves a large dataset, if partial data error caused the job failure,
    there is no easy way for Hive to know the situation and the whole job need
    to be re-runned later, vs. in Dryad you get more control and fine tuning
    opportunties.

    About the implementation of the Dryad model of query execution over HDFS
    and underneath HiveQL, the question is
    how much dependency Hive has upon Map/Reduce.. It is probably difficult
    to share the same translator/optimizer for Hadoop & Dryad without sacrafing
    Dryad's capabilities.We can make Dryad operated only in M/R mode but why
    bother:-P



    Regards

    Qing

    On Fri, Oct 16, 2009 at 1:44 AM, Jeff Hammerbacher <
    hammer@cloudera.com> wrote:
    Hey Qing,

    You can download Dryad and see for yourself:
    http://connect.microsoft.com/site/sitehome.aspx?SiteID=891. There's no
    accompanying distributed file system, unfortunately, and I've never seen a
    benchmark of Dryad scaling to more than 300 nodes, so it's not clear that
    it's the "right" model for all workloads. There's certainly room for a
    richer set of physical operators in the Hadoop project, but the nice thing
    about Hadoop and Hive is that it's a full suite of storage, data flow
    execution, and a higher-level syntax that works today at scale. If you'd
    like to try your hand at an implementation of the Dryad model of query
    execution over HDFS and underneath HiveQL, that would certainly be an
    interesting project.

    Regards,
    Jeff

    On Thu, Oct 15, 2009 at 12:31 AM, Qing Yan wrote:

    Hi,

    Has anyone looked into the Microsoft Dryad project?

    Their basic idea is using DAG(connect computational "vertices" with
    communication "edges") to model distributed computing flows. And they have
    something called DryadLINQ which seems to be the Hive equivilent.

    Since the DAG model doesn't distingish inter-job(workflow) and
    intra-job(map/reduce..etc) layer, their approach of doing Query
    translation,Workflow/Job Scheduling,Execution in one box may score better
    optimization and fine tuning opportunties compared to the Hadoop/Hive
    combo.

    Also giving majority of the hard work will be encapsulated and
    performed by the translation/optimizing layer, the simplicity
    beauty of Map/Reduce becomes irrelevant or even hindrance because
    it doesn't permit more generic and flexible
    operations like Dryad does.


    Seems M$ got it right this time, at least on paper :-P ...thought?



    Qing




    --
    Yours,
    Zheng

    --
    Yours,
    Zheng

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedOct 15, '09 at 7:31a
activeOct 17, '09 at 10:42a
posts6
users3
websitehive.apache.org

People

Translate

site design / logo © 2022 Grokbase