FAQ
After sitting though some HDFS/BHase presentations yesterday, I
started thinking. that the hive model or doing its map/reduce over raw
files from HDFS is great, but a dedicated caching/region server could
be a big benefit in answering real time queries.

I calculated that one data center (not counting non-cachable content)
could have about 378MB of logs a day. Going from facebooks information
here:
http://www.facebook.com/note.php?note_id=110207012002

"The log files are named with the date and time of collection.
Individual hourly files are around 55 MB when compressed, so eight
months of compressed data takes up about 300 GB of space."

During the day and week the logs are collected one would expect the
data to be used very often. So having this in a cached would be ideal.

Given that an average DataNode might have 8 GB or 16 GB of RAM, one GB
could be sliced off and as a dedicated HiveRegion server, or it can
run as several dedicated servers. With maybe RAM and nothing else.

A Hive Region Server would/could contain HiveTables in a compressed
format, maybe hive tables in a derby format, indexes we are creating,
and some information about the usage so different caching algorithms
could evict sections. We could use ZooKeeper to manage the HiveRegions
like in HBase does.

Hive query optimizer would look to see if the in the data was in the
HiveRegionServer or run as normal.

Has anyone ever thought of this?
Edward

Search Discussions

  • Zheng Shao at Oct 4, 2009 at 11:26 pm
    +1

    Making input data and query results available in a short delay is definitely
    a very attractive feature for Hive.
    There are multiple approaches to achieve this, mainly depending on how much
    we leverage HBase.

    The simplest way to go is to probably have a good Hive/HBase integration
    like HIVE-705, HIVE-806 etc.
    This can help us leverage the efforts done by HBase to the maximum degree.
    The potential drawback is that HBase tables have support for random writes
    which may cause additional overhead for simple sequential writes.

    Eventually we may (or may not) need our own HiveRegionServer which hosts
    data in any format supported by Hive (on top of just the internal file
    format supported by HBase), but I feel it might be a good start to first try
    integrate the two.

    Zheng
    On Sun, Oct 4, 2009 at 8:42 AM, Edward Capriolo wrote:

    After sitting though some HDFS/BHase presentations yesterday, I
    started thinking. that the hive model or doing its map/reduce over raw
    files from HDFS is great, but a dedicated caching/region server could
    be a big benefit in answering real time queries.

    I calculated that one data center (not counting non-cachable content)
    could have about 378MB of logs a day. Going from facebooks information
    here:
    http://www.facebook.com/note.php?note_id=110207012002

    "The log files are named with the date and time of collection.
    Individual hourly files are around 55 MB when compressed, so eight
    months of compressed data takes up about 300 GB of space."

    During the day and week the logs are collected one would expect the
    data to be used very often. So having this in a cached would be ideal.

    Given that an average DataNode might have 8 GB or 16 GB of RAM, one GB
    could be sliced off and as a dedicated HiveRegion server, or it can
    run as several dedicated servers. With maybe RAM and nothing else.

    A Hive Region Server would/could contain HiveTables in a compressed
    format, maybe hive tables in a derby format, indexes we are creating,
    and some information about the usage so different caching algorithms
    could evict sections. We could use ZooKeeper to manage the HiveRegions
    like in HBase does.

    Hive query optimizer would look to see if the in the data was in the
    HiveRegionServer or run as normal.

    Has anyone ever thought of this?
    Edward


    --
    Yours,
    Zheng
  • Edward Capriolo at Oct 5, 2009 at 10:33 pm

    On Sun, Oct 4, 2009 at 7:24 PM, Zheng Shao wrote:
    +1

    Making input data and query results available in a short delay is definitely
    a very attractive feature for Hive.
    There are multiple approaches to achieve this, mainly depending on how much
    we leverage HBase.

    The simplest way to go is to probably have a good Hive/HBase integration
    like HIVE-705, HIVE-806 etc.
    This can help us leverage the efforts done by HBase to the maximum degree.
    The potential drawback is that HBase tables have support for random writes
    which may cause additional overhead for simple sequential writes.

    Eventually we may (or may not) need our own HiveRegionServer which hosts
    data in any format supported by Hive (on top of just the internal file
    format supported by HBase), but I feel it might be a good start to first try
    integrate the two.

    Zheng
    On Sun, Oct 4, 2009 at 8:42 AM, Edward Capriolo wrote:

    After sitting though some HDFS/BHase presentations yesterday, I
    started thinking. that the hive model or doing its map/reduce over raw
    files from HDFS is great, but a dedicated caching/region server could
    be a big benefit in answering real time queries.

    I calculated that one data center (not counting non-cachable content)
    could have about 378MB of logs a day. Going from facebooks information
    here:
    http://www.facebook.com/note.php?note_id=110207012002

    "The log files are named with the date and time of collection.
    Individual hourly files are around 55 MB when compressed, so eight
    months of compressed data takes up about 300 GB of space."

    During the day and week the logs are collected one would expect the
    data to be used very often. So having this in a cached would be ideal.

    Given that an average DataNode might have 8 GB or 16 GB of RAM, one GB
    could be sliced off and as a dedicated HiveRegion server, or it can
    run as several dedicated servers. With maybe RAM and nothing else.

    A Hive Region Server would/could contain HiveTables in a compressed
    format, maybe hive tables in a derby format, indexes we are creating,
    and some information about the usage so different caching algorithms
    could evict sections. We could use ZooKeeper to manage the HiveRegions
    like in HBase does.

    Hive query optimizer would look to see if the in the data was  in the
    HiveRegionServer or run as normal.

    Has anyone ever thought of this?
    Edward


    --
    Yours,
    Zheng
    I agree that Hive/Hbase integration is a good thing. I think that the
    differences between hive/hbase are vast. Hive is row oriented with
    column support and HBase is column oriented. HBase is working on
    sparse files and needs random inserts while Hive data is mostly Write
    Once Read Many. HBase is working mostly in memory with a commit log
    while Hive writes during the map and reduce phase directly to HDFS.

    The way I look at there is already a lot of waist. Imagine jobs are
    done simultaneously or right each other on relatively small data sets.


    select name,count() from people group by name;
    select * from people
    select * from people where name='sarah'

    With a HiveRegionServer sections of data might be already in memory in
    a fast binary form, or on disk in a embedded db like the one used by
    map side joins. Disks would be used on intermediate results rather
    then reprocessing the same chunks data repeatedly.

    Managing HiveRegionServers would be much less complex then managing
    HBASE regions that have high random read/inserts. In its simple form
    it would just be an exact duplicate of data, more complex an optimized
    binary form of the data.
  • Zheng Shao at Oct 6, 2009 at 1:32 am
    That's true.

    My thoughts are more constraint here because of resources :)

    What do you think is the major bottleneck for speed now? I have a vague
    feeling that it's the shuffling phase between map and reduce.

    Zheng
    On Mon, Oct 5, 2009 at 3:31 PM, Edward Capriolo wrote:
    On Sun, Oct 4, 2009 at 7:24 PM, Zheng Shao wrote:
    +1

    Making input data and query results available in a short delay is
    definitely
    a very attractive feature for Hive.
    There are multiple approaches to achieve this, mainly depending on how much
    we leverage HBase.

    The simplest way to go is to probably have a good Hive/HBase integration
    like HIVE-705, HIVE-806 etc.
    This can help us leverage the efforts done by HBase to the maximum degree.
    The potential drawback is that HBase tables have support for random writes
    which may cause additional overhead for simple sequential writes.

    Eventually we may (or may not) need our own HiveRegionServer which hosts
    data in any format supported by Hive (on top of just the internal file
    format supported by HBase), but I feel it might be a good start to first try
    integrate the two.

    Zheng

    On Sun, Oct 4, 2009 at 8:42 AM, Edward Capriolo <edlinuxguru@gmail.com>
    wrote:
    After sitting though some HDFS/BHase presentations yesterday, I
    started thinking. that the hive model or doing its map/reduce over raw
    files from HDFS is great, but a dedicated caching/region server could
    be a big benefit in answering real time queries.

    I calculated that one data center (not counting non-cachable content)
    could have about 378MB of logs a day. Going from facebooks information
    here:
    http://www.facebook.com/note.php?note_id=110207012002

    "The log files are named with the date and time of collection.
    Individual hourly files are around 55 MB when compressed, so eight
    months of compressed data takes up about 300 GB of space."

    During the day and week the logs are collected one would expect the
    data to be used very often. So having this in a cached would be ideal.

    Given that an average DataNode might have 8 GB or 16 GB of RAM, one GB
    could be sliced off and as a dedicated HiveRegion server, or it can
    run as several dedicated servers. With maybe RAM and nothing else.

    A Hive Region Server would/could contain HiveTables in a compressed
    format, maybe hive tables in a derby format, indexes we are creating,
    and some information about the usage so different caching algorithms
    could evict sections. We could use ZooKeeper to manage the HiveRegions
    like in HBase does.

    Hive query optimizer would look to see if the in the data was in the
    HiveRegionServer or run as normal.

    Has anyone ever thought of this?
    Edward


    --
    Yours,
    Zheng
    I agree that Hive/Hbase integration is a good thing. I think that the
    differences between hive/hbase are vast. Hive is row oriented with
    column support and HBase is column oriented. HBase is working on
    sparse files and needs random inserts while Hive data is mostly Write
    Once Read Many. HBase is working mostly in memory with a commit log
    while Hive writes during the map and reduce phase directly to HDFS.

    The way I look at there is already a lot of waist. Imagine jobs are
    done simultaneously or right each other on relatively small data sets.


    select name,count() from people group by name;
    select * from people
    select * from people where name='sarah'

    With a HiveRegionServer sections of data might be already in memory in
    a fast binary form, or on disk in a embedded db like the one used by
    map side joins. Disks would be used on intermediate results rather
    then reprocessing the same chunks data repeatedly.

    Managing HiveRegionServers would be much less complex then managing
    HBASE regions that have high random read/inserts. In its simple form
    it would just be an exact duplicate of data, more complex an optimized
    binary form of the data.


    --
    Yours,
    Zheng
  • Edward Capriolo at Oct 6, 2009 at 2:24 am
    While I have not profiled this, I would think we can save on repeated
    de serialization and reuse.

    For example, suppose we have a weblog table partitioned by hour. After
    the information is moved into the DFS we can assume that several
    processes will want to use this information to generate summaries. So
    that block may be used multiple times in the next hour. If we keep
    that block in memory in a binary form we will not need to de serialize
    it again. We also may be able to read directly from memory rather then
    from disk thus freeing up disk resources for shuffle-sorting.

    Also suppose we have a process "select x,y,z form tablea into table
    table b" another process may operate on table b to create tablec. If
    we wrote the tableb to a cache as it was being created it would be
    available for tablec.

    I have been pondering some possible implementations, this could
    probably be done on the Hadoop Layer. CachedInputFormat or
    CachedFileSystem. I am just thinking RAM caches would have to help.
    Considering blocks do not change caching should not be difficult.
    On Mon, Oct 5, 2009 at 9:32 PM, Zheng Shao wrote:
    That's true.

    My thoughts are more constraint here because of resources :)

    What do you think is the major bottleneck for speed now? I have a vague
    feeling that it's the shuffling phase between map and reduce.

    Zheng
    On Mon, Oct 5, 2009 at 3:31 PM, Edward Capriolo wrote:
    On Sun, Oct 4, 2009 at 7:24 PM, Zheng Shao wrote:
    +1

    Making input data and query results available in a short delay is
    definitely
    a very attractive feature for Hive.
    There are multiple approaches to achieve this, mainly depending on how
    much
    we leverage HBase.

    The simplest way to go is to probably have a good Hive/HBase integration
    like HIVE-705, HIVE-806 etc.
    This can help us leverage the efforts done by HBase to the maximum
    degree.
    The potential drawback is that HBase tables have support for random
    writes
    which may cause additional overhead for simple sequential writes.

    Eventually we may (or may not) need our own HiveRegionServer which hosts
    data in any format supported by Hive (on top of just the internal file
    format supported by HBase), but I feel it might be a good start to first
    try
    integrate the two.

    Zheng

    On Sun, Oct 4, 2009 at 8:42 AM, Edward Capriolo <edlinuxguru@gmail.com>
    wrote:
    After sitting though some HDFS/BHase presentations yesterday, I
    started thinking. that the hive model or doing its map/reduce over raw
    files from HDFS is great, but a dedicated caching/region server could
    be a big benefit in answering real time queries.

    I calculated that one data center (not counting non-cachable content)
    could have about 378MB of logs a day. Going from facebooks information
    here:
    http://www.facebook.com/note.php?note_id=110207012002

    "The log files are named with the date and time of collection.
    Individual hourly files are around 55 MB when compressed, so eight
    months of compressed data takes up about 300 GB of space."

    During the day and week the logs are collected one would expect the
    data to be used very often. So having this in a cached would be ideal.

    Given that an average DataNode might have 8 GB or 16 GB of RAM, one GB
    could be sliced off and as a dedicated HiveRegion server, or it can
    run as several dedicated servers. With maybe RAM and nothing else.

    A Hive Region Server would/could contain HiveTables in a compressed
    format, maybe hive tables in a derby format, indexes we are creating,
    and some information about the usage so different caching algorithms
    could evict sections. We could use ZooKeeper to manage the HiveRegions
    like in HBase does.

    Hive query optimizer would look to see if the in the data was  in the
    HiveRegionServer or run as normal.

    Has anyone ever thought of this?
    Edward


    --
    Yours,
    Zheng
    I agree that Hive/Hbase integration is a good thing. I think that the
    differences between hive/hbase are vast. Hive is row oriented with
    column support and HBase is column oriented. HBase is working on
    sparse files and needs random inserts while Hive data is mostly Write
    Once Read Many. HBase is working mostly in memory with a commit log
    while Hive writes during the map and reduce phase directly to HDFS.

    The way I look at there is already a lot of waist. Imagine jobs are
    done simultaneously or right each other on relatively small data sets.


    select name,count() from people group by name;
    select * from people
    select * from people where name='sarah'

    With a HiveRegionServer sections of data might be already in memory in
    a fast binary form, or on disk in a embedded db like the one used by
    map side joins. Disks would be used on intermediate results rather
    then reprocessing the same chunks data repeatedly.

    Managing HiveRegionServers would be much less complex then managing
    HBASE regions that have high random read/inserts. In its simple form
    it would just be an exact duplicate of data, more complex an optimized
    binary form of the data.


    --
    Yours,
    Zheng
  • Zheng Shao at Oct 6, 2009 at 8:04 am
    +1. Since blocks are read-only, the caching logic should be pretty simple.


    Zheng
    On Mon, Oct 5, 2009 at 7:22 PM, Edward Capriolo wrote:

    While I have not profiled this, I would think we can save on repeated
    de serialization and reuse.

    For example, suppose we have a weblog table partitioned by hour. After
    the information is moved into the DFS we can assume that several
    processes will want to use this information to generate summaries. So
    that block may be used multiple times in the next hour. If we keep
    that block in memory in a binary form we will not need to de serialize
    it again. We also may be able to read directly from memory rather then
    from disk thus freeing up disk resources for shuffle-sorting.

    Also suppose we have a process "select x,y,z form tablea into table
    table b" another process may operate on table b to create tablec. If
    we wrote the tableb to a cache as it was being created it would be
    available for tablec.

    I have been pondering some possible implementations, this could
    probably be done on the Hadoop Layer. CachedInputFormat or
    CachedFileSystem. I am just thinking RAM caches would have to help.
    Considering blocks do not change caching should not be difficult.
    On Mon, Oct 5, 2009 at 9:32 PM, Zheng Shao wrote:
    That's true.

    My thoughts are more constraint here because of resources :)

    What do you think is the major bottleneck for speed now? I have a vague
    feeling that it's the shuffling phase between map and reduce.

    Zheng

    On Mon, Oct 5, 2009 at 3:31 PM, Edward Capriolo <edlinuxguru@gmail.com>
    wrote:
    On Sun, Oct 4, 2009 at 7:24 PM, Zheng Shao wrote:
    +1

    Making input data and query results available in a short delay is
    definitely
    a very attractive feature for Hive.
    There are multiple approaches to achieve this, mainly depending on how
    much
    we leverage HBase.

    The simplest way to go is to probably have a good Hive/HBase
    integration
    like HIVE-705, HIVE-806 etc.
    This can help us leverage the efforts done by HBase to the maximum
    degree.
    The potential drawback is that HBase tables have support for random
    writes
    which may cause additional overhead for simple sequential writes.

    Eventually we may (or may not) need our own HiveRegionServer which
    hosts
    data in any format supported by Hive (on top of just the internal file
    format supported by HBase), but I feel it might be a good start to
    first
    try
    integrate the two.

    Zheng

    On Sun, Oct 4, 2009 at 8:42 AM, Edward Capriolo <
    edlinuxguru@gmail.com>
    wrote:
    After sitting though some HDFS/BHase presentations yesterday, I
    started thinking. that the hive model or doing its map/reduce over
    raw
    files from HDFS is great, but a dedicated caching/region server could
    be a big benefit in answering real time queries.

    I calculated that one data center (not counting non-cachable content)
    could have about 378MB of logs a day. Going from facebooks
    information
    here:
    http://www.facebook.com/note.php?note_id=110207012002

    "The log files are named with the date and time of collection.
    Individual hourly files are around 55 MB when compressed, so eight
    months of compressed data takes up about 300 GB of space."

    During the day and week the logs are collected one would expect the
    data to be used very often. So having this in a cached would be
    ideal.
    Given that an average DataNode might have 8 GB or 16 GB of RAM, one
    GB
    could be sliced off and as a dedicated HiveRegion server, or it can
    run as several dedicated servers. With maybe RAM and nothing else.

    A Hive Region Server would/could contain HiveTables in a compressed
    format, maybe hive tables in a derby format, indexes we are creating,
    and some information about the usage so different caching algorithms
    could evict sections. We could use ZooKeeper to manage the
    HiveRegions
    like in HBase does.

    Hive query optimizer would look to see if the in the data was in the
    HiveRegionServer or run as normal.

    Has anyone ever thought of this?
    Edward


    --
    Yours,
    Zheng
    I agree that Hive/Hbase integration is a good thing. I think that the
    differences between hive/hbase are vast. Hive is row oriented with
    column support and HBase is column oriented. HBase is working on
    sparse files and needs random inserts while Hive data is mostly Write
    Once Read Many. HBase is working mostly in memory with a commit log
    while Hive writes during the map and reduce phase directly to HDFS.

    The way I look at there is already a lot of waist. Imagine jobs are
    done simultaneously or right each other on relatively small data sets.


    select name,count() from people group by name;
    select * from people
    select * from people where name='sarah'

    With a HiveRegionServer sections of data might be already in memory in
    a fast binary form, or on disk in a embedded db like the one used by
    map side joins. Disks would be used on intermediate results rather
    then reprocessing the same chunks data repeatedly.

    Managing HiveRegionServers would be much less complex then managing
    HBASE regions that have high random read/inserts. In its simple form
    it would just be an exact duplicate of data, more complex an optimized
    binary form of the data.


    --
    Yours,
    Zheng


    --
    Yours,
    Zheng
  • Edward Capriolo at Oct 6, 2009 at 2:39 pm
    This might help as well.
    http://issues.apache.org/jira/browse/HADOOP-288
    On Tue, Oct 6, 2009 at 4:03 AM, Zheng Shao wrote:
    +1. Since blocks are read-only, the caching logic should be pretty simple.


    Zheng
    On Mon, Oct 5, 2009 at 7:22 PM, Edward Capriolo wrote:

    While I have not profiled this, I would think we can save on repeated
    de serialization and reuse.

    For example, suppose we have a weblog table partitioned by hour. After
    the information is moved into the DFS we can assume that several
    processes will want to use this information to generate summaries. So
    that block may be used multiple times in the next hour. If we keep
    that block in memory in a binary form we will not need to de serialize
    it again. We also may be able to read directly from memory rather then
    from disk thus freeing up disk resources for shuffle-sorting.

    Also suppose we have a process "select x,y,z form tablea into table
    table b" another process may operate on table b to create tablec. If
    we wrote the tableb to a cache as it was being created it would be
    available for tablec.

    I have been pondering some possible implementations, this could
    probably be done on the Hadoop Layer. CachedInputFormat or
    CachedFileSystem. I am just thinking RAM caches would have to help.
    Considering blocks do not change caching should not be difficult.
    On Mon, Oct 5, 2009 at 9:32 PM, Zheng Shao wrote:
    That's true.

    My thoughts are more constraint here because of resources :)

    What do you think is the major bottleneck for speed now? I have a vague
    feeling that it's the shuffling phase between map and reduce.

    Zheng

    On Mon, Oct 5, 2009 at 3:31 PM, Edward Capriolo <edlinuxguru@gmail.com>
    wrote:
    On Sun, Oct 4, 2009 at 7:24 PM, Zheng Shao wrote:
    +1

    Making input data and query results available in a short delay is
    definitely
    a very attractive feature for Hive.
    There are multiple approaches to achieve this, mainly depending on
    how
    much
    we leverage HBase.

    The simplest way to go is to probably have a good Hive/HBase
    integration
    like HIVE-705, HIVE-806 etc.
    This can help us leverage the efforts done by HBase to the maximum
    degree.
    The potential drawback is that HBase tables have support for random
    writes
    which may cause additional overhead for simple sequential writes.

    Eventually we may (or may not) need our own HiveRegionServer which
    hosts
    data in any format supported by Hive (on top of just the internal
    file
    format supported by HBase), but I feel it might be a good start to
    first
    try
    integrate the two.

    Zheng

    On Sun, Oct 4, 2009 at 8:42 AM, Edward Capriolo
    <edlinuxguru@gmail.com>
    wrote:
    After sitting though some HDFS/BHase presentations yesterday, I
    started thinking. that the hive model or doing its map/reduce over
    raw
    files from HDFS is great, but a dedicated caching/region server
    could
    be a big benefit in answering real time queries.

    I calculated that one data center (not counting non-cachable
    content)
    could have about 378MB of logs a day. Going from facebooks
    information
    here:
    http://www.facebook.com/note.php?note_id=110207012002

    "The log files are named with the date and time of collection.
    Individual hourly files are around 55 MB when compressed, so eight
    months of compressed data takes up about 300 GB of space."

    During the day and week the logs are collected one would expect the
    data to be used very often. So having this in a cached would be
    ideal.

    Given that an average DataNode might have 8 GB or 16 GB of RAM, one
    GB
    could be sliced off and as a dedicated HiveRegion server, or it can
    run as several dedicated servers. With maybe RAM and nothing else.

    A Hive Region Server would/could contain HiveTables in a compressed
    format, maybe hive tables in a derby format, indexes we are
    creating,
    and some information about the usage so different caching algorithms
    could evict sections. We could use ZooKeeper to manage the
    HiveRegions
    like in HBase does.

    Hive query optimizer would look to see if the in the data was  in
    the
    HiveRegionServer or run as normal.

    Has anyone ever thought of this?
    Edward


    --
    Yours,
    Zheng
    I agree that Hive/Hbase integration is a good thing. I think that the
    differences between hive/hbase are vast. Hive is row oriented with
    column support and HBase is column oriented. HBase is working on
    sparse files and needs random inserts while Hive data is mostly Write
    Once Read Many. HBase is working mostly in memory with a commit log
    while Hive writes during the map and reduce phase directly to HDFS.

    The way I look at there is already a lot of waist. Imagine jobs are
    done simultaneously or right each other on relatively small data sets.


    select name,count() from people group by name;
    select * from people
    select * from people where name='sarah'

    With a HiveRegionServer sections of data might be already in memory in
    a fast binary form, or on disk in a embedded db like the one used by
    map side joins. Disks would be used on intermediate results rather
    then reprocessing the same chunks data repeatedly.

    Managing HiveRegionServers would be much less complex then managing
    HBASE regions that have high random read/inserts. In its simple form
    it would just be an exact duplicate of data, more complex an optimized
    binary form of the data.


    --
    Yours,
    Zheng


    --
    Yours,
    Zheng

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedOct 4, '09 at 3:43p
activeOct 6, '09 at 2:39p
posts7
users2
websitehive.apache.org

2 users in discussion

Edward Capriolo: 4 posts Zheng Shao: 3 posts

People

Translate

site design / logo © 2022 Grokbase