FAQ
After looking at the HBaseRegionServer and its functionality, I began
wondering if there is a more general use case for memory caching of
HDFS blocks/files. In many use cases people wish to store data on
Hadoop indefinitely, however the last day,last week, last month, data
is probably the most actively used. For some Hadoop clusters the
amount of raw new data could be less then the RAM memory in the
cluster.

Also some data will be used repeatedly, the same source data may be
used to generate multiple result sets, and those results may be used
as the input to other processes.

I am thinking an answer could be to dedicate an amount of physical
memory on each DataNode, or on several dedicated node to a distributed
memcache like layer. Managing this cache should be straight forward
since hadoop blocks are pretty much static. (So say for a DataNode
with 8 GB of memory dedicate 1GB to HadoopCacheServer.) If you had
1000 Nodes that cache would be quite large.

Additionally we could create a new file system type cachedhdfs
implemented as a facade, or possibly implement CachedInputFormat or
CachedOutputFormat.

I know that the underlying filesystems have cache, but I think Hadoop
writing intermediate data is going to evict some of the data which
"should be" semi-permanent.

So has anyone looked into something like this? This was the closest
thing I found.

http://issues.apache.org/jira/browse/HADOOP-288

My goal here is to keep recent data in memory so that tools like Hive
can get a big boost on queries for new data.

Does anyone have any ideas?

Search Discussions

  • Aaron Kimball at Oct 6, 2009 at 10:13 pm
    Edward,

    Interesting concept. I imagine that implementing "CachedInputFormat" over
    something like memcached would make for the most straightforward
    implementation. You could store 64MB chunks in memcached and try to retrieve
    them from there, falling back to the filesystem on failure. One obvious
    potential drawback of this is that a memcached cluster might store those
    blocks on different servers than the file chunks themselves, leading to an
    increased number of network transfers during the mapping phase. I don't know
    if it's possible to "pin" the objects in memcached to particular nodes;
    you'd want to do this for mapper locality reasons.

    I would say, though, that 1 GB out of 8 GB on a datanode is somewhat
    ambitious. It's been my observation that people tend to write memory-hungry
    mappers. If you've got 8 cores in a node, and 1 GB each have already gone to
    the OS, the datanode, and the tasktracker, that leaves only 5 GB for task
    processes. Running 6 or 8 map tasks concurrently can easily gobble that up.
    On a 16 GB datanode with 8 cores, you might get that much wiggle room
    though.

    - Aaron

    On Tue, Oct 6, 2009 at 8:16 AM, Edward Capriolo wrote:

    After looking at the HBaseRegionServer and its functionality, I began
    wondering if there is a more general use case for memory caching of
    HDFS blocks/files. In many use cases people wish to store data on
    Hadoop indefinitely, however the last day,last week, last month, data
    is probably the most actively used. For some Hadoop clusters the
    amount of raw new data could be less then the RAM memory in the
    cluster.

    Also some data will be used repeatedly, the same source data may be
    used to generate multiple result sets, and those results may be used
    as the input to other processes.

    I am thinking an answer could be to dedicate an amount of physical
    memory on each DataNode, or on several dedicated node to a distributed
    memcache like layer. Managing this cache should be straight forward
    since hadoop blocks are pretty much static. (So say for a DataNode
    with 8 GB of memory dedicate 1GB to HadoopCacheServer.) If you had
    1000 Nodes that cache would be quite large.

    Additionally we could create a new file system type cachedhdfs
    implemented as a facade, or possibly implement CachedInputFormat or
    CachedOutputFormat.

    I know that the underlying filesystems have cache, but I think Hadoop
    writing intermediate data is going to evict some of the data which
    "should be" semi-permanent.

    So has anyone looked into something like this? This was the closest
    thing I found.

    http://issues.apache.org/jira/browse/HADOOP-288

    My goal here is to keep recent data in memory so that tools like Hive
    can get a big boost on queries for new data.

    Does anyone have any ideas?
  • Edward Capriolo at Oct 7, 2009 at 12:25 am

    On Tue, Oct 6, 2009 at 6:12 PM, Aaron Kimball wrote:
    Edward,

    Interesting concept. I imagine that implementing "CachedInputFormat" over
    something like memcached would make for the most straightforward
    implementation. You could store 64MB chunks in memcached and try to retrieve
    them from there, falling back to the filesystem on failure. One obvious
    potential drawback of this is that a memcached cluster might store those
    blocks on different servers than the file chunks themselves, leading to an
    increased number of network transfers during the mapping phase. I don't know
    if it's possible to "pin" the objects in memcached to particular nodes;
    you'd want to do this for mapper locality reasons.

    I would say, though, that 1 GB out of 8 GB on a datanode is somewhat
    ambitious. It's been my observation that people tend to write memory-hungry
    mappers. If you've got 8 cores in a node, and 1 GB each have already gone to
    the OS, the datanode, and the tasktracker, that leaves only 5 GB for task
    processes. Running 6 or 8 map tasks concurrently can easily gobble that up.
    On a 16 GB datanode with 8 cores, you might get that much wiggle room
    though.

    - Aaron

    On Tue, Oct 6, 2009 at 8:16 AM, Edward Capriolo wrote:

    After looking at the HBaseRegionServer and its functionality, I began
    wondering if there is a more general use case for memory caching of
    HDFS blocks/files. In many use cases people wish to store data on
    Hadoop indefinitely, however the last day,last week, last month, data
    is probably the most actively used. For some Hadoop clusters the
    amount of raw new data could be less then the RAM memory in the
    cluster.

    Also some data will be used repeatedly, the same source data may be
    used to generate multiple result sets, and those results may be used
    as the input to other processes.

    I am thinking an answer could be to dedicate an amount of physical
    memory on each DataNode, or on several dedicated node to a distributed
    memcache like layer. Managing this cache should be straight forward
    since hadoop blocks are pretty much static. (So say for a DataNode
    with 8 GB of memory dedicate 1GB to HadoopCacheServer.) If you had
    1000 Nodes that cache would be quite large.

    Additionally we could create a new file system type cachedhdfs
    implemented as a facade, or possibly implement CachedInputFormat or
    CachedOutputFormat.

    I know that the underlying filesystems have cache, but I think Hadoop
    writing intermediate data is going to evict some of the data which
    "should be" semi-permanent.

    So has anyone looked into something like this? This was the closest
    thing I found.

    http://issues.apache.org/jira/browse/HADOOP-288

    My goal here is to keep recent data in memory so that tools like Hive
    can get a big boost on queries for new data.

    Does anyone have any ideas?
    Aaron,

    Yes 1GB out of 8GB was just an arbitrary value I decided. Remember
    that 16K of ram did get a man to the moon. :) I am thinking the value
    would be configurable, say dfs.cache.mb.

    Also there is the details of cache eviction, or possibly including and
    excluding paths and files.

    Other then the InputFormat concept we could plug the cache in directly
    into the DFSclient. In this way the cache would always end up on the
    node where the data was. Otherwise the InputFormat will have to manage
    that which would be a lot of work. I think if we prove the concept we
    can then follow up and get it more optimized.

    I am poking around the Hadoop internals to see what options we have.
    My first implementation I will probably patch some code, run some
    tests, profile performance.
  • Todd Lipcon at Oct 7, 2009 at 6:34 am
    I think this is the wrong angle to go about it - like you mentioned in your
    first post, the Linux file system cache *should* be taking care of this for
    us. That it is not is a fault of the current implementation and not an
    inherent problem.

    I think one solution is HDFS-347 - I'm putting the finishing touches on a
    design doc for that JIRA and should have it up in the next day or two.

    -Todd
    On Tue, Oct 6, 2009 at 5:25 PM, Edward Capriolo wrote:
    On Tue, Oct 6, 2009 at 6:12 PM, Aaron Kimball wrote:
    Edward,

    Interesting concept. I imagine that implementing "CachedInputFormat" over
    something like memcached would make for the most straightforward
    implementation. You could store 64MB chunks in memcached and try to retrieve
    them from there, falling back to the filesystem on failure. One obvious
    potential drawback of this is that a memcached cluster might store those
    blocks on different servers than the file chunks themselves, leading to an
    increased number of network transfers during the mapping phase. I don't know
    if it's possible to "pin" the objects in memcached to particular nodes;
    you'd want to do this for mapper locality reasons.

    I would say, though, that 1 GB out of 8 GB on a datanode is somewhat
    ambitious. It's been my observation that people tend to write
    memory-hungry
    mappers. If you've got 8 cores in a node, and 1 GB each have already gone to
    the OS, the datanode, and the tasktracker, that leaves only 5 GB for task
    processes. Running 6 or 8 map tasks concurrently can easily gobble that up.
    On a 16 GB datanode with 8 cores, you might get that much wiggle room
    though.

    - Aaron


    On Tue, Oct 6, 2009 at 8:16 AM, Edward Capriolo <edlinuxguru@gmail.com
    wrote:
    After looking at the HBaseRegionServer and its functionality, I began
    wondering if there is a more general use case for memory caching of
    HDFS blocks/files. In many use cases people wish to store data on
    Hadoop indefinitely, however the last day,last week, last month, data
    is probably the most actively used. For some Hadoop clusters the
    amount of raw new data could be less then the RAM memory in the
    cluster.

    Also some data will be used repeatedly, the same source data may be
    used to generate multiple result sets, and those results may be used
    as the input to other processes.

    I am thinking an answer could be to dedicate an amount of physical
    memory on each DataNode, or on several dedicated node to a distributed
    memcache like layer. Managing this cache should be straight forward
    since hadoop blocks are pretty much static. (So say for a DataNode
    with 8 GB of memory dedicate 1GB to HadoopCacheServer.) If you had
    1000 Nodes that cache would be quite large.

    Additionally we could create a new file system type cachedhdfs
    implemented as a facade, or possibly implement CachedInputFormat or
    CachedOutputFormat.

    I know that the underlying filesystems have cache, but I think Hadoop
    writing intermediate data is going to evict some of the data which
    "should be" semi-permanent.

    So has anyone looked into something like this? This was the closest
    thing I found.

    http://issues.apache.org/jira/browse/HADOOP-288

    My goal here is to keep recent data in memory so that tools like Hive
    can get a big boost on queries for new data.

    Does anyone have any ideas?
    Aaron,

    Yes 1GB out of 8GB was just an arbitrary value I decided. Remember
    that 16K of ram did get a man to the moon. :) I am thinking the value
    would be configurable, say dfs.cache.mb.

    Also there is the details of cache eviction, or possibly including and
    excluding paths and files.

    Other then the InputFormat concept we could plug the cache in directly
    into the DFSclient. In this way the cache would always end up on the
    node where the data was. Otherwise the InputFormat will have to manage
    that which would be a lot of work. I think if we prove the concept we
    can then follow up and get it more optimized.

    I am poking around the Hadoop internals to see what options we have.
    My first implementation I will probably patch some code, run some
    tests, profile performance.
  • Edward Capriolo at Oct 7, 2009 at 2:46 pm

    On Wed, Oct 7, 2009 at 2:33 AM, Todd Lipcon wrote:
    I think this is the wrong angle to go about it - like you mentioned in your
    first post, the Linux file system cache *should* be taking care of this for
    us. That it is not is a fault of the current implementation and not an
    inherent problem.

    I think one solution is HDFS-347 - I'm putting the finishing touches on a
    design doc for that JIRA and should have it up in the next day or two.

    -Todd
    On Tue, Oct 6, 2009 at 5:25 PM, Edward Capriolo wrote:
    On Tue, Oct 6, 2009 at 6:12 PM, Aaron Kimball wrote:
    Edward,

    Interesting concept. I imagine that implementing "CachedInputFormat" over
    something like memcached would make for the most straightforward
    implementation. You could store 64MB chunks in memcached and try to retrieve
    them from there, falling back to the filesystem on failure. One obvious
    potential drawback of this is that a memcached cluster might store those
    blocks on different servers than the file chunks themselves, leading to an
    increased number of network transfers during the mapping phase. I don't know
    if it's possible to "pin" the objects in memcached to particular nodes;
    you'd want to do this for mapper locality reasons.

    I would say, though, that 1 GB out of 8 GB on a datanode is somewhat
    ambitious. It's been my observation that people tend to write
    memory-hungry
    mappers. If you've got 8 cores in a node, and 1 GB each have already gone to
    the OS, the datanode, and the tasktracker, that leaves only 5 GB for task
    processes. Running 6 or 8 map tasks concurrently can easily gobble that up.
    On a 16 GB datanode with 8 cores, you might get that much wiggle room
    though.

    - Aaron


    On Tue, Oct 6, 2009 at 8:16 AM, Edward Capriolo <edlinuxguru@gmail.com
    wrote:
    After looking at the HBaseRegionServer and its functionality, I began
    wondering if there is a more general use case for memory caching of
    HDFS blocks/files. In many use cases people wish to store data on
    Hadoop indefinitely, however the last day,last week, last month, data
    is probably the most actively used. For some Hadoop clusters the
    amount of raw new data could be less then the RAM memory in the
    cluster.

    Also some data will be used repeatedly, the same source data may be
    used to generate multiple result sets, and those results may be used
    as the input to other processes.

    I am thinking an answer could be to dedicate an amount of physical
    memory on each DataNode, or on several dedicated node to a distributed
    memcache like layer. Managing this cache should be straight forward
    since hadoop blocks are pretty much static. (So say for a DataNode
    with 8 GB of memory dedicate 1GB to HadoopCacheServer.) If you had
    1000 Nodes that cache would be quite large.

    Additionally we could create a new file system type cachedhdfs
    implemented as a facade, or possibly implement CachedInputFormat or
    CachedOutputFormat.

    I know that the underlying filesystems have cache, but I think Hadoop
    writing intermediate data is going to evict some of the data which
    "should be" semi-permanent.

    So has anyone looked into something like this? This was the closest
    thing I found.

    http://issues.apache.org/jira/browse/HADOOP-288

    My goal here is to keep recent data in memory so that tools like Hive
    can get a big boost on queries for new data.

    Does anyone have any ideas?
    Aaron,

    Yes 1GB out of 8GB was just an arbitrary value I decided. Remember
    that 16K of ram did get a man to the moon. :) I am thinking the value
    would be configurable, say dfs.cache.mb.

    Also there is the details of cache eviction, or possibly including and
    excluding paths and files.

    Other then the InputFormat concept we could plug the cache in directly
    into the DFSclient. In this way the cache would always end up on the
    node where the data was. Otherwise the InputFormat will have to manage
    that which would be a lot of work. I think if we prove the concept we
    can then follow up and get it more optimized.

    I am poking around the Hadoop internals to see what options we have.
    My first implementation I will probably patch some code, run some
    tests, profile performance.
    Todd,

    I do think it could be an inherent problem. With all the reading and
    writing of intermediate data hadoop does, the file system cache would
    would likely never contain the initial raw data you want to work with.
    The HBase RegionServer seems to be successful, so there must be some
    place for caching.

    Once I get something in HDFS, like lasts hours log data, about 40
    different processes are going to repeatedly re/read it from disk. I
    think if i can force that data into a cache I can get much faster
    processing.

    HDFS-347 sounds great though.

    Edward
  • Todd Lipcon at Oct 7, 2009 at 2:49 pm

    On Wed, Oct 7, 2009 at 7:45 AM, Edward Capriolo wrote:
    Todd,

    I do think it could be an inherent problem. With all the reading and
    writing of intermediate data hadoop does, the file system cache would
    would likely never contain the initial raw data you want to work with.
    The HBase RegionServer seems to be successful, so there must be some
    place for caching.

    Once I get something in HDFS, like lasts hours log data, about 40
    different processes are going to repeatedly re/read it from disk. I
    think if i can force that data into a cache I can get much faster
    processing.

    In cases like this, we should expose access type hints like posix_fadvise
    POSIX_ADV_DONTNEED for the data we dont' want to end up in the cache.
    There's already a JIRA out there for a JNI library for platform specific
    optimization, and I think this is one that will be worth doing.

    -ToddEdward
  • Edward Capriolo at Oct 7, 2009 at 3:21 pm

    On Wed, Oct 7, 2009 at 10:48 AM, Todd Lipcon wrote:
    On Wed, Oct 7, 2009 at 7:45 AM, Edward Capriolo wrote:


    Todd,

    I do think it could be an inherent problem. With all the reading and
    writing of intermediate data hadoop does, the file system cache would
    would likely never contain the initial raw data you want to work with.
    The HBase RegionServer seems to be successful, so there must be some
    place for caching.

    Once I get something in HDFS, like lasts hours log data, about 40
    different processes are going to repeatedly re/read it from disk. I
    think if i can force that data into a cache I can get much faster
    processing.

    In cases like this, we should expose access type hints like posix_fadvise
    POSIX_ADV_DONTNEED for the data we dont' want to end up in the cache.
    There's already a JIRA out there for a JNI library for platform specific
    optimization, and I think this is one that will be worth doing.

    -ToddEdward
    Those make sense.

    This started with HiveRegionServer now we are at VFS hints and JNI.

    I think the optimizations could be done in lots of places, anywhere
    from close to the application with InputFormat and Memcache, on the
    other end we could go the Oracle route and write to raw disk
    partitions :)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 6, '09 at 3:17p
activeOct 7, '09 at 3:21p
posts7
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase