FAQ
Hello,

I am trying to write MapReduce jobs to read data from JSON files and load
it into HBase tables.
Please suggest me an efficient way to do it. I am trying to do it using
Spring Data Hbase Template to make it thread safe and enable table locking.

I use the Map methods to read and parse the JSON files. I use the Reduce
methods to call the HBase Template and store the data into the HBase tables.

My questions:
1. Is this the right approach or should I do all of the above the Map
method?
2. How can I pass the Java Object I create holding the data read from the
Json file to the Reduce method, which needs to be saved to the HBase table?
I can only pass the inbuilt data types to the reduce method from my mapper.
3. I thought of using the distributed cache for the above problem, to store
the object in the cache and pass only the key to the reduce method. But how
do I generate the unique key for all the objects I store in the distributed
cache.

Please help me with the above. Please tell me if I am missing some detail
or over looking some important detail.

Thanking You,


--
Regards,
Ouch Whisper
010101010101

Search Discussions

  • Mohammad Tariq at Feb 7, 2013 at 11:29 am
    Hello Panshul,

    My answers :
    1- You can serialize the entire jSON into a byte[ ] and store it in a
    cell.(Is it important for you extract individual values from your JSON and
    then put them into the table?)
    2- You can write your own datatype to pass your object to the reducer. But,
    it must be a Writable+Comparable. Alternatively you van use Avro.
    3- For generating unique keys, you can use MR counters.

    Warm Regards,
    Tariq
    https://mtariq.jux.com/
    cloudfront.blogspot.com

    On Thu, Feb 7, 2013 at 4:52 PM, Panshul Whisper wrote:

    Hello,

    I am trying to write MapReduce jobs to read data from JSON files and load
    it into HBase tables.
    Please suggest me an efficient way to do it. I am trying to do it using
    Spring Data Hbase Template to make it thread safe and enable table locking.

    I use the Map methods to read and parse the JSON files. I use the Reduce
    methods to call the HBase Template and store the data into the HBase tables.

    My questions:
    1. Is this the right approach or should I do all of the above the Map
    method?
    2. How can I pass the Java Object I create holding the data read from the
    Json file to the Reduce method, which needs to be saved to the HBase table?
    I can only pass the inbuilt data types to the reduce method from my mapper.
    3. I thought of using the distributed cache for the above problem, to
    store the object in the cache and pass only the key to the reduce method.
    But how do I generate the unique key for all the objects I store in the
    distributed cache.

    Please help me with the above. Please tell me if I am missing some detail
    or over looking some important detail.

    Thanking You,


    --
    Regards,
    Ouch Whisper
    010101010101
  • Mohammad Tariq at Feb 7, 2013 at 11:35 am
    One correction. If your datatype is gonna be used just as values, you
    actually don't need it to be comparable. But if you need it to be a key as
    well, then it must be both.

    Warm Regards,
    Tariq
    https://mtariq.jux.com/
    cloudfront.blogspot.com

    On Thu, Feb 7, 2013 at 4:58 PM, Mohammad Tariq wrote:

    Hello Panshul,

    My answers :
    1- You can serialize the entire jSON into a byte[ ] and store it in a
    cell.(Is it important for you extract individual values from your JSON and
    then put them into the table?)
    2- You can write your own datatype to pass your object to the reducer.
    But, it must be a Writable+Comparable. Alternatively you van use Avro.
    3- For generating unique keys, you can use MR counters.

    Warm Regards,
    Tariq
    https://mtariq.jux.com/
    cloudfront.blogspot.com

    On Thu, Feb 7, 2013 at 4:52 PM, Panshul Whisper wrote:

    Hello,

    I am trying to write MapReduce jobs to read data from JSON files and load
    it into HBase tables.
    Please suggest me an efficient way to do it. I am trying to do it using
    Spring Data Hbase Template to make it thread safe and enable table locking.

    I use the Map methods to read and parse the JSON files. I use the Reduce
    methods to call the HBase Template and store the data into the HBase tables.

    My questions:
    1. Is this the right approach or should I do all of the above the Map
    method?
    2. How can I pass the Java Object I create holding the data read from the
    Json file to the Reduce method, which needs to be saved to the HBase table?
    I can only pass the inbuilt data types to the reduce method from my mapper.
    3. I thought of using the distributed cache for the above problem, to
    store the object in the cache and pass only the key to the reduce method.
    But how do I generate the unique key for all the objects I store in the
    distributed cache.

    Please help me with the above. Please tell me if I am missing some detail
    or over looking some important detail.

    Thanking You,


    --
    Regards,
    Ouch Whisper
    010101010101
  • Panshul Whisper at Feb 7, 2013 at 11:36 am
    Hello,

    Thank you for the reply.
    1. I cannot serialize the Json and store it as a whole. I need to extract
    individual values and store them as later I need to query the stored values
    in various aggregation algorithms.
    2. Can u please point me in direction where I can find out how to write a
    data type to be Writable+Comparable. I will look into Avro, but I prefer to
    write my owm data type.
    3. I will look into MR counters.

    Regards,

    On Thu, Feb 7, 2013 at 12:28 PM, Mohammad Tariq wrote:

    Hello Panshul,

    My answers :
    1- You can serialize the entire jSON into a byte[ ] and store it in a
    cell.(Is it important for you extract individual values from your JSON and
    then put them into the table?)
    2- You can write your own datatype to pass your object to the reducer.
    But, it must be a Writable+Comparable. Alternatively you van use Avro.
    3- For generating unique keys, you can use MR counters.

    Warm Regards,
    Tariq
    https://mtariq.jux.com/
    cloudfront.blogspot.com

    On Thu, Feb 7, 2013 at 4:52 PM, Panshul Whisper wrote:

    Hello,

    I am trying to write MapReduce jobs to read data from JSON files and load
    it into HBase tables.
    Please suggest me an efficient way to do it. I am trying to do it using
    Spring Data Hbase Template to make it thread safe and enable table locking.

    I use the Map methods to read and parse the JSON files. I use the Reduce
    methods to call the HBase Template and store the data into the HBase tables.

    My questions:
    1. Is this the right approach or should I do all of the above the Map
    method?
    2. How can I pass the Java Object I create holding the data read from the
    Json file to the Reduce method, which needs to be saved to the HBase table?
    I can only pass the inbuilt data types to the reduce method from my mapper.
    3. I thought of using the distributed cache for the above problem, to
    store the object in the cache and pass only the key to the reduce method.
    But how do I generate the unique key for all the objects I store in the
    distributed cache.

    Please help me with the above. Please tell me if I am missing some detail
    or over looking some important detail.

    Thanking You,


    --
    Regards,
    Ouch Whisper
    010101010101

    --
    Regards,
    Ouch Whisper
    010101010101
  • Mohammad Tariq at Feb 7, 2013 at 11:42 am
    You might find these links helpful :
    http://stackoverflow.com/questions/10961474/how-in-hadoop-is-the-data-put-into-map-and-reduce-functions-in-correct-types/10965026#10965026
    http://stackoverflow.com/questions/13877077/how-do-i-set-an-object-as-the-value-for-map-output-in-hadoop-mapreduce/13877688#13877688

    HTH

    Warm Regards,
    Tariq
    https://mtariq.jux.com/
    cloudfront.blogspot.com

    On Thu, Feb 7, 2013 at 5:05 PM, Panshul Whisper wrote:

    Hello,

    Thank you for the reply.
    1. I cannot serialize the Json and store it as a whole. I need to extract
    individual values and store them as later I need to query the stored values
    in various aggregation algorithms.
    2. Can u please point me in direction where I can find out how to write a
    data type to be Writable+Comparable. I will look into Avro, but I prefer to
    write my owm data type.
    3. I will look into MR counters.

    Regards,

    On Thu, Feb 7, 2013 at 12:28 PM, Mohammad Tariq wrote:

    Hello Panshul,

    My answers :
    1- You can serialize the entire jSON into a byte[ ] and store it in a
    cell.(Is it important for you extract individual values from your JSON and
    then put them into the table?)
    2- You can write your own datatype to pass your object to the reducer.
    But, it must be a Writable+Comparable. Alternatively you van use Avro.
    3- For generating unique keys, you can use MR counters.

    Warm Regards,
    Tariq
    https://mtariq.jux.com/
    cloudfront.blogspot.com

    On Thu, Feb 7, 2013 at 4:52 PM, Panshul Whisper wrote:

    Hello,

    I am trying to write MapReduce jobs to read data from JSON files and
    load it into HBase tables.
    Please suggest me an efficient way to do it. I am trying to do it using
    Spring Data Hbase Template to make it thread safe and enable table locking.

    I use the Map methods to read and parse the JSON files. I use the Reduce
    methods to call the HBase Template and store the data into the HBase tables.

    My questions:
    1. Is this the right approach or should I do all of the above the Map
    method?
    2. How can I pass the Java Object I create holding the data read from
    the Json file to the Reduce method, which needs to be saved to the HBase
    table? I can only pass the inbuilt data types to the reduce method from my
    mapper.
    3. I thought of using the distributed cache for the above problem, to
    store the object in the cache and pass only the key to the reduce method.
    But how do I generate the unique key for all the objects I store in the
    distributed cache.

    Please help me with the above. Please tell me if I am missing some
    detail or over looking some important detail.

    Thanking You,


    --
    Regards,
    Ouch Whisper
    010101010101

    --
    Regards,
    Ouch Whisper
    010101010101
  • Damien Hardy at Feb 7, 2013 at 11:57 am
  • Panshul Whisper at Feb 7, 2013 at 2:25 pm
    I am using the Map Reduce approach. I was looking into AVRO to create my
    own custom Data types to pass from Mapper to Reducer.
    With Avro I need to maintain the schema for all the types of Jason files I
    am receiving and since there will be many different map reduce methods
    running, so a different schema for every type.
    1. Since the Json schema might change very frequently almost 3 times every
    month. Is it advisable to use Avro to create custom data types? or I can
    use the distributed cache and store the Java Object in the cache and pass
    the key to the object to the Reducer?
    2. Will there be any performance issues with using the distributed cache?
    since the data will be very large and very high speed performance required.

    Thanking You,
    Regards,

    On Thu, Feb 7, 2013 at 2:23 PM, Mohammad Tariq wrote:

    Size is not a prob, frequently changing schema might be.

    Warm Regards,
    Tariq
    https://mtariq.jux.com/
    cloudfront.blogspot.com


    On Thu, Feb 7, 2013 at 6:25 PM, Panshul Whisper <ouchwhisper@gmail.com
    wrote:
    Hello,

    Thank you for the replies.

    I have not used pig yet. I am looking into it. I wanted to implement both
    the approaches.
    Are pig scripts maintainable? Because the Json structure that I will be
    receiving will be changing quite often. Almost 3 times a month.
    I will be processing 24 million Json files per month.
    I am getting one big file with almost 3 million Json files aggregated. One
    Json per line. I need to process this file and store all values into HBase.
    Thanking You,




    On Thu, Feb 7, 2013 at 12:59 PM, Mohammad Tariq <dontariq@gmail.com>
    wrote:
    Good point sir. If Pig fits into Panshul's requirements then it's a
    much
    better option.

    Warm Regards,
    Tariq
    https://mtariq.jux.com/
    cloudfront.blogspot.com


    On Thu, Feb 7, 2013 at 5:25 PM, Damien Hardy <dhardy@viadeoteam.com>
    wrote:
    Hello,
    Why not using a PIG script for that ?
    make the json file available on HDFS
    Load with
    http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/JsonLoader.html
    Store with
    http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/backend/hadoop/hbase/HBaseStorage.html


    --
    Regards,
    Ouch Whisper
    010101010101


    --
    Regards,
    Ouch Whisper
    010101010101

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-user @
categorieshadoop
postedFeb 7, '13 at 11:22a
activeFeb 7, '13 at 2:25p
posts7
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2019 Grokbase