FAQ
Hi Guys,

For those that have ever followed Cloudera's example to stream tweets (
https://github.com/cloudera/cdh-twitter-example )
I have made a few prototype enhancements to write to HBASE instead of HDFS
(inspired by Dan Sanders Flume HBASE example
https://github.com/DataDanSandler/log_analysis) and then use Impala to
report off it, instead of HIVE.
It also gets around the Hive table partitioning time delay in their initial
example.

In Hbase you need to create a table:
   sudo -u hdfs hbase shell
   create 'tweets', {NAME => 'tweet'}, {NAME => 'retweeted_status'}, {NAME
=> 'entities'}, {NAME => 'user'}

I then created an Impala table to read the HBASE table:
CREATE EXTERNAL TABLE HB_IMPALA_TWEETS (
   id int,
   id_str string,
   text string,
   created_at timestamp,
   geo_latitude double,
   geo_longitude double,
   user_screen_name string,
   user_location string,
   user_followers_count string,
   user_profile_image_url string

)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" =
":key,tweet:id_str,tweet:text,tweet:created_at,tweet:geo_latitude,tweet:geo_longitude,
user:screen_name,user:location,user:followers_count,user:profile_image_url"
)
TBLPROPERTIES("hbase.table.name" = "tweets");


My prototype code for a Flume Sink Event is available at:
https://github.com/AronMacDonald/Twitter_Hbase_Impala/blob/master/README.md

NOTE: The code is in a rough form at the moment, with temporary log files
and no error handling etc, but I still welcome any constructive tips and
tricks for improving it.
            In addition it also includes logic to send data to SAP HANA,
which you may need to remove if you wish to use it.


Cheers
Aron

Search Discussions

  • Todd Lipcon at Aug 8, 2013 at 8:03 am
    Hey Aron,

    This is a neat example. I'm curious, do you have any benchmark results on
    the ingest throughput you're able to get? How about the query speed for
    some typical "interesting" queries?

    -Todd

    On Thu, Aug 8, 2013 at 1:00 AM, Aron MacDonald
    wrote:
    Hi Guys,

    For those that have ever followed Cloudera's example to stream tweets (
    https://github.com/cloudera/cdh-twitter-example )
    I have made a few prototype enhancements to write to HBASE instead of HDFS
    (inspired by Dan Sanders Flume HBASE example
    https://github.com/DataDanSandler/log_analysis) and then use Impala to
    report off it, instead of HIVE.
    It also gets around the Hive table partitioning time delay in their
    initial example.

    In Hbase you need to create a table:
    sudo -u hdfs hbase shell
    create 'tweets', {NAME => 'tweet'}, {NAME => 'retweeted_status'}, {NAME
    => 'entities'}, {NAME => 'user'}

    I then created an Impala table to read the HBASE table:
    CREATE EXTERNAL TABLE HB_IMPALA_TWEETS (
    id int,
    id_str string,
    text string,
    created_at timestamp,
    geo_latitude double,
    geo_longitude double,
    user_screen_name string,
    user_location string,
    user_followers_count string,
    user_profile_image_url string

    )
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    WITH SERDEPROPERTIES (
    "hbase.columns.mapping" =
    ":key,tweet:id_str,tweet:text,tweet:created_at,tweet:geo_latitude,tweet:geo_longitude,
    user:screen_name,user:location,user:followers_count,user:profile_image_url"
    )
    TBLPROPERTIES("hbase.table.name" = "tweets");


    My prototype code for a Flume Sink Event is available at:
    https://github.com/AronMacDonald/Twitter_Hbase_Impala/blob/master/README.md

    NOTE: The code is in a rough form at the moment, with temporary log files
    and no error handling etc, but I still welcome any constructive tips and
    tricks for improving it.
    In addition it also includes logic to send data to SAP HANA,
    which you may need to remove if you wish to use it.


    Cheers
    Aron


    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Aron MacDonald at Aug 8, 2013 at 2:21 pm
    Hi Tod,

    The 'Big Data' keywords I've used, filtering the Twitter Firehose, only
    pushes through <100 ~tweets a minute, which don't appear to cause Flume or
    Hbase much of a problem.
    I've not tried to ingest more yet.

    So far I've collected less than 20,000 tweets and response times in Impala
    are very fast, even on a tiny Hadoop cluster.

    I'd originally hoped to do some interesting geographic based queries, but
    so far the data flowing through lacks much useful geographic info. I would
    have hoped that even if twitter users don't supply there Geo info (or
    location info), that a tweet would contain some details related to the IP
    source of each tweet. Sadly I've not seen that yet.

    In terms of 'Real-time' feeds into Hadoop using Flume do you think it makes
    sense to store in Hbase and report via Imapala, or would storing in HDFS
    files be a better approach?
    I like using HBASE so I don't have to manage files, however Impala queries
    running off HBASE appear slower than other table types.

    Going off topic a bit, in terms of text analysis of a Tweet, SAP HANA has
    special indexing capability which can analyse 'free text' to give voice of
    customer analysis.
    http://scn.sap.com/community/developer-center/hana/blog/2013/06/19/real-time-sentiment-rating-of-movies-on-sap-hana-one

    I've not tried 'Cloudera Search' yet but do you think it could achieve
    something similar?

    Cheers
    Aron


    On Thursday, August 8, 2013 9:03:15 AM UTC+1, Todd Lipcon wrote:

    Hey Aron,

    This is a neat example. I'm curious, do you have any benchmark results on
    the ingest throughput you're able to get? How about the query speed for
    some typical "interesting" queries?

    -Todd

    On Thu, Aug 8, 2013 at 1:00 AM, Aron MacDonald <aron_ma...@hotmail.com<javascript:>
    wrote:
    Hi Guys,

    For those that have ever followed Cloudera's example to stream tweets (
    https://github.com/cloudera/cdh-twitter-example )
    I have made a few prototype enhancements to write to HBASE instead of
    HDFS (inspired by Dan Sanders Flume HBASE example
    https://github.com/DataDanSandler/log_analysis) and then use Impala to
    report off it, instead of HIVE.
    It also gets around the Hive table partitioning time delay in their
    initial example.

    In Hbase you need to create a table:
    sudo -u hdfs hbase shell
    create 'tweets', {NAME => 'tweet'}, {NAME => 'retweeted_status'}, {NAME
    => 'entities'}, {NAME => 'user'}

    I then created an Impala table to read the HBASE table:
    CREATE EXTERNAL TABLE HB_IMPALA_TWEETS (
    id int,
    id_str string,
    text string,
    created_at timestamp,
    geo_latitude double,
    geo_longitude double,
    user_screen_name string,
    user_location string,
    user_followers_count string,
    user_profile_image_url string

    )
    STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    WITH SERDEPROPERTIES (
    "hbase.columns.mapping" =
    ":key,tweet:id_str,tweet:text,tweet:created_at,tweet:geo_latitude,tweet:geo_longitude,
    user:screen_name,user:location,user:followers_count,user:profile_image_url"
    )
    TBLPROPERTIES("hbase.table.name" = "tweets");


    My prototype code for a Flume Sink Event is available at:

    https://github.com/AronMacDonald/Twitter_Hbase_Impala/blob/master/README.md

    NOTE: The code is in a rough form at the moment, with temporary log files
    and no error handling etc, but I still welcome any constructive tips and
    tricks for improving it.
    In addition it also includes logic to send data to SAP HANA,
    which you may need to remove if you wish to use it.


    Cheers
    Aron


    --
    Todd Lipcon
    Software Engineer, Cloudera

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedAug 8, '13 at 8:00a
activeAug 8, '13 at 2:21p
posts3
users2
websitecloudera.com
irc#hadoop

2 users in discussion

Aron MacDonald: 2 posts Todd Lipcon: 1 post

People

Translate

site design / logo © 2022 Grokbase