FAQ
I'm looking at the feasibility of using Impala to do analytics on a couple
large fact tables, but we have a star schema with slow changing dimensions,
so I'm wondering how I can update those other than reloading them
entirely. Or is this not one of the use cases Impala aims to address?

We also use Infobright, which is a purpose-built analytics column-oriented
db, and though the typical use case is to load files, they also do support
insert and update DML (even though they are very slow) precisely to deal
with changing dimensions.

thanks in advance.

To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Search Discussions

  • Greg Rahn at Oct 4, 2013 at 2:23 pm
    *[...]I'm wondering how I can update [slow changing dimensions] other than
    reloading them entirely.*

    Given that HDFS is an append only filesystem, a bulk reload is really the
    only option.

    On Fri, Oct 4, 2013 at 12:17 AM, Mauricio Aristizabal wrote:

    I'm looking at the feasibility of using Impala to do analytics on a couple
    large fact tables, but we have a star schema with slow changing dimensions,
    so I'm wondering how I can update those other than reloading them
    entirely. Or is this not one of the use cases Impala aims to address?

    We also use Infobright, which is a purpose-built analytics column-oriented
    db, and though the typical use case is to load files, they also do support
    insert and update DML (even though they are very slow) precisely to deal
    with changing dimensions.

    thanks in advance.

    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Mauricio Aristizabal at Oct 4, 2013 at 6:43 pm
    Yes HBase probably would work as I wouldn't have to do scanning on the
    dimension tables and they're not very big (2K-200K records). But would I
    be able to have my dimensions in HBase and fact tables in say Parquet, and
    still do joins between them?

    Also, using the HBase APIs is something we could definitely do, but even
    better would be to continue using a SQL interface... do you see any problem
    using Phoenix on top of HBase, alongside Impala?

    Thanks Alan.

    On Friday, October 4, 2013 11:01:42 AM UTC-7, Alan wrote:

    Hi Mauricio,

    Another option is to consider HBase. HBase allows insert/update/delete.
    It's good at pinpoint look up. However, it's a lot slower when scanning
    large amount of data. If you're not scanning a lot of data, maybe you can
    consider HBase?

    Thanks,
    Alan


    On Fri, Oct 4, 2013 at 7:23 AM, Greg Rahn <gr...@cloudera.com<javascript:>
    wrote:
    *[...]I'm wondering how I can update [slow changing dimensions] other
    than reloading them entirely.*

    Given that HDFS is an append only filesystem, a bulk reload is really the
    only option.


    On Fri, Oct 4, 2013 at 12:17 AM, Mauricio Aristizabal <ari...@gmail.com<javascript:>
    wrote:
    I'm looking at the feasibility of using Impala to do analytics on a
    couple large fact tables, but we have a star schema with slow changing
    dimensions, so I'm wondering how I can update those other than reloading
    them entirely. Or is this not one of the use cases Impala aims to address?

    We also use Infobright, which is a purpose-built analytics
    column-oriented db, and though the typical use case is to load files, they
    also do support insert and update DML (even though they are very slow)
    precisely to deal with changing dimensions.

    thanks in advance.

    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user...@cloudera.org <javascript:>.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user...@cloudera.org <javascript:>.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Alex Behm at Oct 4, 2013 at 6:53 pm

    On Fri, Oct 4, 2013 at 11:43 AM, Mauricio Aristizabal wrote:
    Yes HBase probably would work as I wouldn't have to do scanning on the
    dimension tables and they're not very big (2K-200K records). But would I be
    able to have my dimensions in HBase and fact tables in say Parquet, and
    still do joins between them?
    Yes, absolutely! You can use any mix of supported table formats in a
    single query in Impala.
    Also, using the HBase APIs is something we could definitely do, but even
    better would be to continue using a SQL interface... do you see any problem
    using Phoenix on top of HBase, alongside Impala?

    Thanks Alan.


    On Friday, October 4, 2013 11:01:42 AM UTC-7, Alan wrote:

    Hi Mauricio,

    Another option is to consider HBase. HBase allows insert/update/delete.
    It's good at pinpoint look up. However, it's a lot slower when scanning
    large amount of data. If you're not scanning a lot of data, maybe you can
    consider HBase?

    Thanks,
    Alan

    On Fri, Oct 4, 2013 at 7:23 AM, Greg Rahn wrote:

    [...]I'm wondering how I can update [slow changing dimensions] other than
    reloading them entirely.

    Given that HDFS is an append only filesystem, a bulk reload is really the
    only option.


    On Fri, Oct 4, 2013 at 12:17 AM, Mauricio Aristizabal <ari...@gmail.com>
    wrote:
    I'm looking at the feasibility of using Impala to do analytics on a
    couple large fact tables, but we have a star schema with slow changing
    dimensions, so I'm wondering how I can update those other than reloading
    them entirely. Or is this not one of the use cases Impala aims to address?

    We also use Infobright, which is a purpose-built analytics
    column-oriented db, and though the typical use case is to load files, they
    also do support insert and update DML (even though they are very slow)
    precisely to deal with changing dimensions.

    thanks in advance.

    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user...@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Alan Choi at Oct 4, 2013 at 6:59 pm
    Here's the link to our doc:

    http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_impala_hbase.html

    And yes, you can use SQL interface through Impala to insert/query/join
    HBase table.

    On Fri, Oct 4, 2013 at 11:53 AM, Alex Behm wrote:
    On Fri, Oct 4, 2013 at 11:43 AM, Mauricio Aristizabal wrote:
    Yes HBase probably would work as I wouldn't have to do scanning on the
    dimension tables and they're not very big (2K-200K records). But would I be
    able to have my dimensions in HBase and fact tables in say Parquet, and
    still do joins between them?
    Yes, absolutely! You can use any mix of supported table formats in a
    single query in Impala.
    Also, using the HBase APIs is something we could definitely do, but even
    better would be to continue using a SQL interface... do you see any problem
    using Phoenix on top of HBase, alongside Impala?

    Thanks Alan.


    On Friday, October 4, 2013 11:01:42 AM UTC-7, Alan wrote:

    Hi Mauricio,

    Another option is to consider HBase. HBase allows insert/update/delete.
    It's good at pinpoint look up. However, it's a lot slower when scanning
    large amount of data. If you're not scanning a lot of data, maybe you
    can
    consider HBase?

    Thanks,
    Alan

    On Fri, Oct 4, 2013 at 7:23 AM, Greg Rahn wrote:

    [...]I'm wondering how I can update [slow changing dimensions] other
    than
    reloading them entirely.

    Given that HDFS is an append only filesystem, a bulk reload is really
    the
    only option.


    On Fri, Oct 4, 2013 at 12:17 AM, Mauricio Aristizabal <
    ari...@gmail.com>
    wrote:
    I'm looking at the feasibility of using Impala to do analytics on a
    couple large fact tables, but we have a star schema with slow changing
    dimensions, so I'm wondering how I can update those other than
    reloading
    them entirely. Or is this not one of the use cases Impala aims to
    address?
    We also use Infobright, which is a purpose-built analytics
    column-oriented db, and though the typical use case is to load files,
    they
    also do support insert and update DML (even though they are very slow)
    precisely to deal with changing dimensions.

    thanks in advance.

    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user...@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Alan Choi at Oct 4, 2013 at 7:14 pm
    Another point to keep in mind is concurrent queries.
    If you're deleting a file while a query is trying to access it, the query
    might fail. On the other hand, updating a row in HBase while another query
    is accessing the same row is perfectly fine.



    On Fri, Oct 4, 2013 at 11:59 AM, Alan Choi wrote:

    Here's the link to our doc:


    http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_impala_hbase.html

    And yes, you can use SQL interface through Impala to insert/query/join
    HBase table.

    On Fri, Oct 4, 2013 at 11:53 AM, Alex Behm wrote:

    On Fri, Oct 4, 2013 at 11:43 AM, Mauricio Aristizabal <aristim@gmail.com>
    wrote:
    Yes HBase probably would work as I wouldn't have to do scanning on the
    dimension tables and they're not very big (2K-200K records). But would I be
    able to have my dimensions in HBase and fact tables in say Parquet, and
    still do joins between them?
    Yes, absolutely! You can use any mix of supported table formats in a
    single query in Impala.
    Also, using the HBase APIs is something we could definitely do, but even
    better would be to continue using a SQL interface... do you see any problem
    using Phoenix on top of HBase, alongside Impala?

    Thanks Alan.


    On Friday, October 4, 2013 11:01:42 AM UTC-7, Alan wrote:

    Hi Mauricio,

    Another option is to consider HBase. HBase allows insert/update/delete.
    It's good at pinpoint look up. However, it's a lot slower when scanning
    large amount of data. If you're not scanning a lot of data, maybe you
    can
    consider HBase?

    Thanks,
    Alan

    On Fri, Oct 4, 2013 at 7:23 AM, Greg Rahn wrote:

    [...]I'm wondering how I can update [slow changing dimensions] other
    than
    reloading them entirely.

    Given that HDFS is an append only filesystem, a bulk reload is really
    the
    only option.


    On Fri, Oct 4, 2013 at 12:17 AM, Mauricio Aristizabal <
    ari...@gmail.com>
    wrote:
    I'm looking at the feasibility of using Impala to do analytics on a
    couple large fact tables, but we have a star schema with slow
    changing
    dimensions, so I'm wondering how I can update those other than
    reloading
    them entirely. Or is this not one of the use cases Impala aims to
    address?
    We also use Infobright, which is a purpose-built analytics
    column-oriented db, and though the typical use case is to load
    files, they
    also do support insert and update DML (even though they are very
    slow)
    precisely to deal with changing dimensions.

    thanks in advance.

    To unsubscribe from this group and stop receiving emails from it,
    send
    an email to impala-user...@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it,
    send an
    email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Mauricio Aristizabal at Oct 4, 2013 at 7:46 pm
    But correct me if I'm wrong, I cannot UPDATE a record through Impala's SQL
    interface. That's why I was asking about Phoenix to do the updates, while
    using Impala to do all querying, usually with joins. Though, again,
    updating via HBase native API would work for us too.

    https://github.com/forcedotcom/phoenix/wiki

    On Friday, October 4, 2013 11:59:26 AM UTC-7, Alan wrote:

    Here's the link to our doc:


    http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_impala_hbase.html

    And yes, you can use SQL interface through Impala to insert/query/join
    HBase table.


    On Fri, Oct 4, 2013 at 11:53 AM, Alex Behm <alex...@cloudera.com<javascript:>
    wrote:
    On Fri, Oct 4, 2013 at 11:43 AM, Mauricio Aristizabal <ari...@gmail.com<javascript:>>
    wrote:
    Yes HBase probably would work as I wouldn't have to do scanning on the
    dimension tables and they're not very big (2K-200K records). But would I be
    able to have my dimensions in HBase and fact tables in say Parquet, and
    still do joins between them?
    Yes, absolutely! You can use any mix of supported table formats in a
    single query in Impala.
    Also, using the HBase APIs is something we could definitely do, but even
    better would be to continue using a SQL interface... do you see any problem
    using Phoenix on top of HBase, alongside Impala?

    Thanks Alan.


    On Friday, October 4, 2013 11:01:42 AM UTC-7, Alan wrote:

    Hi Mauricio,

    Another option is to consider HBase. HBase allows insert/update/delete.
    It's good at pinpoint look up. However, it's a lot slower when scanning
    large amount of data. If you're not scanning a lot of data, maybe you
    can
    consider HBase?

    Thanks,
    Alan

    On Fri, Oct 4, 2013 at 7:23 AM, Greg Rahn wrote:

    [...]I'm wondering how I can update [slow changing dimensions] other
    than
    reloading them entirely.

    Given that HDFS is an append only filesystem, a bulk reload is really
    the
    only option.


    On Fri, Oct 4, 2013 at 12:17 AM, Mauricio Aristizabal <
    ari...@gmail.com>
    wrote:
    I'm looking at the feasibility of using Impala to do analytics on a
    couple large fact tables, but we have a star schema with slow
    changing
    dimensions, so I'm wondering how I can update those other than
    reloading
    them entirely. Or is this not one of the use cases Impala aims to
    address?
    We also use Infobright, which is a purpose-built analytics
    column-oriented db, and though the typical use case is to load
    files, they
    also do support insert and update DML (even though they are very
    slow)
    precisely to deal with changing dimensions.

    thanks in advance.

    To unsubscribe from this group and stop receiving emails from it,
    send
    an email to impala-user...@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it,
    send an
    email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user...@cloudera.org <javascript:>.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user...@cloudera.org <javascript:>.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Alan Choi at Oct 4, 2013 at 7:52 pm
    you'r right. we can't do update.

    On Fri, Oct 4, 2013 at 12:46 PM, Mauricio Aristizabal wrote:

    But correct me if I'm wrong, I cannot UPDATE a record through Impala's SQL
    interface. That's why I was asking about Phoenix to do the updates, while
    using Impala to do all querying, usually with joins. Though, again,
    updating via HBase native API would work for us too.

    https://github.com/forcedotcom/phoenix/wiki


    On Friday, October 4, 2013 11:59:26 AM UTC-7, Alan wrote:

    Here's the link to our doc:

    http://www.cloudera.com/**content/cloudera-content/**
    cloudera-docs/Impala/latest/**Installing-and-Using-Impala/**
    ciiu_impala_hbase.html<http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_impala_hbase.html>

    And yes, you can use SQL interface through Impala to insert/query/join
    HBase table.


    On Fri, Oct 4, 2013 at 11:53 AM, Alex Behm wrote:

    On Fri, Oct 4, 2013 at 11:43 AM, Mauricio Aristizabal <ari...@gmail.com>
    wrote:
    Yes HBase probably would work as I wouldn't have to do scanning on the
    dimension tables and they're not very big (2K-200K records). But
    would I be
    able to have my dimensions in HBase and fact tables in say Parquet, and
    still do joins between them?
    Yes, absolutely! You can use any mix of supported table formats in a
    single query in Impala.
    Also, using the HBase APIs is something we could definitely do, but even
    better would be to continue using a SQL interface... do you see any problem
    using Phoenix on top of HBase, alongside Impala?

    Thanks Alan.


    On Friday, October 4, 2013 11:01:42 AM UTC-7, Alan wrote:

    Hi Mauricio,

    Another option is to consider HBase. HBase allows
    insert/update/delete.
    It's good at pinpoint look up. However, it's a lot slower when
    scanning
    large amount of data. If you're not scanning a lot of data, maybe you
    can
    consider HBase?

    Thanks,
    Alan

    On Fri, Oct 4, 2013 at 7:23 AM, Greg Rahn wrote:

    [...]I'm wondering how I can update [slow changing dimensions] other
    than
    reloading them entirely.

    Given that HDFS is an append only filesystem, a bulk reload is
    really the
    only option.


    On Fri, Oct 4, 2013 at 12:17 AM, Mauricio Aristizabal <
    ari...@gmail.com>
    wrote:
    I'm looking at the feasibility of using Impala to do analytics on a
    couple large fact tables, but we have a star schema with slow
    changing
    dimensions, so I'm wondering how I can update those other than
    reloading
    them entirely. Or is this not one of the use cases Impala aims to
    address?
    We also use Infobright, which is a purpose-built analytics
    column-oriented db, and though the typical use case is to load
    files, they
    also do support insert and update DML (even though they are very
    slow)
    precisely to deal with changing dimensions.

    thanks in advance.

    To unsubscribe from this group and stop receiving emails from it,
    send
    an email to impala-user...@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it,
    send an
    email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user...@**cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user...@**cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Alex Behm at Oct 4, 2013 at 8:21 pm
    You can "update" a record in HBase by inserting its key with a new
    value (HBase will create a new version of that key). So you can
    actually do updates to HBase by doing a SQL INSERT in Impala. Impala
    will always read the newest version of all HBase records.
    The downside I think is that you will always have to INSERT the
    complete record - you cannot simply change a single column.

    Alex


    On Fri, Oct 4, 2013 at 12:46 PM, Mauricio Aristizabal wrote:
    But correct me if I'm wrong, I cannot UPDATE a record through Impala's SQL
    interface. That's why I was asking about Phoenix to do the updates, while
    using Impala to do all querying, usually with joins. Though, again,
    updating via HBase native API would work for us too.

    https://github.com/forcedotcom/phoenix/wiki


    On Friday, October 4, 2013 11:59:26 AM UTC-7, Alan wrote:

    Here's the link to our doc:


    http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_impala_hbase.html

    And yes, you can use SQL interface through Impala to insert/query/join
    HBase table.

    On Fri, Oct 4, 2013 at 11:53 AM, Alex Behm wrote:

    On Fri, Oct 4, 2013 at 11:43 AM, Mauricio Aristizabal <ari...@gmail.com>
    wrote:
    Yes HBase probably would work as I wouldn't have to do scanning on the
    dimension tables and they're not very big (2K-200K records). But would
    I be
    able to have my dimensions in HBase and fact tables in say Parquet, and
    still do joins between them?
    Yes, absolutely! You can use any mix of supported table formats in a
    single query in Impala.
    Also, using the HBase APIs is something we could definitely do, but
    even
    better would be to continue using a SQL interface... do you see any
    problem
    using Phoenix on top of HBase, alongside Impala?

    Thanks Alan.


    On Friday, October 4, 2013 11:01:42 AM UTC-7, Alan wrote:

    Hi Mauricio,

    Another option is to consider HBase. HBase allows
    insert/update/delete.
    It's good at pinpoint look up. However, it's a lot slower when
    scanning
    large amount of data. If you're not scanning a lot of data, maybe you
    can
    consider HBase?

    Thanks,
    Alan

    On Fri, Oct 4, 2013 at 7:23 AM, Greg Rahn wrote:

    [...]I'm wondering how I can update [slow changing dimensions] other
    than
    reloading them entirely.

    Given that HDFS is an append only filesystem, a bulk reload is really
    the
    only option.


    On Fri, Oct 4, 2013 at 12:17 AM, Mauricio Aristizabal
    <ari...@gmail.com>
    wrote:
    I'm looking at the feasibility of using Impala to do analytics on a
    couple large fact tables, but we have a star schema with slow
    changing
    dimensions, so I'm wondering how I can update those other than
    reloading
    them entirely. Or is this not one of the use cases Impala aims to
    address?

    We also use Infobright, which is a purpose-built analytics
    column-oriented db, and though the typical use case is to load
    files, they
    also do support insert and update DML (even though they are very
    slow)
    precisely to deal with changing dimensions.

    thanks in advance.

    To unsubscribe from this group and stop receiving emails from it,
    send
    an email to impala-user...@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it,
    send an
    email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Mauricio Aristizabal at Oct 4, 2013 at 10:46 pm
    Excellent! Yeah I think that would work for us.

    Thanks everyone, this has been extremely helpful!

    On Friday, October 4, 2013 1:21:22 PM UTC-7, Alex Behm wrote:

    You can "update" a record in HBase by inserting its key with a new
    value (HBase will create a new version of that key). So you can
    actually do updates to HBase by doing a SQL INSERT in Impala. Impala
    will always read the newest version of all HBase records.
    The downside I think is that you will always have to INSERT the
    complete record - you cannot simply change a single column.

    Alex


    On Fri, Oct 4, 2013 at 12:46 PM, Mauricio Aristizabal wrote:
    But correct me if I'm wrong, I cannot UPDATE a record through Impala's SQL
    interface. That's why I was asking about Phoenix to do the updates, while
    using Impala to do all querying, usually with joins. Though, again,
    updating via HBase native API would work for us too.

    https://github.com/forcedotcom/phoenix/wiki


    On Friday, October 4, 2013 11:59:26 AM UTC-7, Alan wrote:

    Here's the link to our doc:

    http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_impala_hbase.html
    And yes, you can use SQL interface through Impala to insert/query/join
    HBase table.

    On Fri, Oct 4, 2013 at 11:53 AM, Alex Behm wrote:

    On Fri, Oct 4, 2013 at 11:43 AM, Mauricio Aristizabal <
    ari...@gmail.com>
    wrote:
    Yes HBase probably would work as I wouldn't have to do scanning on
    the
    dimension tables and they're not very big (2K-200K records). But
    would
    I be
    able to have my dimensions in HBase and fact tables in say Parquet,
    and
    still do joins between them?
    Yes, absolutely! You can use any mix of supported table formats in a
    single query in Impala.
    Also, using the HBase APIs is something we could definitely do, but
    even
    better would be to continue using a SQL interface... do you see any
    problem
    using Phoenix on top of HBase, alongside Impala?

    Thanks Alan.


    On Friday, October 4, 2013 11:01:42 AM UTC-7, Alan wrote:

    Hi Mauricio,

    Another option is to consider HBase. HBase allows
    insert/update/delete.
    It's good at pinpoint look up. However, it's a lot slower when
    scanning
    large amount of data. If you're not scanning a lot of data, maybe
    you
    can
    consider HBase?

    Thanks,
    Alan

    On Fri, Oct 4, 2013 at 7:23 AM, Greg Rahn wrote:

    [...]I'm wondering how I can update [slow changing dimensions]
    other
    than
    reloading them entirely.

    Given that HDFS is an append only filesystem, a bulk reload is
    really
    the
    only option.


    On Fri, Oct 4, 2013 at 12:17 AM, Mauricio Aristizabal
    <ari...@gmail.com>
    wrote:
    I'm looking at the feasibility of using Impala to do analytics on
    a
    couple large fact tables, but we have a star schema with slow
    changing
    dimensions, so I'm wondering how I can update those other than
    reloading
    them entirely. Or is this not one of the use cases Impala aims
    to
    address?

    We also use Infobright, which is a purpose-built analytics
    column-oriented db, and though the typical use case is to load
    files, they
    also do support insert and update DML (even though they are very
    slow)
    precisely to deal with changing dimensions.

    thanks in advance.

    To unsubscribe from this group and stop receiving emails from it,
    send
    an email to impala-user...@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it,
    send an
    email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send
    an
    email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user...@cloudera.org <javascript:>.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Alex Behm at Oct 4, 2013 at 10:52 pm
    Glad to hear it! Good luck.
    On Fri, Oct 4, 2013 at 3:46 PM, Mauricio Aristizabal wrote:
    Excellent! Yeah I think that would work for us.

    Thanks everyone, this has been extremely helpful!


    On Friday, October 4, 2013 1:21:22 PM UTC-7, Alex Behm wrote:

    You can "update" a record in HBase by inserting its key with a new
    value (HBase will create a new version of that key). So you can
    actually do updates to HBase by doing a SQL INSERT in Impala. Impala
    will always read the newest version of all HBase records.
    The downside I think is that you will always have to INSERT the
    complete record - you cannot simply change a single column.

    Alex



    On Fri, Oct 4, 2013 at 12:46 PM, Mauricio Aristizabal <ari...@gmail.com>
    wrote:
    But correct me if I'm wrong, I cannot UPDATE a record through Impala's
    SQL
    interface. That's why I was asking about Phoenix to do the updates,
    while
    using Impala to do all querying, usually with joins. Though, again,
    updating via HBase native API would work for us too.

    https://github.com/forcedotcom/phoenix/wiki


    On Friday, October 4, 2013 11:59:26 AM UTC-7, Alan wrote:

    Here's the link to our doc:



    http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_impala_hbase.html

    And yes, you can use SQL interface through Impala to insert/query/join
    HBase table.


    On Fri, Oct 4, 2013 at 11:53 AM, Alex Behm <alex...@cloudera.com>
    wrote:
    On Fri, Oct 4, 2013 at 11:43 AM, Mauricio Aristizabal
    <ari...@gmail.com>
    wrote:
    Yes HBase probably would work as I wouldn't have to do scanning on
    the
    dimension tables and they're not very big (2K-200K records). But
    would
    I be
    able to have my dimensions in HBase and fact tables in say Parquet,
    and
    still do joins between them?
    Yes, absolutely! You can use any mix of supported table formats in a
    single query in Impala.
    Also, using the HBase APIs is something we could definitely do, but
    even
    better would be to continue using a SQL interface... do you see any
    problem
    using Phoenix on top of HBase, alongside Impala?

    Thanks Alan.


    On Friday, October 4, 2013 11:01:42 AM UTC-7, Alan wrote:

    Hi Mauricio,

    Another option is to consider HBase. HBase allows
    insert/update/delete.
    It's good at pinpoint look up. However, it's a lot slower when
    scanning
    large amount of data. If you're not scanning a lot of data, maybe
    you
    can
    consider HBase?

    Thanks,
    Alan


    On Fri, Oct 4, 2013 at 7:23 AM, Greg Rahn <gr...@cloudera.com>
    wrote:
    [...]I'm wondering how I can update [slow changing dimensions]
    other
    than
    reloading them entirely.

    Given that HDFS is an append only filesystem, a bulk reload is
    really
    the
    only option.


    On Fri, Oct 4, 2013 at 12:17 AM, Mauricio Aristizabal
    <ari...@gmail.com>
    wrote:
    I'm looking at the feasibility of using Impala to do analytics on
    a
    couple large fact tables, but we have a star schema with slow
    changing
    dimensions, so I'm wondering how I can update those other than
    reloading
    them entirely. Or is this not one of the use cases Impala aims
    to
    address?

    We also use Infobright, which is a purpose-built analytics
    column-oriented db, and though the typical use case is to load
    files, they
    also do support insert and update DML (even though they are very
    slow)
    precisely to deal with changing dimensions.

    thanks in advance.

    To unsubscribe from this group and stop receiving emails from it,
    send
    an email to impala-user...@cloudera.org.

    To unsubscribe from this group and stop receiving emails from it,
    send an
    email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it,
    send
    an
    email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send
    an
    email to impala-user...@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user+unsubscribe@cloudera.org.
    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.
  • Mauricio Aristizabal at Oct 4, 2013 at 6:51 pm
    Well the updates would be randomly distributed, but I think your approach
    would still work (though only needed on the larger dims.. reloading the
    small ones entirely would be a nonissue): We could 'shard' the dim by a mod
    of natural key into say 100 files, and on update easily calculate which of
    those are involved, and only replace those. Even though each update
    touches random records it's only a few each time.

    Thanks John.

    On Friday, October 4, 2013 11:32:14 AM UTC-7, John Russell wrote:

    Mauricio, what is the volume and rate of change for the dimension tables?

    Impala considers all the data files in a table directory to make up the
    data for the table. So if you had some way to know which portions of the
    data for a table were subject to change, and if you split those rows out
    into 1 or more separate data files, you could replace just those files and
    then do a REFRESH for the table. (That approach wouldn't work if the
    changing data is distributed randomly / unpredictably throughout the table.)

    John

    On Oct 4, 2013, at 11:01 AM, Alan Choi <al...@cloudera.com <javascript:>>
    wrote:

    Hi Mauricio,

    Another option is to consider HBase. HBase allows insert/update/delete.
    It's good at pinpoint look up. However, it's a lot slower when scanning
    large amount of data. If you're not scanning a lot of data, maybe you can
    consider HBase?

    Thanks,
    Alan


    On Fri, Oct 4, 2013 at 7:23 AM, Greg Rahn <gr...@cloudera.com<javascript:>
    wrote:
    *[...]I'm wondering how I can update [slow changing dimensions] other
    than reloading them entirely.*

    Given that HDFS is an append only filesystem, a bulk reload is really the
    only option.


    On Fri, Oct 4, 2013 at 12:17 AM, Mauricio Aristizabal <ari...@gmail.com<javascript:>
    wrote:
    I'm looking at the feasibility of using Impala to do analytics on a
    couple large fact tables, but we have a star schema with slow changing
    dimensions, so I'm wondering how I can update those other than reloading
    them entirely. Or is this not one of the use cases Impala aims to address?

    We also use Infobright, which is a purpose-built analytics
    column-oriented db, and though the typical use case is to load files, they
    also do support insert and update DML (even though they are very slow)
    precisely to deal with changing dimensions.

    thanks in advance.

    To unsubscribe from this group and stop receiving emails from it, send
    an email to impala-user...@cloudera.org <javascript:>.

    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user...@cloudera.org <javascript:>.

    To unsubscribe from this group and stop receiving emails from it, send an
    email to impala-user...@cloudera.org <javascript:>.

    To unsubscribe from this group and stop receiving emails from it, send an email to impala-user+unsubscribe@cloudera.org.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupimpala-user @
categorieshadoop
postedOct 4, '13 at 7:17a
activeOct 4, '13 at 10:52p
posts12
users4
websitecloudera.com
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase