FAQ
I'm just wondering if anyone has any insights on how to speed this up.

Our fact tables consist of the primary keys of the involved dimension tables
plus numeric count columns.

To build our fact table, let's call it Z (as I understand the process) we
grab the primary key of the row in dimension A plus information which
we can use to find the associated row in dimension B. We go to dimension B,
grab the primary key plus information which allows us to
find the associated row in dimension C, and so on through 8 dimension
tables.

This gives us one row in the fact table.

This worked OK for our ETL developers when we were dealing with 10,000 rows
in the development database.

Now we're working with a much larger source set and we're talking
100,000,000 rows. It doesn't work that well. Basically, each row in the
fact table requires full a index scan and a rowid fetch from each of the
dimension tables.

Does anybody have experience or even a theoretical insight into a better way
to do this?

Thanks

Search Discussions

  • Ken Naim at Dec 21, 2007 at 8:58 pm
    Almost every column in your dimension and fact table should have a single
    column bitmap index on which will allow for star transformation explain plan
    which combines all the bitmap indexes and finds the rows in the fact table
    very quickly.



    From: oracle-l-bounce_at_freelists.org
    On Behalf Of Thomas Day
    Sent: Friday, December 21, 2007 10:17 AM
    To: oracle-l_at_freelists.org
    Subject: Building fact tables in a data warehouse



    I'm just wondering if anyone has any insights on how to speed this up.



    Our fact tables consist of the primary keys of the involved dimension tables
    plus numeric count columns.



    To build our fact table, let's call it Z (as I understand the process) we
    grab the primary key of the row in dimension A plus information which

    we can use to find the associated row in dimension B. We go to dimension B,
    grab the primary key plus information which allows us to

    find the associated row in dimension C, and so on through 8 dimension
    tables.



    This gives us one row in the fact table.



    This worked OK for our ETL developers when we were dealing with 10,000 rows
    in the development database.



    Now we're working with a much larger source set and we're talking
    100,000,000 rows. It doesn't work that well. Basically, each row in the
    fact table requires full a index scan and a rowid fetch from each of the
    dimension tables.



    Does anybody have experience or even a theoretical insight into a better way
    to do this?



    Thanks
  • Job Miller at Dec 22, 2007 at 2:24 am
    Chapter 6, p.215 of the Data Warehouse ETL Toolkit talk about the common approach for the "Surrogate Key Pipeline" and presents a diagram basically what you describe.

    The gist of it is to ensure that all records move through successive lookups in parallel, without going to disk, and are reading from cached lookup tables that contain the most recent dimension values. Ideally, you don't have to use the entire dimension, because much of the dimension contains dated data..

    however, if you have late arriving data, that may invalidate your ability to do all lookups against smaller current lookup tables vs. doing the lookup against the full current dimensions.

    it goes on to say that you can do it all in one big star join query, (with outer joins if you are concerned some dimensional values may not match), but that that approach isn't the most efficient..

    Thomas Day wrote: I'm just wondering if anyone has any insights on how to speed this up.


    Our fact tables consist of the primary keys of the involved dimension tables plus numeric count columns.


    To build our fact table, let's call it Z (as I understand the process) we grab the primary key of the row in dimension A plus information which
    we can use to find the associated row in dimension B. We go to dimension B, grab the primary key plus information which allows us to
    find the associated row in dimension C, and so on through 8 dimension tables.


    This gives us one row in the fact table.


    This worked OK for our ETL developers when we were dealing with 10,000 rows in the development database.


    Now we're working with a much larger source set and we're talking 100,000,000 rows. It doesn't work that well. Basically, each row in the fact table requires full a index scan and a rowid fetch from each of the dimension tables.


    Does anybody have experience or even a theoretical insight into a better way to do this?


    Thanks




    Never miss a thing. Make Yahoo your homepage.
  • Jaromir nemec at Dec 23, 2007 at 8:49 pm
    Hi Thomas,
    This worked OK for our ETL developers when we were dealing with 10,000 rows
    in the development database.
    Now we're working with a much larger source set and we're talking
    100,000,000 rows. It doesn't work that well.
    Does anybody have experience or even a theoretical insight into a better way
    to do this?
    One important thing in an ETL process is to know where is the bottleneck. From the birds view this can be

    a) in the data source
    b) in the transformation
    c) in the data target

    Except for some cases such as inefficient selects in data source, indexes on targets or contention with other processes in prevailing number of cases the problem lies in the transformation. This is due to the row-by-row processing logic of the majority of ETL tools.
    A typical answer is to use parallelism; in principle to start more ETL sessions working on partitioned sources and targets. This could indeed help, and with your row count will be probably required.
    But there is other possibility to shift some transformation logic in the data source.
    To build our fact table, let's call it Z (as I understand the process) we
    grab the primary key of the row in dimension A plus information which
    we can use to find the associated row in dimension B. We go to dimension B,
    grab the primary key plus information which allows us to
    find the associated row in dimension C, and so on through 8 dimension
    tables.
    In this case I'd try to join the fact records with the dimensions in the database (e.g. using parallel, partition-wise hash join) possible with pre-denormalizing the dimension hierarchy in a single table before the ETL process starts (to avoid the cascading joins).

    One starting point in troubles with ETL process is to rewrite the process (possible simplified) in a CTAS statement. This gives you a quick orientation where the problem could be. If the CTAS doesn't work as expected, it is probably an a) or c) problem and should be solved in the database. Otherwise you get estimation what can be an expected target throughput of the ETL transformation.

    Regards,

    Jaromir D.B. Nemec
  • Yechiel Adar at Dec 26, 2007 at 5:04 pm
    This looks like a long process.
    Have you consider using a primary key that contain the primary key
    fields + the fields that you need to go to the next table.
    This way you will eliminate going to the table itself.

    Adar Yechiel
    Rechovot, Israel

    Thomas Day wrote:
    I'm just wondering if anyone has any insights on how to speed this up.

    Our fact tables consist of the primary keys of the involved dimension
    tables plus numeric count columns.

    To build our fact table, let's call it Z (as I understand the
    process) we grab the primary key of the row in dimension A plus
    information which
    we can use to find the associated row in dimension B. We go to
    dimension B, grab the primary key plus information which allows us to
    find the associated row in dimension C, and so on through 8 dimension
    tables.
    --
    http://www.freelists.org/webpage/oracle-l

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouporacle-l @
categoriesoracle
postedDec 21, '07 at 3:17p
activeDec 26, '07 at 5:04p
posts5
users5
websiteoracle.com

People

Translate

site design / logo © 2022 Grokbase