FAQ
We have a huge table (> 160 million rows) which has about 20 million duplicate rows that we need to delete. What is the most efficient way to do this as we will need to do this daily?
A single varchar2(30) column is used to identified duplicates. We could possibly have > 2 rows of duplicates.


We are doing direct path load so no unique key indexes can be put on the table to take care of the duplicates.


Platform: Oracle 10G RAC (2 node) on Solaris 10.


Thanks!



Need a quick answer? Get one in minutes from people who know. Ask your question on Yahoo! Answers.

Search Discussions

  • Bobak, Mark at Dec 12, 2006 at 10:53 pm
    Lots of solutions. Try a search on AskTom (http://asktom.oracle.com/)
    or MetaLink specifying "delete duplicate rows".

    -Mark

    --

    Mark J. Bobak

    Senior Oracle Architect

    ProQuest Information & Learning

    There is nothing so useless as doing efficiently that which shouldn't be
    done at all. -Peter F. Drucker, 1909-2005



    From: oracle-l-bounce_at_freelists.org
    On Behalf Of A Ebadi
    Sent: Tuesday, December 12, 2006 5:31 PM
    To: oracle-l_at_freelists.org
    Subject: de-dup process

    We have a huge table (> 160 million rows) which has about 20 million
    duplicate rows that we need to delete. What is the most efficient way
    to do this as we will need to do this daily?
    A single varchar2(30) column is used to identified duplicates. We could
    possibly have > 2 rows of duplicates.


    We are doing direct path load so no unique key indexes can be put on the
    table to take care of the duplicates.


    Platform: Oracle 10G RAC (2 node) on Solaris 10.


    Thanks!

    Need a quick answer? Get one in minutes from people who know. Ask your
    question on Yahoo! Answers
    <http://answers.yahoo.com/;_ylc=X3oDMTFvbGNhMGE3BF9TAzM5NjU0NTEwOARfcwMz
    OTY1NDUxMDMEc2VjA21haWxfdGFnbGluZQRzbGsDbWFpbF90YWcx> .

    --
    http://www.freelists.org/webpage/oracle-l
  • Tboss_at_bossconsulting.com at Dec 13, 2006 at 1:42 am
    From asktom, the best way I've found is to use Tom's little code snippet below:

    delete from table your_huge_table
    where rowid in
    (select rid

    from
    (select rowid rid,
    row_number() over
    (partition by varchar_that_defines_duplicates
    order by rowid ) rn

    from your_huge_table
    )
    where rn <> 1
    )
    /

    It will get multiple duplicate rows, and works far faster than any not exists, minus,
    or cursor-based solution.

    A few other options exist for you if you can do them that may be faster
    1. create table as select distinct; probably faster than doing any sort of deleting.

    2. Alter table mytab enable constraint PK exceptions into exceptions;
    Better way; much faster for large tables, lets you audit the
    duplicate rows by examining exceptions table. (you must run
    $ORACLE_HOME/rdbms/admin/utlexcpt.sql before doing this).
    Con: the exceptions table will contain BOTH duplicate rows in
    the source table ... you'll have to delete them manually.

    3. Use unix. Perhaps the purest fastest way is to use unix sort/unique commands:
    a. sqlload data out or select out delimited
    b. sort filename | uniq > new file
    c. sqlload back in.

    only a viable option if your table is "thin" and only has a few columns.

    hope this helps, todd
    We have a huge table (> 160 million rows) which has about 20 million duplicate rows that we need to delete. What is the most efficient way to do this as we will need to do this daily?
    A single varchar2(30) column is used to identified duplicates. We could possibly have > 2 rows of duplicates.

    We are doing direct path load so no unique key indexes can be put on the table to take care of the duplicates.

    Platform: Oracle 10G RAC (2 node) on Solaris 10.
    --
    http://www.freelists.org/webpage/oracle-l
  • Mladen Gogala at Dec 13, 2006 at 2:05 pm

    On 12/12/2006 08:42:38 PM, tboss_at_bossconsulting.com wrote:
    From asktom, the best way I've found is to use Tom's little code snippet below:
    delete from table your_huge_table
    where rowid in
    (select rid
    from
    (select rowid rid,
    row_number() over
    (partition by varchar_that_defines_duplicates
    order by rowid ) rn
    from your_huge_table
    )
    where rn <> 1
    )
    /
    Good luck with that if cardinality is +60M rows. I would also add a condition like
    WHERE ROWID in (SELECT ROW_ID from EXCEPTIONS) to your query. You only need to populate
    exceptions table with duplicates and if those duplicates are a small percentage of the
    total number of records, your task will be done two order of magnitude faster then without
    the exceptions table.
  • A Ebadi at Dec 13, 2006 at 10:38 pm
    Biggest problem we've faced in coming up with a solution is none of the solutions so far scale. In other words, things are fine if we have a 20 million row table with 2-3 million duplicates - runs in 10-15 minutes. However, trying it for 100+ million row table - it runs for hrs!


    We've even had another tool (Informatica) select out the ROWIDs of the duplicates into a separate table then we are using PL/SQL cursor to delete those rows from the large table, but this doesn't scale either!


    I'm currently looking at some of the suggestions I got from this list and seeing if any of them will work with us.


    Thanks for all who replied and I'll let you know how it goes!


    Mladen Gogala wrote:

    On 12/12/2006 08:42:38 PM, tboss_at_bossconsulting.com wrote:
    From asktom, the best way I've found is to use Tom's little code snippet below:
    delete from table your_huge_table
    where rowid in
    (select rid
    from
    (select rowid rid,
    row_number() over
    (partition by varchar_that_defines_duplicates
    order by rowid ) rn
    from your_huge_table
    )
    where rn <> 1
    )
    /
    Good luck with that if cardinality is +60M rows. I would also add a condition like
    WHERE ROWID in (SELECT ROW_ID from EXCEPTIONS) to your query. You only need to populate
    exceptions table with duplicates and if those duplicates are a small percentage of the
    total number of records, your task will be done two order of magnitude faster then without
    the exceptions table.

    --
    Mladen Gogala
    http://www.mladen-gogala.com

    --
    http://www.freelists.org/webpage/oracle-l

    ---------------------------------
    Everyone is raving about the all-new Yahoo! Mail beta.
    --
    http://www.freelists.org/webpage/oracle-l
  • Tony van Lingen at Dec 14, 2006 at 2:02 am

    A Ebadi wrote:

    Biggest problem we've faced in coming up with a solution is none of
    the solutions so far scale. In other words, things are fine if we
    have a 20 million row table with 2-3 million duplicates - runs in
    10-15 minutes. However, trying it for 100+ million row table - it
    runs for hrs!
    You do of course delete non-redoable? When deleting a row, Oracle will
    create redo info which you, having done a direct load, will not be
    needed. This'll take time.
    We've even had another tool (Informatica) select out the ROWIDs of the
    duplicates into a separate table then we are using PL/SQL cursor to
    delete those rows from the large table, but this doesn't scale either!
    if you mean that deleting 20million rows from a huge tabel is not as
    fast as deleting 2, then no. Nothing will scale. Try buying more iron
    and use parallel query.

    Why don't you look at cleansing the dataset before loading it? e.g. use
    'sort -u' on the file to get rid of duplicate lines. Might be quicker
    than loading everything and deleting later on...

    Cheers,

    --
    Tony van Lingen
    Tech One Contractor
    Information Management
    Corporate Development Division
    Environmental Protection Agency

    Ph: (07) 3234 1972
    Fax: (07) 3227 6534
    Mobile: 0413 701 284
    E-mail: tony.vanlingen_at_epa.qld.gov.au

    Visit us online at www.epa.qld.gov.au
    --

    ___________________________
    Disclaimer

    WARNING: This e-mail (including any attachments) has originated from a Queensland Government department and may contain information that is confidential, private, or covered by legal professional privilege, and may be protected by copyright.

    You may use this e-mail only if you are the person(s) it was intended to be sent to and if you use it in an authorised way. No one is allowed to use, review, alter, transmit, disclose, distribute, print or copy this e-mail without appropriate authority. If you have received this e-mail in error, please inform the sender immediately by phone or e-mail and delete this e-mail, including any copies, from your computer system network and destroy any hardcopies.

    Unless otherwise stated, this e-mail represents the views of the sender and not the views of the Environmental Protection Agency.

    Although this e-mail has been checked for the presence of computer viruses, the Environmental Protection Agency provides no warranty that all viruses have been detected and cleaned. Any use of this e-mail could harm your computer system. It is your responsibility to ensure that this e-mail does not contain and is not affected by computer viruses, defects or interference by third parties or replication problems (including incompatibility with your computer system).

    E-mails sent to and from the Environmental Protection Agency will be electronically stored, managed and may be audited, in accordance with the law and Queensland Government Information Standards (IS31, IS38, IS40, IS41 and IS42) to the extent they are consistent with the law.

    ___________________________

    --
    http://www.freelists.org/webpage/oracle-l
  • A Ebadi at Dec 14, 2006 at 5:47 pm
    Cannot clean data before loading as data is from many different sources that don't know about each other.


    Thanks for everyone that replied and still doing testing to find the best method!

    Tony van Lingen wrote:


    A Ebadi wrote:
    Biggest problem we've faced in coming up with a solution is none of
    the solutions so far scale. In other words, things are fine if we
    have a 20 million row table with 2-3 million duplicates - runs in
    10-15 minutes. However, trying it for 100+ million row table - it
    runs for hrs!
    You do of course delete non-redoable? When deleting a row, Oracle will
    create redo info which you, having done a direct load, will not be
    needed. This'll take time.
    We've even had another tool (Informatica) select out the ROWIDs of the
    duplicates into a separate table then we are using PL/SQL cursor to
    delete those rows from the large table, but this doesn't scale either!
    if you mean that deleting 20million rows from a huge tabel is not as
    fast as deleting 2, then no. Nothing will scale. Try buying more iron
    and use parallel query.

    Why don't you look at cleansing the dataset before loading it? e.g. use
    'sort -u' on the file to get rid of duplicate lines. Might be quicker
    than loading everything and deleting later on...

    Cheers,

    --
    Tony van Lingen
    Tech One Contractor
    Information Management
    Corporate Development Division
    Environmental Protection Agency

    Ph: (07) 3234 1972
    Fax: (07) 3227 6534
    Mobile: 0413 701 284
    E-mail: tony.vanlingen_at_epa.qld.gov.au

    Visit us online at www.epa.qld.gov.au
    --

    ___________________________
    Disclaimer

    WARNING: This e-mail (including any attachments) has originated from a Queensland Government department and may contain information that is confidential, private, or covered by legal professional privilege, and may be protected by copyright.

    You may use this e-mail only if you are the person(s) it was intended to be sent to and if you use it in an authorised way. No one is allowed to use, review, alter, transmit, disclose, distribute, print or copy this e-mail without appropriate authority. If you have received this e-mail in error, please inform the sender immediately by phone or e-mail and delete this e-mail, including any copies, from your computer system network and destroy any hardcopies.

    Unless otherwise stated, this e-mail represents the views of the sender and not the views of the Environmental Protection Agency.

    Although this e-mail has been checked for the presence of computer viruses, the Environmental Protection Agency provides no warranty that all viruses have been detected and cleaned. Any use of this e-mail could harm your computer system. It is your responsibility to ensure that this e-mail does not contain and is not affected by computer viruses, defects or interference by third parties or replication problems (including incompatibility with your computer system).

    E-mails sent to and from the Environmental Protection Agency will be electronically stored, managed and may be audited, in accordance with the law and Queensland Government Information Standards (IS31, IS38, IS40, IS41 and IS42) to the extent they are consistent with the law.

    ___________________________

    ---------------------------------
    Need a quick answer? Get one in minutes from people who know. Ask your question on Yahoo! Answers.
    --
    http://www.freelists.org/webpage/oracle-l
  • Ken Naim at Dec 14, 2006 at 6:35 pm
    For the initial load use an external table and do an insert as select
    distinct. After the initial load use an external table for the file and use
    the merge statement to only insert the ones you want or you can use union
    and insert it into a third table and then rename the tables and truncate the
    original and swap back and forth nightly between the two. The select form
    the external table can be optimized to exclude value known to be
    duplicates, i.e.anything over 6 months old or whatever criteria makes sense
    for you.


    Ken


    From: oracle-l-bounce_at_freelists.org
    On Behalf Of A Ebadi
    Sent: Thursday, December 14, 2006 12:47 PM
    To: Tony van Lingen; oracle-l_at_freelists.org
    Cc: ebadi01_at_yahoo.com
    Subject: Re: de-dup process


    Cannot clean data before loading as data is from many different sources that
    don't know about each other.


    Thanks for everyone that replied and still doing testing to find the best
    method!

    Tony van Lingen wrote:

    A Ebadi wrote:
    Biggest problem we've faced in coming up with a solution is none of
    the solutions so far scale. In other words, things are fine if we
    have a 20 million row table with 2-3 million duplicates - runs in
    10-15 minutes. However, trying it for 100+ million row table - it
    runs for hrs!
    You do of course delete non-redoable? When deleting a row, Oracle will
    create redo info which you, having done a direct load, will not be
    needed. This'll take time.
    We've even had another tool (Informatica) select out the ROWIDs of the
    duplicates into a separate table then we are using PL/SQL cursor to
    delete those rows from the large table, but this doesn't scale either!
    if you mean that deleting 20million rows from a huge tabel is not as
    fast as deleting 2, then no. Nothing will scale. Try buying more iron
    and use parallel query.

    Why don't you look at cleansing the dataset before loading it? e.g. use
    'sort -u' on the file to get rid of duplicate lines. Might be quicker
    than loading everything and deleting later on...

    Cheers,

    --
    Tony van Lingen
    Tech One Contractor
    Information Management
    Corporate Development Division
    Environmental Protection Agency

    Ph: (07) 3234 1972
    Fax: (07) 3227 6534
    Mobile: 0413 701 284
    E-mail: tony.vanlingen_at_epa.qld.gov.au

    Visit us online at www.epa.qld.gov.au
    --

    ___________________________
    Disclaimer

    WARNING: This e-mail (including any attachments) has originated from a
    Queensland Government department and may contain information that is
    confidential, private, or covered by legal professional privilege, and may
    be protected by copyright.

    You may use this e-mail only if you are the person(s) it was intended to be
    sent to and if you use it in an authorised way. No one is allowed to use,
    review, alter, transmit, disclose, distribute, print or copy this e-mail
    without appropriate authority. If you have received this e-mail in error,
    please inform the sender immediately by phone or e-mail and delete this
    e-mail, including any copies, from your computer system network and destroy
    any hardcopies.

    Unless otherwise stated, this e-mail represents the views of the sender and
    not the views of the Environmental Protection Agency.

    Although this e-mail has been checked for the presence of computer viruses,
    the Environmental Protection Agency provides no warranty that all viruses
    have been detected and cleaned. Any use of this e-mail could harm your
    computer system. It is your responsibility to ensure that this e-mail does
    not contain and is not affected by computer viruses, defects or interference
    by third parties or replication problems (including incompatibility with
    your computer system).

    E-mails sent to and from the Environmental Protection Agency will be
    electronically stored, managed and may be audited, in accordance with the
    law and Queensland Government Information Standards (IS31, IS38, IS40, IS41
    and IS42) to the extent they are consistent with the law.

    ___________________________

    _____

    Need a quick answer? Get one in minutes from people who know. Ask your
    question on Yahoo!
    <http://answers.yahoo.com/;_ylc=X3oDMTFvbGNhMGE3BF9TAzM5NjU0NTEwOARfcwMzOTY1
    NDUxMDMEc2VjA21haWxfdGFnbGluZQRzbGsDbWFpbF90YWcx> Answers.

    --
    http://www.freelists.org/webpage/oracle-l
  • Tim Gorman at Dec 14, 2006 at 1:19 pm
    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

    When an UPDATE or DELETE won't scale (and it never will!), convert it
    into a parallel nologging direct-path INSERT.
    Use a query similar to Tom's to insert only the first "copy" of a row
    (i.e. use "RN = 1" instead of "RN <> 1" the outermost WHERE
    clause) to INSERT the rows to be retained into a new table.  Make the
    new table NOLOGGING, use APPEND and PARALLEL hints on the INSERT
    portion and PARALLEL hints on the SELECT portion of the INSERT ...
    SELECT statement.
    Here's the cute part.  If you can't simply RENAME the two tables (i.e.
    old and new) to swap them, then make the newly-created table look
    exactly like the original table, but make it range-partitioned on a
    dummy numeric column (call it DUMMY or PARTKEY), give this column a
    constant value of zero, and name the single partition PZERO (or PDUMMY
    or whatever).  Also, add this DUMMY or PARTKEY column to the original
    table as well.  Then, use "ALTER TABLE EXCHANGE
    PARTITION PZERO WITH TABLE " to swap the new data
    with the old data.
    A direct-path parallel INSERT...SELECT will scale far better than any
    possible optimization on an UPDATE or DELETE statement.  The comparison
    is not even close.
    Hope this helps...
    -Tim

    A Ebadi wrote:

    Biggest problem we've faced in coming up with a solution is none
    of the solutions so far scale.  In other words, things are fine if we
    have a 20 million row table with 2-3 million duplicates - runs in 10-15
    minutes.  However, trying it for 100+ million row table - it runs for
    hrs!

    We've even had another tool (Informatica) select out the ROWIDs
    of the duplicates into a separate table then we are using PL/SQL cursor
    to delete those rows from the large table, but this doesn't scale
    either!

    I'm currently looking at some of the suggestions I got from this
    list and seeing if any of them will work with us.

    Thanks for all who replied and I'll let you know how it goes!

    Mladen Gogala
    wrote:

    On 12/12/2006 08:42:38 PM, tboss@bossconsulting.com
    wrote:
    From asktom, the best way I've found is to use Tom's little
    code snippet below:
    delete from table your_huge_table
    where rowid in
    (select rid
    from
    (select rowid rid,
    row_number() over
    (partition by varchar_that_defines_duplicates
    order by rowid ) rn
    from your_huge_table
    )
    where rn <> 1
    )
    /
    Good luck with that if cardinality is +60M rows. I would also add a
    condition like
    WHERE ROWID in (SELECT ROW_ID from EXCEPTIONS) to your query. You only
    need to populate
    exceptions table with duplicates and if those duplicates are a small
    percentage of the
    total number of records, your task will be done two order of magnitude
    faster then without
    the exceptions table.
  • Gints Plivna at Dec 14, 2006 at 2:19 pm
    Ok I've created a test case on our development box with 3 striped SATA
    disks, 2 CPUs and 2G of RAM, but only 500M are devoted to this 9.2
    Oracle instance.

    base_table was created from dba_source with one unique column and text
    just to fill up some space
    Then I multiplied base_table again and again until the count reached
    119 258 304 rows. Space it took was about ~30GB.

    Then I created temp_table with 20M rows to simulate your loaded table.
    half of them i.e. 10M rows where different than those in base_table,
    other 10M were the same.

    Then I created unique index on base_table varchar2 column simalting
    your unique key and altered this column to not null.

    OK. Now I was ready to do the insert.

    Firstly I've used following insert

    insert /*+ append */ into base_table
    select * from temp_table where ukrow in (
    select ukrow from temp_table
    minus
    select ukrow from base_table);
    with plan you can find in the link at very end of this message

    It took 26 minutes and 11 seconds to complete
    and it consisted of following steps show in v$session_longops
    Seconds Opname Target

    281 Sort Output
    18 Sort/Merge
    316 Hash Join
    246 Sort Output
    261 Index Fast Full Scan GINTS.BASE_TABLE
    525 Sort Output
    37 Sort/Merge

    177 Table Scan GINTS.TEMP_TABLE
    217 Table Scan GINTS.TEMP_TABLE

    The thing I didn't like was at least that GINTS.TEMP_TABLE was scanned
    twice. Writing SQLs I tend to follow the principle scan any necessary
    row as less as possible, at best only once. So after the first scan of
    GINTS.TEMP_TABLE I actaully knew all data for insert and second scan
    was a simple waste.

    Then I rewrote my insert as follows:
    insert /*+ append */ into base_table
    select ukrow, text from (
    select t.ukrow, t.text,
    case when b.ukrow = t.ukrow then 0 else 1 end flag
    from temp_table t, base_table b
    where t.ukrow = b.ukrow(+)
    )
    where flag = 1;

    with plan you can find in the link at very end of this message

    It took 16 minutes and 35 seconds to complete
    and it consisted of following steps show in v$session_longops
    Seconds Opname Target
    196 Sort Output
    39 Sort/Merge
    363 Hash Join

    244 Index Fast Full Scan GINTS.BASE_TABLE
    147 Table Scan GINTS.TEMP_TABLE

    So summary is
    10 M rows out of 20 M rows to insert in ~120 M rows table took 16.5
    minutes, and it included also 1 UK index update.

    So looking in the insert we can see that it is directly dependent on
    temp_table size, base_table UK index size and involves 1 hash and 1
    sort. To my mind it should scale rather linearly or at least the in
    the same rate as worst of sort and hash join.

    One more thing - consider using workarea_size_policy = manual with big
    sort and hahs area_sizes instead of workarea_size_policy = auto and
    [possibly] miserable pga_aggregate_target.

    Gints Plivna
    http://www.gplivna.eu

    P.S. resent message because the body was too big.
    The full spool file you can find in http://www.gplivna.eu/TabInsert.txt

    2006/12/14, A Ebadi :
    Biggest problem we've faced in coming up with a solution is none of the
    solutions so far scale.
    --
    http://www.freelists.org/webpage/oracle-l
  • Jaromir nemec at Dec 13, 2006 at 9:22 pm
    Hi,
    (select rowid rid,
    vrow_number() over
    (partition by varchar_that_defines_duplicates
    order by rowid ) rn
    a good thing on this approach is that you can control which of the
    duplicated rows will be discarded and which preserved. You may order not
    only by rowid but by any other attribute(s).
    (In case that dups are defined based on a key attribute there is no
    guarantee that all attributes in duplicated rows are identical).

    A common approach by incremental loading is to check for dups in the
    newly loaded data and to eliminate them.

    Optionally, if relevant, a second step is performed.
    In the second step the new data is checked against you base table. A big
    optimisation can be reached if there is some business rule saying that the
    dups must be timely coupled (i.e. they can appear only within a limited
    interval of time). This allowed to limit the check only to the some most
    recent partitions of your base table and it is not necessary to consider
    the whole base table. Provided that the base table is timely partitioned,
    of course.

    Both step can be combined and performed in one pass.

    Regards,

    Jaromir
  • Gints Plivna at Dec 13, 2006 at 10:23 am
    Probably instead of blindly loading all rows into the base table you
    can do your direct path load into some temporary table and then insert
    into base table only distinct rows (if the base table is empty) or
    insert into base table only those distinct rows that aren't already in
    the base table (if the base table is not empty i.e. something like
    insert into base_table select * from temp_table minus select * from
    base_table).

    Probably eliminating reason will be more effective than fighting with
    consequences?

    Gints Plivna
    http://www.gplivna.eu

    2006/12/13, A Ebadi :
    We have a huge table (> 160 million rows) which has about 20 million
    duplicate rows that we need to delete. What is the most efficient way to do
    this as we will need to do this daily?
    A single varchar2(30) column is used to identified duplicates. We could
    possibly have > 2 rows of duplicates.

    We are doing direct path load so no unique key indexes can be put on the
    table to take care of the duplicates.

    Platform: Oracle 10G RAC (2 node) on Solaris 10.

    Thanks!

    ________________________________
    Need a quick answer? Get one in minutes from people who know. Ask your
    question on Yahoo! Answers.
    --
    http://www.freelists.org/webpage/oracle-l
  • Tony van Lingen at Dec 14, 2006 at 3:04 am
    It may even be easier... You say "We are doing direct path load so no
    unique key indexes can be put on the table to take care of the
    duplicates". The utility guide (10gR2) however explicitly names unique
    constraints as a constraint that can be enforced during direct path loads:
    Integrity Constraints

    All integrity constraints are enforced during direct path loads,
    although not necessarily at the same time. |NOT| |NULL| constraints
    are enforced during the load. Records that fail these constraints are
    rejected.
    UNIQUE| constraints are enforced both during and after the load. A
    record that violates a |UNIQUE| constraint is not rejected (the record
    is not available in memory when the constraint violation is detected).
    (Utilities, B14215-01 chapter 11).

    Did you actually try this?

    Cheers,
    Tony

    A Ebadi wrote:
    We have a huge table (> 160 million rows) which has about 20 million
    duplicate rows that we need to delete. What is the most efficient way
    to do this as we will need to do this daily?
    A single varchar2(30) column is used to identified duplicates. We
    could possibly have > 2 rows of duplicates.

    We are doing direct path load so no unique key indexes can be put on
    the table to take care of the duplicates.

    Platform: Oracle 10G RAC (2 node) on Solaris 10.

    Thanks!

    ------------------------------------------------------------------------
    Need a quick answer? Get one in minutes from people who know. Ask your
    question on Yahoo! Answers
    <http://answers.yahoo.com/;_ylc=X3oDMTFvbGNhMGE3BF9TAzM5NjU0NTEwOARfcwMzOTY1NDUxMDMEc2VjA21haWxfdGFnbGluZQRzbGsDbWFpbF90YWcx>.
    Disclaimer

    WARNING: This e-mail (including any attachments) has originated from a Queensland Government department and may contain information that is confidential, private, or covered by legal professional privilege, and may be protected by copyright.

    You may use this e-mail only if you are the person(s) it was intended to be sent to and if you use it in an authorised way. No one is allowed to use, review, alter, transmit, disclose, distribute, print or copy this e-mail without appropriate authority. If you have received this e-mail in error, please inform the sender immediately by phone or e-mail and delete this e-mail, including any copies, from your computer system network and destroy any hardcopies.

    Unless otherwise stated, this e-mail represents the views of the sender and not the views of the Environmental Protection Agency.

    Although this e-mail has been checked for the presence of computer viruses, the Environmental Protection Agency provides no warranty that all viruses have been detected and cleaned. Any use of this e-mail could harm your computer system. It is your responsibility to ensure that this e-mail does not contain and is not affected by computer viruses, defects or interference by third parties or replication problems (including incompatibility with your computer system).

    E-mails sent to and from the Environmental Protection Agency will be electronically stored, managed and may be audited, in accordance with the law and Queensland Government Information Standards (IS31, IS38, IS40, IS41 and IS42) to the extent they are consistent with the law.
  • Alan J. Gagne at Dec 14, 2006 at 12:30 pm
    I haven't tried this with large data sets but it may be
    worth trying. Create the unique key then oad your data using the
    external table feature and use the new log error feature from an insert
    statement to catch the dups.

    Alan

    CREATE TABLE raises (emp_id NUMBER, sal NUMBER

    CONSTRAINT check_sal CHECK(sal > 8000));

    EXECUTE DBMS_ERRLOG.CREATE_ERROR_LOG('raises', 'errlog');

    INSERT INTO raises

    SELECT employee_id, salary*1.1 FROM employees
    WHERE commission_pct > .2
    LOG ERRORS INTO errlog ('my_bad') REJECT LIMIT 10;

    SELECT ORA_ERR_MESG$, ORA_ERR_TAG$, emp_id, sal FROM errlog;

    ORA_ERR_MESG$ ORA_ERR_TAG$ EMP_ID SAL

    --------------------------- -------------------- ------ -------
    ORA-02290: check constraint my_bad 161 7700
  • Kerber, Andrew at Dec 14, 2006 at 1:31 pm
    This will probably work, but keep in mind that by adding a column you
    run the risk of breaking any applications that rely on the table
    structure to remain constant (ie, anything that relies on a select *..
    ). Admittedly, it is bad programming practice to do this, but I have
    seen it done quite a bit.



    Andrew W. Kerber
    Oracle DBA
    UMB





    "If at first you dont succeed, dont take up skydiving"

    -----Original Message-----
    From: oracle-l-bounce_at_freelists.org
    On Behalf Of Tim Gorman
    Sent: Thursday, December 14, 2006 7:19 AM
    To: ebadi01_at_yahoo.com
    Cc: mgogala_at_verizon.net; tboss_at_bossconsulting.com;
    oracle-l_at_freelists.org
    Subject: Re: de-dup process



    When an UPDATE or DELETE won't scale (and it never will!), convert it
    into a parallel nologging direct-path INSERT.

    Use a query similar to Tom's to insert only the first "copy" of a row
    (i.e. use "RN = 1" instead of "RN <> 1" the outermost WHERE clause) to
    INSERT the rows to be retained into a new table. Make the new table
    NOLOGGING, use APPEND and PARALLEL hints on the INSERT portion and
    PARALLEL hints on the SELECT portion of the INSERT ... SELECT statement.

    Here's the cute part. If you can't simply RENAME the two tables (i.e.
    old and new) to swap them, then make the newly-created table look
    exactly like the original table, but make it range-partitioned on a
    dummy numeric column (call it DUMMY or PARTKEY), give this column a
    constant value of zero, and name the single partition PZERO (or PDUMMY
    or whatever). Also, add this DUMMY or PARTKEY column to the original
    table as well. Then, use "ALTER TABLE EXCHANGE PARTITION
    PZERO WITH TABLE " to swap the new data with the old
    data.

    A direct-path parallel INSERT...SELECT will scale far better than any
    possible optimization on an UPDATE or DELETE statement. The comparison
    is not even close.

    Hope this helps...

    -Tim

    A Ebadi wrote:

    Biggest problem we've faced in coming up with a solution is none of the
    solutions so far scale. In other words, things are fine if we have a 20
    million row table with 2-3 million duplicates - runs in 10-15 minutes.
    However, trying it for 100+ million row table - it runs for hrs!



    We've even had another tool (Informatica) select out the ROWIDs of the
    duplicates into a separate table then we are using PL/SQL cursor to
    delete those rows from the large table, but this doesn't scale either!



    I'm currently looking at some of the suggestions I got from this list
    and seeing if any of them will work with us.

    Thanks for all who replied and I'll let you know how it goes!

    Mladen Gogala wrote:


    On 12/12/2006 08:42:38 PM, tboss_at_bossconsulting.com wrote:
    From asktom, the best way I've found is to use Tom's little
    code snippet below:
    >
    delete from table your_huge_table
    where rowid in
    (select rid
    from
    (select rowid rid,
    row_number() over
    (partition by varchar_that_defines_duplicates
    order by rowid ) rn
    from your_huge_table
    )
    where rn <> 1
    )
    /
    Good luck with that if cardinality is +60M rows. I would also
    add a condition like
    WHERE ROWID in (SELECT ROW_ID from EXCEPTIONS) to your query.
    You only need to populate
    exceptions table with duplicates and if those duplicates are a
    small percentage of the
    total number of records, your task will be done two order of
    magnitude faster then without
    the exceptions table.
    --
    Mladen Gogala
    http://www.mladen-gogala.com

    --
    http://www.freelists.org/webpage/oracle-l

    ________________________________

    Everyone is raving about the all-new Yahoo! Mail beta.
    <http://us.rd.yahoo.com/evt=42297/*http:/advision.webevents.yahoo.com/ma
    ilbeta>

    http://www.freelists.org/webpage/oracle-l

    NOTICE: This electronic mail message and any attached files are confidential. The information is exclusively for the use of the individual or entity intended as the recipient. If you are not the intended recipient, any use, copying, printing, reviewing, retention, disclosure, distribution or forwarding of the message or any attached file is not authorized and is strictly prohibited. If you have received this electronic mail message in error, please advise the sender by reply electronic mail immediately and permanently delete the original transmission, any attachments and any copies of this message from your computer system. Thank you.

    --
    http://www.freelists.org/webpage/oracle-l
  • Nigel Thomas at Dec 14, 2006 at 7:47 pm
    Cannot clean data before loading as data is from many different sources that don't know about each other.:
    How many is many?
    If you really have many different load files (or can manufacture them), then you may be able to change the way the problem scales.
    How do you decide which of the duplicates to use?
    if you can use the "first come - first served" rule your de-duplicate problem becomes much simpler - duplicate avoidance rather than duplicate removal
    Is there any easy way you can partition the incoming data so you can be sure that records in files in group A can't interfere with records in files in groups B, C, D, etc?
    if the files 'overlap' when they come in, you may still be able to split them on a partition key - ie a component of the unique key - before they get loaded
    Task I: "Reception"
    First, as data files arrive, deal them out into N non-overlapping file groups, where with luck N is an acceptable degree of (external) parallelism. This task is not internally parallelised
    get the next file
    place it as-is into the appropriate load file group for task III OR to a splitter process for task II if necessary
    Go back to step 1
    This is NOT parallel - like the maitre-d in a restaurant, it does one thing quickly, then hands you off:
    - "here is Annette to take you to your table" OR
    - "Your table is by the window"

    Task II: "Your table is ready"
    This step deals with physically partitioning individual files outside the database as a precursor to task III. You can skip this if you don't need to physically split files
    find a file that's ready to split (eg it's in a specific directory)
    deal it the records into a file for each load group
    load it into temporary table TEMP_TABLE_A (truncate first). Each group has its own temp table...
    You can parallelize this task if necessary - have as many parallel splitter processes as the cpu and file system can stand, to ensure that there is always a file ready for every loader process in the next task.

    This is parallel - like the number of waiters in a restaurant; one of them organises you, takes your order etc

    Task III: "The meal"
    Within each group, set up a loader process:
    find a file that's ready to load for this group (eg it's in a specific directory)
    load it into temporary table TEMP_TABLE_A (truncate first). Each group has its own temp table...
    merge TEMP_TABLE_A into your master table BASE_TABLE (using the technique Gints Plivna mentioned)
    mark the file as loaded (eg move it to another directory)
    Go back to step 1 until files are exhausted (or you are)
    This is parallel (like the number of tables in a restaurant). Restaurants can add more tables - but only if the kitchen can keep up... otherwise you get mad and leave.

    Summary
    You now have some flexibility to tune things:
    the size of the individual files - to get the best out of the hash join TEMP_TABLE_A to BASE_TABLE.
    the external degree of parallelism (how many loader group processes, how many splitter processes)
    the (internal) degree of parallelism within the database (for each group) so that the sum of parallel servers required across all groups matches the capacity of your server (ie DOP * N = function of max_parallel_servers).
    task II could be inside Oracle (easier to manage?) - subsititute temp tables for temp files throughout
    or if task II could be pipelined into task III using fifos (means you don't have to double handle the data on disc - but can cause other problems)
    And BTW, this doesn't necessarily need the Partitioning option...

    HTH

    Regards Nigel

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouporacle-l @
categoriesoracle
postedDec 12, '06 at 10:31p
activeDec 14, '06 at 7:47p
posts16
users12
websiteoracle.com

People

Translate

site design / logo © 2022 Grokbase