FAQ
Hello all,

Sorry about the off-topic question, but this has been driving me nuts. I am
trying to import a tab delimited file into a MySQL database and need to remove
duplicate email addresses, before importing the data. The file is setup like
this:

id1 user1 pass1 email1 name1 na 1 na
id2 user2 pass2 email2 name2 na 0 na
id3 user3 pass3 email2 name3 na 1 na
id4 user4 pass4 email name4 na 1 na
....etc
now I need to go thru each data record in the file and remove all duplicate
records where the email address is the same. IE above, id2 has the same email
address and id3 so I want to remove id3.. and so on. the file is sorted by the
id value.

Any help much appreciated.


Mike(mickalo)Blezien
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Thunder Rain Internet Publishing
Providing Internet Solutions that work!
http://www.thunder-rain.com
Tel: 1(225)686-2002
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Search Discussions

  • Mickalo at Aug 12, 2001 at 12:55 pm
    Hello all,

    Sorry about the off-topic question, but this has been driving me nuts. I am
    trying to import a tab delimited file into a MySQL database and need to remove
    duplicate email addresses, before importing the data. The file is setup like
    this:

    id1 user1 pass1 email1 name1 na 1 na
    id2 user2 pass2 email2 name2 na 0 na
    id3 user3 pass3 email2 name3 na 1 na
    id4 user4 pass4 email4 name4 na 1 na
    ....etc
    now I need to go thru each data record in the file and remove all duplicate
    records where the email address is the same. IE above, id2 has the same email
    address and id3 so I want to remove id3.. and so on. the file is sorted by the
    id value. I'm working with Perl

    Any help much appreciated.


    Mike(mickalo)Blezien
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    Thunder Rain Internet Publishing
    Providing Internet Solutions that work!
    http://www.thunder-rain.com
    Tel: 1(225)686-2002
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
  • Mark Rowlands at Aug 12, 2001 at 2:08 pm

    On Sunday 12 August 2001 14:53, mickalo@bellsouth.net wrote:
    Hello all,

    Sorry about the off-topic question, but this has been driving me nuts. I am
    trying to import a tab delimited file into a MySQL database and need to
    remove duplicate email addresses, before importing the data. The file is
    setup like this:

    id1 user1 pass1 email1 name1 na 1 na
    id2 user2 pass2 email2 name2 na 0 na
    id3 user3 pass3 email2 name3 na 1 na
    id4 user4 pass4 email4 name4 na 1 na
    ....etc
    now I need to go thru each data record in the file and remove all duplicate
    records where the email address is the same. IE above, id2 has the same
    email address and id3 so I want to remove id3.. and so on. the file is
    sorted by the id value. I'm working with Perl
    could use a hash?

    $hash_of_email{$email)=$whatever;

    if (exists($hash_of_email{$email})) {
    # We've seen this one
    } else {
    # Ohh its a new then!
    }
  • Mickalo at Aug 12, 2001 at 2:51 pm
    Mark,

    thanks for the input. I was able resolve the problem with the mysqlimport
    utility as it has an "ignore" and "replace" options which made it alot easier to
    import the data, without alot screwing around. Worked perfectly.

    Much thanks to all that replied, appreciated as always :)

    On Sun, 12 Aug 2001 16:07:57 +0200, Mark Rowlands wrote:
    On Sunday 12 August 2001 14:53, mickalo@bellsouth.net wrote:
    Hello all,

    Sorry about the off-topic question, but this has been driving me nuts. I am
    trying to import a tab delimited file into a MySQL database and need to
    remove duplicate email addresses, before importing the data. The file is
    setup like this:

    id1 user1 pass1 email1 name1 na 1 na
    id2 user2 pass2 email2 name2 na 0 na
    id3 user3 pass3 email2 name3 na 1 na
    id4 user4 pass4 email4 name4 na 1 na
    ....etc
    now I need to go thru each data record in the file and remove all duplicate
    records where the email address is the same. IE above, id2 has the same
    email address and id3 so I want to remove id3.. and so on. the file is
    sorted by the id value. I'm working with Perl
    could use a hash?

    $hash_of_email{$email)=$whatever;

    if (exists($hash_of_email{$email})) {
    # We've seen this one
    } else {
    # Ohh its a new then!
    }
    Mike(mickalo)Blezien
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    Thunder Rain Internet Publishing
    Providing Internet Solutions that work!
    http://www.thunder-rain.com
    Tel: 1(225)686-2002
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
  • Steven Lembark at Aug 12, 2001 at 5:51 pm
    -- mickalo@bellsouth.net
    now I need to go thru each data record in the file and remove all
    duplicate records where the email address is the same. IE above, id2 has
    the same email address and id3 so I want to remove id3.. and so on. the
    file is sorted by the id value.
    (a) Extract the key fields from one list into a hash and only take records
    from the list that don't exist in the hash.

    (b) Sort the second list on its key also and get two arrays. Keep shifting
    records off the arrays, ignoring items from the second one with keys
    that match the first.

    (c) use Quantum::Superpositions, create a package with an overloaded string
    compare operator that simply does $_[0][0] eq $_[1][0] for all of the key
    fields $_[X][WHATEVER]. The deduped list is:

    {
    package NoDups;
    use overload q{eq} =>
    sub{ #ignores $_[2] since both objects are NoDups
    my($a,$b)=@_[0,1]; $a->[0] eq $b->[0] && $a->[1] eq $b->[1] }# for all
    key fields
    }

    my $list1 = bless $sth1->fetchall_arrayref, NoDups;
    my $list2 = bless $sth2->fetchall_arrayref, NoDups;

    @deduped = ( @$list1, any(@$list2) ne all(@$list1) );

    --
    Steven Lembark 2930 W. Palmer
    Workhorse Computing Chicago, IL 60647
    +1 800 762 1582
  • Hardy Merrill at Aug 13, 2001 at 2:39 pm
    Build a hash where the key is the email address, and the value
    is the whole record:

    my %email_hash = ();
    while (<IN>) {
    ### use "unpack" or "split" to get fields from line ###
    if (! exists($email_hash{$email})) {
    $email_hash{$email} = $_;
    }
    }

    You would end up with a hash containing only lines with unique
    email addresses - you could then iterate through the hash and
    insert rows into the table. Of course this approach could use
    a LOT of memory if the file you're importing is large.

    Another approach might be to read the tab delimited file into
    a temporary MySQL table that has an index based on email address.
    Then with DBI(this is just pseudo-code),

    1. SELECT * FROM A
    ORDER BY EMAIL

    2. while (fetch) {
    if (email changed) {
    insert row into good table
    }
    }

    this would be the long-winded approach. I'm sure there's a
    better way, but these are the 2 that came to mind...

    HTH.


    mickalo@bellsouth.net [mickalo@bellsouth.net] wrote:
    Hello all,

    Sorry about the off-topic question, but this has been driving me nuts. I am
    trying to import a tab delimited file into a MySQL database and need to remove
    duplicate email addresses, before importing the data. The file is setup like
    this:

    id1 user1 pass1 email1 name1 na 1 na
    id2 user2 pass2 email2 name2 na 0 na
    id3 user3 pass3 email2 name3 na 1 na
    id4 user4 pass4 email name4 na 1 na
    ....etc
    now I need to go thru each data record in the file and remove all duplicate
    records where the email address is the same. IE above, id2 has the same email
    address and id3 so I want to remove id3.. and so on. the file is sorted by the
    id value.

    Any help much appreciated.


    Mike(mickalo)Blezien
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    Thunder Rain Internet Publishing
    Providing Internet Solutions that work!
    http://www.thunder-rain.com
    Tel: 1(225)686-2002
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    --
    Hardy Merrill
    Mission Critical Linux, Inc.
    http://www.missioncriticallinux.com
  • Neil Lunn at Aug 14, 2001 at 1:10 am
    While I can read that the person who posted this message has solved their
    issue, this might be worthwhile still.

    One thing I notice is that all responses are processing very record to see
    if a duplicate record is there. Hopefully though the database server is a
    little more powerful than the machine running the script so why not do the
    hash match there?

    Use something like this as the record source for items with duplicate email
    addresses:

    select *
    from table
    where email in (
    select email
    from table
    group by email
    having count(*) > 1)

    and delete every second record.

    -----Original Message-----
    From: mickalo@bellsouth.net
    Sent: Sunday, August 12, 2001 10:49 PM
    To: dbi-users@perl.org
    Subject: Removing duplicate records - OT


    Hello all,

    Sorry about the off-topic question, but this has been driving
    me nuts. I am
    trying to import a tab delimited file into a MySQL database
    and need to remove
    duplicate email addresses, before importing the data. The file
    is setup like
    this:

    id1 user1 pass1 email1 name1 na 1 na
    id2 user2 pass2 email2 name2 na 0 na
    id3 user3 pass3 email2 name3 na 1 na
    id4 user4 pass4 email name4 na 1 na
    ....etc
    now I need to go thru each data record in the file and remove
    all duplicate
    records where the email address is the same. IE above, id2 has
    the same email
    address and id3 so I want to remove id3.. and so on. the file
    is sorted by the
    id value.

    Any help much appreciated.


    Mike(mickalo)Blezien
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    Thunder Rain Internet Publishing
    Providing Internet Solutions that work!
    http://www.thunder-rain.com
    Tel: 1(225)686-2002
    =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
    __________________________________________________________________________
    Please Note :
    Only the intended recipient is authorised to access or use this e-mail.
    If you are not the intended recipient, please delete this e-mail and notify
    the sender immediately. The contents of this e-mail are the writer's
    opinion and are not necessarily endorsed by the Gunz Companies
    unless expressly stated.

    We use virus scanning software but exclude all liability for viruses or
    similar in any attachment.
  • Bart Lateur at Aug 14, 2001 at 10:27 am

    On Tue, 14 Aug 2001 11:12:23 +1000, Neil Lunn wrote:
    Use something like this as the record source for items with duplicate email
    addresses:

    select *
    from table
    where email in (
    select email
    from table
    group by email
    having count(*) > 1)

    and delete every second record.
    Next question: how do you delete every second record? Quickly?

    --
    Bart.
  • Ronald J Kimball at Aug 14, 2001 at 1:23 pm

    On Tue, Aug 14, 2001 at 11:12:23AM +1000, Neil Lunn wrote:

    Use something like this as the record source for items with duplicate email
    addresses:

    select *
    from table
    where email in (
    select email
    from table
    group by email
    having count(*) > 1)

    and delete every second record.
    This approach would end up deleting the wrong records if the records aren't
    returned in the expected order (you could add an order by to fix that), or
    if any email appears more than twice.

    Ronald
  • Neil Lunn at Aug 14, 2001 at 11:48 am

    -----Original Message-----
    From: Bart Lateur
    Sent: Tuesday, August 14, 2001 8:30 PM
    To: dbi-users@perl.org
    Subject: Re: Removing duplicate records - OT

    On Tue, 14 Aug 2001 11:12:23 +1000, Neil Lunn wrote:

    Use something like this as the record source for items with
    duplicate email
    addresses:

    select *
    from table
    where email in (
    select email
    from table
    group by email
    having count(*) > 1)

    and delete every second record.
    Next question: how do you delete every second record? Quickly?
    $count = 0;
    while ($sth->fetch) { # or whatever

    $count++;
    if ($count > 1) {
    # delete statement here for Id = current Id
    $count = 0;
    }
    # Otherwise skip over it
    }

    I wasn't going to write all the code and I can't be quick when I'm not here.

    --Neil
    --
    Bart.
    __________________________________________________________________________
    Please Note :
    Only the intended recipient is authorised to access or use this e-mail.
    If you are not the intended recipient, please delete this e-mail and notify
    the sender immediately. The contents of this e-mail are the writer's
    opinion and are not necessarily endorsed by the Gunz Companies
    unless expressly stated.

    We use virus scanning software but exclude all liability for viruses or
    similar in any attachment.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdbi-users @
categoriesperl
postedAug 12, '01 at 12:50p
activeAug 14, '01 at 1:23p
posts10
users7
websitedbi.perl.org

People

Translate

site design / logo © 2022 Grokbase