FAQ
I'm trying to write a script to remove duplicate e-mail addresses from a
list.
I'd like some help understanding...
1. Why does it remove all but one of the duplicate lines?
2. How can I fix it?

Thanks for any advice,
John
-------------------------------
#!/usr/bin/perl
use warnings;
use strict;

open ALLNAMES, "emails.txt" or die "File: infile failed to open: $!\n";
my @allnames = <ALLNAMES>;

my %seen = ();
my @unique = grep { ! $seen{ $_ }++ } @allnames;

print "@unique";

close ALLNAMES or die "cannot close infile";
-----------------------------------------
here's a small test file with fourteen lines, but only ten unique lines:

one@ahoo.com
two@ahoo.com
three@bcglobal.net
four@bcglobal.net
five@ahoo.com
six@mail.com
seven@otmail.com
eight@ildblue.net
nine@arthlink.net
ten@omcast.net
one@ahoo.com
two@ahoo.com
three@bcglobal.net
four@bcglobal.net

-------------------------------

Search Discussions

  • 亂世貓熊 at Feb 5, 2008 at 10:03 am

    -------------------------------
    #!/usr/bin/perl
    use warnings;
    use strict;

    open ALLNAMES, "emails.txt" or die "File: infile failed to open: $!\n";
    my @allnames = <ALLNAMES>;
    chomp @allnames ; # I don't know why, but seems you need this
    my %seen = ();
    my @unique = grep { ! $seen{ $_ }++ } @allnames;

    print "@unique";

    close ALLNAMES or die "cannot close infile";
    -----------------------------------------
    HTH
  • Chas. Owens at Feb 5, 2008 at 2:41 pm

    On Feb 5, 2008 1:18 AM, boll wrote:
    I'm trying to write a script to remove duplicate e-mail addresses from a
    list.
    I'd like some help understanding...
    1. Why does it remove all but one of the duplicate lines?
    snip

    Because that is what the code says to do. It says to print any line
    it hasn't seen before. It isn't looking forward to see if the line
    may exist again in the list.

    snip
    2. How can I fix it?
    snip

    Well, I think you should think about it for a second. Do you really
    want to throw away any lines that have duplicates. For instance:

    one@example.com
    two@example.com
    one@example.com

    When I look at that I want

    one@example.com
    two@example.com

    not

    two@example.com

    But if you really want the later then you will need to keep a running
    count of how many times you have seen an email and only print out the
    ones you have seen once:

    #!/usr/bin/perl

    use strict;
    use warnings;

    my %seen;
    $seen{$_}++ while <>;
    print grep { $seen{$_} == 1 } keys %seen;
  • Rob Dixon at Feb 5, 2008 at 3:22 pm

    boll wrote:
    I'm trying to write a script to remove duplicate e-mail addresses from a
    list.
    I'd like some help understanding...
    1. Why does it remove all but one of the duplicate lines?
    2. How can I fix it?

    Thanks for any advice,
    John
    -------------------------------
    #!/usr/bin/perl
    use warnings;
    use strict;

    open ALLNAMES, "emails.txt" or die "File: infile failed to open: $!\n";
    my @allnames = <ALLNAMES>;

    my %seen = ();
    my @unique = grep { ! $seen{ $_ }++ } @allnames;

    print "@unique";

    close ALLNAMES or die "cannot close infile";
    -----------------------------------------
    here's a small test file with fourteen lines, but only ten unique lines:

    one@ahoo.com
    two@ahoo.com
    three@bcglobal.net
    four@bcglobal.net
    five@ahoo.com
    six@mail.com
    seven@otmail.com
    eight@ildblue.net
    nine@arthlink.net
    ten@omcast.net
    one@ahoo.com
    two@ahoo.com
    three@bcglobal.net
    four@bcglobal.net

    -------------------------------
    I would guess that your output includes the last line of the file when
    you don't expect it to. You are retaining the newline character at the
    end of each line. If the final line doesn't have a newline at the end it
    will appear different from the ones that do, and so will be listed in
    the output. To fix this just

    my @allnames = <ALLNAMES>;
    chomp @allnames;

    and then

    print "$_\n" foreach @unique;

    at the end.

    HTH,

    Rob
  • Obdulio santana at Feb 5, 2008 at 7:40 pm
    May be this helps

    perl -lne "print if ++$D{$_} == 1" address.txt

    regards



    2008/2/5, Rob Dixon <rob.dixon@gmx.com>:
    boll wrote:
    I'm trying to write a script to remove duplicate e-mail addresses from a
    list.
    I'd like some help understanding...
    1. Why does it remove all but one of the duplicate lines?
    2. How can I fix it?

    Thanks for any advice,
    John
    -------------------------------
    #!/usr/bin/perl
    use warnings;
    use strict;

    open ALLNAMES, "emails.txt" or die "File: infile failed to open: $!\n";
    my @allnames = <ALLNAMES>;

    my %seen = ();
    my @unique = grep { ! $seen{ $_ }++ } @allnames;

    print "@unique";

    close ALLNAMES or die "cannot close infile";
    -----------------------------------------
    here's a small test file with fourteen lines, but only ten unique lines:

    one@ahoo.com
    two@ahoo.com
    three@bcglobal.net
    four@bcglobal.net
    five@ahoo.com
    six@mail.com
    seven@otmail.com
    eight@ildblue.net
    nine@arthlink.net
    ten@omcast.net
    one@ahoo.com
    two@ahoo.com
    three@bcglobal.net
    four@bcglobal.net

    -------------------------------
    I would guess that your output includes the last line of the file when
    you don't expect it to. You are retaining the newline character at the
    end of each line. If the final line doesn't have a newline at the end it
    will appear different from the ones that do, and so will be listed in
    the output. To fix this just

    my @allnames = <ALLNAMES>;
    chomp @allnames;

    and then

    print "$_\n" foreach @unique;

    at the end.

    HTH,

    Rob



    --
    To unsubscribe, e-mail: beginners-unsubscribe@perl.org
    For additional commands, e-mail: beginners-help@perl.org
    http://learn.perl.org/

  • Sivasakthi at Apr 29, 2008 at 10:35 am

    On Tue, 2008-02-05 at 14:40 -0500, obdulio santana wrote:

    May be this helps

    perl -lne "print if ++$D{$_} == 1" address.txt

    regards

    I have tried the above command , but it shows the following error,

    Can't modify single ref constructor in preincrement (++) at -e line 1,
    near "} =="
    Execution of -e aborted due to compilation errors.
  • John W. Krahn at Apr 29, 2008 at 10:56 am

    sivasakthi wrote:
    On Tue, 2008-02-05 at 14:40 -0500, obdulio santana wrote:

    May be this helps

    perl -lne "print if ++$D{$_} == 1" address.txt
    I have tried the above command , but it shows the following error,

    Can't modify single ref constructor in preincrement (++) at -e line 1,
    near "} =="
    Execution of -e aborted due to compilation errors.
    Using double quotes means that the shell will interpolate the line first
    before perl gets it so you have to use single quotes instead.

    perl -lne 'print if ++$D{$_} == 1' address.txt


    John
    --
    Perl isn't a toolbox, but a small machine shop where you
    can special-order certain sorts of tools at low cost and
    in short order. -- Larry Wall
  • Rob Dixon at Apr 29, 2008 at 10:29 pm

    obdulio santana wrote:
    May be this helps

    perl -lne "print if ++$D{$_} == 1" address.txt
    You may prefer the cuteness of

    perl -lne "print unless $D{$_}++" address.txt
  • Rob Dixon at Apr 29, 2008 at 10:56 pm

    Rob Dixon wrote:
    obdulio santana wrote:
    May be this helps

    perl -lne "print if ++$D{$_} == 1" address.txt
    You may prefer the cuteness of

    perl -lne "print unless $D{$_}++" address.txt
    My apologies. That requires single quotes:

    perl -lne 'print unless $D{$_}++' address.txt

    to avoid perpetuating the earlier problem.

    Rob
  • John W. Krahn at Apr 29, 2008 at 11:21 pm

    Rob Dixon wrote:
    obdulio santana wrote:
    May be this helps

    perl -lne "print if ++$D{$_} == 1" address.txt
    You may prefer the cuteness of

    perl -lne "print unless $D{$_}++" address.txt
    Or the shortness of:

    perl -ne'$D{$_}++||print' address.txt


    John
    --
    Perl isn't a toolbox, but a small machine shop where you
    can special-order certain sorts of tools at low cost and
    in short order. -- Larry Wall
  • Jenda Krynicky at Apr 30, 2008 at 12:09 am
    From: "John W. Krahn" <krahnj@telus.net>
    Rob Dixon wrote:
    obdulio santana wrote:
    May be this helps

    perl -lne "print if ++$D{$_} == 1" address.txt
    You may prefer the cuteness of

    perl -lne "print unless $D{$_}++" address.txt
    Or the shortness of:

    perl -ne'$D{$_}++||print' address.txt

    John
    Well, since you started golfing ...

    perl -ne'$D{$_}||=print'

    one character less :-)

    Jenda
    ===== Jenda@Krynicky.cz === http://Jenda.Krynicky.cz =====
    When it comes to wine, women and song, wizards are allowed
    to get drunk and croon as much as they like.
    -- Terry Pratchett in Sourcery
  • Jenda Krynicky at Apr 30, 2008 at 12:09 am
    From: "John W. Krahn" <krahnj@telus.net>
    Rob Dixon wrote:
    obdulio santana wrote:
    May be this helps

    perl -lne "print if ++$D{$_} == 1" address.txt
    You may prefer the cuteness of

    perl -lne "print unless $D{$_}++" address.txt
    Or the shortness of:

    perl -ne'$D{$_}++||print' address.txt

    John
    Actually this

    perl -n'${$_}||=print'

    works as well as long as the last line in the file either does end
    with a newline or doesn't match any builtin variable whose default
    value is already true.

    One more keystroke down.

    Jenda
    ===== Jenda@Krynicky.cz === http://Jenda.Krynicky.cz =====
    When it comes to wine, women and song, wizards are allowed
    to get drunk and croon as much as they like.
    -- Terry Pratchett in Sourcery
  • John W. Krahn at Apr 30, 2008 at 3:45 am

    Jenda Krynicky wrote:
    From: "John W. Krahn" <krahnj@telus.net>
    Rob Dixon wrote:
    obdulio santana wrote:
    May be this helps

    perl -lne "print if ++$D{$_} == 1" address.txt
    You may prefer the cuteness of

    perl -lne "print unless $D{$_}++" address.txt
    Or the shortness of:

    perl -ne'$D{$_}++||print' address.txt
    Actually this

    perl -n'${$_}||=print'

    works as well as long as the last line in the file either does end
    with a newline or doesn't match any builtin variable whose default
    value is already true.

    One more keystroke down.
    $ perl -c -n'$D{$_}++||print'
    Unrecognized switch: -$D{$_}++||print (-h will show valid options).



    John
    --
    Perl isn't a toolbox, but a small machine shop where you
    can special-order certain sorts of tools at low cost and
    in short order. -- Larry Wall
  • Jenda Krynicky at Apr 30, 2008 at 11:54 am
    From: "John W. Krahn" <krahnj@telus.net>
    Jenda Krynicky wrote:
    From: "John W. Krahn" <krahnj@telus.net>
    Rob Dixon wrote:
    obdulio santana wrote:
    May be this helps

    perl -lne "print if ++$D{$_} == 1" address.txt
    You may prefer the cuteness of

    perl -lne "print unless $D{$_}++" address.txt
    Or the shortness of:

    perl -ne'$D{$_}++||print' address.txt
    Actually this

    perl -n'${$_}||=print'

    works as well as long as the last line in the file either does end
    with a newline or doesn't match any builtin variable whose default
    value is already true.

    One more keystroke down.
    $ perl -c -n'$D{$_}++||print'
    Unrecognized switch: -$D{$_}++||print (-h will show valid options).
    Sorry, it should have been

    perl -ne'${$_}||=print'

    I'm using windows so I have to test it with doublequotes in place of
    the singlequotes so I must have made a mistake changing them back.


    From: "Chas. Owens" <chas.owens@gmail.com>
    That is nasty. If I am reading that correctly you are using a
    symbolic reference to create variables with the name of the line. > I can't remember, does Perl have a limit on the size of a variable > name?
    Yes, you are reading it correctly. And since the line contains the
    newline it can't match any builtin. Except possibly for the last line
    in the file.

    Does it have a limit on the size of a hash key?

    If it does, then the limit is high enough I believe.

    Jenda


    ===== Jenda@Krynicky.cz === http://Jenda.Krynicky.cz =====
    When it comes to wine, women and song, wizards are allowed
    to get drunk and croon as much as they like.
    -- Terry Pratchett in Sourcery
  • Chas. Owens at Apr 30, 2008 at 6:26 am
    On Tue, Apr 29, 2008 at 8:06 PM, Jenda Krynicky wrote:
    snip
    perl -n'${$_}||=print'
    snip

    That is nasty. If I am reading that correctly you are using a
    symbolic reference to create variables with the name of the line. I
    can't remember, does Perl have a limit on the size of a variable name?

    --
    Chas. Owens
    wonkden.net
    The most important skill a programmer can have is the ability to read.
  • Boll at Feb 5, 2008 at 11:38 pm

    Rob Dixon wrote:
    boll wrote:
    I'm trying to write a script to remove duplicate e-mail addresses
    from a list.
    I'd like some help understanding...
    1. Why does it remove all but one of the duplicate lines?
    2. How can I fix it?

    Thanks for any advice,
    John
    -------------------------------
    #!/usr/bin/perl
    use warnings;
    use strict;

    open ALLNAMES, "emails.txt" or die "File: infile failed to open: $!\n";
    my @allnames = <ALLNAMES>;

    my %seen = ();
    my @unique = grep { ! $seen{ $_ }++ } @allnames;

    print "@unique";

    close ALLNAMES or die "cannot close infile";
    -----------------------------------------
    here's a small test file with fourteen lines, but only ten unique lines:

    one@ahoo.com
    two@ahoo.com
    three@bcglobal.net
    four@bcglobal.net
    five@ahoo.com
    six@mail.com
    seven@otmail.com
    eight@ildblue.net
    nine@arthlink.net
    ten@omcast.net
    one@ahoo.com
    two@ahoo.com
    three@bcglobal.net
    four@bcglobal.net

    -------------------------------
    I would guess that your output includes the last line of the file when
    you don't expect it to. You are retaining the newline character at the
    end of each line. If the final line doesn't have a newline at the end it
    will appear different from the ones that do, and so will be listed in
    the output. To fix this just

    my @allnames = <ALLNAMES>;
    chomp @allnames;

    and then

    print "$_\n" foreach @unique;

    at the end.

    HTH,

    Rob

    OK, now I get it.
    The two lines appear identical, but only one has a newline appended.
    Thanks for the explanation!
    -John

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupbeginners @
categoriesperl
postedFeb 5, '08 at 6:18a
activeApr 30, '08 at 11:54a
posts16
users8
websiteperl.org

People

Translate

site design / logo © 2022 Grokbase