FAQ
Is there a way of writing a regex to find 1 or more occurances of specific
text string, and replace with a single occurance.

e.g.:

<data tag>
AI000001
AI000001
AI000001
AI000001

needs to be replaced with
<data tag>
AI0000001

thus (m/(AI\d{6}\n)/) will find one occurance and capture as $1 (assuming
delimiter set to something other than \n, but how can this regex be modified
to find multiple occurances and replace with a single occurance?

thanks,
J.

Search Discussions

  • John W. Krahn at Jun 12, 2007 at 5:04 pm

    James wrote:
    Is there a way of writing a regex to find 1 or more occurances of specific
    text string, and replace with a single occurance.

    e.g.:

    <data tag>
    AI000001
    AI000001
    AI000001
    AI000001

    needs to be replaced with
    <data tag>
    AI0000001

    thus (m/(AI\d{6}\n)/) will find one occurance and capture as $1 (assuming
    delimiter set to something other than \n, but how can this regex be modified
    to find multiple occurances and replace with a single occurance?

    $ perl -le'
    my $data = q[<data tag>
    AI000001
    AI000001
    AI000001
    AI000001
    ];
    print $data;
    $data =~ s/(AI\d{6}\n)(?=\1)//g;
    print $data;
    '
    <data tag>
    AI000001
    AI000001
    AI000001
    AI000001

    <data tag>
    AI000001




    John
    --
    use Perl;
    program
    fulfillment
  • Chas Owens at Jun 12, 2007 at 5:49 pm

    On 6/12/07, James wrote:
    Is there a way of writing a regex to find 1 or more occurances of specific
    text string, and replace with a single occurance.
    Possibly, but using a hash is a lot easier and probably more efficient:

    #!/usr/bin/perl

    use strict;
    use warnings;

    my %h;
    while (<DATA>) {
    print unless $h{$_}++
    }

    __DATA__
    AAAAAAAAAAAAA
    AAAAAAAAAAAAA
    AAAAAAAAAAAAA
    BBBBBB
    NNNNNNNNNNN
    NNNNNNNNNNN
    CCCCCCCCC
    CCCCCCCCC
    CCCCCCCCC
    CCCCCCCCC
  • Yitzle at Jun 12, 2007 at 5:56 pm
    Issues with both methods:

    John's doesn't work for this data:
    aaaaaa
    aaaaaa
    bbb
    cccccc
    cccccc

    I would expect:
    aaaaaa
    bbb
    cccccc

    I would get:
    aaaaaa
    bbb
    cccccc
    cccccc

    With the solution by Chas and the data:
    aaaaaa
    aaaaaa
    bbb
    aaaaaa
    aaaaaa

    I expect:
    aaaaaa
    bbb
    aaaaaa

    I get:
    aaaaaa
    bbb
  • Chas Owens at Jun 12, 2007 at 6:03 pm

    On 6/12/07, yitzle wrote:
    Issues with both methods:
    snip

    If you only want to reduce runs (as opposed to removing dups) then

    #!/usr/bin/perl

    use strict;
    use warnings;

    my $cur = undef;
    while (<DATA>) {
    print unless defined $cur and $_ eq $cur;
    $cur = $_;
    }

    __DATA__
    AAAAAAAAAAAAA
    AAAAAAAAAAAAA
    AAAAAAAAAAAAA
    BBBBBB
    AAAAAAAAAAAAA
    NNNNNNNNNNN
    NNNNNNNNNNN
    AAAAAAAAAAAAA
    CCCCCCCCC
    CCCCCCCCC
    CCCCCCCCC
    CCCCCCCCC
    AAAAAAAAAAAAA
  • John W. Krahn at Jun 12, 2007 at 7:20 pm

    yitzle wrote:
    Issues with both methods:

    John's doesn't work for this data:
    aaaaaa
    aaaaaa
    bbb
    cccccc
    cccccc

    I would expect:
    aaaaaa
    bbb
    cccccc

    I would get:
    aaaaaa
    bbb
    cccccc
    cccccc
    It works for me:

    $ perl -le'
    my $data = q[aaaaaa
    aaaaaa
    bbb
    cccccc
    cccccc
    ];
    print $data;
    $data =~ s/(.*\n)(?=\1)//g;
    print $data;
    '
    aaaaaa
    aaaaaa
    bbb
    cccccc
    cccccc

    aaaaaa
    bbb
    cccccc




    John
    --
    use Perl;
    program
    fulfillment
  • James at Jun 13, 2007 at 9:24 am
    Thanks all, I have something working
    $data =~ s/(.*\n)(?=\1)//g;
    Can anyone explain the (?=\1) bit? I get the search replace.

    J.
  • Yitzle at Jun 13, 2007 at 1:16 pm

    On 6/13/07, James wrote:
    Thanks all, I have something working
    $data =~ s/(.*\n)(?=\1)//g;
    Can anyone explain the (?=\1) bit? I get the search replace.

    J.
    Didn't understand it myself, but see:
    http://www.boost.org/libs/regex/doc/syntax_perl.html

    Search for Back references

    And look at Perl Extended Patterns -> Lookahead
    "(?=pattern) consumes zero characters, only if pattern matches."

    So... It searches for .*\n and replaces it with "" if there is another
    instance of same pattern right after this one.
    ie replaces .*\n with "" if it is immediately repeated.

    Cool!
  • Paul Lalli at Jun 13, 2007 at 1:50 pm

    On Jun 13, 5:21 am, jlum...@arrowt.co.uk (James) wrote:
    Thanks all, I have something working
    $data =~ s/(.*\n)(?=\1)//g;
    Can anyone explain the (?=\1) bit? I get the search replace.
    Which part do you not understand? The (?=) or the \1 or both?

    (?= ) is a "positive lookahead assertion". It "peeks" into the
    pattern match to determine if the next thing matches its contents, but
    it doesn not actually match those contents. It doesn't move the
    internal position pointer along, and whatever is in the (?= ) is not
    part of the actual match so will not be replaced.

    \1 within a pattern match means exactly what $1 will mean when the
    pattern match is finished. That is, it's whatever was matched by the
    first capturing parentheses in this pattern match. In this case,
    that's .*\n.

    So this pattern is searching for 0 or more of any character, followed
    by the newline, and then checks to see if the next thing after that is
    exactly what was matched again. If so, the entire *MATCH* is replaced
    with nothing. Since the second instance was of the .*\n was not
    actually matched, just looked for, it does not get replaced.

    For more information,
    perldoc perlretut
    perldoc perlre
    perldoc perlreref

    Paul Lalli
  • Tony Heal at Aug 2, 2007 at 10:33 pm
    John's will only work if the next string is the same as the last string. If you mix up the strings it does not work.

    sarge-plain:~# perl -le'
    my $data = q[aaaaaa
    cccccc
    bbb
    cccccc
    aaaaaa
    ];
    print $data;
    $data =~ s/(.*\n)(?=\1)//g;
    print $data;
    '
    aaaaaa
    cccccc
    bbb
    cccccc
    aaaaaa

    aaaaaa
    cccccc
    bbb
    cccccc
    aaaaaa

    while Chas' will work for repeating and duplicate strings.
    #!/usr/bin/perl

    use strict;
    use warnings;

    my %h;
    while (<DATA>) {
    print unless $h{$_}++
    }

    __DATA__
    AAAAAAAAAAAAA
    NNNNNNNNNNN
    BBBBBBBBB
    CCCCCCCCC
    AAAAAAAAAAAAA
    NNNNNNNNNNN
    BBBBBBBBB
    CCCCCCCCC
    AAAAAAAAAAAAA
    AAAAAAAAAAAAA
    AAAAAAAAAAAAA
    NNNNNNNNNNN
    NNNNNNNNNNN
    NNNNNNNNNNN
    BBBBBBBBB
    BBBBBBBBB
    BBBBBBBBB
    CCCCCCCCC
    CCCCCCCCC
    CCCCCCCCC

    sarge-plain:~# ./temp.pl
    AAAAAAAAAAAAA
    NNNNNNNNNNN
    BBBBBBBBB
    CCCCCCCCC

    Tony Heal

    -----Original Message-----
    From: John W. Krahn
    Sent: Tuesday, June 12, 2007 3:20 PM
    To: Perl beginners
    Subject: Re: regex for matching repeated strings

    yitzle wrote:
    Issues with both methods:

    John's doesn't work for this data:
    aaaaaa
    aaaaaa
    bbb
    cccccc
    cccccc

    I would expect:
    aaaaaa
    bbb
    cccccc

    I would get:
    aaaaaa
    bbb
    cccccc
    cccccc
    It works for me:

    $ perl -le'
    my $data = q[aaaaaa
    aaaaaa
    bbb
    cccccc
    cccccc
    ];
    print $data;
    $data =~ s/(.*\n)(?=\1)//g;
    print $data;
    '
    aaaaaa
    aaaaaa
    bbb
    cccccc
    cccccc

    aaaaaa
    bbb
    cccccc




    John
    --
    use Perl;
    program
    fulfillment

    --
    To unsubscribe, e-mail: beginners-unsubscribe@perl.org
    For additional commands, e-mail: beginners-help@perl.org
    http://learn.perl.org/

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupbeginners @
categoriesperl
postedJun 12, '07 at 4:15p
activeAug 2, '07 at 10:33p
posts10
users6
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase