FAQ
Greetings;

I have, conservatively, dozens of html files to change.

I can find them and pass the file name to perl and
do the usual s/// changes but there is one change I can't
figure out.

There is a line in each file that looks like

<H1>This-Is-The-Title</H1>

of course, they are all different!

How can I change the hyphens to spaces in this line only?

Complicating the task is:

1. I don't know that there is only one such line per file.
I need to get them all.
2. I don't know that all <H1> are upper case.
3. Not all of the <H1> lines are the same record in the file.


TIA for any help!

Dennis

Search Discussions

  • Xavier Noria at Aug 15, 2007 at 5:05 pm

    On Aug 15, 2007, at 6:45 PM, Dennis G. Wicks wrote:
    Greetings;

    I have, conservatively, dozens of html files to change.

    I can find them and pass the file name to perl and
    do the usual s/// changes but there is one change I can't
    figure out.

    There is a line in each file that looks like

    <H1>This-Is-The-Title</H1>

    of course, they are all different!

    How can I change the hyphens to spaces in this line only?

    Complicating the task is:

    1. I don't know that there is only one such line per file.
    I need to get them all.
    2. I don't know that all <H1> are upper case.
    3. Not all of the <H1> lines are the same record in the file.
    I think this satisfies those constraints:

    perl -0777 -pi.bak -we 's{(<h1>)(.*?)(</h1>)}{$x = $2; $x =~
    tr:-: :; "$1$x$3"}geis' *.html

    We slurp the file with -0777 to be able to work across lines (I
    undestand (3) that way). Then we capture stuff between H1s case
    insensitive, tr/// what's in between. Note that $n-variables are read-
    only, that's why we copy the capture into a regular variable.

    Of course that assumes a simple regexp is enough, you need to judge
    whether that's the case in your data set.

    -- fxn
  • Xavier Noria at Aug 15, 2007 at 5:16 pm

    On Aug 15, 2007, at 7:04 PM, Xavier Noria wrote:

    perl -0777 -pi.bak -we 's{(<h1>)(.*?)(</h1>)}{$x = $2; $x =~
    tr:-: :; "$1$x$3"}geis' *.html
    A small improvement, groups are unnecessary because the elements are
    guaranteed not to have hyphens (in general they could, for instance
    in a class name, but in this case they don't):

    perl -0777 -pi.bak -we 's{<h1>.*?</h1>}{$x = $&; $x =~ tr:-: :; $x}
    geis' *.html

    -- fxn
  • Paul Lalli at Aug 15, 2007 at 5:13 pm

    On Aug 15, 12:45 pm, w...@eskimo.com (Dennis G. Wicks) wrote:
    Greetings;

    I have, conservatively, dozens of html files to change.

    I can find them and pass the file name to perl and
    do the usual s/// changes but there is one change I can't
    figure out.

    There is a line in each file that looks like

    <H1>This-Is-The-Title</H1>

    of course, they are all different!

    How can I change the hyphens to spaces in this line only?

    Complicating the task is:

    1. I don't know that there is only one such line per file.
    I need to get them all.
    2. I don't know that all <H1> are upper case.
    3. Not all of the <H1> lines are the same record in the file.
    Untested:

    perl -lpi.bkp -e'
    if (m!(<h1>(?:[a-z]+-)+[a-z]+</h1>)!i) { #if the pattern is found
    on this line
    $h1_sec = $1; #save the offending pattern
    ($mod_sec = $h1_sec) =~ tr/-/ /; #change the dashes to spaces
    s/$h1_sec/$mod_sec/; #replace the pattern found with the
    modified version
    }
    ' file1.html file2.html file3.html

    This should find all instances of your pattern in each file, with the
    exception of more than one instance of the pattern on the SAME line of
    the SAME file. If that's a possibility, you'd have to make it more
    complicated, putting a foreach loop around the whole thing, making the
    first pattern match globally....

    Hope that helps,
    Paul Lalli

    Paul Lalli
  • Dr.Ruud at Aug 16, 2007 at 8:40 am

    Paul Lalli schreef:

    s/$h1_sec/$mod_sec/; #replace the pattern found with the
    modified version
    Many s/$search/replace/ constructs should have been written with
    quotemeta, so that they look like:

    s/\Q${search}/replace/

    --
    Affijn, Ruud

    "Gewoon is een tijger."
  • Paul Lalli at Aug 16, 2007 at 11:10 am

    On Aug 16, 4:37 am, rvtol+n...@isolution.nl (Dr.Ruud) wrote:
    Paul Lalli schreef:
    s/$h1_sec/$mod_sec/; #replace the pattern found with the
    modified version
    Many s/$search/replace/ constructs should have been written with
    quotemeta, so that they look like:

    s/\Q${search}/replace/
    Many, sure. But not this one. If you actually read the whole code,
    you'll see there's no way for $h1_sec to contain any meta characters,
    and so there is no reason to put the \@ there.

    Since you snipped it, here it is again:
    if (m!(<h1>(?:[a-z]+-)+[a-z]+</h1>)!i) {
    $h1_sec = $1;
    ($mod_sec = $h1_sec) =~ tr/-/ /;
    s/$h1_sec/$mod_sec/;
    }

    As you can see, the only possible characters that can be in $h1_sec
    are:
    letters, 1, -, <, >, and /
    None of those need to be escaped.

    Paul Lalli
  • Andrew Curry at Aug 16, 2007 at 9:02 am
    Why?

    -----Original Message-----
    From: Dr.Ruud
    Sent: 16 August 2007 09:37
    To: beginners@perl.org
    Subject: Re: One liner to change one line

    Paul Lalli schreef:
    s/$h1_sec/$mod_sec/; #replace the pattern found with the
    modified version
    Many s/$search/replace/ constructs should have been written with quotemeta,
    so that they look like:

    s/\Q${search}/replace/

    --
    Affijn, Ruud

    "Gewoon is een tijger."


    --
    To unsubscribe, e-mail: beginners-unsubscribe@perl.org For additional
    commands, e-mail: beginners-help@perl.org http://learn.perl.org/



    This e-mail is from the PA Group. For more information, see
    www.thepagroup.com.

    This e-mail may contain confidential information. Only the addressee is
    permitted to read, copy, distribute or otherwise use this email or any
    attachments. If you have received it in error, please contact the sender
    immediately. Any opinion expressed in this e-mail is personal to the sender
    and may not reflect the opinion of the PA Group.

    Any e-mail reply to this address may be subject to interception or
    monitoring for operational reasons or for lawful business practices.
  • Chas Owens at Aug 16, 2007 at 9:12 am
    On 8/16/07, Andrew Curry wrote:
    snip
    Many s/$search/replace/ constructs should have been written with quotemeta,
    so that they look like:

    s/\Q${search}/replace/
    snip
    Why?
    snip

    given

    my $search = "file.txt";

    What do you want matched? Without the quotemeta it would match
    "fileAtxt", this is not normally the behavior you desire. Basically
    it all comes down to this: always use quotemeta unless the variable is
    known to contain the string form of a regex. Also, it is a good idea
    to note the end of the quotemeta like this

    s/\Q${search}\E/replace/
  • Paul Lalli at Aug 16, 2007 at 11:13 am

    On Aug 16, 5:12 am, chas.ow...@gmail.com (Chas Owens) wrote:
    Basically
    it all comes down to this: always use quotemeta unless the variable is
    known to contain the string form of a regex.
    No. It comes down to: "always use quotemeta unless the variable is
    known to contain the string form of a regexp, or known to not contain
    any meta characters."

    Paul Lalli
  • Chas Owens at Aug 16, 2007 at 2:19 pm

    On 8/16/07, Paul Lalli wrote:
    On Aug 16, 5:12 am, chas.ow...@gmail.com (Chas Owens) wrote:
    Basically
    it all comes down to this: always use quotemeta unless the variable is
    known to contain the string form of a regex.
    No. It comes down to: "always use quotemeta unless the variable is
    known to contain the string form of a regexp, or known to not contain
    any meta characters."

    Paul Lalli
    *** Warning silly rhetorical argument ahead ***

    So the foo in /foo/ is not a regular expression because it contains no
    meta-characters? I don't think anyone would make that argument. So
    the fact that the string is "know to not contain any meta-characters"
    doesn't change the fact that you know what is inside it and know that
    it is a regular expression and therefore my original rule of thumb
    should stand.

    *** end silly rhetorical argument ***
  • Paul Lalli at Aug 16, 2007 at 2:56 pm

    On Aug 16, 10:19 am, chas.ow...@gmail.com (Chas Owens) wrote:
    On 8/16/07, Paul Lalli wrote:
    On Aug 16, 5:12 am, chas.ow...@gmail.com (Chas Owens) wrote:
    Basically
    it all comes down to this: always use quotemeta unless the variable is
    known to contain the string form of a regex.
    No. It comes down to: "always use quotemeta unless the variable is
    known to contain the string form of a regexp, or known to not contain
    any meta characters."
    *** Warning silly rhetorical argument ahead ***

    So the foo in /foo/ is not a regular expression because it contains no
    meta-characters? I don't think anyone would make that argument. So
    the fact that the string is "know to not contain any meta-characters"
    doesn't change the fact that you know what is inside it and know that
    it is a regular expression and therefore my original rule of thumb
    should stand.

    *** end silly rhetorical argument ***
    Silly rhetorical-ness aside, you seem unfamiliar with the term you
    introduced to this thread: "string form of a regexp":

    $ perl -le'
    $a = q{foo};
    $b = qr{foo};
    print $a;
    print $b;
    '
    foo
    (?-xism:foo)

    My assertion is that you do not need to make sure your variable is of
    the form of $b above, but only that whatever it does contain, there
    are no meta characters in it.

    Paul Lalli
  • Chas Owens at Aug 16, 2007 at 4:22 pm
    On 8/16/07, Paul Lalli wrote:
    snip
    Silly rhetorical-ness aside, you seem unfamiliar with the term you
    introduced to this thread: "string form of a regexp":

    $ perl -le'
    $a = q{foo};
    $b = qr{foo};
    print $a;
    print $b;
    '
    foo
    (?-xism:foo)

    My assertion is that you do not need to make sure your variable is of
    the form of $b above, but only that whatever it does contain, there
    are no meta characters in it.

    Paul Lalli
    I know about the quote regex operator*, but it is not what I was
    referring to when I said "string form of a regex". I was referring to
    to a string that contains a a regex. qr// is just a fancy double
    quote that adds a non-capturing group and sets the appropriate options
    (in case you did something like qr/foo/i). The string "(?:-xism:foo)"
    is no more or less a regex than the string "foo". By ensuring that a
    string contains no regex meta-characters you are also ensuring that it
    is a regex that will do exactly what you want. The danger inherent in
    your original code

    if (m!(<h1>(?:[a-z]+-)+[a-z]+</h1>)!i) {
    $h1_sec = $1;
    ($mod_sec = $h1_sec) =~ tr/-/ /;
    s/$h1_sec/$mod_sec/;
    }

    is that someone in the future may change the part that ensures the
    correctness of $h1_sec

    #changed code to handle _ as well as -
    if (m!(<h1>.*?</h1>!) {
    $h1_sec = $1;
    ($mod_sec = $h1_sec) =~ tr/-_/ /; #must get rid of _ as well
    s/$h1_sec/$mod_sec/;
    }

    And everyone sits around for a few hours scratching their heads as to
    why the program no longer works like it should (especially since it
    would only work incorrectly on some input). Now, in this specific
    case the check is close to the use so the danger is minimal and a
    halfway decent coder should be able to spot the issue quickly, but if
    the use were a few pages of code away it would be much more difficult.
    This is why I state the rule the way I do. It is better to do the
    check at the same time as the use. There are two downsides to this
    advice:
    1. you have to take the time to type \Q and \E
    2. it produces slower code** (since it is running quotemeta every time)

    Now, if this is a use-once-then-throw-away situation it doesn't really
    matter and most of the rules of good software development go out the
    window. Heck, I don't know anybody who, on a regular basis, types
    something like this

    perl -Mstrict -lnwe 'our %h; my $c = y/,//; print $c unless $h{$c}++'
    load.csv foo

    even though use strict and use warnings are the first thing most of us
    say to new Perl users.

    * its use here would definitely be overkill since its primary use is
    to allow the definition of complex regexes in pieces like so

    my $identifier = qr/ [a-zA-Z_] \w*/x;
    my $expression = qr/ $identifier | \d+ /x;
    my $assignment = qr/ $identifier \s* = \s* $expression \s* ; /x ;

    ** this slowdown can be mostly mitigated by using the o option, but
    only if the variable will never need to be change.
  • John W. Krahn at Aug 16, 2007 at 5:15 pm

    Chas Owens wrote:
    On 8/16/07, Paul Lalli wrote:
    snip
    Silly rhetorical-ness aside, you seem unfamiliar with the term you
    introduced to this thread: "string form of a regexp":

    $ perl -le'
    $a = q{foo};
    $b = qr{foo};
    print $a;
    print $b;
    '
    foo
    (?-xism:foo)

    My assertion is that you do not need to make sure your variable is of
    the form of $b above, but only that whatever it does contain, there
    are no meta characters in it.
    I know about the quote regex operator*, but it is not what I was
    referring to when I said "string form of a regex". I was referring to
    to a string that contains a a regex. qr// is just a fancy double
    quote that adds a non-capturing group and sets the appropriate options
    (in case you did something like qr/foo/i). The string "(?:-xism:foo)"
    is no more or less a regex than the string "foo".

    If "qr// is just a fancy double quote" then \b would represent the backspace
    character and not a word boundary for example:

    $ perl -le'my $x = qq/(?i:\bfoo\b)/; my $y = qr/\bfoo\b/i; print for $x, $y'
    (?ifo)
    (?i-xsm:\bfoo\b)


    The object created by qr// *is* a compiled regex:

    $ perl -le'my $x = qq/(?i:foo)/; my $y = qr/foo/i; print ref for $x, $y'

    Regexp

    $ perl -Mre=debug -le'my $y = qr/foo/i'
    Freeing REx: `","'
    Compiling REx `foo'
    size 3 Got 28 bytes for offset annotations.
    first at 1
    1: EXACTF <foo>(3)
    3: END(0)
    stclass "EXACTF <foo>" minlen 3
    Offsets: [3]
    1[3] 0[0] 4[0]
    Freeing REx: `"foo"'




    John
    --
    Perl isn't a toolbox, but a small machine shop where you
    can special-order certain sorts of tools at low cost and
    in short order. -- Larry Wall
  • Chas Owens at Aug 16, 2007 at 11:00 pm
    On 8/16/07, John W. Krahn wrote:
    snip
    The object created by qr// *is* a compiled regex:
    snip

    Yeah my bad, the stringification confused me for a minute.
  • Gunnar Hjalmarsson at Aug 16, 2007 at 5:35 pm

    Chas Owens wrote:
    qr// is just a fancy double
    quote that adds a non-capturing group and sets the appropriate options
    (in case you did something like qr/foo/i). The string "(?:-xism:foo)"
    is no more or less a regex than the string "foo".
    Let's look into that.

    C:\home>type test.pl
    $re1 = '(?i-xsm:foo)';
    $re2 = qr(foo)i;
    foreach ($re1, $re2) {
    print $_, "\n";
    if ( $ret = ref $_ ) {
    print "Scalar variabel type: $ret\n\n";
    } else {
    print "Scalar variabel type: plain string\n\n";
    }
    }

    C:\home>perl test.pl
    (?i-xsm:foo)
    Scalar variabel type: plain string

    (?i-xsm:foo)
    Scalar variabel type: Regexp


    C:\home>

    Apparently there is more into it; maybe it has something to do with qr//
    compiling the string that is passed to the operator.

    http://perldoc.perl.org/perlop.html#qr%2fSTRING%2fimosx-qr-%2fi-%2fm-%2fo-%2fs-%2fx

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
  • Andrew Curry at Aug 16, 2007 at 9:23 am
    Apologies I miss read the original remark.

    -----Original Message-----
    From: Chas Owens
    Sent: 16 August 2007 10:12
    To: Andrew Curry
    Cc: Dr.Ruud; beginners@perl.org
    Subject: Re: One liner to change one line

    On 8/16/07, Andrew Curry wrote:
    snip
    Many s/$search/replace/ constructs should have been written with
    quotemeta, so that they look like:

    s/\Q${search}/replace/
    snip
    Why?
    snip

    given

    my $search = "file.txt";

    What do you want matched? Without the quotemeta it would match "fileAtxt",
    this is not normally the behavior you desire. Basically it all comes down
    to this: always use quotemeta unless the variable is known to contain the
    string form of a regex. Also, it is a good idea to note the end of the
    quotemeta like this

    s/\Q${search}\E/replace/


    This e-mail is from the PA Group. For more information, see
    www.thepagroup.com.

    This e-mail may contain confidential information. Only the addressee is
    permitted to read, copy, distribute or otherwise use this email or any
    attachments. If you have received it in error, please contact the sender
    immediately. Any opinion expressed in this e-mail is personal to the sender
    and may not reflect the opinion of the PA Group.

    Any e-mail reply to this address may be subject to interception or
    monitoring for operational reasons or for lawful business practices.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupbeginners @
categoriesperl
postedAug 15, '07 at 4:45p
activeAug 16, '07 at 11:00p
posts16
users8
websiteperl.org

People

Translate

site design / logo © 2022 Grokbase