FAQ
Hello again,

Yesterday I had a question on pattern matching. A couple of people responded
with very useful information. After some finagling, I got my rudimentary
code to work. I'm a PhD student studying computational linguistics without
any formal programming training. While there are various modules that can be
applied to my questions, our professor wants us to manually code things so
we understand the wider problems of computational linguistics. With that,
here is what I'm trying to do.

In a given file, I believe it was XML originally, insert <s> at the
beginning of every sentence and </s> at the end of every sentence. So far,
I've got the following. The output is in *bold*.

$hello = "This is some sample text.";

$hello =~ s/^../<s>/gi;
$hello =~ s/..$/<\/s>/gi;

print "$hello\n";

*<s>is is some sample tex</s>*
*
*

I can see why this is happening. I'm telling the program to do exactly what
it did. But what I want the output to look like is this.

*<s> This is some sample text.</s>*
*
*

Any comments are very appreciated. This is a very helpful crowd.

Cheers.

Zach


--
--------------------------------------------------------------------------------------------------
Zachary S. Brooks
PhD Student in Second Language Acquisition and Teaching (SLAT)
The University of Arizona - http://www.coh.arizona.edu/slat/
Graduate Associate in Teaching - Department of English
M.A. Applied Linguistics - University of Massachusetts Boston
---------------------------------------------------------------------------------------------------

Search Discussions

  • Shawn wilson at Nov 14, 2010 at 4:58 pm

    On Sun, Nov 14, 2010 at 11:42 AM, Zachary Brooks wrote:

    Hello again,

    Yesterday I had a question on pattern matching. A couple of people
    responded
    with very useful information. After some finagling, I got my rudimentary
    code to work. I'm a PhD student studying computational linguistics without
    any formal programming training. While there are various modules that can
    be
    applied to my questions, our professor wants us to manually code things so
    we understand the wider problems of computational linguistics. With that,
    here is what I'm trying to do.

    In a given file, I believe it was XML originally, insert <s> at the
    beginning of every sentence and </s> at the end of every sentence. So far,
    I've got the following. The output is in *bold*.

    first, forget about about testing tons of regex in programs. if you're
    trying to learn, it'll make you go nuts. try something like
    http://regexpal.com/ or google for other 'regex tester' sites. there are
    also programs (ymmv).

    btw, i don't see your bold...

    $hello = "This is some sample text.";

    $hello =~ s/^../<s>/gi;
    $hello =~ s/..$/<\/s>/gi;
    second, why not use a place holder like someone recommended yesterday?
    something like:
    s/^(.+)$/<s>\1<\/s>/g
    print "$hello\n";

    *<s>is is some sample tex</s>*
    *
    *

    I can see why this is happening. I'm telling the program to do exactly what
    it did. But what I want the output to look like is this.

    *<s> This is some sample text.</s>*
    *
    *

    Any comments are very appreciated. This is a very helpful crowd.

    Cheers.

    Zach


    --

    --------------------------------------------------------------------------------------------------
    Zachary S. Brooks
    PhD Student in Second Language Acquisition and Teaching (SLAT)
    The University of Arizona - http://www.coh.arizona.edu/slat/
    Graduate Associate in Teaching - Department of English
    M.A. Applied Linguistics - University of Massachusetts Boston

    ---------------------------------------------------------------------------------------------------
  • Uri Guttman at Nov 14, 2010 at 6:53 pm
    "sw" == shawn wilson writes:
    sw> second, why not use a place holder like someone recommended yesterday?
    sw> something like:
    sw> s/^(.+)$/<s>\1<\/s>/g

    what is a placeholder? nothing like that in regexes. what you have there
    is a backreference and used in the wrong place. \1 is meant to be used
    ONLY in the regex part, not the replacement section. use $1 to get
    the first grabbed part when in the replacement part. your code will
    generate warnings:

    perl -wle '$x = "a" ; $x =~ s/(a)/\1\1/'
    \1 better written as $1 at -e line 1.
    \1 better written as $1 at -e line 1.

    uri

    --
    Uri Guttman ------ uri@stemsystems.com -------- http://www.sysarch.com --
    ----- Perl Code Review , Architecture, Development, Training, Support ------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
  • Shawn wilson at Nov 14, 2010 at 7:03 pm
    On Sun, Nov 14, 2010 at 1:53 PM, Uri Guttman wrote:

    "sw" == shawn wilson <ag4ve.us@gmail.com> writes:
    sw> second, why not use a place holder like someone recommended yesterday?
    sw> something like:
    sw> s/^(.+)$/<s>\1<\/s>/g

    what is a placeholder? nothing like that in regexes. what you have there
    is a backreference and used in the wrong place. \1 is meant to be used
    ONLY in the regex part, not the replacement section. use $1 to get
    the first grabbed part when in the replacement part. your code will
    generate warnings:

    yep, got confused with sed....
    perl -wle '$x = "a" ; $x =~ s/(a)/\1\1/'
    \1 better written as $1 at -e line 1.
    \1 better written as $1 at -e line 1.

    uri

    --
    Uri Guttman ------ uri@stemsystems.com -------- http://www.sysarch.com--
    ----- Perl Code Review , Architecture, Development, Training, Support
    ------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com---------
  • Zachary Brooks at Nov 14, 2010 at 7:04 pm
    What happened when I used the code --

    $hello =~ s/^(.+)$/<s>\1<\/s>/gis;

    -- is that is properly marked <s> and the beginning of the sentence and </s>
    at the end of the sentence, but then it only worked for one sentence.

    Any suggestions on getting <s> to appear at the beginning of every sentence
    and </s> to appear at the end of every sentence for more than one sentence?

    Zach
    On Sun, Nov 14, 2010 at 11:53 AM, Uri Guttman wrote:

    "sw" == shawn wilson <ag4ve.us@gmail.com> writes:
    sw> second, why not use a place holder like someone recommended yesterday?
    sw> something like:
    sw> s/^(.+)$/<s>\1<\/s>/g

    what is a placeholder? nothing like that in regexes. what you have there
    is a backreference and used in the wrong place. \1 is meant to be used
    ONLY in the regex part, not the replacement section. use $1 to get
    the first grabbed part when in the replacement part. your code will
    generate warnings:

    perl -wle '$x = "a" ; $x =~ s/(a)/\1\1/'
    \1 better written as $1 at -e line 1.
    \1 better written as $1 at -e line 1.

    uri

    --
    Uri Guttman ------ uri@stemsystems.com -------- http://www.sysarch.com--
    ----- Perl Code Review , Architecture, Development, Training, Support
    ------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com---------


    --
    --------------------------------------------------------------------------------------------------
    Zachary S. Brooks
    PhD Student in Second Language Acquisition and Teaching (SLAT)
    The University of Arizona - http://www.coh.arizona.edu/slat/
    Graduate Associate in Teaching - Department of English
    M.A. Applied Linguistics - University of Massachusetts Boston
    ---------------------------------------------------------------------------------------------------
  • Rob Dixon at Nov 14, 2010 at 7:24 pm

    On 14/11/2010 19:04, Zachary Brooks wrote:
    What happened when I used the code --

    $hello =~ s/^(.+)$/<s>\1<\/s>/gis;

    -- is that is properly marked<s> and the beginning of the sentence and</s>
    at the end of the sentence, but then it only worked for one sentence.

    Any suggestions on getting<s> to appear at the beginning of every sentence
    and</s> to appear at the end of every sentence for more than one sentence?
    You must think carefully about what constitutes a 'sentence'. A string
    starting with a capital letter and ending with a full stop is the most
    basic definition, but is unlikely to be sufficient for your purposes
    unless your data is very simple.

    The program below uses this definition to enclose all 'sentences' in a
    multi-line string in <s> tags. I hope it helps you to get started.

    - Rob


    use strict;
    use warnings;

    my $text = "
    This is some sample text. It has
    three sentences, all beginning with
    a capital letter and ending with a full
    stop. Proper recognition of a 'sentence'
    could get extremely complicated.";

    $text =~ s|([A-Z].*?\.)|<s>$1</s>|gs;

    print $text;

    __END__

    **OUTPUT**


    <s>This is some sample text.</s> <s>It has
    three sentences, all beginning with
    a capital letter and ending with a full
    stop.</s> <s>Proper recognition of a 'sentence'
    could get extremely complicated.</s>
  • Shawn wilson at Nov 14, 2010 at 7:16 pm
    so, if you've got a file, do something like:

    while ($line = <FH> ) {
    $line =~ m/^(.+)$/ig;
    print "<s>$1<\/s>\n";
    }
    On Sun, Nov 14, 2010 at 1:53 PM, Uri Guttman wrote:

    "sw" == shawn wilson <ag4ve.us@gmail.com> writes:
    sw> second, why not use a place holder like someone recommended yesterday?
    sw> something like:
    sw> s/^(.+)$/<s>\1<\/s>/g

    what is a placeholder? nothing like that in regexes. what you have there
    is a backreference and used in the wrong place. \1 is meant to be used
    ONLY in the regex part, not the replacement section. use $1 to get
    the first grabbed part when in the replacement part. your code will
    generate warnings:

    perl -wle '$x = "a" ; $x =~ s/(a)/\1\1/'
    \1 better written as $1 at -e line 1.
    \1 better written as $1 at -e line 1.

    uri

    --
    Uri Guttman ------ uri@stemsystems.com -------- http://www.sysarch.com--
    ----- Perl Code Review , Architecture, Development, Training, Support
    ------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com---------
  • Jim Gibson at Nov 14, 2010 at 8:03 pm

    At 2:16 PM -0500 11/14/10, shawn wilson wrote:
    so, if you've got a file, do something like:

    while ($line = <FH> ) {
    $line =~ m/^(.+)$/ig;
    print "<s>$1<\/s>\n";
    }
    If all you want to do is print each line in the file surrounded by<s>
    tags, you don't need regular expressions, and you don't need to
    escape forward slash characters in double-quotes:

    while ($line = <FH> ) {
    chomp($line);
    print "<s>$line</s>\n";
    }

    As Rob said, the hard thing with this task is finding out where
    sentences begin and end.

    --
    Jim Gibson
    Jim@Gibson.org
  • Shawn H Corey at Nov 14, 2010 at 5:15 pm

    On 10-11-14 11:42 AM, Zachary Brooks wrote:
    $hello = "This is some sample text.";

    $hello =~ s/^../<s>/gi;
    $hello =~ s/..$/<\/s>/gi;

    print "$hello\n";

    *<s>is is some sample tex</s>*
    The meta-character '.' matches every character except a newline. The
    first substitution replaces 'Th' with '<s>'. The second, 't.' with '</s>'.

    When faced with a problem like this, it is best to write down everything
    that is a sentence and determine what is common in all cases. That will
    tell you how to create your matching patterns.


    --
    Just my 0.00000002 million dollars worth,
    Shawn

    Programming is as much about organization and communication
    as it is about coding.

    The secret to great software: Fail early & often.

    Eliminate software piracy: use only FLOSS.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupbeginners @
categoriesperl
postedNov 14, '10 at 4:43p
activeNov 14, '10 at 8:03p
posts9
users6
websiteperl.org

People

Translate

site design / logo © 2022 Grokbase