FAQ
Hi all,

I'm struggling with an RD grammar problem and am hoping you can help.

I've got some data that is embedded inside a file and I need to parse only the
embedded data and leave the "noise" untouched.

For example:

afaf asf af <DELIMITER> command command command </DELIMITER> asdf asd qer f a

I want to parse the command(s), remove the DELIMITERS and preserve everything
else.

In the past, I've looped over the file with a regex looking for the delimeters
and then running RD on the text inside. However, the cost of launching
several instances of the parser is very expensive, about 80% of runtime.

I'd like to be able to use one parser and have it "do" the entire file.

What I've tried amounts to this:

chunk: /.*?/ delimiter_start command(s) delimiter_end /.*?/

However, I think the first regex is eating too much.

Any suggestions on how to do this?

TIA.

--

Take care and have fun,
Mike Diehl.

Search Discussions

  • Ted Zlatanov at Sep 2, 2009 at 10:57 am
    On Tue, 1 Sep 2009 20:33:22 -0600 Mike Diehl wrote:

    MD> Hi all,
    MD> I'm struggling with an RD grammar problem and am hoping you can help.

    MD> I've got some data that is embedded inside a file and I need to parse only the
    MD> embedded data and leave the "noise" untouched.

    MD> For example:

    MD> afaf asf af <DELIMITER> command command command </DELIMITER> asdf asd qer f a

    MD> I want to parse the command(s), remove the DELIMITERS and preserve everything
    MD> else.

    MD> In the past, I've looped over the file with a regex looking for the delimeters
    MD> and then running RD on the text inside. However, the cost of launching
    MD> several instances of the parser is very expensive, about 80% of runtime.

    MD> I'd like to be able to use one parser and have it "do" the entire file.

    MD> What I've tried amounts to this:

    MD> chunk: /.*?/ delimiter_start command(s) delimiter_end /.*?/

    MD> However, I think the first regex is eating too much.

    MD> Any suggestions on how to do this?

    This seems reasonable. Can you show a full runnable example that fails?

    Ted
  • Damian Conway at Sep 3, 2009 at 7:51 am
    Hi Mike,

    What I've tried amounts to this:

    chunk: /.*?/ delimiter_start command(s) delimiter_end /.*?/
    Unfortunately that won't work, because every regex in a PRD grammar is
    independent of the rest of the grammar, so even a minimal-matching .*?
    eats everything.

    Is there some reason you can't use something like:

    my $parser = Parse::RecDescent->new($grammar);

    $text =~ s{<DELIMITER> (.*?) </DELIMITER>}
    { $parser->parse($1); q{} }gexs;

    ???

    Damian
  • Mike Diehl at Sep 3, 2009 at 3:49 pm

    On Thursday 03 September 2009 01:50:58 Damian Conway wrote:
    Hi Mike,
    What I've tried amounts to this:

    chunk: /.*?/ delimiter_start command(s) delimiter_end /.*?/
    Unfortunately that won't work, because every regex in a PRD grammar is
    independent of the rest of the grammar, so even a minimal-matching .*?
    eats everything.
    Ya, that's what I was suspecting. In hind sight, I should have figured that;
    that's how I'd write it...
    Is there some reason you can't use something like:

    my $parser = Parse::RecDescent->new($grammar);

    $text =~ s{<DELIMITER> (.*?) </DELIMITER>}
    { $parser->parse($1); q{} }gexs;
    That's what I was doing, but it seems I misinterpreted my profiling results.
    I found from profiling that the function I use to create (once) and run the
    parser accounted for 80% of runtime.

    I assumed that since I only create the parser once (if !defined), creating the
    parser wasn't where the cost was. So I decided that it must be due to
    actually running the parser, which might run several times during program
    execution. My conclusion was that I needed to rewrite the grammar so that
    the parser would only run once.

    It sounds like I may need to go back to the old algorithm and start tuning the
    grammar.

    --

    Take care and have fun,
    Mike Diehl.
  • Matthew Braid at Sep 3, 2009 at 10:23 pm
    Hi all,

    Would there be some way of manipulating the skip re to do this?

    Something along the lines of:

    top: <skip: /NOT START DELIMETER/> chunk(s) eof
    chunk: delimeter_start <skip: /NORMAL SKIP/> command(s) delimiter_end
    eof: /\Z/

    The problem there is defining a skip that won't skip a
    delimeter_start. This probably won't allow delimeter_start to _not_
    mean the start of a set of commands as well.

    Not tested, but just a suggestion.

    MB

    2009/9/4 Mike Diehl <mdiehl@diehlnet.com>:
    On Thursday 03 September 2009 01:50:58 Damian Conway wrote:
    Hi Mike,
    What I've tried amounts to this:

    chunk: /.*?/ delimiter_start command(s) delimiter_end /.*?/
    Unfortunately that won't work, because every regex in a PRD grammar is
    independent of the rest of the grammar, so even a minimal-matching .*?
    eats everything.
    Ya, that's what I was suspecting.  In hind sight, I should have figured that;
    that's how I'd write it...
    Is there some reason you can't use something like:

    my $parser = Parse::RecDescent->new($grammar);

    $text =~ s{<DELIMITER> (.*?) </DELIMITER>}
    { $parser->parse($1); q{} }gexs;
    That's what I was doing, but it seems I misinterpreted my profiling results.
    I found from profiling that the function I use to create (once) and run the
    parser accounted for 80% of runtime.

    I assumed that since I only create the parser once (if !defined), creating the
    parser wasn't where the cost was.  So I decided that it must be due to
    actually running the parser, which might run several times during program
    execution.  My conclusion was that I needed to rewrite the grammar so that
    the parser would only run once.

    It sounds like I may need to go back to the old algorithm and start tuning the
    grammar.

    --

    Take care and have fun,
    Mike Diehl.
  • Matthew Braid at Sep 6, 2009 at 11:47 pm
    Sorry, replying to myself, but I just stumbled across a similar
    situation and my solution might help you too.

    I needed to define a block like this:

    perl until FLAG
    PERL
    FLAG;

    which is like a 'here-doc' for inlining perl in another language that
    doesn't require actually parsing the perl code. Like your input, I
    need to match 'anything' up to the closing flag. I ended up using a
    rule similar to your original solution, except instead of having a
    /.*?/ match, I combined that with the next terminal. After playing
    around a bit, I came up with the following test script that parses out
    all valid chunks between 'START' and 'END' amongst other rubbish in
    the input in one pass:

    ====== START CODE ======

    use Parse::RecDescent;
    use Data::Dumper;
    #$::RD_TRACE = 1;

    # assuming start/end delimiters of START and END
    my $grammar = <<'STOP';

    start:
    chunk(s?)

    chunk:
    /.*?START/s command(s) 'END' # This is the important bit
    {$item[2]}

    command:
    'test' ';'
    {"TEST COMMAND"}

    STOP

    my $text = << 'STOP';

    blah blah blayh

    asdsd kjkl

    START
    test;
    test;
    END

    kjsaljdlk
    askd

    START
    test;
    END

    sad
    asdgfdsf
    gfsfg

    STOP

    my $res = Parse::RecDescent->new($grammar)->start($text);
    print Data::Dumper::Dumper($res), "\n";

    ====== END CODE ======

    Note that the /s modifier on the 'garbage scooping' re's is important
    for this to work. Was scratching my head over that for a bit :)

    The output of that is:

    ====== START OUTPUT ======

    $VAR1 = [
    [
    'TEST COMMAND',
    'TEST COMMAND'
    ],
    [
    'TEST COMMAND'
    ]
    ];

    ====== END OUTPUT ======

    I haven't done any benchmarking, but that might be faster than
    sequential parses of 'clean' data. My original solution anchored to
    the end of the input with an eof marker and a 'trailing_guff' rule
    that matched anything after the chunk(s?) subrule, but that is
    unnecessary.

    MB

    2009/9/4 Matthew Braid <mattybear@gmail.com>:
    Hi all,

    Would there be some way of manipulating the skip re to do this?

    Something along the lines of:

    top: <skip: /NOT START DELIMETER/> chunk(s) eof
    chunk: delimeter_start <skip: /NORMAL SKIP/> command(s) delimiter_end
    eof: /\Z/

    The problem there is defining a skip that won't skip a
    delimeter_start. This probably won't allow delimeter_start to _not_
    mean the start of a set of commands as well.

    Not tested, but just a suggestion.

    MB

    2009/9/4 Mike Diehl <mdiehl@diehlnet.com>:
    On Thursday 03 September 2009 01:50:58 Damian Conway wrote:
    Hi Mike,
    What I've tried amounts to this:

    chunk: /.*?/ delimiter_start command(s) delimiter_end /.*?/
    Unfortunately that won't work, because every regex in a PRD grammar is
    independent of the rest of the grammar, so even a minimal-matching .*?
    eats everything.
    Ya, that's what I was suspecting.  In hind sight, I should have figured that;
    that's how I'd write it...
    Is there some reason you can't use something like:

    my $parser = Parse::RecDescent->new($grammar);

    $text =~ s{<DELIMITER> (.*?) </DELIMITER>}
    { $parser->parse($1); q{} }gexs;
    That's what I was doing, but it seems I misinterpreted my profiling results.
    I found from profiling that the function I use to create (once) and run the
    parser accounted for 80% of runtime.

    I assumed that since I only create the parser once (if !defined), creating the
    parser wasn't where the cost was.  So I decided that it must be due to
    actually running the parser, which might run several times during program
    execution.  My conclusion was that I needed to rewrite the grammar so that
    the parser would only run once.

    It sounds like I may need to go back to the old algorithm and start tuning the
    grammar.

    --

    Take care and have fun,
    Mike Diehl.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouprecdescent @
categoriesperl
postedSep 2, '09 at 2:32a
activeSep 6, '09 at 11:47p
posts6
users4
websitemetacpan.org...

People

Translate

site design / logo © 2018 Grokbase