FAQ
Hello,

i want to parse this String:

version 3.5.1 {

$pid_dir = /opt/samba-3.5.1/var/locks/
$bin_dir = /opt/samba-3.5.1/bin/

service smbd {
bin = ${bin_dir}smbd -D
pid = ${pid_dir}smbd.pid
}
service nmbd {
bin = ${bin_dir}nmbd -D
pid = ${pid_dir}nmbd.pid
}
service winbindd {
bin = ${bin_dir}winbindd -D
pid = ${pid_dir}winbindd.pid
}
}

version 3.2.14 {

$pid_dir = /opt/samba-3.5.1/var/locks/
$bin_dir = /opt/samba-3.5.1/bin/

service smbd {
bin = ${bin_dir}smbd -D
pid = ${pid_dir}smbd.pid
}
service nmbd {
bin = ${bin_dir}nmbd -D
pid = ${pid_dir}nmbd.pid
}
service winbindd {
bin = ${bin_dir}winbindd -D
pid = ${pid_dir}winbindd.pid
}
}

Step 1:

version 3.2.14 {

$pid_dir = /opt/samba-3.5.1/var/locks/
$bin_dir = /opt/samba-3.5.1/bin/

service smbd {
bin = ${bin_dir}smbd -D
pid = ${pid_dir}smbd.pid
}
service nmbd {
bin = ${bin_dir}nmbd -D
pid = ${pid_dir}nmbd.pid
}
service winbindd {
bin = ${bin_dir}winbindd -D
pid = ${pid_dir}winbindd.pid
}
}

Step 2:
service smbd {
bin = ${bin_dir}smbd -D
pid = ${pid_dir}smbd.pid
}
Step 3:
$pid_dir = /opt/samba-3.5.1/var/locks/
$bin_dir = /opt/samba-3.5.1/bin/

Step 4:
bin = ${bin_dir}smbd -D
pid = ${pid_dir}smbd.pid

My Regular Expressions:
version[\s]*[\w\.]*[\s]*\{[\w\s\n\t\{\}=\$\.\-_\/]*\}
service[\s]*[\w]*[\s]*\{([\n\s\w\=]*(\$\{[\w_]*\})*[\w\s\-=\.]*)*\}

I think it was no good Solution. I'am trying with Groups:
(service[\s\w]*)\{([\n\w\s=\$\-_\.]*)
but this part makes Problems: ${bin_dir}

Kind Regards

Richi

Search Discussions

  • Chris Rebert at Apr 7, 2010 at 8:52 am

    On Wed, Apr 7, 2010 at 1:37 AM, Richard Lamboj wrote:
    i want to parse this String:

    version 3.5.1 {

    ? ? ? ?$pid_dir = /opt/samba-3.5.1/var/locks/
    ? ? ? ?$bin_dir = /opt/samba-3.5.1/bin/

    ? ? ? ?service smbd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}smbd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}smbd.pid
    ? ? ? ?}
    ? ? ? ?service nmbd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}nmbd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}nmbd.pid
    ? ? ? ?}
    ? ? ? ?service winbindd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}winbindd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}winbindd.pid
    ? ? ? ?}
    }

    version 3.2.14 {

    ? ? ? ?$pid_dir = /opt/samba-3.5.1/var/locks/
    ? ? ? ?$bin_dir = /opt/samba-3.5.1/bin/

    ? ? ? ?service smbd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}smbd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}smbd.pid
    ? ? ? ?}
    ? ? ? ?service nmbd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}nmbd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}nmbd.pid
    ? ? ? ?}
    ? ? ? ?service winbindd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}winbindd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}winbindd.pid
    ? ? ? ?}
    }

    Step 1:

    version 3.2.14 {

    ? ? ? ?$pid_dir = /opt/samba-3.5.1/var/locks/
    ? ? ? ?$bin_dir = /opt/samba-3.5.1/bin/

    ? ? ? ?service smbd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}smbd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}smbd.pid
    ? ? ? ?}
    ? ? ? ?service nmbd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}nmbd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}nmbd.pid
    ? ? ? ?}
    ? ? ? ?service winbindd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}winbindd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}winbindd.pid
    ? ? ? ?}
    }

    Step 2:
    ? ? ? ?service smbd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}smbd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}smbd.pid
    ? ? ? ?}
    Step 3:
    ? ? ? ?$pid_dir = /opt/samba-3.5.1/var/locks/
    ? ? ? ?$bin_dir = /opt/samba-3.5.1/bin/

    Step 4:
    ? ? ? ? ? ? ? ?bin = ${bin_dir}smbd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}smbd.pid

    My Regular Expressions:
    version[\s]*[\w\.]*[\s]*\{[\w\s\n\t\{\}=\$\.\-_\/]*\}
    service[\s]*[\w]*[\s]*\{([\n\s\w\=]*(\$\{[\w_]*\})*[\w\s\-=\.]*)*\}

    I think it was no good Solution. I'am trying with Groups:
    (service[\s\w]*)\{([\n\w\s=\$\-_\.]*)
    but this part makes Problems: ${bin_dir}
    Regular expressions != Parsers

    Every time someone tries to parse nested structures using regular
    expressions, Jamie Zawinski kills a puppy.

    Try using an *actual* parser, such as Pyparsing:
    http://pyparsing.wikispaces.com/

    Cheers,
    Chris
    --
    Some people, when confronted with a problem, think:
    "I know, I'll use regular expressions." Now they have two problems.
    http://blog.rebertia.com
  • Richard Lamboj at Apr 7, 2010 at 9:44 am

    Am Wednesday 07 April 2010 10:52:14 schrieb Chris Rebert:
    On Wed, Apr 7, 2010 at 1:37 AM, Richard Lamboj wrote:
    i want to parse this String:

    version 3.5.1 {

    ? ? ? ?$pid_dir = /opt/samba-3.5.1/var/locks/
    ? ? ? ?$bin_dir = /opt/samba-3.5.1/bin/

    ? ? ? ?service smbd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}smbd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}smbd.pid
    ? ? ? ?}
    ? ? ? ?service nmbd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}nmbd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}nmbd.pid
    ? ? ? ?}
    ? ? ? ?service winbindd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}winbindd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}winbindd.pid
    ? ? ? ?}
    }

    version 3.2.14 {

    ? ? ? ?$pid_dir = /opt/samba-3.5.1/var/locks/
    ? ? ? ?$bin_dir = /opt/samba-3.5.1/bin/

    ? ? ? ?service smbd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}smbd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}smbd.pid
    ? ? ? ?}
    ? ? ? ?service nmbd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}nmbd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}nmbd.pid
    ? ? ? ?}
    ? ? ? ?service winbindd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}winbindd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}winbindd.pid
    ? ? ? ?}
    }

    Step 1:

    version 3.2.14 {

    ? ? ? ?$pid_dir = /opt/samba-3.5.1/var/locks/
    ? ? ? ?$bin_dir = /opt/samba-3.5.1/bin/

    ? ? ? ?service smbd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}smbd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}smbd.pid
    ? ? ? ?}
    ? ? ? ?service nmbd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}nmbd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}nmbd.pid
    ? ? ? ?}
    ? ? ? ?service winbindd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}winbindd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}winbindd.pid
    ? ? ? ?}
    }

    Step 2:
    ? ? ? ?service smbd {
    ? ? ? ? ? ? ? ?bin = ${bin_dir}smbd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}smbd.pid
    ? ? ? ?}
    Step 3:
    ? ? ? ?$pid_dir = /opt/samba-3.5.1/var/locks/
    ? ? ? ?$bin_dir = /opt/samba-3.5.1/bin/

    Step 4:
    ? ? ? ? ? ? ? ?bin = ${bin_dir}smbd -D
    ? ? ? ? ? ? ? ?pid = ${pid_dir}smbd.pid

    My Regular Expressions:
    version[\s]*[\w\.]*[\s]*\{[\w\s\n\t\{\}=\$\.\-_\/]*\}
    service[\s]*[\w]*[\s]*\{([\n\s\w\=]*(\$\{[\w_]*\})*[\w\s\-=\.]*)*\}

    I think it was no good Solution. I'am trying with Groups:
    (service[\s\w]*)\{([\n\w\s=\$\-_\.]*)
    but this part makes Problems: ${bin_dir}
    Regular expressions != Parsers

    Every time someone tries to parse nested structures using regular
    expressions, Jamie Zawinski kills a puppy.

    Try using an *actual* parser, such as Pyparsing:
    http://pyparsing.wikispaces.com/

    Cheers,
    Chris
    --
    Some people, when confronted with a problem, think:
    "I know, I'll use regular expressions." Now they have two problems.
    http://blog.rebertia.com
    Well, after some trying with regex, your both right. I will use pyparse it
    seems to be the better solution.

    Kind Regards
  • Bruno Desthuilliers at Apr 7, 2010 at 9:13 am

    Richard Lamboj a ?crit :
    Hello,

    i want to parse this String:

    version 3.5.1 {

    $pid_dir = /opt/samba-3.5.1/var/locks/
    $bin_dir = /opt/samba-3.5.1/bin/

    service smbd {
    bin = ${bin_dir}smbd -D
    pid = ${pid_dir}smbd.pid
    }
    service nmbd {
    bin = ${bin_dir}nmbd -D
    pid = ${pid_dir}nmbd.pid
    }
    service winbindd {
    bin = ${bin_dir}winbindd -D
    pid = ${pid_dir}winbindd.pid
    }
    }
    (snip)

    I think you'd be better writing a specific parser here. Paul McGuire's
    PyParsing package might help:

    http://pyparsing.wikispaces.com/

    My 2 cents.
  • Patrick Maupin at Apr 8, 2010 at 1:25 am

    On Apr 7, 3:52?am, Chris Rebert wrote:

    Regular expressions != Parsers
    True, but lots of parsers *use* regular expressions in their
    tokenizers. In fact, if you have a pure Python parser, you can often
    get huge performance gains by rearranging your code slightly so that
    you can use regular expressions in your tokenizer, because that
    effectively gives you access to a fast, specialized C library that is
    built into practically every Python interpreter on the planet.
    Every time someone tries to parse nested structures using regular
    expressions, Jamie Zawinski kills a puppy.
    And yet, if you are parsing stuff in Python, and your parser doesn't
    use some specialized C code for tokenization (which will probably be
    regular expressions unless you are using mxtexttools or some other
    specialized C tokenizer code), your nested structure parser will be
    dog slow.

    Now, for some applications, the speed just doesn't matter, and for
    people who don't yet know the difference between regexps and parsing,
    pointing them at PyParsing is certainly doing them a valuable service.

    But that's twice today when I've seen people warned off regular
    expressions without a cogent explanation that, while the re module is
    good at what it does, it really only handles the very lowest level of
    a parsing problem.

    My 2 cents is that something like PyParsing is absolutely great for
    people who want a simple parser without a lot of work. But if people
    use PyParsing, and then find out that (for their particular
    application) it isn't fast enough, and then wonder what to do about
    it, if all they remember is that somebody told them not to use regular
    expressions, they will just come to the false conclusion that pure
    Python is too painfully slow for any real world task.

    Regards,
    Pat
  • Nobody at Apr 8, 2010 at 10:13 am

    On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote:

    Regular expressions != Parsers
    True, but lots of parsers *use* regular expressions in their
    tokenizers. In fact, if you have a pure Python parser, you can often
    get huge performance gains by rearranging your code slightly so that
    you can use regular expressions in your tokenizer, because that
    effectively gives you access to a fast, specialized C library that is
    built into practically every Python interpreter on the planet.
    Unfortunately, a typical regexp library (including Python's) doesn't allow
    you to match against a set of regexps, returning the index of which one
    matched. Which is what you really want for a tokeniser.
    Every time someone tries to parse nested structures using regular
    expressions, Jamie Zawinski kills a puppy.
    And yet, if you are parsing stuff in Python, and your parser doesn't
    use some specialized C code for tokenization (which will probably be
    regular expressions unless you are using mxtexttools or some other
    specialized C tokenizer code), your nested structure parser will be
    dog slow.
    The point is that you *cannot* match arbitrarily-nested expressions using
    regexps. You could, in theory, write a regexp which will match any valid
    syntax up to N levels of nesting, for any finite N. But in practice, the
    regexp is going to look horrible (and is probably going to be quite
    inefficient if the regexp library uses backtracking rather than a DFA).

    Even tokenising with Python's regexp interface is inefficient if the
    number of token types is large, as you have to test against each regexp
    sequentially.

    Ultimately, if you want an efficient parser, you need something with a C
    component, e.g. Plex.
  • Richard Lamboj at Apr 8, 2010 at 11:56 am
    At the moment i have less time, so its painful to read about parsing, but it
    is quite interessting.

    I have taken a look at the different Parsing Modules and i'am reading the
    Source Code to understand how they Work. Since Yesterday i'am writing on my
    own small Engine - Just for Fun and understanding how i can get what i need.

    It seems that this is hard to code, becouse the logic is complex and sometimes
    confussing. Its not easy to find a "perfect" solution.

    If someone knows good links to this thema, or can explain how parsers
    should/could work, please post it, or explain it.

    Thanks for the Informations and the Help!

    Kind Regards

    Richi
  • Charles at Apr 8, 2010 at 12:02 pm
    "Nobody" <nobody at nowhere.com> wrote in message
    news:pan.2010.04.08.10.12.59.594000 at nowhere.com...
    On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote:

    Regular expressions != Parsers
    True, but lots of parsers *use* regular expressions in their
    tokenizers. In fact, if you have a pure Python parser, you can often
    get huge performance gains by rearranging your code slightly so that
    you can use regular expressions in your tokenizer, because that
    effectively gives you access to a fast, specialized C library that is
    built into practically every Python interpreter on the planet.
    Unfortunately, a typical regexp library (including Python's) doesn't allow
    you to match against a set of regexps, returning the index of which one
    matched. Which is what you really want for a tokeniser.
    [snip]

    Really !,
    I am only a python newbie, but what about ...

    import re
    rr = [
    ( "id", '([a-zA-Z][a-zA-Z0-9]*)' ),
    ( "int", '([+-]?[0-9]+)' ),
    ( "float", '([+-]?[0-9]+\.[0-9]*)' ),
    ( "float", '([+-]?[0-9]+\.[0-9]*[eE][+-]?[0-9]+)' )
    ]
    tlist = [ t[0] for t in rr ]
    pat = '^ *(' + '|'.join([ t[1] for t in rr ]) + ') *$'
    p = re.compile(pat)

    ss = [ ' annc', '1234', 'abcd', ' 234sz ', '-1.24e3', '5.' ]
    for s in ss:
    m = p.match(s)
    if m:
    ix = [ i-2 for i in range(2,6) if m.group(i) ]
    print "'"+s+"' matches and has type", tlist[ix[0]]
    else:
    print "'"+s+"' does not match"

    output:
    ' annc' matches and has type id
    '1234' matches and has type int
    'abcd' matches and has type id
    ' 234sz ' does not match
    '-1.24e3' matches and has type float
    '5.' matches and has type float

    seems to me to match a (small) set of regular expressions and
    indirectly return the index of the matched expression, without
    doing a sequential loop over the regular expressions.

    Of course there is a loop over the reults of the match to determine
    which sub-expression matched, but a good regexp library (which
    I presume Python has) should match the sub-expressions without
    looping over them. The techniques to do this were well known in
    the 1970's when the first versons of lex were written.

    Not that I would recommend tricks like this. The regular
    expression would quickly get out of hand for any non-trivial
    list of regular expresssions to match.

    Charles
  • Neil Cerutti at Apr 10, 2010 at 3:39 pm

    On 2010-04-08, Richard Lamboj wrote:
    If someone knows good links to this thema, or can explain how
    parsers should/could work, please post it, or explain it.

    Thanks for the Informations and the Help!
    I liked Crenshaw's "Let's Build a Compiler!". It's pretty trivial
    to convert his Pascal to Python, and you'll get to basic parsing
    in no time. URL:http://compilers.iecc.com/crenshaw/

    --
    Neil Cerutti
  • Patrick Maupin at Apr 10, 2010 at 4:21 pm

    On Apr 8, 5:13?am, Nobody wrote:
    On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote:
    Regular expressions != Parsers
    True, but lots of parsers *use* regular expressions in their
    tokenizers. ?In fact, if you have a pure Python parser, you can often
    get huge performance gains by rearranging your code slightly so that
    you can use regular expressions in your tokenizer, because that
    effectively gives you access to a fast, specialized C library that is
    built into practically every Python interpreter on the planet.
    Unfortunately, a typical regexp library (including Python's) doesn't allow
    you to match against a set of regexps, returning the index of which one
    matched. Which is what you really want for a tokeniser.
    Actually, there is some not very-well-documented code in the re module
    that will let you do exactly that. But even not using that code,
    using a first cut of re.split() or re.finditer() to break the string
    apart into tokens (without yet classifying them) is usually a huge
    performance win over anything else in the standard library (or that
    you could write in pure Python) for this task.
    Every time someone tries to parse nested structures using regular
    expressions, Jamie Zawinski kills a puppy.
    And yet, if you are parsing stuff in Python, and your parser doesn't
    use some specialized C code for tokenization (which will probably be
    regular expressions unless you are using mxtexttools or some other
    specialized C tokenizer code), your nested structure parser will be
    dog slow.
    The point is that you *cannot* match arbitrarily-nested expressions using
    regexps. You could, in theory, write a regexp which will match any valid
    syntax up to N levels of nesting, for any finite N. But in practice, the
    regexp is going to look horrible (and is probably going to be quite
    inefficient if the regexp library uses backtracking rather than a DFA).
    Trust me, I already knew that. But what you just wrote is a much more
    useful thing to tell the OP than "Every time someone tries to parse
    nested structures using regular expressions, Jamie Zawinski kills a
    puppy" which is what I was responding to. And right after
    regurgitating that inside joke, Chris Rebert then went on to say "Try
    using an *actual* parser, such as Pyparsing". Which is all well and
    good, except then the OP will download pyparsing, take a look, realize
    that it uses regexps under the hood, and possibly be very confused.
    Even tokenising with Python's regexp interface is inefficient if the
    number of token types is large, as you have to test against each regexp
    sequentially.
    It's not that bad if you do it right. You can first rip things apart,
    then use a dict-based scheme to categorize them.
    Ultimately, if you want an efficient parser, you need something with a C
    component, e.g. Plex.
    There is no doubt that you can get better performance with C than with
    Python. But, for a lot of tasks, the Python performance is
    acceptable, and, as always, algorithm, algorithm, algorithm...

    A case in point. My pure Python RSON parser is faster on my computer
    on a real-world dataset of JSON data than the json decoder that comes
    with Python 2.6, *even with* the json decoder's C speedups enabled.

    Having said that, the current subversion pure Python simplejson parser
    is slightly faster than my RSON parser, and the C reimplementation of
    the parser in current subversion simplejson completely blows the doors
    off my RSON parser.

    So, a naive translation to C, even by an experienced programmer, may
    not do as much for you as an algorithm rework.

    Regards,
    Pat
  • Neil Cerutti at Apr 10, 2010 at 4:35 pm

    On 2010-04-10, Patrick Maupin wrote:
    Trust me, I already knew that. But what you just wrote is a
    much more useful thing to tell the OP than "Every time someone
    tries to parse nested structures using regular expressions,
    Jamie Zawinski kills a puppy" which is what I was responding
    to. And right after regurgitating that inside joke, Chris
    Rebert then went on to say "Try using an *actual* parser, such
    as Pyparsing". Which is all well and good, except then the OP
    will download pyparsing, take a look, realize that it uses
    regexps under the hood, and possibly be very confused.
    I don't agree with that. If a person is trying to ski using
    pieces of wood that they carved themselves, I don't expect them
    to be surprised that the skis they buy are made out of similar
    materials.

    --
    Neil Cerutti
  • Patrick Maupin at Apr 10, 2010 at 5:11 pm

    On Apr 10, 11:35?am, Neil Cerutti wrote:
    On 2010-04-10, Patrick Maupin wrote:
    as Pyparsing". ?Which is all well and good, except then the OP
    will download pyparsing, take a look, realize that it uses
    regexps under the hood, and possibly be very confused.
    I don't agree with that. If a person is trying to ski using
    pieces of wood that they carved themselves, I don't expect them
    to be surprised that the skis they buy are made out of similar
    materials.
    But, in this case, the guy ASKED how to make the skis in his
    woodworking shop, and was told not to be silly -- you don't use wood
    to make skis -- and then directed to go buy some skis that are, in
    fact, made out of wood.

    I think it would have been perfectly appropriate to point out that it
    might take some additional woodworking equipment and a bit of
    experience and/or study and/or extra work to make decent skis out of
    wood (and, oh, by the way, here is where you can buy some ready-made
    skis cheap), but the original response didn't explain it like this.

    Regards,
    Pat
  • Stefan Behnel at Apr 10, 2010 at 6:05 pm

    Patrick Maupin, 10.04.2010 19:11:
    On Apr 10, 11:35 am, Neil Ceruttiwrote:
    On 2010-04-10, Patrick Maupinwrote:
    as Pyparsing". Which is all well and good, except then the OP
    will download pyparsing, take a look, realize that it uses
    regexps under the hood, and possibly be very confused.
    I don't agree with that. If a person is trying to ski using
    pieces of wood that they carved themselves, I don't expect them
    to be surprised that the skis they buy are made out of similar
    materials.
    But, in this case, the guy ASKED how to make the skis in his
    woodworking shop, and was told not to be silly -- you don't use wood
    to make skis -- and then directed to go buy some skis that are, in
    fact, made out of wood.
    Running a Python program in CPython eventually boils down to a sequence of
    commands being executed by the CPU. That doesn't mean you should write
    those commands manually, even if you can. It's perfectly ok to write the
    program in Python instead.

    Stefan
  • Ethan Furman at Apr 10, 2010 at 11:57 pm

    Stefan Behnel wrote:
    Patrick Maupin, 10.04.2010 19:11:
    On Apr 10, 11:35 am, Neil Ceruttiwrote:
    On 2010-04-10, Patrick Maupinwrote:
    as Pyparsing". Which is all well and good, except then the OP
    will download pyparsing, take a look, realize that it uses
    regexps under the hood, and possibly be very confused.

    I don't agree with that. If a person is trying to ski using
    pieces of wood that they carved themselves, I don't expect them
    to be surprised that the skis they buy are made out of similar
    materials.

    But, in this case, the guy ASKED how to make the skis in his
    woodworking shop, and was told not to be silly -- you don't use wood
    to make skis -- and then directed to go buy some skis that are, in
    fact, made out of wood.

    Running a Python program in CPython eventually boils down to a sequence
    of commands being executed by the CPU. That doesn't mean you should
    write those commands manually, even if you can. It's perfectly ok to
    write the program in Python instead.

    Stefan
    And it's even more perfectly okay to use Python when it's the best tool
    for the job, and re when *it's* the best tool for the job.

    ~Ethan~
  • Steven D'Aprano at Apr 11, 2010 at 1:23 am

    On Sat, 10 Apr 2010 10:11:07 -0700, Patrick Maupin wrote:
    On Apr 10, 11:35?am, Neil Cerutti wrote:
    On 2010-04-10, Patrick Maupin wrote:
    as Pyparsing". ?Which is all well and good, except then the OP will
    download pyparsing, take a look, realize that it uses regexps under
    the hood, and possibly be very confused.
    I don't agree with that. If a person is trying to ski using pieces of
    wood that they carved themselves, I don't expect them to be surprised
    that the skis they buy are made out of similar materials.
    But, in this case, the guy ASKED how to make the skis in his woodworking
    shop, and was told not to be silly -- you don't use wood to make skis --
    and then directed to go buy some skis that are, in fact, made out of
    wood.
    As entertaining as this is, the analogy is rubbish. Skis are far too
    simple to use as an analogy for a parser (he says, having never seen skis
    up close in his life *wink*). Have you looked at PyParsing's source code?
    Regexes are only a small part of the parser, and not analogous to the
    wood of skis.

    Perhaps a better analogy would be a tennis racket, with regexes being the
    strings. You have a whole lot of strings, not just one, and they are held
    together with a strong framework. Without the framework the strings are
    useless, and without the strings the racket doesn't do anything useful.

    Using this analogy, I would say the OP was wanting to play tennis with a
    single piece of string, and asking for advise on beefing it up to make it
    work better. Perhaps a knot tied in one end will help?



    --
    Steven
  • Paul Rubin at Apr 11, 2010 at 1:38 am

    Steven D'Aprano <steve at REMOVE-THIS-cybersource.com.au> writes:
    As entertaining as this is, the analogy is rubbish. Skis are far too
    simple to use as an analogy for a parser (he says, having never seen skis
    up close in his life *wink*). Have you looked at PyParsing's source code?
    Regexes are only a small part of the parser, and not analogous to the
    wood of skis.
    The impression that I have (from a distance) is that Pyparsing is a good
    interface abstraction with a kludgy and slow implementation. That the
    implementation uses regexps just goes to show how kludgy it is. One
    hopes that someday there will be a more serious implementation, perhaps
    using llvm-py (I wonder whatever happened to that project, by the way)
    so that your parser script will compile to executable machine code on
    the fly.
  • Paul McGuire at Apr 11, 2010 at 4:32 am

    On Apr 10, 8:38?pm, Paul Rubin wrote:
    The impression that I have (from a distance) is that Pyparsing is a good
    interface abstraction with a kludgy and slow implementation. ?That the
    implementation uses regexps just goes to show how kludgy it is. ?One
    hopes that someday there will be a more serious implementation, perhaps
    using llvm-py (I wonder whatever happened to that project, by the way)
    so that your parser script will compile to executable machine code on
    the fly.
    I am definitely flattered that pyparsing stirs up so much interest,
    and among such a distinguished group. But I have to take some umbrage
    at Paul Rubin's left-handed compliment, "Pyparsing is a good
    interface abstraction with a kludgy and slow implementation,"
    especially since he forms his opinions "from a distance".

    I actually *did* put some thought into what I wanted in pyparsing
    before designing it, and this forms this chapter of "Getting Started
    with Pyparsing" (available here as a free online excerpt:
    http://my.safaribooksonline.com/9780596514235/what_makes_pyparsing_so_special#X2ludGVybmFsX0ZsYXNoUmVhZGVyP3htbGlkPTk3ODA1OTY1MTQyMzUvMTYmaW1hZ2VwYWdlPTE2),
    the "Zen of Pyparsing" as it were. My goals were:

    - build parsers using explicit constructs (such as words, groups,
    repetition, alternatives), vs. expression encoding using specialized
    character sequences, as found in regexen

    - easy parser construction from primitive elements to complex groups
    and alternatives, using Python's operator overloading for ease of
    direct implementation of parsers using ordinary Python syntax; include
    mechanisms for defining recursive parser expressions

    - implicit skipping of whitespace between parser elements

    - results returned not just as a list of strings, but as a rich data
    object, with access to parsed fields by name or by list index, taking
    interfaces from both dicts and lists for natural adoption into common
    Python idioms

    - no separate code-generation steps, a la lex/yacc

    - support for parse-time callbacks, for specialized token handling,
    conversion, and/or construction of data structures

    - 100% pure Python, to be runnable on any platform that supports
    Python

    - liberal licensing, to permit easy adoption into any user's projects
    anywhere

    So raw performance really didn't even make my short-list, beyond the
    obvious "should be tolerably fast enough."

    I have found myself reading posts on c.l.py with wording like "I'm
    trying to parse <blah-blah> and I've been trying for hours/days to get
    this regex working." For kicks, I'd spend 5-15 minutes working up a
    working pyparsing solution, which *does* run comparatively slowly,
    perhaps taking a few minutes to process the poster's data file. But
    the net solution is developed and running in under 1/2 an hour, which
    to me seems like an overall gain compared to hours of fruitless
    struggling with backslashes and regex character sequences. On top of
    which, the pyparsing solutions are still readable when I come back to
    them weeks or months later, instead of staring at some line-noise
    regex and just scratch my head wondering what it was for. And
    sometimes "comparatively slowly" means that it runs 50x slower than a
    compiled method that runs in 0.02 seconds - that's still getting the
    job done in just 1 second.

    And is the internal use of regexes with pyparsing really a "kludge"?
    Why? They are almost completely hidden from the parser developer. And
    yet by using compiled regexes, I retain the portability of 100% Python
    while leveraging the compiled speed of the re engine.

    It does seem that there have been many posts of late (either on c.l.py
    or the related posts on Stackoverflow) where the OP is trying to
    either scrape content from HTML, or parse some type of recursive
    expression. HTML scrapers implemented using re's are terribly
    fragile, since HTML in the wild often contains little surprises
    (unexpected whitespace; upper/lower case inconsistencies; tag
    attributes in unpredictable order; attribute values with double,
    single, or no quotation marks) which completely frustrate any re-based
    approach. Granted, there are times when an re-parsing-of-HTML
    endeavor *isn't* futile or doomed from the start - the OP may be
    working with a very restricted set of HTML, generated from some other
    script so that the output is very consistent. Unfortunately, this
    poster usually gets thrown under the same "you'll never be able to
    parse HTML with re's" bus. I can't explain the surge in these posts,
    other than to wonder if we aren't just seeing a skewed sample - that
    is, the many cases where people *are* successfully using re's to solve
    their text extraction problems aren't getting posted to c.l.py, since
    no one posts questions they already have the answers to.

    So don't be too dismissive of pyparsing, Mr. Rubin. I've gotten many e-
    mails, wiki, and forum posts from Python users at all levels of the
    expertise scale, saying that pyparsing has helped them to be very
    productive in one or another aspect of creating a command parser, or
    adding safe expression evaluation to an app, or just extracting some
    specific data from a log file. I am encouraged that most report that
    they can get their parsers working in reasonably short order, often by
    reworking one of the examples that comes with pyparsing. If you're
    offering to write that extension to pyparsing that generates the
    parser runtime in fast machine code, it sounds totally bitchin' and
    I'd be happy to include it when it's ready.

    -- Paul
  • Neil Cerutti at Apr 12, 2010 at 12:09 pm

    On 2010-04-11, Steven D'Aprano wrote:
    On Sat, 10 Apr 2010 10:11:07 -0700, Patrick Maupin wrote:
    On Apr 10, 11:35??am, Neil Cerutti wrote:
    On 2010-04-10, Patrick Maupin wrote:
    as Pyparsing". ??Which is all well and good, except then the OP will
    download pyparsing, take a look, realize that it uses regexps under
    the hood, and possibly be very confused.
    I don't agree with that. If a person is trying to ski using pieces of
    wood that they carved themselves, I don't expect them to be surprised
    that the skis they buy are made out of similar materials.
    But, in this case, the guy ASKED how to make the skis in his woodworking
    shop, and was told not to be silly -- you don't use wood to make skis --
    and then directed to go buy some skis that are, in fact, made out of
    wood.
    As entertaining as this is, the analogy is rubbish.
    You should have seen the car engine analogy I thought up at
    first. ;)
    Skis are far too simple to use as an analogy for a parser (he
    says, having never seen skis up close in his life *wink*).
    Have you looked at PyParsing's source code? Regexes are only a
    small part of the parser, and not analogous to the wood of
    skis.
    I was mainly trying to get accross my incredulity that somebody
    should be surprised a parsing package uses regexes under the
    good. But for the record, a set of downhill skis comes with a
    really fancy interface layer:

    URL:http://images03.olx.com/ui/1/85/66/13147966_1.jpg

    --
    Neil Cerutti
  • Patrick Maupin at Apr 11, 2010 at 6:29 am

    On Apr 10, 1:05?pm, Stefan Behnel wrote:

    Running a Python program in CPython eventually boils down to a sequence of
    commands being executed by the CPU. That doesn't mean you should write
    those commands manually, even if you can. It's perfectly ok to write the
    program in Python instead.
    Absolutely. But (as I seem to have posted many times recently) if
    somebody asks how to do "x" it may be useful to point out that it
    sounds like he really wants "y" and there are already several canned
    solutions that do "y", but if he really wants "x", here is how he
    should do it, or here is why he will have problems if he attempts to
    do it (hint: whether Jamie Zawinski decides to kill a puppy or not is
    not really a problem for somebody just asking a programming question
    -- that's really up to Jamie).

    Regards,
    Pat

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedApr 7, '10 at 8:37a
activeApr 12, '10 at 12:09p
posts19
users12
websitepython.org

People

Translate

site design / logo © 2022 Grokbase