FAQ
I have a set of files with this kind of content (it's dumped from WebSphere):

[propertySet "[[resourceProperties "[[[description "This is a required
property. This is an actual database name, and its not the locally
catalogued database name. The Universal JDBC Driver does not rely on
information catalogued in the DB2 database directory."]
[name databaseName]
[required true]
[type java.lang.String]
[value DB2Foo]] [[description "The JDBC connectivity-type of a data
source. If you want to use a type 4 driver, set the value to 4. If you
want to use a type 2 driver, set the value to 2. Use of driverType 2
is not supported on WAS z/OS."]
[name driverType]
[required true]
[type java.lang.Integer]
[value 4]] [[description "The TCP/IP address or host name for the DRDA server."]
[name serverName]
[required false]
[type java.lang.String]
[value ServerFoo]] [[description "The TCP/IP port number where the
DRDA server resides."]
[name portNumber]
[required false]
[type java.lang.Integer]
[value 007]] [[description "The description of this datasource."]
[name description]
[required false]
[type java.lang.String]
[value []]] [[description "The DB2 trace level for logging to the
logWriter or trace file. Possible trace levels are: TRACE_NONE =
0,TRACE_CONNECTION_CALLS = 1,TRACE_STATEMENT_CALLS =
2,TRACE_RESULT_SET_CALLS = 4,TRACE_DRIVER_CONFIGURATION =
16,TRACE_CONNECTS = 32,TRACE_DRDA_FLOWS =
64,TRACE_RESULT_SET_META_DATA = 128,TRACE_PARAMETER_META_DATA =
256,TRACE_DIAGNOSTICS = 512,TRACE_SQLJ = 1024,TRACE_ALL = -1, ."]
[name traceLevel]
[required false]
[type java.lang.Integer]
[value []]] [[description "The trace file to store the trace output.
If you specify the trace file, the DB2 Jcc trace will be logged in
this trace file. If this property is not specified and the
WAS.database trace group is enabled, then both WebSphere trace and DB2
trace will be logged into the WebSphere trace file."]

I'm trying to figure out the best way to feed it all into
dictionaries, without having to know exactly what the contents of the
file are. There are a number of things going on, The nesting is
preserved in [] pairs, and in some cases in between double quotes.
There are also cases where double quotes are only there to preserve
spaces in a string though. I managed to get what I needed in the
short term by just stripping the nesting all together, and flattening
out the key/value pairs, but I had to do some things that were
specific to the file contents to make it work.

Any ideas? I was considering making a list of string combinations, like so:

junk = ['[[','"[',']]']

and just using re.sub to covert them into a single character that I
could start to do split() actions on. There must be something else I
can do.. those brackets can't be a coincidence. The output came from
a jython script.

Thanks!

Search Discussions

  • George Sakkis at Apr 24, 2008 at 1:17 am

    On Apr 23, 9:00?pm, "Eric Wertman" wrote:

    I have a set of files with this kind of content (it's dumped from WebSphere):

    [snipped]

    I'm trying to figure out the best way to feed it all into
    dictionaries, without having to know exactly what the contents of the
    file are. ?
    It would be pretty pointless if you had to know in advance the exact
    file content, but you still have to know the structure of the files,
    that is the grammar they conform to.
    Any ideas? ?I was considering making a list of string combinations, like so:

    junk = ['[[','"[',']]']

    and just using re.sub to covert them into a single character that I
    could start to do split() actions on. ?There must be something else I
    can do..
    Yes, find out the formal grammar of these files and use a parser
    generator [1] to specify it.

    HTH,
    George

    [1] http://wiki.python.org/moin/LanguageParsing
  • Paul McGuire at Apr 24, 2008 at 2:05 am

    On Apr 23, 8:00?pm, "Eric Wertman" wrote:
    I have a set of files with this kind of content (it's dumped from WebSphere):

    [propertySet "[[resourceProperties "[[[description "This is a required
    property. This is an actual database name, and its not the locally
    catalogued database name. The Universal JDBC Driver does not rely on
    ...
    A couple of comments first:
    - What is the significance of '"[' vs. '[' ? I stripped them all out
    using
    text = text.replace('"[','[')
    - Your input text was missing 5 trailing ]'s.

    Here's the parser I used, using pyparsing:


    from pyparsing import nestedExpr,Word,alphanums,QuotedString
    from pprint import pprint

    content = Word(alphanums+"_.") | QuotedString('"',multiline=True)
    structure = nestedExpr("[", "]", content).parseString(text)

    pprint(structure.asList())


    Prints (I've truncated the long lines, but the long quoted strings do
    parse intact):

    [['propertySet',
    [['resourceProperties',
    [[['description',
    'This is a required \nproperty. This is an actual data...
    ['name', 'databaseName'],
    ['required', 'true'],
    ['type', 'java.lang.String'],
    ['value', 'DB2Foo']],
    [['description',
    'The JDBC connectivity-type of a data \nsource. If you...
    ['name', 'driverType'],
    ['required', 'true'],
    ['type', 'java.lang.Integer'],
    ['value', '4']],
    [['description',
    '"The TCP/IP address or host name for the DRDA server."'],
    ['name', 'serverName'],
    ['required', 'false'],
    ['type', 'java.lang.String'],
    ['value', 'ServerFoo']],
    [['description',
    'The TCP/IP port number where the \nDRDA server resides.'],
    ['name', 'portNumber'],
    ['required', 'false'],
    ['type', 'java.lang.Integer'],
    ['value', '007']],
    [['description', '"The description of this datasource."'],
    ['name', 'description'],
    ['required', 'false'],
    ['type', 'java.lang.String'],
    ['value', []]],
    [['description',
    'The DB2 trace level for logging to the \nlogWriter ...
    ['name', 'traceLevel'],
    ['required', 'false'],
    ['type', 'java.lang.Integer'],
    ['value', []]],
    [['description',
    'The trace file to store the trace output. \nIf you ...
    ]]]]]]]

    -- Paul
    The pyparsing wiki is at http://pyparsing.wikispaces.com.
  • Gerard Flanagan at Apr 24, 2008 at 9:52 am

    On Apr 24, 4:05 am, Paul McGuire wrote:
    On Apr 23, 8:00 pm, "Eric Wertman" wrote:

    I have a set of files with this kind of content (it's dumped from WebSphere):
    [propertySet "[[resourceProperties "[[[description "This is a required
    property. This is an actual database name, and its not the locally
    catalogued database name. The Universal JDBC Driver does not rely on
    ...
    A couple of comments first:
    - What is the significance of '"[' vs. '[' ? I stripped them all out
    using
    The data can be thought of as a serialised object. A simple attribute
    looks like:

    [name someWebsphereObject]

    or

    [jndiName []]

    if 'jndiName is None'.

    A complex attribute is an attribute whose value is itself an object
    (or dict if you prefer). The *value* is indicated with "[...]":

    [connectionPool "[[agedTimeout 0]
    [connectionTimeout 180]
    [freePoolDistributionTableSize 0]
    [maxConnections 10]
    [minConnections 1]
    [numberOfFreePoolPartitions 0]
    [numberOfSharedPoolPartitions 0]
    [unusedTimeout 1800]]"]

    However, 'propertySet' is effectively a keyword and its value may be
    thought of as a 'data table' or 'list of data rows', where 'data row'
    == dict/object

    You can see how the posted example is incomplete because the last
    'row' is missing all but one 'column'.
    text = text.replace('"[','[')
    - Your input text was missing 5 trailing ]'s.
    I think only 2 (the original isn't Python). To fix the example, remove
    the last 'description' and add two ]'s
    Here's the parser I used, using pyparsing:

    from pyparsing import nestedExpr,Word,alphanums,QuotedString
    from pprint import pprint

    content = Word(alphanums+"_.") | QuotedString('"',multiline=True)
    structure = nestedExpr("[", "]", content).parseString(text)

    pprint(structure.asList())
    By the way, I think this would be a good example for the pyparsing
    recipes page (even an IBM developerworks article?)

    http://www.ibm.com/developerworks/websphere/library/techarticles/0801_simms/0801_simms.html

    Gerard

    example data (copied and pasted; doesn't have the case where a complex
    attribute has a complex attribute):

    [authDataAlias []]
    [authMechanismPreference BASIC_PASSWORD]
    [connectionPool "[[agedTimeout 0]
    [connectionTimeout 180]
    [freePoolDistributionTableSize 0]
    [maxConnections 10]
    [minConnections 1]
    [numberOfUnsharedPoolPartitions 0]
    [properties []]
    [purgePolicy FailingConnectionOnly]
    [reapTime 180]
    [surgeThreshold -1]
    [testConnection false]
    [testConnectionInterval 0]
    [unusedTimeout 1800]]"]
    [propertySet "[[resourceProperties "[[[description "This is a required
    property. This is an actual database name, and its not the locally
    catalogued database name. The Universal JDBC Driver does not rely on
    information catalogued in the DB2 database directory."]
    [name databaseName]
    [required true]
    [type java.lang.String]
    [value DB2Foo]] [[description "The JDBC connectivity-type of a data
    source. If you want to use a type 4 driver, set the value to 4. If you
    want to use a type 2 driver, set the value to 2. Use of driverType 2
    is not supported on WAS z/OS."]
    [name driverType]
    [required true]
    [type java.lang.Integer]
    [value 4]] [[description "The TCP/IP address or name for the DRDA
    server."]
    [name serverName]
    [required false]
    [type java.lang.String]
    [value ServerFoo]] [[description "The TCP/IP port number where the
    DRDA server resides."]
    [name portNumber]
    [required false]
    [type java.lang.Integer]
    [value 007]] [[description "The description of this datasource."]
    [name description]
    [required false]
    [type java.lang.String]
    [value []]] [[description "The DB2 trace level for logging to the
    logWriter or trace file. Possible trace levels are: TRACE_NONE =
    0,TRACE_CONNECTION_CALLS = 1,TRACE_STATEMENT_CALLS =
    2,TRACE_RESULT_SET_CALLS = 4,TRACE_DRIVER_CONFIGURATION =
    16,TRACE_CONNECTS = 32,TRACE_DRDA_FLOWS =
    64,TRACE_RESULT_SET_META_DATA = 128,TRACE_PARAMETER_META_DATA =
    256,TRACE_DIAGNOSTICS = 512,TRACE_SQLJ = 1024,TRACE_ALL = -1, ."]
    [name traceLevel]
    [required false]
    [type java.lang.Integer]
    [value []]]
    ]]
  • Mark Wooding at Apr 24, 2008 at 12:19 pm

    Eric Wertman wrote:

    I have a set of files with this kind of content (it's dumped from
    WebSphere):

    [propertySet "[[resourceProperties "[[[description "This is a required
    property. This is an actual database name, and its not the locally
    catalogued database name. The Universal JDBC Driver does not rely on
    information catalogued in the DB2 database directory."]
    [name databaseName]
    [required true]
    [type java.lang.String]
    [value DB2Foo]] ...>
    Looks to me like S-expressions with square brackets instead of the
    normal round ones. I'll bet that the correct lexical analysis is
    approximately

    [ open-list
    propertySet symbol
    " open-string
    [ open-list
    [ open-list
    resourceProperties symbol
    " open-string (not close-string!)
    ...

    so it also looks as if strings aren't properly escaped.

    This is definitely not a pretty syntax. I'd suggest an initial
    tokenization pass for the lexical syntax

    [ open-list
    ] close-list
    "[ open-qlist
    ]" close-qlist
    "..." string
    whitespace ignore
    anything-else symbol

    Correct nesting should give you two kinds of lists -- which I've shown
    as `list' and `qlist' (for quoted-list), though given the nastiness of
    the dump you showed, there's no guarantee of correctness.

    Turn the input string (or file) into a list (generator?) of lexical
    objects above; then scan that recursively. The lists (or qlists) seem
    to have two basic forms:

    * properties, that is a list of the form [SYMBOL VALUE ...] which can
    be thought of as a declaration that some property, named by the
    SYMBOL, has a particular VALUE (or maybe VALUEs); and

    * property lists, which are just lists of properties.

    Property lists can be usefully turned into Python dictionaries, indexed
    by their SYMBOLs, assuming that they don't try to declare the same
    property twice.

    There are, alas, other kinds of lists too -- one of the property lists
    contains a property `[value []]' which simply contains an empty list.

    The right first-cut rule for disambiguation is probably that a property
    list is a non-empty list, all of whose items look like properties, and a
    property is an entry in a property list, and (initially at least)
    restrict properties to the simple form [SYMBOL VALUE] rather than
    allowing multiple values.

    Does any of this help?

    (In fact, this syntax looks so much like a demented kind of S-expression
    that I'd probably try to parse it, initially at least, by using a Common
    Lisp system's reader and a custom readtable, but that may not be useful
    to you.)

    -- [mdw]
  • Eric Wertman at Apr 24, 2008 at 3:42 pm
    Thanks to everyone for the help and feedback. It's amazing to me that
    I've been dealing with odd log files and other outputs for quite a
    while, and never really stumbled onto a parser as a solution.


    I got this far, with Paul's help, which manages my current set of files:

    from pyparsing import nestedExpr,Word,alphanums,QuotedString
    from pprint import pprint
    import re
    import glob

    files = glob.glob('wsout/*')

    for file in files :
    text = open(file).read()
    text = re.sub('"\[',' [',text) # These 2 lines just drop double quotes
    text = re.sub('\]"','] ',text) # that aren't related to a string
    text = re.sub('\[\]','None',text) # this drops the empty []
    text = '[ ' + text + ' ]' # Needs an outer layer

    content = Word(alphanums+"-_./()*=#\\${}| :,;\t\n\r@?&%%") |
    QuotedString('"',multiline=True)
    structure = nestedExpr("[", "]", content).parseString(text)

    pprint(structure[0].asList())

    I'm sure there are cooler ways to do some of that. I spent most of my
    time expanding the characters that constitute content. I'm concerned
    that over time I'll have things break as other characters show up.
    Specifically a few of the nodes are of German locale.. so I could get
    some odd international characters.

    It looks like pyparser has a constant for printable characters. I'm
    not sure if I can just use that, without worrying about it?

    At any rate, thumbs up on the parser! Definitely going to add to my toolbox.

    On Thu, Apr 24, 2008 at 8:19 AM, Mark Wooding wrote:

    Eric Wertman wrote:
    I have a set of files with this kind of content (it's dumped from
    WebSphere):

    [propertySet "[[resourceProperties "[[[description "This is a required
    property. This is an actual database name, and its not the locally
    catalogued database name. The Universal JDBC Driver does not rely on
    information catalogued in the DB2 database directory."]
    [name databaseName]
    [required true]
    [type java.lang.String]
    [value DB2Foo]] ...>
    Looks to me like S-expressions with square brackets instead of the
    normal round ones. I'll bet that the correct lexical analysis is
    approximately

    [ open-list
    propertySet symbol
    " open-string
    [ open-list
    [ open-list
    resourceProperties symbol
    " open-string (not close-string!)
    ...

    so it also looks as if strings aren't properly escaped.

    This is definitely not a pretty syntax. I'd suggest an initial
    tokenization pass for the lexical syntax

    [ open-list
    ] close-list
    "[ open-qlist
    ]" close-qlist
    "..." string
    whitespace ignore
    anything-else symbol

    Correct nesting should give you two kinds of lists -- which I've shown
    as `list' and `qlist' (for quoted-list), though given the nastiness of
    the dump you showed, there's no guarantee of correctness.

    Turn the input string (or file) into a list (generator?) of lexical
    objects above; then scan that recursively. The lists (or qlists) seem
    to have two basic forms:

    * properties, that is a list of the form [SYMBOL VALUE ...] which can
    be thought of as a declaration that some property, named by the
    SYMBOL, has a particular VALUE (or maybe VALUEs); and

    * property lists, which are just lists of properties.

    Property lists can be usefully turned into Python dictionaries, indexed
    by their SYMBOLs, assuming that they don't try to declare the same
    property twice.

    There are, alas, other kinds of lists too -- one of the property lists
    contains a property `[value []]' which simply contains an empty list.

    The right first-cut rule for disambiguation is probably that a property
    list is a non-empty list, all of whose items look like properties, and a
    property is an entry in a property list, and (initially at least)
    restrict properties to the simple form [SYMBOL VALUE] rather than
    allowing multiple values.

    Does any of this help?

    (In fact, this syntax looks so much like a demented kind of S-expression
    that I'd probably try to parse it, initially at least, by using a Common
    Lisp system's reader and a custom readtable, but that may not be useful
    to you.)

    -- [mdw]



    --
    http://mail.python.org/mailman/listinfo/python-list
  • Paul McGuire at Apr 24, 2008 at 4:08 pm

    On Apr 24, 10:42?am, "Eric Wertman" wrote:
    I'm sure there are cooler ways to do some of that. ?I spent most of my
    time expanding the characters that constitute content. ?I'm concerned
    that over time I'll have things break as other characters show up.
    Specifically a few of the nodes are of German locale.. so I could get
    some odd international characters.
    If you want to add international characters without going to Unicode,
    a first cut would be to add pyparsing's string constant "ascii8bit".
    It looks like pyparser has a constant for printable characters. ?I'm
    not sure if I can just use that, without worrying about it?
    I would discourage you from using printables, since it also includes
    '[', ']', and '"', which are significant to other elements of the
    parser (but you could create your own variable initialized with
    printables, and then use replace("[","") etc. to strip out the
    offending characters). I'm also a little concerned that you needed to
    add \t and \n to the content word - was this really necessary? None
    of your examples showed such words, and I would rather have you let
    pyparsing skip over the whitespace as is its natural behavior.

    -- Paul
  • Eric Wertman at Apr 24, 2008 at 4:14 pm

    I would discourage you from using printables, since it also includes
    '[', ']', and '"', which are significant to other elements of the
    parser (but you could create your own variable initialized with
    printables, and then use replace("[","") etc. to strip out the
    offending characters). I'm also a little concerned that you needed to
    add \t and \n to the content word - was this really necessary? None
    of your examples showed such words, and I would rather have you let
    pyparsing skip over the whitespace as is its natural behavior.

    -- Paul
    You are right... I have taken those out and it still works. I was
    adding everything I could think of at one point in trying to determine
    what was breaking the parser. Some of the data in there is input free
    form... which means that any developer could have put just about
    anything in there... I find a lot of ^M stuff from day to day in
    other places.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedApr 24, '08 at 1:00a
activeApr 24, '08 at 4:14p
posts8
users5
websitepython.org

People

Translate

site design / logo © 2022 Grokbase