FAQ
Hi all,

I am writing a perl script to parse a file. The data in the file is
seperated by space/tab. However, certain fields may be empty or
consist of mutiple words and are double quoted and this makes it
difficut for me to do a split.

Example of data:
"" "This is 2nd field"
3 4
1 2
"" 4
1 2 "The field may consist of (meta)
characters" ""


What I am doing is as such:
while ($line=~/(".*?")/) {; <- Loops until all double-
quoted string is replaced
$line=~s/""/__EMPTY__/g;
$tmp1=$1;
$tmp2=$1;
$tmp1=~s/"//g;
$tmp1=~s/ /__SPACE__/g;
$tmp2=~s/([\(\)])/\\$1/g;
$line=~s/$tmp2/$tmp1/; <- needs to replace meta-
characters in $tmp2
}
@tmp=split /\s+/, $line;
foreach $i (0..$#tmp) {
$tmp[$i]=~s/__SPACE__/ /g;
$tmp[$i]=~s/__EMPTY__//g;
// Store data
}


Substitue "" with __EMPTY__
While line matches ".*?" (non-greedy match), remember the content
between the quotes.
Assign this content to $tmp1 and $tmp2. Remove " from $tmp1, Replace '
' with __SPACE__.
Replace metacharacters of $tmp2 with escape, ie (meta) to \(meta\).
Substition of $tmp2 with $tmp1 (non-global).
Do a split /\s+/,
Replace __EMPTY__ with empty string
Replace __SPACE__ with " ".

Does you one have a neater and more efficient way either by split of
regexp?


Thanks
Shu Teng

Search Discussions

  • Brad Baxter at Apr 8, 2008 at 12:39 pm
    If I were you, I'd use Text::ParseWords::parse_line()
    On Mon, Apr 7, 2008 at 9:11 PM, wrote:

    Hi all,

    I am writing a perl script to parse a file. The data in the file is
    seperated by space/tab. However, certain fields may be empty or
    consist of mutiple words and are double quoted and this makes it
    difficut for me to do a split.

    Example of data:
    "" "This is 2nd field"
    3 4
    1 2
    "" 4
    1 2 "The field may consist of (meta)
    characters" ""


    What I am doing is as such:
    while ($line=~/(".*?")/) {; <- Loops until all double-
    quoted string is replaced
    $line=~s/""/__EMPTY__/g;
    $tmp1=$1;
    $tmp2=$1;
    $tmp1=~s/"//g;
    $tmp1=~s/ /__SPACE__/g;
    $tmp2=~s/([\(\)])/\\$1/g;
    $line=~s/$tmp2/$tmp1/; <- needs to replace meta-
    characters in $tmp2
    }
    @tmp=split /\s+/, $line;
    foreach $i (0..$#tmp) {
    $tmp[$i]=~s/__SPACE__/ /g;
    $tmp[$i]=~s/__EMPTY__//g;
    // Store data
    }


    Substitue "" with __EMPTY__
    While line matches ".*?" (non-greedy match), remember the content
    between the quotes.
    Assign this content to $tmp1 and $tmp2. Remove " from $tmp1, Replace '
    ' with __SPACE__.
    Replace metacharacters of $tmp2 with escape, ie (meta) to \(meta\).
    Substition of $tmp2 with $tmp1 (non-global).
    Do a split /\s+/,
    Replace __EMPTY__ with empty string
    Replace __SPACE__ with " ".

    Does you one have a neater and more efficient way either by split of
    regexp?


    Thanks
    Shu Teng
  • Otavio at Apr 8, 2008 at 1:09 pm
    Either you use the module mentioned or try a multi stage split. It´s
    uglier but is a way to get the work done.

    First I´d split he data by ("\s+\"") then by ("\"\s+") then I´d deal
    with the tabs....

    Just my two cents. ;-)
    On 8 abr, 09:39, b...@mail.libs.uga.edu (Brad Baxter) wrote:
    If I were you, I'd use Text::ParseWords::parse_line()
    On Mon, Apr 7, 2008 at 9:11 PM, wrote:
    Hi all,
    I am writing a perl script to parse a file. The data in the file is
    seperated by space/tab. However, certain fields may be empty or
    consist of mutiple words and are double quoted and this makes it
    difficut for me to do a split.
    Example of data:
    "" "This is 2nd field"
    3 4
    1 2
    "" 4
    1 2 "The field may consist of (meta)
    characters" ""
    What I am doing is as such:
    while ($line=~/(".*?")/) {; <- Loops until all double-
    quoted string is replaced
    $line=~s/""/__EMPTY__/g;
    $tmp1=$1;
    $tmp2=$1;
    $tmp1=~s/"//g;
    $tmp1=~s/ /__SPACE__/g;
    $tmp2=~s/([\(\)])/\\$1/g;
    $line=~s/$tmp2/$tmp1/; <- needs to replace meta-
    characters in $tmp2
    }
    @tmp=split /\s+/, $line;
    foreach $i (0..$#tmp) {
    $tmp[$i]=~s/__SPACE__/ /g;
    $tmp[$i]=~s/__EMPTY__//g;
    // Store data
    }
    Substitue "" with __EMPTY__
    While line matches ".*?" (non-greedy match), remember the content
    between the quotes.
    Assign this content to $tmp1 and $tmp2. Remove " from $tmp1, Replace '
    ' with __SPACE__.
    Replace metacharacters of $tmp2 with escape, ie (meta) to \(meta\).
    Substition of $tmp2 with $tmp1 (non-global).
    Do a split /\s+/,
    Replace __EMPTY__ with empty string
    Replace __SPACE__ with " ".
    Does you one have a neater and more efficient way either by split of
    regexp?
    Thanks
    Shu Teng
  • Wolfpack307 at Apr 9, 2008 at 12:29 am
    Yes Brad,

    I have tried the Text::ParseWords and that is exactly what I am
    looking for.

    Thanks
    On Apr 8, 8:39 pm, b...@mail.libs.uga.edu (Brad Baxter) wrote:
    If I were you, I'd use Text::ParseWords::parse_line()


    On Mon, Apr 7, 2008 at 9:11 PM, wrote:
    Hi all,
    I am writing a perl script to parse a file. The data in the file is
    seperated by space/tab. However, certain fields may be empty or
    consist of mutiple words and are double quoted and this makes it
    difficut for me to do a split.
    Example of data:
    ""   "This is 2nd field"
    3                                                                  4
    1    2
    ""                                                                 4
    1    2                           "The field may consist of (meta)
    characters"   ""
    What I am doing is as such:
    while ($line=~/(".*?")/) {;             <- Loops until all double-
    quoted string is replaced
    $line=~s/""/__EMPTY__/g;
    $tmp1=$1;
    $tmp2=$1;
    $tmp1=~s/"//g;
    $tmp1=~s/ /__SPACE__/g;
    $tmp2=~s/([\(\)])/\\$1/g;
    $line=~s/$tmp2/$tmp1/;            <- needs to replace meta-
    characters in $tmp2
    }
    @tmp=split /\s+/, $line;
    foreach $i (0..$#tmp) {
    $tmp[$i]=~s/__SPACE__/ /g;
    $tmp[$i]=~s/__EMPTY__//g;
    // Store data
    }
    Substitue "" with __EMPTY__
    While line matches ".*?" (non-greedy match), remember the content
    between the quotes.
    Assign this content to $tmp1 and $tmp2. Remove " from $tmp1, Replace '
    ' with __SPACE__.
    Replace metacharacters of $tmp2 with escape, ie (meta) to \(meta\).
    Substition of $tmp2 with $tmp1 (non-global).
    Do a split /\s+/,
    Replace __EMPTY__ with empty string
    Replace __SPACE__ with " ".
    Does you one have a neater and more efficient way either by split of
    regexp?
    Thanks
    Shu Teng- Hide quoted text -
    - Show quoted text -
  • Johan Vromans at Apr 8, 2008 at 10:02 pm

    wolfpack307@yahoo.com writes:

    I am writing a perl script to parse a file. The data in the file is
    seperated by space/tab. However, certain fields may be empty or
    consist of mutiple words and are double quoted and this makes it
    difficut for me to do a split.

    Example of data:
    "" "This is 2nd field"
    3 4
    1 2
    "" 4
    1 2 "The field may consist of (meta)
    characters" ""
    I think Text::CSV (Text::CSV_XS) can handle this. Just set the
    separator to Tab.

    -- Johan
  • John W. Krahn at Apr 9, 2008 at 5:07 am

    wolfpack307@yahoo.com wrote:
    Hi all, Hello,
    I am writing a perl script to parse a file. The data in the file is
    seperated by space/tab. However, certain fields may be empty or
    consist of mutiple words and are double quoted and this makes it
    difficut for me to do a split.

    Example of data:
    "" "This is 2nd field"
    3 4
    1 2
    "" 4
    1 2 "The field may consist of (meta)
    characters" ""

    $ echo '"" "This is 2nd field" 3
    4
    1 2 "" 4
    1 2 "The field may consist of (meta)
    characters" ""' | \

    perl -lne'
    my @x = /"[^"]*"|\S+/g;
    print "Number of fields: " . @x . " ", map " >$_<", @x;
    '
    Number of fields: 4 >""< >"This is 2nd field"< >3< >4<
    Number of fields: 4 >1< >2< >""< >4<
    Number of fields: 4 >1< >2< >"The field may consist of (meta)
    characters"< >""<




    John
    --
    Perl isn't a toolbox, but a small machine shop where you
    can special-order certain sorts of tools at low cost and
    in short order. -- Larry Wall

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupscripts @
categoriesperl
postedApr 8, '08 at 1:12a
activeApr 9, '08 at 5:07a
posts6
users5
websiteperl.org

People

Translate

site design / logo © 2021 Grokbase