FAQ
Hello, I'm trying to parse HTML files. I want to extract values from tables (1) and from text fields (2). (1)<tr><td><img src="/image.gif" alt="" width="1" height="1" border="0"></td></tr>
<tr>
<td align="right" valign="top"><b>Floor plan:</b></td>
<td>
Ranch #1 </td>
</tr> (2)
<input type="text" name="date_constructed" id="date_constructed" value="04/01/2004" size="10" disabled> I would want to retrieve the floor plan (Ranch #1) and the date constructed (04/01/2004) from each HTML file (along with many other text boxes). What is an easy way of doing that? Jeff

Search Discussions

  • Shlomi Fish at Jul 25, 2011 at 8:30 pm
    Hi Jeffrey,

    On Mon, 25 Jul 2011 13:17:57 -0700
    Jeffrey Joh wrote:



    Hello, I'm trying to parse HTML files. I want to extract values from tables
    (1) and from text fields (2). (1)<tr><td><img src="/image.gif" alt=""
    width="1" height="1" border="0"></td></tr> <tr> <td align="right"
    valign="top"><b>Floor plan:</b></td> <td>
    Ranch #1 </td>
    </tr> (2)
    <input type="text" name="date_constructed" id="date_constructed"
    value="04/01/2004" size="10" disabled> I would want to retrieve the floor
    plan (Ranch #1) and the date constructed (04/01/2004) from each HTML file
    (along with many other text boxes). What is an easy way of doing that? Jeff
    You should use an HTML parser for that:

    http://perl-begin.org/uses/text-parsing/

    Regards,

    Shlomi Fish

    --
    -----------------------------------------------------------------
    Shlomi Fish http://www.shlomifish.org/
    Interview with Ben Collins-Sussman - http://shlom.in/sussman

    Had I not been already insane, I would have long ago driven myself mad.
    — The Enemy and how I Helped to Fight It

    Please reply to list if it's a mailing list post - http://shlom.in/reply .
  • Jim Gibson at Jul 25, 2011 at 8:35 pm
    On 7/25/11 Mon Jul 25, 2011 1:30 PM, "Shlomi Fish"
    <shlomif@shlomifish.org> scribbled:
    On Mon, 25 Jul 2011 13:17:57 -0700
    Jeffrey Joh wrote:
    Hello, I'm trying to parse HTML files.
    You should use an HTML parser for that:

    http://perl-begin.org/uses/text-parsing/
    Also look at HTML::TableExtract (I have not used it).

    <http://search.cpan.org/~msisk/HTML-TableExtract-2.10/lib/HTML/TableExtract.
    pm>
  • Dr.Ruud at Jul 25, 2011 at 10:36 pm

    On 2011-07-25 22:35, Jim Gibson wrote:
    Shlomi:
    Jeffrey:
    Hello, I'm trying to parse HTML files.
    You should use an HTML parser for that:

    http://perl-begin.org/uses/text-parsing/
    Also look at HTML::TableExtract (I have not used it).

    <http://search.cpan.org/~msisk/HTML-TableExtract-2.10/lib/HTML/TableExtract.pm>
    The 'permalink' is on the top-right of such a page:
    http://search.cpan.org/perldoc?HTML::TableExtract

    --
    Ruud
  • Rob Dixon at Jul 26, 2011 at 3:48 pm

    On 25/07/2011 21:17, Jeffrey Joh wrote:

    Hello, I'm trying to parse HTML files. I want to extract values from
    tables (1) and from text fields (2). (1)<tr><td><img
    src="/image.gif" alt="" width="1" height="1" border="0"></td></tr>

    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>
    Ranch #1</td>
    </tr> (2)
    <input type="text" name="date_constructed" id="date_constructed" value="04/01/2004" size="10" disabled> I would want to retrieve the floor plan (Ranch #1) and the date constructed (04/01/2004) from each HTML file (along with many other text boxes). What is an easy way of doing that? Jeff
    Hello Jeff

    I am unclear what you want to do. The HTML fragments you have shown are
    syntactically incorrect, and in any case are irrelevant out of the
    context of a complete HTML document.

    However I think I can help a little. The HTML::TreeBuilder module will
    build an HTML::Element object for you that you can navigate, modify, and
    extract data from. It is very forgiving of incorrect syntax, and will
    try to build a complete HTML document from any fragment that you offer it.

    The program below seems to do what you want, but without testing against
    the complete data that you are dealing with I cannot vouch for its
    correctness. In particular you should add checks to verify that the HTML
    you are working with looks as you expect it to. I have written a couple
    such checks, but only you can improve on those.

    HTH,

    Rob


    use strict;
    use warnings;

    use HTML::TreeBuilder;

    my $tree = HTML::TreeBuilder->new_from_file(*DATA);

    print "Working from HTML:\n\n";
    print $tree->as_HTML(undef, ' '), "\n\n";

    # Find an <input> element with an 'id' atttribute of 'date_constructed'
    # (there should be only one). The date required comes from the 'value'
    # attribute of that element.
    #
    my $date_tr = $tree->look_down(
    _tag => 'input',
    id => 'date_constructed',
    )
    or die "No construction date";
    my $plan_date = $date_tr->attr('value');

    # Now look up the tree to the containing <tr> element, and find its previous
    # sibling <tr> which contains the floor plan text in the second <td> child
    # element
    #
    my $plan_tr = $date_tr->look_up(_tag => 'tr')->left;
    my @tds = $plan_tr->look_down(_tag => 'td');
    die "Unexpected format" unless @tds == 2;

    my $plan_text = $tds[1]->as_trimmed_text;

    print "Plan found: $plan_text on $plan_date\n";

    __DATA__
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>
    Ranch #1 </td>
    </tr>
    <input type="text" name="date_constructed" id="date_constructed" value="04/01/2004" size="10" disabled>

    **OUTPUT**

    Working from HTML:

    <html>
    <head>
    </head>
    <body>
    <table>
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td> Ranch #1 </td>
    </tr>
    <tr>
    <td><input disabled id="date_constructed" name="date_constructed" size="10" type="text" value="04/01/2004" /></td>
    </tr>
    </table>
    </body>
    </html>

    Plan found: Ranch #1 on 04/01/2004

    Tool completed successfully
  • Jeffrey Joh at Jul 26, 2011 at 8:12 pm
    Hey Rob,This is awesome! However, let's say I have an unknown number of floorplans in a table that looks like this:<tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Ranch #1</td>
    </tr>
    <tr><td><input type="text" name="date_constructed" id="date_constructed" value="04/01/2004" size="10" disabled></td>
    <td><input type="text" name="ID" id="ID453" value="453" size="10"></td>
    </tr>
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Mission #3</td>
    </tr>
    <tr><td><input type="text" name="date_constructed" id="date_constructed" value="08/01/2009" size="10" disabled></td>
    <td><input type="text" name="ID" id="ID986" value="986" size="10"></td>
    </tr>
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Big house #9</td>
    </tr>
    <tr><td><input type="text" name="date_constructed" id="date_constructed" value="last summer" size="10" disabled></td>
    <td><input type="text" name="ID" id="ID354" value="354" size="10"></td>
    </tr>
    Date: Tue, 26 Jul 2011 16:48:41 +0100
    From: rob.dixon@gmx.com
    To: beginners@perl.org
    CC: johjeffrey@hotmail.com
    Subject: Re: Parse HTML
    On 25/07/2011 21:17, Jeffrey Joh wrote:

    Hello, I'm trying to parse HTML files. I want to extract values from
    tables (1) and from text fields (2). (1)<tr><td><img
    src="/image.gif" alt="" width="1" height="1" border="0"></td></tr>

    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>
    Ranch #1</td>
    </tr> (2)
    <input type="text" name="date_constructed" id="date_constructed" value="04/01/2004" size="10" disabled> I would want to retrieve the floor plan (Ranch #1) and the date constructed (04/01/2004) from each HTML file (along with many other text boxes). What is an easy way of doing that? Jeff
    Hello Jeff

    I am unclear what you want to do. The HTML fragments you have shown are
    syntactically incorrect, and in any case are irrelevant out of the
    context of a complete HTML document.

    However I think I can help a little. The HTML::TreeBuilder module will
    build an HTML::Element object for you that you can navigate, modify, and
    extract data from. It is very forgiving of incorrect syntax, and will
    try to build a complete HTML document from any fragment that you offer it.

    The program below seems to do what you want, but without testing against
    the complete data that you are dealing with I cannot vouch for its
    correctness. In particular you should add checks to verify that the HTML
    you are working with looks as you expect it to. I have written a couple
    such checks, but only you can improve on those.

    HTH,

    Rob


    use strict;
    use warnings;

    use HTML::TreeBuilder;

    my $tree = HTML::TreeBuilder->new_from_file(*DATA);

    print "Working from HTML:\n\n";
    print $tree->as_HTML(undef, ' '), "\n\n";

    # Find an <input> element with an 'id' atttribute of 'date_constructed'
    # (there should be only one). The date required comes from the 'value'
    # attribute of that element.
    #
    my $date_tr = $tree->look_down(
    _tag => 'input',
    id => 'date_constructed',
    )
    or die "No construction date";
    my $plan_date = $date_tr->attr('value');

    # Now look up the tree to the containing <tr> element, and find its previous
    # sibling <tr> which contains the floor plan text in the second <td> child
    # element
    #
    my $plan_tr = $date_tr->look_up(_tag => 'tr')->left;
    my @tds = $plan_tr->look_down(_tag => 'td');
    die "Unexpected format" unless @tds == 2;

    my $plan_text = $tds[1]->as_trimmed_text;

    print "Plan found: $plan_text on $plan_date\n";

    __DATA__
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>
    Ranch #1 </td>
    </tr>
    <input type="text" name="date_constructed" id="date_constructed" value="04/01/2004" size="10" disabled>

    **OUTPUT**

    Working from HTML:

    <html>
    <head>
    </head>
    <body>
    <table>
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td> Ranch #1 </td>
    </tr>
    <tr>
    <td><input disabled id="date_constructed" name="date_constructed" size="10" type="text" value="04/01/2004" /></td>
    </tr>
    </table>
    </body>
    </html>

    Plan found: Ranch #1 on 04/01/2004

    Tool completed successfully
  • Rob Dixon at Jul 28, 2011 at 7:25 pm

    On 26/07/2011 21:12, Jeffrey Joh wrote:
    On 26 Jul 2011 16:48, Rob Dixon wrote:
    On 25/07/2011 21:17, Jeffrey Joh wrote:

    Hello, I'm trying to parse HTML files. I want to extract values from
    tables (1) and from text fields (2). (1)<tr><td><img
    src="/image.gif" alt="" width="1" height="1" border="0"></td></tr>
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Ranch #1</td>
    </tr> (2)
    <input type="text" name="date_constructed" id="date_constructed"
    value="04/01/2004" size="10" disabled>

    I would want to retrieve the floor plan (Ranch #1) and the date
    constructed (04/01/2004) from each HTML file (along with many
    other text boxes). What is an easy way of doing that? Jeff
    I am unclear what you want to do. The HTML fragments you have shown are
    syntactically incorrect, and in any case are irrelevant out of the
    context of a complete HTML document.

    However I think I can help a little. The HTML::TreeBuilder module will
    build an HTML::Element object for you that you can navigate, modify, and
    extract data from. It is very forgiving of incorrect syntax, and will
    try to build a complete HTML document from any fragment that you offer it.

    The program below seems to do what you want, but without testing against
    the complete data that you are dealing with I cannot vouch for its
    correctness. In particular you should add checks to verify that the HTML
    you are working with looks as you expect it to. I have written a couple
    such checks, but only you can improve on those.


    use strict;
    use warnings;

    use HTML::TreeBuilder;

    my $tree = HTML::TreeBuilder->new_from_file(*DATA);

    print "Working from HTML:\n\n";
    print $tree->as_HTML(undef, ' '), "\n\n";

    # Find an <input> element with an 'id' atttribute of 'date_constructed'
    # (there should be only one). The date required comes from the 'value'
    # attribute of that element.
    #
    my $date_tr = $tree->look_down(
    _tag => 'input',
    id => 'date_constructed',
    )
    or die "No construction date";
    my $plan_date = $date_tr->attr('value');

    # Now look up the tree to the containing <tr> element, and find its previous
    # sibling <tr> which contains the floor plan text in the second <td> child
    # element
    #
    my $plan_tr = $date_tr->look_up(_tag => 'tr')->left;
    my @tds = $plan_tr->look_down(_tag => 'td');
    die "Unexpected format" unless @tds == 2;

    my $plan_text = $tds[1]->as_trimmed_text;

    print "Plan found: $plan_text on $plan_date\n";

    __DATA__
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td> Ranch #1 </td>
    </tr>
    <input type="text" name="date_constructed" id="date_constructed" value="04/01/2004" size="10" disabled>

    **OUTPUT**

    Plan found: Ranch #1 on 04/01/2004
    This is awesome! However, let's say I have an unknown number of
    floorplans in a table that looks like this:

    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Ranch #1</td>
    </tr>
    <tr><td><input type="text" name="date_constructed" id="date_constructed"
    value="04/01/2004" size="10" disabled></td>
    <td><input type="text" name="ID" id="ID453" value="453" size="10"></td>
    </tr>
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Mission #3</td>
    </tr>
    <tr><td><input type="text" name="date_constructed" id="date_constructed"
    value="08/01/2009" size="10" disabled></td>
    <td><input type="text" name="ID" id="ID986" value="986" size="10"></td>
    </tr>
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Big house #9</td>
    </tr>
    <tr><td><input type="text" name="date_constructed" id="date_constructed"
    value="last summer" size="10" disabled></td>
    <td><input type="text" name="ID" id="ID354" value="354" size="10"></td>
    </tr>
    Hi Jeff

    Please bottom-post your replies here. It is the standard for the list,
    and long and complex threads can quickly become incomprehensible if
    posts are made at both ends of the quoted message. Thank you.

    To achieve this, all you need to do is find all of the <input> elements
    with an id attribute of 'date_constructed'. The plan name can be found
    from each of these as before. Take a look at the program below.

    HTH,

    Rob



    use strict;
    use warnings;

    use HTML::TreeBuilder;

    my $tree = HTML::TreeBuilder->new_from_file(*DATA);

    print "Working from HTML:\n\n";
    print $tree->as_HTML(undef, ' '), "\n\n";

    # Find all <input> elements with an 'id' atttribute of 'date_constructed'.
    #
    my @date_tr = $tree->look_down(
    _tag => 'input',
    id => 'date_constructed',
    )
    or die "No construction dates";

    # Look at each <input> element found, taking the date string from its 'value'
    # attribute
    #
    for my $date_tr (@date_tr) {

    my $plan_date = $date_tr->attr('value');

    # Now look up the tree to the containing <tr> element, and find its previous
    # sibling <tr> which contains the floor plan text in the second <td> child
    # element
    #
    my $plan_tr = $date_tr->look_up(_tag => 'tr')->left;
    my @tds = $plan_tr->look_down(_tag => 'td');
    die "Unexpected format" unless @tds == 2;

    my $plan_text = $tds[1]->as_trimmed_text;

    print "Plan found: $plan_text on $plan_date\n";
    }

    __DATA__
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Ranch #1</td>
    </tr>
    <tr><td><input type="text" name="date_constructed" id="date_constructed" value="04/01/2004" size="10" disabled></td>
    <td><input type="text" name="ID" id="ID453" value="453" size="10"></td>
    </tr>
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Mission #3</td>
    </tr>
    <tr><td><input type="text" name="date_constructed" id="date_constructed" value="08/01/2009" size="10" disabled></td>
    <td><input type="text" name="ID" id="ID986" value="986" size="10"></td>
    </tr>
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Big house #9</td>
    </tr>
    <tr><td><input type="text" name="date_constructed" id="date_constructed" value="last summer" size="10" disabled></td>
    <td><input type="text" name="ID" id="ID354" value="354" size="10"></td>
    </tr>

    **OUTPUT**

    Working from HTML:

    <html>
    <head>
    </head>
    <body>
    <table>
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Ranch #1</td>
    </tr>
    <tr>
    <td><input disabled id="date_constructed" name="date_constructed" size="10" type="text" value="04/01/2004" /></td>
    <td><input id="ID453" name="ID" size="10" type="text" value="453" /></td>
    </tr>
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Mission #3</td>
    </tr>
    <tr>
    <td><input disabled id="date_constructed" name="date_constructed" size="10" type="text" value="08/01/2009" /></td>
    <td><input id="ID986" name="ID" size="10" type="text" value="986" /></td>
    </tr>
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Big house #9</td>
    </tr>
    <tr>
    <td><input disabled id="date_constructed" name="date_constructed" size="10" type="text" value="last summer" /></td>
    <td><input id="ID354" name="ID" size="10" type="text" value="354" /></td>
    </tr>
    </table>
    </body>
    </html>

    Plan found: Ranch #1 on 04/01/2004
    Plan found: Mission #3 on 08/01/2009
    Plan found: Big house #9 on last summer
  • Jeffrey Joh at Jul 26, 2011 at 8:12 pm
    Hey Rob,This is awesome! However, let's say I have an unknown number of floorplans in a table that looks like this:<tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Ranch #1</td>
    </tr>
    <tr><td><input type="text" name="date_constructed" id="date_constructed" value="04/01/2004" size="10" disabled></td>
    <td><input type="text" name="ID" id="ID453" value="453" size="10"></td>
    </tr>
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Mission #3</td>
    </tr>
    <tr><td><input type="text" name="date_constructed" id="date_constructed" value="08/01/2009" size="10" disabled></td>
    <td><input type="text" name="ID" id="ID986" value="986" size="10"></td>
    </tr>
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Big house #9</td>
    </tr>
    <tr><td><input type="text" name="date_constructed" id="date_constructed" value="last summer" size="10" disabled></td>
    <td><input type="text" name="ID" id="ID354" value="354" size="10"></td>
    </tr> I would like to retrieve all of the plan/date/IDs, AND discard all those plans that do not have a proper date_constructed such as "last summer".How could I do that? Jeff
    Date: Tue, 26 Jul 2011 16:48:41 +0100
    From: rob.dixon@gmx.com
    To: beginners@perl.org
    CC: johjeffrey@hotmail.com
    Subject: Re: Parse HTML
    On 25/07/2011 21:17, Jeffrey Joh wrote:

    Hello, I'm trying to parse HTML files. I want to extract values from
    tables (1) and from text fields (2). (1)<tr><td><img
    src="/image.gif" alt="" width="1" height="1" border="0"></td></tr>

    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>
    Ranch #1</td>
    </tr> (2)
    <input type="text" name="date_constructed" id="date_constructed" value="04/01/2004" size="10" disabled> I would want to retrieve the floor plan (Ranch #1) and the date constructed (04/01/2004) from each HTML file (along with many other text boxes). What is an easy way of doing that? Jeff
    Hello Jeff

    I am unclear what you want to do. The HTML fragments you have shown are
    syntactically incorrect, and in any case are irrelevant out of the
    context of a complete HTML document.

    However I think I can help a little. The HTML::TreeBuilder module will
    build an HTML::Element object for you that you can navigate, modify, and
    extract data from. It is very forgiving of incorrect syntax, and will
    try to build a complete HTML document from any fragment that you offer it.

    The program below seems to do what you want, but without testing against
    the complete data that you are dealing with I cannot vouch for its
    correctness. In particular you should add checks to verify that the HTML
    you are working with looks as you expect it to. I have written a couple
    such checks, but only you can improve on those.

    HTH,

    Rob


    use strict;
    use warnings;

    use HTML::TreeBuilder;

    my $tree = HTML::TreeBuilder->new_from_file(*DATA);

    print "Working from HTML:\n\n";
    print $tree->as_HTML(undef, ' '), "\n\n";

    # Find an <input> element with an 'id' atttribute of 'date_constructed'
    # (there should be only one). The date required comes from the 'value'
    # attribute of that element.
    #
    my $date_tr = $tree->look_down(
    _tag => 'input',
    id => 'date_constructed',
    )
    or die "No construction date";
    my $plan_date = $date_tr->attr('value');

    # Now look up the tree to the containing <tr> element, and find its previous
    # sibling <tr> which contains the floor plan text in the second <td> child
    # element
    #
    my $plan_tr = $date_tr->look_up(_tag => 'tr')->left;
    my @tds = $plan_tr->look_down(_tag => 'td');
    die "Unexpected format" unless @tds == 2;

    my $plan_text = $tds[1]->as_trimmed_text;

    print "Plan found: $plan_text on $plan_date\n";

    __DATA__
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>
    Ranch #1 </td>
    </tr>
    <input type="text" name="date_constructed" id="date_constructed" value="04/01/2004" size="10" disabled>

    **OUTPUT**

    Working from HTML:

    <html>
    <head>
    </head>
    <body>
    <table>
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td> Ranch #1 </td>
    </tr>
    <tr>
    <td><input disabled id="date_constructed" name="date_constructed" size="10" type="text" value="04/01/2004" /></td>
    </tr>
    </table>
    </body>
    </html>

    Plan found: Ranch #1 on 04/01/2004

    Tool completed successfully
  • Shawn wilson at Jul 26, 2011 at 10:27 pm
    Ya know, I'm sure there's a place for all of these. However, web::scraper
    works great with the xpath that element inspectors return. It's real easy to
    use and you can easily return variable types that suite your output best.
    Ie, a hash with field names per table element for dbic.
    On Jul 26, 2011 4:15 PM, "Jeffrey Joh" wrote:

    Hey Rob,This is awesome! However, let's say I have an unknown number of
    floorplans in a table that looks like this:<tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Ranch #1</td>
    </tr>
    <tr><td><input type="text" name="date_constructed" id="date_constructed"
    value="04/01/2004" size="10" disabled></td>
    <td><input type="text" name="ID" id="ID453" value="453" size="10"></td>
    </tr>
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Mission #3</td>
    </tr>
    <tr><td><input type="text" name="date_constructed" id="date_constructed"
    value="08/01/2009" size="10" disabled></td>
    <td><input type="text" name="ID" id="ID986" value="986" size="10"></td>
    </tr>
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>Big house #9</td>
    </tr>
    <tr><td><input type="text" name="date_constructed" id="date_constructed"
    value="last summer" size="10" disabled></td>
    <td><input type="text" name="ID" id="ID354" value="354" size="10"></td>
    </tr> I would like to retrieve all of the plan/date/IDs, AND discard all
    those plans that do not have a proper date_constructed such as "last
    summer".How could I do that? Jeff
    Date: Tue, 26 Jul 2011 16:48:41 +0100
    From: rob.dixon@gmx.com
    To: beginners@perl.org
    CC: johjeffrey@hotmail.com
    Subject: Re: Parse HTML
    On 25/07/2011 21:17, Jeffrey Joh wrote:

    Hello, I'm trying to parse HTML files. I want to extract values from
    tables (1) and from text fields (2). (1)<tr><td><img
    src="/image.gif" alt="" width="1" height="1" border="0"></td></tr>

    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>
    Ranch #1</td>
    </tr> (2)
    <input type="text" name="date_constructed" id="date_constructed"
    value="04/01/2004" size="10" disabled> I would want to retrieve the floor
    plan (Ranch #1) and the date constructed (04/01/2004) from each HTML file
    (along with many other text boxes). What is an easy way of doing that? Jeff
    Hello Jeff

    I am unclear what you want to do. The HTML fragments you have shown are
    syntactically incorrect, and in any case are irrelevant out of the
    context of a complete HTML document.

    However I think I can help a little. The HTML::TreeBuilder module will
    build an HTML::Element object for you that you can navigate, modify, and
    extract data from. It is very forgiving of incorrect syntax, and will
    try to build a complete HTML document from any fragment that you offer
    it.
    The program below seems to do what you want, but without testing against
    the complete data that you are dealing with I cannot vouch for its
    correctness. In particular you should add checks to verify that the HTML
    you are working with looks as you expect it to. I have written a couple
    such checks, but only you can improve on those.

    HTH,

    Rob


    use strict;
    use warnings;

    use HTML::TreeBuilder;

    my $tree = HTML::TreeBuilder->new_from_file(*DATA);

    print "Working from HTML:\n\n";
    print $tree->as_HTML(undef, ' '), "\n\n";

    # Find an <input> element with an 'id' atttribute of 'date_constructed'
    # (there should be only one). The date required comes from the 'value'
    # attribute of that element.
    #
    my $date_tr = $tree->look_down(
    _tag => 'input',
    id => 'date_constructed',
    )
    or die "No construction date";
    my $plan_date = $date_tr->attr('value');

    # Now look up the tree to the containing <tr> element, and find its
    previous
    # sibling <tr> which contains the floor plan text in the second <td>
    child
    # element
    #
    my $plan_tr = $date_tr->look_up(_tag => 'tr')->left;
    my @tds = $plan_tr->look_down(_tag => 'td');
    die "Unexpected format" unless @tds == 2;

    my $plan_text = $tds[1]->as_trimmed_text;

    print "Plan found: $plan_text on $plan_date\n";

    __DATA__
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td>
    Ranch #1 </td>
    </tr>
    <input type="text" name="date_constructed" id="date_constructed"
    value="04/01/2004" size="10" disabled>
    **OUTPUT**

    Working from HTML:

    <html>
    <head>
    </head>
    <body>
    <table>
    <tr>
    <td align="right" valign="top"><b>Floor plan:</b></td>
    <td> Ranch #1 </td>
    </tr>
    <tr>
    <td><input disabled id="date_constructed" name="date_constructed"
    size="10" type="text" value="04/01/2004" /></td>
    </tr>
    </table>
    </body>
    </html>

    Plan found: Ranch #1 on 04/01/2004

    Tool completed successfully

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupbeginners @
categoriesperl
postedJul 25, '11 at 8:18p
activeJul 28, '11 at 7:25p
posts9
users6
websiteperl.org

People

Translate

site design / logo © 2022 Grokbase