FAQ
Hello,

I need help in using sgmlparser to parse a html file and keep track of
the number of times each tag is being used.

In the end of this program I need to print out the number of times each
tag was seen(presumably any type of tag can be used) and the linked
text.

I need help in getting past the first steps. I already have this basic
program to return hyperlinks. I cant seem to understand how to parse
any tag and keep track of it to print it out at a later time....

very frustrated and help is appreciated!!!!!



--------------------------------------------------------------------------
import sgmllib, urllib

class HtmParser(sgmllib.SGMLParser):
def __init__(self, verbose=0):
"Initialise an object, passing 'verbose' to the superclass."

sgmllib.SGMLParser.__init__(self, verbose)
self.hyperlinks = []
self.descriptions = []
self.inside_a_element = 0

def start_a(self, attributes):
"Process a hyperlink and its 'attributes'."

for name, value in attributes:
if name == "href":
self.hyperlinks.append(value)

def get_hyperlinks(self):
"Return the list of hyperlinks."

return self.hyperlinks


parser = HtmParser()

inptAdrs = raw_input('Please input the absolute path to the url\n')
print 'you entered: ', inptAdrs

content = urllib.urlopen(inptAdrs)

bufff = content.read()
print 'Statistics for ', inptAdrs

print 'There is', len(bufff), 'characters in the web page'

parser.feed(bufff)


print parser.get_hyperlinks()
parser.close()


---------------------------------------------------------------------------------

any help is much appreciated

Search Discussions

  • Hapaboy2059 at May 2, 2006 at 6:38 pm
    could i make a global variable and keep track of each tag count?

    Also how would i make a list or dictionary of tags that is found?
    how can i handle any tag that is given?
  • Heiko Wundram at May 2, 2006 at 9:48 pm

    Am Dienstag 02 Mai 2006 20:38 schrieb hapaboy2059 at gmail.com:
    could i make a global variable and keep track of each tag count?

    Also how would i make a list or dictionary of tags that is found?
    how can i handle any tag that is given?
    The following snippet does what you want:

    >>>
    from sgmllib import SGMLParser

    class MyParser(SGMLParser):

    def __init__(self):
    SGMLParser.__init__(self)
    self.tagcount = {}
    self.links = set()

    # Tag count handling
    # ------------------

    def handle_starttag(self,tag,method,args):
    self.tagcount[tag] = self.tagcount.get(tag,0) + 1
    method(args)

    def unknown_starttag(self,tag,args):
    self.tagcount[tag] = self.tagcount.get(tag,0) + 1

    # Argument handling
    # -----------------

    def start_a(self,args):
    self.links.update([value for name, value in args if name == "href"])

    parser = MyParser()
    parser.feed(file("test.html").read()) # Insert your data source here...
    parser.close()

    print parser.tagcount
    print parser.links
    >>>

    See the documentation for sgmllib for more info on handle_starttag (whose
    logic might just as well have been implemented in start_a, but if you want
    argument handling for more tags, it's best to keep it at this one central
    place) and unknown_starttag.

    --- Heiko.
  • Heiko Wundram at May 2, 2006 at 9:49 pm

    Am Dienstag 02 Mai 2006 20:38 schrieb hapaboy2059 at gmail.com:
    could i make a global variable and keep track of each tag count?

    Also how would i make a list or dictionary of tags that is found?
    how can i handle any tag that is given?
    The following snippet does what you want:

    >>>
    from sgmllib import SGMLParser

    class MyParser(SGMLParser):

    def __init__(self):
    SGMLParser.__init__(self)
    self.tagcount = {}
    self.links = set()

    # Tag count handling
    # ------------------

    def handle_starttag(self,tag,method,args):
    self.tagcount[tag] = self.tagcount.get(tag,0) + 1
    method(args)

    def unknown_starttag(self,tag,args):
    self.tagcount[tag] = self.tagcount.get(tag,0) + 1

    # Argument handling
    # -----------------

    def start_a(self,args):
    self.links.update([value for name, value in args if name == "href"])

    parser = MyParser()
    parser.feed(file("test.html").read()) # Insert your data source here...
    parser.close()

    print parser.tagcount
    print parser.links
    >>>

    See the documentation for sgmllib for more info on handle_starttag (whose
    logic might just as well have been implemented in start_a, but if you want
    argument handling for more tags, it's best to keep it at this one central
    place) and unknown_starttag.

    --- Heiko.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedMay 2, '06 at 4:12a
activeMay 2, '06 at 9:49p
posts4
users3
websitepython.org

People

Translate

site design / logo © 2022 Grokbase