FAQ
Is there a way to extract the source of an image in an HTML file using only
one struct (with encode/xml)? Now I have something like this

type XML struct {
A Imatge `xml:"div>img"`
}

type Imatge struct {
I string `xml:"src,attr"`
}

And would be great to only declare something like this

type Image struct {
I string `xml:"div>img,src,attr"`
}

This is the HTML

<div><div><img src="hello.png"/></div></div>

--

Search Discussions

  • Krolaw at Sep 22, 2012 at 7:27 pm
    This is a known issue with the xml package.
    See: http://code.google.com/p/go/issues/detail?id=3688
    I suggest you star it :-)
    On Sunday, 23 September 2012 01:30:24 UTC+12, Eduard Castany wrote:

    Is there a way to extract the source of an image in an HTML file using
    only one struct (with encode/xml)? Now I have something like this

    type XML struct {
    A Imatge `xml:"div>img"`
    }

    type Imatge struct {
    I string `xml:"src,attr"`
    }

    And would be great to only declare something like this

    type Image struct {
    I string `xml:"div>img,src,attr"`
    }

    This is the HTML

    <div><div><img src="hello.png"/></div></div>
    --
  • Kyle Lemons at Sep 24, 2012 at 4:04 pm
    There is also an HTML5-compliant parser in the works; it currently exists
    at tip as exp/html <http://tip.golang.org/pkg/exp/html>.
    On Sat, Sep 22, 2012 at 6:30 AM, Eduard Castany wrote:

    Is there a way to extract the source of an image in an HTML file using
    only one struct (with encode/xml)? Now I have something like this

    type XML struct {

    A Imatge `xml:"div>img"`
    }

    type Imatge struct {
    I string `xml:"src,attr"`
    }

    And would be great to only declare something like this

    type Image struct {

    I string `xml:"div>img,src,attr"`
    }

    This is the HTML

    <div><div><img src="hello.png"/></div></div>

    --

    --
  • Eduard Castany at Sep 25, 2012 at 4:32 pm
    woah! tried the exp/html and it immediatelly consumed all available ram
    (4GB) so I kiled the process.
    I ended up doing this terrible thing:

    i := strings.Index(s, `src="`) + 5
    j := strings.Index(s[i:], `"`)
    Img := s[i:j]

    (imagine it for very large and ugly html files with lots of images in each)

    El dilluns 24 de setembre de 2012 18:04:52 UTC+2, Kyle Lemons va escriure:
    There is also an HTML5-compliant parser in the works; it currently exists
    at tip as exp/html <http://tip.golang.org/pkg/exp/html>.

    On Sat, Sep 22, 2012 at 6:30 AM, Eduard Castany <eduard....@gmail.com<javascript:>
    wrote:
    Is there a way to extract the source of an image in an HTML file using
    only one struct (with encode/xml)? Now I have something like this

    type XML struct {


    A Imatge `xml:"div>img"`
    }

    type Imatge struct {
    I string `xml:"src,attr"`
    }

    And would be great to only declare something like this

    type Image struct {


    I string `xml:"div>img,src,attr"`
    }

    This is the HTML

    <div><div><img src="hello.png"/></div></div>

    --

    --
  • Francesc Campoy Flores at Sep 25, 2012 at 6:23 pm
    Hi Eduard,

    The parsing that you're using seems really weak, there's many situations
    where it won't work.

    I was thinking how I would do it, and I wrote this:
    http://play.golang.org/p/lT1lsmGf-0

    I don't know if it's exactly what you need, but I hope it helps.

    Salut,
    On Tue, Sep 25, 2012 at 9:32 AM, Eduard Castany wrote:

    woah! tried the exp/html and it immediatelly consumed all available ram
    (4GB) so I kiled the process.
    I ended up doing this terrible thing:

    i := strings.Index(s, `src="`) + 5
    j := strings.Index(s[i:], `"`)
    Img := s[i:j]

    (imagine it for very large and ugly html files with lots of images in each)

    El dilluns 24 de setembre de 2012 18:04:52 UTC+2, Kyle Lemons va escriure:
    There is also an HTML5-compliant parser in the works; it currently exists
    at tip as exp/html <http://tip.golang.org/pkg/exp/html>.
    On Sat, Sep 22, 2012 at 6:30 AM, Eduard Castany wrote:

    Is there a way to extract the source of an image in an HTML file using
    only one struct (with encode/xml)? Now I have something like this

    type XML struct {


    A Imatge `xml:"div>img"`
    }

    type Imatge struct {
    I string `xml:"src,attr"`
    }

    And would be great to only declare something like this

    type Image struct {


    I string `xml:"div>img,src,attr"`
    }

    This is the HTML

    <div><div><img src="hello.png"/></div></div>

    --

    --


    --
    --
    Francesc

    --
  • Nigel Tao at Sep 25, 2012 at 10:47 pm

    On 26 September 2012 02:32, Eduard Castany wrote:
    woah! tried the exp/html and it immediatelly consumed all available ram
    (4GB) so I kiled the process.
    That's not good. If you mail me (off-list) your code and data then I
    will debug this.

    --
  • Nigel Tao at Sep 26, 2012 at 12:10 am

    On 26 September 2012 08:47, Nigel Tao wrote:
    On 26 September 2012 02:32, Eduard Castany wrote:
    woah! tried the exp/html and it immediatelly consumed all available ram
    (4GB) so I kiled the process.
    That's not good. If you mail me (off-list) your code and data then I
    will debug this.
    Your program explodes because you're trying to marshal result of
    html.Parse to JSON. An HTML document is a tree of nodes that is
    self-referential: parents link to children and children link to
    parents. Trying to marshal this relationship in JSON (which doesn't
    have pointers) leads to an infinite loop as it tries to flatten the
    link graph.

    Your original post wanted to extract the img src's from an HTML
    document. You don't need JSON marshaling for that. The program below
    shows how:


    package main

    import (
    "exp/html"
    "exp/html/atom"
    "fmt"
    "log"
    "os"
    )

    func main() {
    f, err := os.Open(os.Getenv("GOROOT") + "/doc/docs.html")
    if err != nil {
    log.Fatal(err)
    }
    defer f.Close()

    doc, err := html.Parse(f)
    if err != nil {
    log.Fatal(err)
    }

    walk(doc)
    }

    func walk(n *html.Node) {
    if n.Type == html.ElementNode && n.DataAtom == atom.Img {
    for _, a := range n.Attr {
    if a.Key == "src" {
    fmt.Println(a.Val)
    }
    }
    }
    for c := n.FirstChild; c != nil; c = c.NextSibling {
    walk(c)
    }
    }


    Sample output:

    $ grep img.*src $GOROOT/doc/docs.html
    <img class="gopher" src="/doc/gopher/doc.png"/>
    <img class="gopher" src="/doc/gopher/talks.png"/>
    <img class="gopher" src="/doc/gopher/project.png"/>
    $ go run main.go
    /doc/gopher/doc.png
    /doc/gopher/talks.png
    /doc/gopher/project.png

    --
  • Eduard Castany at Sep 26, 2012 at 2:18 am
    GOOD! I've been using exp/html for 1h and I really like it!
    With it I'm able to parse html, store the interesting data in a struct and
    finally marshal it to JSON.

    Is there a possibility for n.Attr to be a map[string]string instead of
    []{Namespace, Key, Val string}? because I ended up with a lot of this code:

    switch n.DataAtom {
    case atom.P:
    if n.Parent.Attr != nil && (n.Parent.Attr[0].Val == "p3"
    n.Parent.Attr[1].Val == "p3") {
    val1 = n.Attr[1].Val
    val2 = n.Attr[0].Val
    }
    ...

    That I think would be reduced to

    switch n.DataAtom {
    case atom.P:
    if n.Parent.Attr != nil && n.Parent.Attr["class"] == "p3" {
    val1 = n.Attr["alt"]
    val2 = n.Attr["src"]
    }
    ....

    El dimecres 26 de setembre de 2012 2:10:12 UTC+2, Nigel Tao va escriure:
    On 26 September 2012 08:47, Nigel Tao <nige...@golang.org <javascript:>>
    wrote:
    On 26 September 2012 02:32, Eduard Castany wrote:
    woah! tried the exp/html and it immediatelly consumed all available ram
    (4GB) so I kiled the process.
    That's not good. If you mail me (off-list) your code and data then I
    will debug this.
    Your program explodes because you're trying to marshal result of
    html.Parse to JSON. An HTML document is a tree of nodes that is
    self-referential: parents link to children and children link to
    parents. Trying to marshal this relationship in JSON (which doesn't
    have pointers) leads to an infinite loop as it tries to flatten the
    link graph.

    Your original post wanted to extract the img src's from an HTML
    document. You don't need JSON marshaling for that. The program below
    shows how:


    package main

    import (
    "exp/html"
    "exp/html/atom"
    "fmt"
    "log"
    "os"
    )

    func main() {
    f, err := os.Open(os.Getenv("GOROOT") + "/doc/docs.html")
    if err != nil {
    log.Fatal(err)
    }
    defer f.Close()

    doc, err := html.Parse(f)
    if err != nil {
    log.Fatal(err)
    }

    walk(doc)
    }

    func walk(n *html.Node) {
    if n.Type == html.ElementNode && n.DataAtom == atom.Img {
    for _, a := range n.Attr {
    if a.Key == "src" {
    fmt.Println(a.Val)
    }
    }
    }
    for c := n.FirstChild; c != nil; c = c.NextSibling {
    walk(c)
    }
    }


    Sample output:

    $ grep img.*src $GOROOT/doc/docs.html
    <img class="gopher" src="/doc/gopher/doc.png"/>
    <img class="gopher" src="/doc/gopher/talks.png"/>
    <img class="gopher" src="/doc/gopher/project.png"/>
    $ go run main.go
    /doc/gopher/doc.png
    /doc/gopher/talks.png
    /doc/gopher/project.png
    --

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-nuts @
categoriesgo
postedSep 22, '12 at 1:30p
activeSep 26, '12 at 2:18a
posts8
users5
websitegolang.org

People

Translate

site design / logo © 2021 Grokbase