FAQ
Hi all,

I'm very interested about this thread. I also have to solve the problem
of spidering web sites, creating index (weel about this there is the
BIG problem that lucene can't be integrated easily with a DB),
extracting links from the page repeating all the process.

For extracting links from a page I'm thinking to use JTidy. I think
that with this library you can also parse a non well formed page (that
you can take from the web with URLConnection) setting the property to
clean the page. The class Tidy() returns a org.w3c.dom.Document that
you can use for analizing all the document: for example you can use
doc.getElementsByTagName(a) for taking all the a elements. You can
parse as xml.

Did someone solve the problem to spider recursively a web pages?

Laura



While trying to research the same thing, I found the following...here
's a
good example of link extraction.....
Try http://www.quiotix.com/opensource/html-parser

Its easy to write a Visitor which extracts the links; should take abou t ten
lines of code.



--
Brian Goetz
Quiotix Corporation
brian@quiotix.com Tel: 650-843-1300 Fax: 650-324- 8032
http://www.quiotix.com


--
To unsubscribe, e-mail: <mailto:lucene-user-
unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-
help@jakarta.apache.org>

Search Discussions

  • Otis Gospodnetic at Apr 20, 2002 at 2:10 pm
    Laura,

    Search the lucene-user and lucene-dev archives for things like:
    crawler
    spider
    spindle
    lucene sandbox

    Spindle is something you may want to look at, as is MoJo (not mentioned
    on lucene lists, use Google).

    Otis
    Did someone solve the problem to spider recursively a web pages?
    While trying to research the same thing, I found the
    following...here
    's a
    good example of link extraction.....
    Try http://www.quiotix.com/opensource/html-parser

    Its easy to write a Visitor which extracts the links; should take
    abou
    t ten
    lines of code.

    __________________________________________________
    Do You Yahoo!?
    Yahoo! Games - play chess, backgammon, pool and more
    http://games.yahoo.com/

    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • Lucene at Apr 21, 2002 at 12:47 pm
    Hi Otis,

    thanks for your reply. I have been looking for Spindle and Mojo for 2
    hours but I don't found anything.

    Can you help me? Wher can I find something?

    Thanks for your help and time


    Laura



    Laura,

    Search the lucene-user and lucene-dev archives for things like:
    crawler
    spider
    spindle
    lucene sandbox

    Spindle is something you may want to look at, as is MoJo (not mentione d
    on lucene lists, use Google).

    Otis
    Did someone solve the problem to spider recursively a web pages?
    While trying to research the same thing, I found the
    following...here
    's a
    good example of link extraction.....
    Try http://www.quiotix.com/opensource/html-parser

    Its easy to write a Visitor which extracts the links; should take
    abou
    t ten
    lines of code.

    __________________________________________________
    Do You Yahoo!?
    Yahoo! Games - play chess, backgammon, pool and more
    http://games.yahoo.com/

    --
    To unsubscribe, e-mail: <mailto:lucene-user-
    unsubscribe@jakarta.apache.org>
    For additional commands, e-mail: <mailto:lucene-user-
    help@jakarta.apache.org>
  • Otis Gospodnetic at Apr 21, 2002 at 4:27 pm
    Laura,

    http://marc.theaimsgroup.com/?l=lucene-user&w=2&r=1&s=Spindle&q=b

    Oops, it's JoBo, not MoJo :)
    http://www.matuschek.net/software/jobo/

    Otis

    --- "lucene@libero.it" wrote:
    Hi Otis,

    thanks for your reply. I have been looking for Spindle and Mojo for 2

    hours but I don't found anything.

    Can you help me? Wher can I find something?

    Thanks for your help and time


    Laura



    Laura,

    Search the lucene-user and lucene-dev archives for things like:
    crawler
    spider
    spindle
    lucene sandbox

    Spindle is something you may want to look at, as is MoJo (not
    mentione
    d
    on lucene lists, use Google).

    Otis
    Did someone solve the problem to spider recursively a web pages?
    While trying to research the same thing, I found the
    following...here
    's a
    good example of link extraction.....
    Try http://www.quiotix.com/opensource/html-parser

    Its easy to write a Visitor which extracts the links; should
    take
    abou
    t ten
    lines of code.

    __________________________________________________
    Do You Yahoo!?
    Yahoo! Games - play chess, backgammon, pool and more
    http://games.yahoo.com/

    --
    To unsubscribe, e-mail: <mailto:lucene-user-
    unsubscribe@jakarta.apache.org>
    For additional commands, e-mail: <mailto:lucene-user-
    help@jakarta.apache.org>

    __________________________________________________
    Do You Yahoo!?
    Yahoo! Games - play chess, backgammon, pool and more
    http://games.yahoo.com/

    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • James Cooper at Apr 22, 2002 at 5:16 am

    On Sun, 21 Apr 2002, [iso-8859-1] lucene@libero.it wrote:

    thanks for your reply. I have been looking for Spindle and Mojo for 2
    hours but I don't found anything.
    spindle is at:

    http://www.bitmechanic.com/projects/spindle/

    cheers

    -- James


    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:
  • David Black at Apr 21, 2002 at 9:48 pm
    I think I have found the secret recipe for doing this.......

    1. The example at Sun for link extraction..this was very easy to convert
    over to my application.
    http://developer.java.sun.com/developer/TechTips/1999/tt0923.html


    2. Brian Goetz's (great) Library at
    http://www.quiotix.com/opensource/html-parser


    While the "Visitor Design Pattern" might make your eyes cross at first,
    it's actually pretty cool. Here's a simple Vistor class that I wrote
    to extract the HTML. I also make reference to a piece of code for
    searching and replacing strings in a class called StripperUtils.java.
    If I understood the Visitor thing better, I could probably produce
    something more elegant...like the conversion of "&..;" with it's
    appropriate unencoded text.



    ---------------begin HTMLTextVisitor.java --------

    import com.quiotix.html.parser.*;
    import java.io.*;

    public class HTMLTextVisitor extends HtmlVisitor{
    protected PrintWriter out;

    public HTMLTextVisitor(OutputStream os) {
    out = new PrintWriter(os);
    }

    public HTMLTextVisitor(OutputStream os, String encoding) throws
    UnsupportedEncodingException {
    out = new PrintWriter( new OutputStreamWriter(os,
    encoding) );
    }

    public void finish() {
    out.flush();
    }

    public void visit(HtmlDocument.Text t) {
    String txt = t.toString();
    txt = StripperUtils.replace(txt,"&nbsp;"," ");
    txt = StripperUtils.replace(txt,"&nbsp;"," "); // for some
    wierd reason, the first pass doesn't get all of them
    out.print(txt);
    }

    ---------- end HTMLTextVisitor-------


    --------- begin StripperUtils.java ----------

    public static String replace(String originalText,
    String subStringToFind, String
    subStringToReplaceWith) {
    int s = 0;
    int e = 0;
    StringBuffer newText = new StringBuffer();
    while ((e = originalText.indexOf(subStringToFind, s)) >= 0) {
    newText.append(originalText.substring(s, e));
    newText.append(subStringToReplaceWith);
    s = e + subStringToFind.length();
    }
    newText.append(originalText.substring(s));
    return newText.toString();
    }

    --------- End StripperUtils.java --------------













    On Saturday, April 20, 2002, at 09:29 AM, lucene@libero.it wrote:

    Hi all,

    I'm very interested about this thread. I also have to solve the problem
    of spidering web sites, creating index (weel about this there is the
    BIG problem that lucene can't be integrated easily with a DB),
    extracting links from the page repeating all the process.

    For extracting links from a page I'm thinking to use JTidy. I think
    that with this library you can also parse a non well formed page (that
    you can take from the web with URLConnection) setting the property to
    clean the page. The class Tidy() returns a org.w3c.dom.Document that
    you can use for analizing all the document: for example you can use
    doc.getElementsByTagName(a) for taking all the a elements. You can
    parse as xml.

    Did someone solve the problem to spider recursively a web pages?

    Laura



    While trying to research the same thing, I found the following...here
    's a
    good example of link extraction.....
    Try http://www.quiotix.com/opensource/html-parser

    Its easy to write a Visitor which extracts the links; should take abou t ten
    lines of code.



    --
    Brian Goetz
    Quiotix Corporation
    brian@quiotix.com Tel: 650-843-1300 Fax: 650-324- 8032
    http://www.quiotix.com


    --
    To unsubscribe, e-mail: <mailto:lucene-user-
    unsubscribe@jakarta.apache.org>
    For additional commands, e-mail: <mailto:lucene-user-
    help@jakarta.apache.org>
  • Lucene at Apr 22, 2002 at 9:43 am
    Hi all,

    did someone try jobo?

    It seems a good software which can be extended.

    Has someone some experiences about it?

    Laura

    Laura,

    http://marc.theaimsgroup.com/?l=lucene-user&w=2&r=1&s=Spindle&q=b

    Oops, it's JoBo, not MoJo :)
    http://www.matuschek.net/software/jobo/

    Otis

    --- "lucene@libero.it" wrote:
    Hi Otis,

    thanks for your reply. I have been looking for Spindle and Mojo for
    2
    hours but I don't found anything.

    Can you help me? Wher can I find something?

    Thanks for your help and time


    Laura



    Laura,

    Search the lucene-user and lucene-dev archives for things like:
    crawler
    spider
    spindle
    lucene sandbox

    Spindle is something you may want to look at, as is MoJo (not
    mentione
    d
    on lucene lists, use Google).

    Otis
    Did someone solve the problem to spider recursively a web pages?
    While trying to research the same thing, I found the
    following...here
    's a
    good example of link extraction.....
    Try http://www.quiotix.com/opensource/html-parser

    Its easy to write a Visitor which extracts the links; should
    take
    abou
    t ten
    lines of code.

    __________________________________________________
    Do You Yahoo!?
    Yahoo! Games - play chess, backgammon, pool and more
    http://games.yahoo.com/

    --
    To unsubscribe, e-mail: <mailto:lucene-user-
    unsubscribe@jakarta.apache.org>
    For additional commands, e-mail: <mailto:lucene-user-
    help@jakarta.apache.org>

    __________________________________________________
    Do You Yahoo!?
    Yahoo! Games - play chess, backgammon, pool and more
    http://games.yahoo.com/

    --
    To unsubscribe, e-mail: <mailto:lucene-user-
    unsubscribe@jakarta.apache.org>
    For additional commands, e-mail: <mailto:lucene-user-
    help@jakarta.apache.org>
  • Paulo Gaspar at Apr 24, 2002 at 8:28 pm
    Did anyone take a look at NekoHTML?
    http://www.apache.org/~andyc/

    Con: needs Xerces.


    Have fun,
    Paulo Gaspar

    --
    To unsubscribe, e-mail:
    For additional commands, e-mail:

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedApr 20, '02 at 1:29p
activeApr 24, '02 at 8:28p
posts8
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase