FAQ
Folks,

I download a web page and want to add it to the index. I am using omindex (as below). When I search for the document, I see in search results that the hyper text link URL is to a file (e.g. http://www.mysite.com/shoe/tennis_shoe/tennis_shoe.html). What I want to be able to do is download the HTML file, save it, have it appear with a link back to the original web URL. How can I do this? I have been looking at modifying the omega.cc, but am just getting started with the source code. I thought perhaps there is a better way or tool to use.

Thanks,
OSC

~/xapian/bin/omindex --db /var/data/omega/data/default --url 'http://www.mysite.com/shoe/tennis_shoe' /tmp/shoe/tennis_shoe.html


--
___________________________________________________
Play 100s of games for FREE! http://games.mail.com/

Search Discussions

  • Olly Betts at May 16, 2006 at 9:46 am

    On Mon, May 15, 2006 at 03:43:21PM -0800, oscaruser@programmer.net wrote:
    I download a web page and want to add it to the index. I am using
    omindex (as below). When I search for the document, I see in search
    results that the hyper text link URL is to a file (e.g.
    http://www.mysite.com/shoe/tennis_shoe/tennis_shoe.html).
    That's what you asked for when you said:

    --url 'http://www.mysite.com/shoe/tennis_shoe'
    What I want to be able to do is download the HTML file, save it, have
    it appear with a link back to the original web URL.
    You mean like Google's "cached copy"?

    After indexing, save each page using a filename which can be derived
    from the URL (e.g. MD5SUM of the URL). Then you can write a simple
    template page in PHP or similar (called cached.php, say) which takes a
    parameter "url" and loads the text of the cached page and displays it
    with a link to the url.

    Then you can just set the link in the omegascript query template to be
    something like:

    <a href="$html{cached.php?url=$field{url}}">$field{title}</a>

    Cheers,
    Olly
  • Oscaruser at May 16, 2006 at 11:12 pm
    how can i have it display only the specified URL (e.g. http://www.mysite.com/shoe/tennis_shoe) ? it is appending "/tennis_shoe.html" representing the local file -- not what i want.
    thanks
    ----- Original Message -----
    From: "Olly Betts" <olly@survex.com>
    To: oscaruser@programmer.net
    Subject: Re: [Xapian-discuss] [Omindex] How to associate a web URL in search results based on a document stored as a local file?
    Date: Tue, 16 May 2006 09:46:58 +0100

    On Mon, May 15, 2006 at 03:43:21PM -0800, oscaruser@programmer.net wrote:
    I download a web page and want to add it to the index. I am using
    omindex (as below). When I search for the document, I see in search
    results that the hyper text link URL is to a file (e.g.
    http://www.mysite.com/shoe/tennis_shoe/tennis_shoe.html).
    That's what you asked for when you said:

    --url 'http://www.mysite.com/shoe/tennis_shoe'
    What I want to be able to do is download the HTML file, save it, have
    it appear with a link back to the original web URL.
    You mean like Google's "cached copy"?

    After indexing, save each page using a filename which can be derived
    from the URL (e.g. MD5SUM of the URL). Then you can write a simple
    template page in PHP or similar (called cached.php, say) which takes a
    parameter "url" and loads the text of the cached page and displays it
    with a link to the url.

    Then you can just set the link in the omegascript query template to be
    something like:

    <a href="$html{cached.php?url=$field{url}}">$field{title}</a>

    Cheers,
    Olly
    >


    --
    ___________________________________________________
    Play 100s of games for FREE! http://games.mail.com/
  • Oscaruser at May 17, 2006 at 1:41 am
    fyi, resolved using the following change. note i assume only a single file in the directory.
    thanks

    oscar@delta:~/xapian/omega-0.9.6$ diff omindex.cc ../orig/omega-0.9.6/omindex.cc
    399,400c399
    < //string record = "url=" + baseurl + url + "\nsample=" + sample;
    < string record = "url=" + baseurl + "\nsample=" + sample;
    ---
    string record = "url=" + baseurl + url + "\nsample=" + sample;
    oscar@delta:~/xapian/omega-0.9.6$
    ----- Original Message -----
    From: oscaruser@programmer.net
    To: xapian-discuss@lists.xapian.org
    Subject: Re: [Xapian-discuss] [Omindex] How to associate a web URL in search results based on a document stored as a local file?
    Date: Tue, 16 May 2006 14:12:19 -0800


    how can i have it display only the specified URL (e.g.
    http://www.mysite.com/shoe/tennis_shoe) ? it is appending
    "/tennis_shoe.html" representing the local file -- not what i want.
    thanks
    ----- Original Message -----
    From: "Olly Betts" <olly@survex.com>
    To: oscaruser@programmer.net
    Subject: Re: [Xapian-discuss] [Omindex] How to associate a web
    URL in search results based on a document stored as a local file?
    Date: Tue, 16 May 2006 09:46:58 +0100

    On Mon, May 15, 2006 at 03:43:21PM -0800, oscaruser@programmer.net wrote:
    I download a web page and want to add it to the index. I am using
    omindex (as below). When I search for the document, I see in search
    results that the hyper text link URL is to a file (e.g.
    http://www.mysite.com/shoe/tennis_shoe/tennis_shoe.html).
    That's what you asked for when you said:

    --url 'http://www.mysite.com/shoe/tennis_shoe'
    What I want to be able to do is download the HTML file, save it, have
    it appear with a link back to the original web URL.
    You mean like Google's "cached copy"?

    After indexing, save each page using a filename which can be derived
    from the URL (e.g. MD5SUM of the URL). Then you can write a simple
    template page in PHP or similar (called cached.php, say) which takes a
    parameter "url" and loads the text of the cached page and displays it
    with a link to the url.

    Then you can just set the link in the omegascript query template to be
    something like:

    <a href="$html{cached.php?url=$field{url}}">$field{title}</a>

    Cheers,
    Olly

    --
    ___________________________________________________
    Play 100s of games for FREE! http://games.mail.com/


    _______________________________________________
    Xapian-discuss mailing list
    Xapian-discuss@lists.xapian.org
    http://lists.xapian.org/mailman/listinfo/xapian-discuss
    >


    --
    ___________________________________________________
    Play 100s of games for FREE! http://games.mail.com/
  • Olly Betts at May 17, 2006 at 8:55 am

    On Tue, May 16, 2006 at 04:41:02PM -0800, oscaruser@programmer.net wrote:
    fyi, resolved using the following change. note i assume only a single
    file in the directory.
    OK, I think I understand what you're doing now. That change will work,
    though if you want to handle a lot of pages at once you'll probably find
    it faster to use scriptindex than to run your modified omindex for every
    page. But for scriptindex you'll need to put the page into a suitable
    format whereas omindex allows you to just read the HTML page.

    Cheers,
    Olly

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupxapian-discuss @
categoriesxapian
postedMay 16, '06 at 12:43a
activeMay 17, '06 at 8:55a
posts5
users2
websitexapian.org
irc#xapian

2 users in discussion

Oscaruser: 3 posts Olly Betts: 2 posts

People

Translate

site design / logo © 2022 Grokbase