FAQ
Hello, I am not sure if this is the right question for this list but
it is in regards to search engines.

Suppose you have a website that hosts some protected content that is
accessible via registered users. How you make the content searchable
by Google and other popular websearch engines? The idea is not to
reveal the conent even via the "Google cache."

Here is what I am thinking...
Using Lucene (or its derivatives), skim thru the "protected content"
and remove all the common stop words , stem the words and place the
resulting text files in a directory availabe for the search bots (via
robotstxt rules). That way, even if the content is cached by the
search engines, it does not make much sense to humans but it still
will enable them to search it. When they click on the link to the
skimmed files, we need to redirect them to the login/registe page and
upon successful login, they should be redirected to the actual human
readable/understandable page that corresponds to that has the "skimmed
content." Note that the "protected content" may be living in a Content
Management System or a database.

Am I overthinking/engineering it? Any ideas are really appreciated.

Thanks in advance,
Chakra
--
Visit my weblog: http://www.jroller.com/page/cyblogue

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Otis Gospodnetic at Mar 17, 2005 at 5:22 am
    Hello,

    --- Chakra Yadavalli wrote:
    Hello, I am not sure if this is the right question for this list but
    it is in regards to search engines.
    This is not really the right place to ask these types of questions,
    but... robots at mccmedia.com may be a better place to ask, or one of
    those forums where people who track things phenomena such as Google
    Dances exchange information.
    Suppose you have a website that hosts some protected content that is
    accessible via registered users. How you make the content searchable
    by Google and other popular websearch engines? The idea is not to
    reveal the conent even via the "Google cache."

    Here is what I am thinking...
    Using Lucene (or its derivatives), skim thru the "protected content"
    and remove all the common stop words , stem the words and place the
    resulting text files in a directory availabe for the search bots (via
    robotstxt rules). That way, even if the content is cached by the
    search engines, it does not make much sense to humans but it still
    will enable them to search it. When they click on the link to the
    skimmed files, we need to redirect them to the login/registe page and
    upon successful login, they should be redirected to the actual human
    readable/understandable page that corresponds to that has the
    "skimmed
    content." Note that the "protected content" may be living in a
    Content
    Management System or a database.

    Am I overthinking/engineering it? Any ideas are really appreciated.
    What you described is doable. You will have to detect Googlebot user
    agent, and feed it indexable text, while redirecting real users to the
    protected area.

    Using Lucene is an overkill, though. You can easily remove stop words
    with a simple Perl script, for example. Another thing you could do is
    just shuffle the words around. The current generation of search
    engines typically doesn't care (make use of) the word order, while a
    human will be lost if you shuffle the words.

    Otis

    Thanks in advance,
    Chakra
    --
    Visit my weblog: http://www.jroller.com/page/cyblogue

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Kevin L. Cobb at Mar 17, 2005 at 12:55 pm
    I worked on a website that had the same issue. We made a "search engine"
    page that listed all the documents that we wanted to index as links to
    documents that contained summaries of those documents with links to the
    entire document on the limited access site - Google won't be able to
    follow these links because they require use sign on, but the link will
    be there for enticement when Googlers find the page.

    Love the idea about retrieving the stop words and stemming them for
    google indexing. But, Google is pretty picky I am told, so I would not
    be surprised if they detected this sort of scheme and decided not to
    index your pages.

    Good luck.

    KLCobb

    -----Original Message-----
    From: Chakra Yadavalli
    Sent: Wednesday, March 16, 2005 11:44 PM
    To: java-user@lucene.apache.org
    Subject: How do you make "protected content" searchable by Google?

    Hello, I am not sure if this is the right question for this list but
    it is in regards to search engines.

    Suppose you have a website that hosts some protected content that is
    accessible via registered users. How you make the content searchable
    by Google and other popular websearch engines? The idea is not to
    reveal the conent even via the "Google cache."

    Here is what I am thinking...
    Using Lucene (or its derivatives), skim thru the "protected content"
    and remove all the common stop words , stem the words and place the
    resulting text files in a directory availabe for the search bots (via
    robotstxt rules). That way, even if the content is cached by the
    search engines, it does not make much sense to humans but it still
    will enable them to search it. When they click on the link to the
    skimmed files, we need to redirect them to the login/registe page and
    upon successful login, they should be redirected to the actual human
    readable/understandable page that corresponds to that has the "skimmed
    content." Note that the "protected content" may be living in a Content
    Management System or a database.

    Am I overthinking/engineering it? Any ideas are really appreciated.

    Thanks in advance,
    Chakra
    --
    Visit my weblog: http://www.jroller.com/page/cyblogue

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chakra Yadavalli at Mar 17, 2005 at 2:06 pm
    I think this scheme is not misleading any users. We are not putting
    "meaningless" keywords into the pages to get the page rank. That would
    be against google's policy. correct?


    On Thu, 17 Mar 2005 07:55:31 -0500, Kevin L. Cobb
    wrote:
    I worked on a website that had the same issue. We made a "search engine"
    page that listed all the documents that we wanted to index as links to
    documents that contained summaries of those documents with links to the
    entire document on the limited access site - Google won't be able to
    follow these links because they require use sign on, but the link will
    be there for enticement when Googlers find the page.

    Love the idea about retrieving the stop words and stemming them for
    google indexing. But, Google is pretty picky I am told, so I would not
    be surprised if they detected this sort of scheme and decided not to
    index your pages.

    Good luck.

    KLCobb

    -----Original Message-----
    From: Chakra Yadavalli
    Sent: Wednesday, March 16, 2005 11:44 PM
    To: java-user@lucene.apache.org
    Subject: How do you make "protected content" searchable by Google?

    Hello, I am not sure if this is the right question for this list but
    it is in regards to search engines.

    Suppose you have a website that hosts some protected content that is
    accessible via registered users. How you make the content searchable
    by Google and other popular websearch engines? The idea is not to
    reveal the conent even via the "Google cache."

    Here is what I am thinking...
    Using Lucene (or its derivatives), skim thru the "protected content"
    and remove all the common stop words , stem the words and place the
    resulting text files in a directory availabe for the search bots (via
    robotstxt rules). That way, even if the content is cached by the
    search engines, it does not make much sense to humans but it still
    will enable them to search it. When they click on the link to the
    skimmed files, we need to redirect them to the login/registe page and
    upon successful login, they should be redirected to the actual human
    readable/understandable page that corresponds to that has the "skimmed
    content." Note that the "protected content" may be living in a Content
    Management System or a database.

    Am I overthinking/engineering it? Any ideas are really appreciated.

    Thanks in advance,
    Chakra
    --
    Visit my weblog: http://www.jroller.com/page/cyblogue

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Visit my weblog: http://www.jroller.com/page/cyblogue

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedMar 17, '05 at 4:44a
activeMar 17, '05 at 2:06p
posts4
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase