FAQ
Hello,


I am writing a crawler in python, which crawl quora. I can't read the content of quora without login. But google/bing crawls quora. One thing i can do is use browser automation and login in my account and the go links by link and crawl content, but this method is slow. So can any one tell me how should i start in writing this crawler.




Thanks,
Umesh Kumar Sharma

Search Discussions

  • Dave Angel at Aug 3, 2013 at 9:09 pm

    Umesh Sharma wrote:


    Hello,

    I am writing a crawler in python, which crawl quora. I can't read the content of quora without login. But google/bing crawls quora. One thing i can do is use browser automation and login in my account and the go links by link and crawl content, but this method is slow. So can any one tell me how should i start in writing this crawler.
    I had never heard of quora. And I had to hunt a bit to find a link to
    this website. When you post a question here which refers to a
    non-Python site, you really should include a link to it.


    You start with reading the page: http://www.quora.com/about/tos


    which you agreed to when you created your account with them. At one
    place it seems pretty clear that unless you make specific arrangements
    with Quora, you're limited to using their API.


    I suspect that they bend over backwards to get Google and the other big
    names to index their stuff. But that doesn't make it legal for you to
    do the same.


    In particular, the section labeled "Rules" makes constraints on
    automated crawling. And so do other parts of the TOS. Crawling is
    permissible, but not scraping. What's that mean? I dunno. Perhaps
    scraping is what you're describing above as "method is slow."


    I'm going to be looking to see what API's they offer, if any. I'm
    creating an account now.


    --
    DaveA
  • David Hutto at Aug 8, 2013 at 3:51 am
    Never tried this, but if it's not data you're after, but a search term type
    of app, then ip address crawl, and if keyword/metadata, then crawl, and
    parse, just as it seems you are doing, for keywords, and url's associated
    with them, then eliminate url's without that specified keyword parameter
    into your function.


    Then, of course, just as stated above, some sites won't let you have access
    in other ways, which you should be able to circumvent some way.






    On Sat, Aug 3, 2013 at 5:09 PM, Dave Angel wrote:

    Umesh Sharma wrote:
    Hello,

    I am writing a crawler in python, which crawl quora. I can't read the
    content of quora without login. But google/bing crawls quora. One thing i
    can do is use browser automation and login in my account and the go links
    by link and crawl content, but this method is slow. So can any one tell me
    how should i start in writing this crawler.
    I had never heard of quora. And I had to hunt a bit to find a link to
    this website. When you post a question here which refers to a
    non-Python site, you really should include a link to it.

    You start with reading the page: http://www.quora.com/about/tos

    which you agreed to when you created your account with them. At one
    place it seems pretty clear that unless you make specific arrangements
    with Quora, you're limited to using their API.

    I suspect that they bend over backwards to get Google and the other big
    names to index their stuff. But that doesn't make it legal for you to
    do the same.

    In particular, the section labeled "Rules" makes constraints on
    automated crawling. And so do other parts of the TOS. Crawling is
    permissible, but not scraping. What's that mean? I dunno. Perhaps
    scraping is what you're describing above as "method is slow."

    I'm going to be looking to see what API's they offer, if any. I'm
    creating an account now.

    --
    DaveA

    --
    http://mail.python.org/mailman/listinfo/python-list





    --
    Best Regards,
    David Hutto
    *CEO:* *http://www.hitwebdevelopment.com*
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: <http://mail.python.org/pipermail/python-list/attachments/20130807/7f450ca5/attachment.html>

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedAug 3, '13 at 7:01p
activeAug 8, '13 at 3:51a
posts3
users3
websitepython.org

People

Translate

site design / logo © 2022 Grokbase