FAQ
Hi guys.I have to implement a topical crawler as a part of my
project.What language should i implement
C or Python?Python though has fast development cycle but my concern is
speed also.I want to strke a balance between development speed and
crawler speed.Since Python is an interpreted language it is rather
slow.The crawler which will be working on huge set of pages should be
as fast as possible.One possible implementation would be implementing
partly in C and partly in Python so that i can have best of both
worlds.But i don't know to approach about it.Can anyone guide me on
what part should i implement in C and what should be in Python?

Search Discussions

  • Abhinav at Feb 16, 2006 at 7:46 am
    It is DSL broadband 128kbps.But thats not the point.What i am saying is
    that would python be fine for implementing fast crawler algorithms or
    should i use C.Handling huge data,multithreading,file
    handling,heuristics for ranking,and maintaining huge data
    structures.What should be the language so as not to compromise that
    much on speed.What is the performance of python based crawlers vs C
    based crawlers.Should I use both the languages(partly C and python).How
    should i decide what part to be implemented in C and what should be
    done in python?
    Please guide me.Thanks.
  • Fuzzyman at Feb 16, 2006 at 8:29 am

    abhinav wrote:
    It is DSL broadband 128kbps.But thats not the point.What i am saying is
    that would python be fine for implementing fast crawler algorithms or
    should i use C.
    But a web crawler is going to be *mainly* I/O bound - so language
    efficiency won't be the main issue. There are several web crawler
    implemented in Python.
    Handling huge data,multithreading,file
    handling,heuristics for ranking,and maintaining huge data
    structures.What should be the language so as not to compromise that
    much on speed.What is the performance of python based crawlers vs C
    based crawlers.Should I use both the languages(partly C and python).How
    If your data processing requirements are fairly heavy you will
    *probably* get a speed advantage coding them in C and accessing them
    from Python.

    The usdual advice (which seems to be applicable to you), is to
    prototype in Python (which will be much more fun than in C) then test.

    Profile to find your real bottlenecks (if the Python one isn't fast
    enough - which it may be), and move your bottlenecks to C.

    All the best,

    Fuzzyman
    http://www.voidspace.org.uk/python/index.shtml
    should i decide what part to be implemented in C and what should be
    done in python?
    Please guide me.Thanks.
  • Gene tani at Feb 16, 2006 at 1:27 pm

    Paul Rubin wrote:
    "abhinav" <abhinavduggal at gmail.com> writes:
    maintaining huge data structures.What should be the language so as
    not to compromise that much on speed.What is the performance of
    python based crawlers vs C based crawlers.Should I use both the
    languages(partly C and python).How should i decide what part to be
    implemented in C and what should be done in python? Please guide
    me.Thanks.
    I think if you don't know how to answer these questions for yourself,
    you're not ready to take on projects of that complexity. My advice
    is start in Python since development will be much easier. If and when
    you start hitting performance problems, you'll have to examine many
    combinations of tactics for dealing with them, and switching languages
    is just one such tactic.
    There's another potential bottleneck, parsing HTML and extracting the
    text you want, especially when you hit pages that don't meet HTML 4 or
    XHTML spec.
    http://sig.levillage.org/?pY9

    Paul's advice is very sound, given what little info you've provided.

    http://trific.ath.cx/resources/python/optimization/
    (and look at psyco, pyrex, boost, Swig, Ctypes for bridging C and
    python, you have a lot of options. Also look at Harvestman, mechanize,
    other existing libs.
  • Gene tani at Feb 16, 2006 at 1:46 pm

    abhinav wrote:
    Hi guys.I have to implement a topical crawler as a part of my
    project.What language should i implement
    Oh, and there's some really good books out there, besides the Orilly
    Spidering Hacks. Springer Verlag has a couple books on "Text Mining"
    and at least a couple books with "web intelligence" in the title.
    Expensive but worth it.
  • Andrew Gwozdziewycz at Feb 16, 2006 at 2:31 pm

    On 15 Feb 2006 21:56:52 -0800, abhinav wrote:
    Hi guys.I have to implement a topical crawler as a part of my
    project.What language should i implement
    C or Python?
    Why does this keep coming up on here as of late? If you search the
    archives, you can find numerous posts about spiders. One interesting
    fact is that google itself starting with their spiders in python.
    http://www-db.stanford.edu/~backrub/google.html I'm _sure_ it'll work
    for you.



    --
    Andrew Gwozdziewycz <apgwoz at gmail.com>
    http://ihadagreatview.org
    http://plasticandroid.org
  • Steven D'Aprano at Feb 16, 2006 at 5:11 pm

    On Wed, 15 Feb 2006 21:56:52 -0800, abhinav wrote:

    Hi guys.I have to implement a topical crawler as a part of my
    project.What language should i implement
    C or Python?Python though has fast development cycle but my concern is
    speed also.I want to strke a balance between development speed and
    crawler speed.Since Python is an interpreted language it is rather
    slow.
    Python is no more interpreted than Java. Like Java, it is compiled to
    byte-code. Unlike Java, it doesn't take three weeks to start the runtime
    environment. (Okay, maybe it just *seems* like three weeks.)

    The nice clean distinctions between "compiled" and "interpreted" languages
    haven't existed in most serious programming languages for a decade or
    more. In these days of tokenizers and byte-code compilers and processors
    emulating other processors, the difference is more of degree than kind.

    It is true that standard Python doesn't compile to platform dependent
    machine code, but that is rarely an issue since the bottleneck for most
    applications is I/O or human interaction, not language speed. And for
    those cases where it is a problem, there are solutions, like Psycho.

    After all, it is almost never true that your code must run as fast as
    physically possible. That's called "over-engineering". It just needs to
    run as fast as needed, that's all. And that's a much simpler problem to
    solve cheaply.


    The crawler which will be working on huge set of pages should be
    as fast as possible.
    Web crawler performance is almost certainly going to be I/O bound. Sounds
    to me like you are guilty of trying to optimize your code before even
    writing a single line of code. What you call "huge" may not be huge to
    your computer. Have you tried? The great thing about Python is you can
    write a prototype in maybe a tenth the time it would take you to do the
    same thing in C. Instead of trying to guess what the performance
    bottlenecks will be, you can write your code and profile it and find the
    bottlenecks with accuracy.

    One possible implementation would be implementing
    partly in C and partly in Python so that i can have best of both
    worlds.
    Sure you can do that, if you need to.
    But i don't know to approach about it.Can anyone guide me on
    what part should i implement in C and what should be in Python?
    Yes. Write it all in Python. Test it, debug it, get it working.

    Once it is working, and not before, rigorously profile it. You may find it
    is fast enough.

    If it is not fast enough, find the bottlenecks. Replace them with better
    algorithms. We had an example on comp.lang.python just a day or two ago
    where a function which was taking hours to complete was re-written with a
    better algorithm which took only seconds. And still in Python.

    If it is still too slow after using better algorithms, or if there are no
    better algorithms, then and only then re-write those bottlenecks in C for
    speed.



    --
    Steven.
  • Steve Holden at Feb 17, 2006 at 4:41 am

    abhinav wrote:
    Hi guys.I have to implement a topical crawler as a part of my
    project.What language should i implement
    C or Python?Python though has fast development cycle but my concern is
    speed also.I want to strke a balance between development speed and
    crawler speed.Since Python is an interpreted language it is rather
    slow.The crawler which will be working on huge set of pages should be
    as fast as possible.One possible implementation would be implementing
    partly in C and partly in Python so that i can have best of both
    worlds.But i don't know to approach about it.Can anyone guide me on
    what part should i implement in C and what should be in Python?
    Get real. Any web crawler is bound to spend huge amounts of its time
    waiting for data to come in over network pipes. Or do you have plans for
    massive parallelism previously unheard of in the Python world?

    regards
    Steve
    --
    Steve Holden +44 150 684 7255 +1 800 494 3119
    Holden Web LLC www.holdenweb.com
    PyCon TX 2006 www.python.org/pycon/
  • Ravi Teja at Feb 17, 2006 at 7:11 am
    This is following the pattern of your previous post on language choice
    wrt. writing a mail server. It is very common for beginers to over
    emphasize performance requirements, size of the executable etc. More is
    always good. Right? Yes! But at what cost?

    The rule of thumb for all your Python Vs C questions is ...
    1.) Choose Python by default.
    2.) If your program is slow, it's your algorithm that you need to check
    first. Python strictly speaking will be slow because of its dynamism.
    However, most of whatever is performance critical in Python is already
    implemented in C. And the speed difference of well written Python
    programs with properly chosen extensions and algorithms is not far off.
    3.) Remember that you can always drop back to C where ever you need to
    without throwing all of your code. And even if you had to, Python is
    very valuable as a prototyping tool since it is very agile. You would
    have figured out what you needed to do by then, that rewriting it in C
    will only take a fraction of the time compared to if it was written in
    C directly.

    Don't even start with asking the question, "is it fast enough?" till
    you have already written it in Python and it turns out that it is not
    running fast enough despite correctness of your code. If it does, you
    can fix it relatively easily. It is easy to write bad code in C and
    poorly written C code performance is lower than well written Python
    code performance.

    Remember Donald Knuth's quote.
    "Premature optimization is the root of all evil in programming".

    C is a language intended to be used when you NEED tight control over
    memory allocation. It has few advantages in other scenarios. Don't
    abuse it by choosing it by default.
  • Alex Martelli at Feb 17, 2006 at 3:11 pm
    Ravi Teja wrote:
    ...
    The rule of thumb for all your Python Vs C questions is ...
    1.) Choose Python by default.
    +1 QOTW!-)

    2.) If your program is slow, it's your algorithm that you need to check
    Seriously: yes, and (often even more importantly) data structure.

    However, often most important tip, particularly for large-scale systems,
    is to consider your program's _architecture_ (algorithms are about
    details of computation, architecture is about partitioning systems into
    components, locating their deployment, and so forth). At a generic and
    lowish level: are you for example creating a lot of threads each for a
    small amount of work? Then consider reusing threads from a "worker
    threads" pool. Or maybe you could avoid threads and use event-driven
    programming; or, at the other extreme, have multiple processes
    communicating by TCP/IP so you can scale up your system to tens or
    hundreds of processors -- in the latter case, partitioning your system
    appropriately to minimize inter process communication may be the
    bottleneck. Consider UDP, when you can afford missing a packet once in a
    while -- sometimes it may let you reduce overheads compared to TCP
    connections.

    Database connections, and less importantly database cursors, are well
    worth reusing. What are you "caching", and what instead is getting
    recomputed over and over? It's possible to undercache (needless
    repeated computation) but also to overcache (tying up memory and causing
    paging). Are you making lots of system calls that you might be able to
    avoid? Each system call has a context-switching cost, after all...

    Any or all of these hints may be irrelevant to a specific category of
    applications, but then, so can the hint about algorithms be. One cool
    thing about Python is that it makes it easy and fast for you to try out
    different approaches (particularly to architecture, but to algorithms as
    well), even drastically different ones, when simple reasoning about the
    issues leaves you undecided and you need to settle them empirically.

    Remember Donald Knuth's quote.
    "Premature optimization is the root of all evil in programming".
    I believe Knuth himself said he was quoting Tony Hoare, and indeed
    referred to this as "Hoare's dictum".


    Alex
  • Magnus Lycka at Feb 20, 2006 at 6:33 pm

    abhinav wrote:
    I want to strke a balance between development speed and crawler speed.
    "The best performance improvement is the transition from the
    nonworking state to the working state." - J. Osterhout

    Try to get there are soon as possible. You can figure out what
    that means. ;^)

    When you do all your programming in Python, most of the code that
    is relevant for speed *is* written in C already. If performance
    is slow, measure! Use the profiler to see if you are spending a
    lot of time in Python code. If that is your problem, take a close
    look at your algorithms and perhaps your data structures and see
    what you can improve with Python. In the long run, going from from
    e.g. O(n^2) to O(n log n) might mean much more than going from
    Python to C. A poor algorithm in machine code still sucks when you
    have to handle enough data. Changing your code to improve on
    algorithms and structure is a lot easier in Python than in C.

    If you've done all these things, still have performance problems,
    and have identified a bottle neck in your Python code, it might
    be time to get that piece rewritten in C. The easiest and least
    intrusive way to do that might be with pyrex. You might also want
    to try Psyco before you do this.

    Even if you end up writing a whole program in C, it's not unlikely
    that you will get to your goal faster if your first version is
    written in Python.

    Good luck!

    P.S. Why someone would want to write yet another web crawler is
    a puzzle to me. Surely there are plenty of good ideas that haven't
    been properly implemented yet! It's probably very difficult to
    beat Google on their home turf now, but I'd really like to see
    a good tool to manage all that information I got from the net,
    or through mail or wrote myself. I don't think they wrote that
    yet--although I'm sure they are trying.
  • Dfj225 at Feb 20, 2006 at 9:07 pm
    I think something that may be even more important to consider than just
    the pure speed of your program, would be ease of design as well as the
    overall stability of your code.

    My opinion would be that writing in Python would have many benefits
    over the speed gains of using C. For instance, you crawler will have to
    handle all types of input from all over the web. Who can say what types
    of malformed or poorly writen data it will come across. I think it
    would be easier to create a system to handle this type of data in
    Python than in C.

    I don't want to pigeon-hole your project, but if it is for any use
    other than a commercial product, I would say speed would be a concern
    lower on the list than accurracy or time to develop. As others have
    pointed out, if you hit many performance barriers chances are the
    problem is the algorithm and not Python itself.

    I wish you luck and hope you will experiment in Python first. If your
    crawler is still not up to par, at the very least you might come up
    with some ideas for how Python could be improved.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouppython-list @
categoriespython
postedFeb 16, '06 at 5:56a
activeFeb 20, '06 at 9:07p
posts12
users10
websitepython.org

People

Translate

site design / logo © 2022 Grokbase