On Jun 9, 1:48 pm, disappeare... at gmail.com wrote:
I am currently planning to write my own web crawler. I know Python but
not Perl, and I am interested in knowing which of these two are a
better choice given the following scenario:
1) I/O issues: my biggest constraint in terms of resource will be
bandwidth throttle neck.
2) Efficiency issues: The crawlers have to be fast, robust and as
"memory efficient" as possible. I am running all of my crawlers on
cheap pcs with about 500 mb RAM and P3 to P4 processors
3) Compatibility issues: Most of these crawlers will run on Unix
(FreeBSD), so there should exist a pretty good compiler that can
optimize my code these under the environments.
What are your opinions?
You mentioned *what* you want but not *why*. If it's for a real-world
production project, why reinvent a square wheel and not use (or at
least extend) an existing open source crawler, with years of
development behind it ? If it's a learning exercise, why bother about
performance so early ?
In any case, since you said you know python but not perl, the choice
is almost a no-brainer, unless you're looking for an excuse to learn
perl. In terms of performance they are comparable, and you can
probably manage crawls in the order of 10-100K pages at best. For
million-page or larger crawls though, you'll have to resort to C/C++
sooner or later.