beautifulsoup – I Wonder

I created a web crawler using python and its modules. It follows certain conditions like it reads robots.txt before crawling a page. If the robots.txt allows the page to be crawled the spidey crawls it. It dives in recursively. But there are certain limitations I have set. It do not go beyond 20 pages, as it is just a prototype. It cannot be detect traps, where it will go infinitely.

Spidey is a very basic crawler which works just fine with websites at least on the websites I have tested. I have tested it with http://python.org/ and http://stackoverflow.com/ and it did pretty well.

The modules I have used for the purpose are urllib2, re, BeautifulSoup, robotparser.

Here is the code if you want to test/use it.

Category: beautifulsoup

Spidey: Python Web Crawler