Using HTTrack to Test Crawlers

Websites are very dynamic these days.
Sometimes, crawlers that worked yesterday, don't seem to work today.

However, if you are working on a big project, that requires a few days of work,
on a rapidly changing piece of information, it might be smart to get a version of the
website you want to crawl on your local computer.

Ideally you'll have FTP access to the website and thus be able to
download it locally, but that's not always the option.

Some businesses allow you to access their websites for information,
but there's no way they'll give you FTP access, not to mention hire someone
to prepare an RSS feed (or god forbid an API).

Sometimes you have to be covered for all cases, and every once in a while
you're pretty much on your own.

On that case, comes a handy tool called HTTrack.
It basically lets you copy a website (however deep you want to do it) offline,
including localizing the links (which is one of its major strengths).

Though the configuration of the program may be confusing, it's crucial to go
through it step by step.
For instance, if you just want a specific page (or a page with all its links), you might
want to turn off the robots.txt options (because that would take pretty much forever).

You'd also might only like specific file types, or maximum size limit, etc.

At any rate, if you'd like to get a local version of any website (depending on the TOS of that website of course),
HTTrack is the way to go.

Asaf Zamir – Chief Technology Officer & AI/Cloud Consultant. Founder of CloudExpat, AZdev. Author of Coder to CTO.