I need to download all of the pages of a website under www.xyz.com. How do I accomplish this in an automated fashion?

Asked by bootonthroat (344

) June 28th, 2010

I know I cannot duplicate dynamic content but I should at least be able to download all of the static content.

Observing members: 0

Composing members: 0

10 Answers

http://www.httrack.com/ is a WIndows & LInux tool for that task.

“HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility.

“It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer.”

jaytkay (25810

)“Great Answer” (3

) Flag as…

http://www.gnu.org/software/wget/ will also do it, under Linux and Cygwin.

Note that wget at least respects the robots.txt file.

grumpyfish (6657

)“Great Answer” (2

) Flag as…

thank you both
i am trying httrack, next if that fails I will try wget

bootonthroat (344

)“Great Answer” (0

) Flag as…

httrack failed—it cannot handle https

bootonthroat (344

)“Great Answer” (0

) Flag as…

wget also failed. It only retrieved a few of the pages but others hidden behind links generated by javascript (within the javascript-menu) did not function. Are there additional options?

bootonthroat (344

)“Great Answer” (0

) Flag as…

I use a tool called Web Site Downloader. http://www.web-site-downloader.com/entire/

The only time I have had it fail is when there were required proxy server settings I did not have. However, for links generated client side with JavaScript, you are probably going to have to click through by hand and save each such page individually. That’s one reason to not assign mission-critical behavior like linking to client-side scripting. Another good reason is that most search spiders do not run JavaScript interpreters, so such content is invisible to them as well.

ETpro (34605

)“Great Answer” (3

) Flag as…

@ETpro: Thank you. I have solved the problem as follows:
1. have downloader download to local directory
2. run webserver pointed at mirror
3. click EVERYTHING (there is a click-and-drag-to-open-links firefox add-in)
4. have downloader download everything listed as a 404 in the log of the local webserver
5. repeat

bootonthroat (344

)“Great Answer” (2

) Flag as…