General Question

bootonthroat's avatar

I need to download all of the pages of a website under www.xyz.com. How do I accomplish this in an automated fashion?

Asked by bootonthroat (344points) June 28th, 2010

I know I cannot duplicate dynamic content but I should at least be able to download all of the static content.

Observing members: 0 Composing members: 0

10 Answers

jaytkay's avatar

http://www.httrack.com/ is a WIndows & LInux tool for that task.

“HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility.

“It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer.”

grumpyfish's avatar

http://www.gnu.org/software/wget/ will also do it, under Linux and Cygwin.

Note that wget at least respects the robots.txt file.

bootonthroat's avatar

thank you both
i am trying httrack, next if that fails I will try wget

bootonthroat's avatar

httrack failed—it cannot handle https

bootonthroat's avatar

wget also failed. It only retrieved a few of the pages but others hidden behind links generated by javascript (within the javascript-menu) did not function. Are there additional options?

ETpro's avatar

I use a tool called Web Site Downloader. http://www.web-site-downloader.com/entire/

The only time I have had it fail is when there were required proxy server settings I did not have. However, for links generated client side with JavaScript, you are probably going to have to click through by hand and save each such page individually. That’s one reason to not assign mission-critical behavior like linking to client-side scripting. Another good reason is that most search spiders do not run JavaScript interpreters, so such content is invisible to them as well.

bootonthroat's avatar

@ETpro: Thank you. I have solved the problem as follows:
1. have downloader download to local directory
2. run webserver pointed at mirror
3. click EVERYTHING (there is a click-and-drag-to-open-links firefox add-in)
4. have downloader download everything listed as a 404 in the log of the local webserver
5. repeat

Vincentt's avatar

wget is cool but can’t handle Javascript indeed. What you also could’ve done was to use the DownThemAll! Firefox extension, I think.

bootonthroat's avatar

@Vincentt I added that to my list for next time.

Thanks!

ETpro's avatar

@bootonthroat Glad you got it tamed.

Answer this question

Login

or

Join

to answer.

This question is in the General Section. Responses must be helpful and on-topic.

Your answer will be saved while you login or join.

Have a question? Ask Fluther!

What do you know more about?
or
Knowledge Networking @ Fluther