10-01-2018, 02:15 AM
The following are abbreviated steps for downloading a Website from the Internet Archive Wayback Machine using CentOS 7.0.
If you check GitHub there are a number of Downloaders available and I don't think they're all equal in quality. I was lucky to find this one by the author hartator as it worked for me:
https://github.com/hartator/wayback-machine-downloader
If you want, you can check through the above github page, as it does give a long list of options of how to control the download for larger Websites. If you check through the issues you'll note that pages are missed - there were a number of report backs about this. So it's not perfect by a long shot. But it's fun for less serious projects.
Before you start the download it is important that you check how large the Website is that you want to download and whether you have enough resources on your VPS, particularly memory, bandwidth and disk space to handle the download efficiently. One flaw in the downloader is that it doesn't provide you with the size it is going to download and it also doesn't give you the option to say "no". You need to do that research before you use the download command.
Here is an abbreviation of the commands using CentOS 7.0:
Step 1: Install Ruby
Step 2: Install the Wayback Machine Downloader
Step 3: Use the downloader command to start the download:
If you want to interrupt the download you can use Ctrl C. If you want to resume the download at the point where it was interrupted just repeat the downloader command:
I'm happy with the outcome so far, however haven't taken it to its conclusion yet. My download project is very tricky in that it is a Forum instead of a static Website. I think for uncomplicated static Websites this will work fine. Not sure about Forums and Blogs though. There's an issue with time stamps and the way the Forums and Blogs have been archived. And of course no database. The layers of .html pages don't go that deep. Hopefully I'll be able to report back about this at a later stage. I'm hoping to get a snapshot of the Forum on X date. Will be interesting to see what will appear.
In retrospect have decided to give up on megatools. The backup was so slow and almost a tenth of the way it just came to a complete stop. I'll probably go about this in a different way.
If you check GitHub there are a number of Downloaders available and I don't think they're all equal in quality. I was lucky to find this one by the author hartator as it worked for me:
https://github.com/hartator/wayback-machine-downloader
If you want, you can check through the above github page, as it does give a long list of options of how to control the download for larger Websites. If you check through the issues you'll note that pages are missed - there were a number of report backs about this. So it's not perfect by a long shot. But it's fun for less serious projects.
Before you start the download it is important that you check how large the Website is that you want to download and whether you have enough resources on your VPS, particularly memory, bandwidth and disk space to handle the download efficiently. One flaw in the downloader is that it doesn't provide you with the size it is going to download and it also doesn't give you the option to say "no". You need to do that research before you use the download command.
Here is an abbreviation of the commands using CentOS 7.0:
Step 1: Install Ruby
Code: (Select All)
yum install ruby
Step 2: Install the Wayback Machine Downloader
Code: (Select All)
gem install wayback_machine_downloader
Step 3: Use the downloader command to start the download:
Code: (Select All)
wayback_machine_downloader http://domainname.com
If you want to interrupt the download you can use Ctrl C. If you want to resume the download at the point where it was interrupted just repeat the downloader command:
Code: (Select All)
wayback_machine_downloader http://domainname.com
I'm happy with the outcome so far, however haven't taken it to its conclusion yet. My download project is very tricky in that it is a Forum instead of a static Website. I think for uncomplicated static Websites this will work fine. Not sure about Forums and Blogs though. There's an issue with time stamps and the way the Forums and Blogs have been archived. And of course no database. The layers of .html pages don't go that deep. Hopefully I'll be able to report back about this at a later stage. I'm hoping to get a snapshot of the Forum on X date. Will be interesting to see what will appear.
In retrospect have decided to give up on megatools. The backup was so slow and almost a tenth of the way it just came to a complete stop. I'll probably go about this in a different way.