arrow_upward

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to download an entire website from the Internet Archive Wayback Machine
#1
The following are  abbreviated steps for downloading a Website from the Internet Archive Wayback Machine using CentOS 7.0.  

If you check GitHub there are a number of  Downloaders available and I don't think they're all equal in quality.  I was lucky to find this one by the author hartator as it worked for me:

https://github.com/hartator/wayback-machine-downloader

If you want, you can check through the above github page, as it does give a long list of options of how to control the download for larger Websites. If you check through the issues you'll note that pages are missed - there were a number of report backs about this.  So it's not perfect by a long shot. But it's fun for less serious projects. 

Before you start the download it is important that you check how large the Website is that you want to download and whether you have enough resources on your VPS, particularly memory, bandwidth and disk space to handle the download efficiently.  One flaw in the downloader is that it doesn't provide you with the size it is going to download and it also doesn't give you the option to say "no".  You need to do that research before you use the download command.

Here is an abbreviation of the commands using CentOS 7.0:

Step 1:  Install Ruby

yum install ruby


Step 2:  Install the Wayback Machine Downloader

gem install wayback_machine_downloader


Step 3:  Use the downloader command to start the download:

wayback_machine_downloader http://domainname.com



If you want to interrupt the download you can use Ctrl C.  If you want to resume the download at the point where it was interrupted just repeat the downloader command:

wayback_machine_downloader http://domainname.com


I'm happy with the outcome so far, however haven't taken it to its conclusion yet.  My download project is very tricky in that it is a Forum instead of a static Website.  I think for uncomplicated static Websites this will work fine.  Not sure about Forums and Blogs though.  There's an issue with time stamps and the way the Forums and Blogs have been archived.  And of course no database.  The layers of .html pages don't go that deep.  Hopefully I'll be able to report back about this at a later stage.  I'm hoping to get a snapshot of the Forum on X date.  Will be interesting to see what will appear.

In retrospect have decided to give up on megatools. The backup was so slow and almost a tenth of the way it just came to a complete stop. I'll probably go about this in a different way.
Terminal
Thank you to Post4VPS and VirMach for my awesome VPS 9!  
#2
Downloads what? Only the index page or all the pages? And can i download the MySQL with it?
#3
(10-07-2018, 07:49 PM)youssefbasha Wrote: Downloads what? Only the index page or all the pages? And can i download the MySQL with it?

Lol no, it's not possible.

Never heard of this, and if it actually does the thing like the title tells then it's seriously something else lol.
#4
(10-07-2018, 07:49 PM)youssefbasha Wrote: Downloads what? Only the index page or all the pages? And can i download the MySQL with it?

No, you can only download html pages.  Like it's an html screenshot of all of the available pages in html that can be viewed by a guest.  It is logical it wouldn't be able to download a database.  If it were able to do that then that would obviously have created a big uproar by the owners.  Someone would have been able to steal the content of the Forum.

Also, not all of the pages are included.  Only the top layer of links.  So if it's a Forum you'll notice that if you click deeper, then you won't be able to find all of the pages. I'd say maybe two or tops three layers.

It also doesn't make html screenshots of restricted pages.  Only the Forums that can be viewed as guest and nothing more than that.

(10-08-2018, 05:05 AM)Abinash Wrote: Never heard of this, and if it actually does the thing like the title tells then it's seriously something else lol.
Exactly what do you mean with this?  Did you read the tutorial?  There was no claim that it can capture a mySQL database - in fact if you had read the last paragraph of my tutorial you'd have noticed that I said that it can't extract a database. 

deanhills Wrote:I'm happy with the outcome so far, however haven't taken it to its conclusion yet.  My download project is very tricky in that it is a Forum instead of a static Website.  I think for uncomplicated static Websites this will work fine.  Not sure about Forums and Blogs though.  There's an issue with time stamps and the way the Forums and Blogs have been archived.  And of course no database.  The layers of .html pages don't go that deep.  Hopefully I'll be able to report back about this at a later stage.  I'm hoping to get a snapshot of the Forum on X date.  Will be interesting to see what will appear.
I doubt the intention of the Wayback Machine was ever to copy Websites in detail.  Only to "snapshot" it - like create a superficial html representation for it.

If you do want to do detailed backups of Forums with the Wayback Machine, I'm sure it could be arranged on a pay basis by contacting the guys there.

But this was not what this tutorial was about.  This tutorial contains simple abbreviated steps for installing the wayback downloader to download the pages from the Internet Archive.  This script is available in different forms by quite a number of authors, you can check out at Github or Google "wayback downloader".  If you check the reviews of the scripts they are all unanimous that you don't get a consistent result - like you don't get all of the pages, and one download of the same doesn't appear the same in the next.  But, for simple Websites, you do get a snapshot view of exactly what the Website looked like.
Terminal
Thank you to Post4VPS and VirMach for my awesome VPS 9!  
#5
(10-08-2018, 05:23 AM)deanhills Wrote:
Exactly what do you mean with this?  Did you read the tutorial?  There was no claim that it can capture a mySQL database - in fact if you had read the last paragraph of my tutorial you'd have noticed that I said that it can't extract a database. 

I do know that it's not a possible thing and that is what i said in my reply to @yousefbasha, lol -

(10-08-2018, 05:05 AM)Abinash Wrote: Lol no, it's not possible.


(10-08-2018, 05:05 AM)Abinash Wrote: Never heard of this, and if it actually does the thing like the title tells then it's seriously something else lol.

Now where do you see me talking about capturing mysql database in the above reply?
I just meant that if it really captures the site then it's like something cool.  Cool
#6
(10-08-2018, 06:38 AM)Abinash Wrote: I do know that it's not a possible thing and that is what i said in my reply to @yousefbasha, lol -




Now where do you see me talking about capturing mysql database in the above reply?
I just meant that if it really captures the site then it's like something cool.  Cool

Oh good.  Glad you clarified it and thanks for the feedback. Smile 

I probably need to test this downloader script with a much smaller Website and see how it performs.  I'll check around if I can find a smaller Forum or Website and retest the script.
Terminal
Thank you to Post4VPS and VirMach for my awesome VPS 9!  
#7
try using their api if it is available . it is available if im not wrong if not just CTRL+S trhorugh all the available page or crawl the page . and wayback machine have slow i mean really slow connection so crawling will probably take very long even though your internet is fasr
Terminal
humanpuff69@FPAX:~$ Thanks To Shadow Hosting And Post4VPS for VPS 5


Possibly Related Threads…
Thread
Author
Replies
Views
Last Post
7,916
04-04-2017, 06:12 PM
Last Post: FacTioN
2,335
02-03-2017, 12:41 PM
Last Post: Hero^

person_pin_circle Users browsing this thread: 1 Guest(s)
Sponsors: VirMach - Host4Fun - CubeData - Evolution-Host - HostDare - Hyper Expert - Shadow Hosting - Bladenode - Hostlease - RackNerd - ReadyDedis - Limitless Hosting