arrow_upward

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Any one with experience - scraping WP site from Wayback Machine?
#3
(12-10-2020, 02:08 PM)tbelldesignco Wrote: @deanhills actually I had to do something similar last year when it came to a website for a board I was assisting with. I used this tool I found from GitHub and it worked really well so we could claim back some assets being used on the old site before it was removed.
https://github.com/sangaline/wayback-machine-scraper

Thx @tbelldesignco.  The one I tried must have been similar, but maybe I should try this one too.  This is the one I tried:
https://github.com/hartator/wayback-machine-downloader

It is good, but at the same time can't really download a Wordpress site as it can't get the database.  It managed to get themes, but not the theme that was being used.  Like there are too many holes to really save the blog as it used to be.  He would have to start a completely new blog, and then use the materials available to create the posts and pages from scratch.

To be honest, I think with my way of hitting the source of the WayBack Machine pages, I got more materials for the guy's blog posts, project posts and technical posts. I got a long one page blog page with many blog posts, ditto project post and technical posts. One could probably separate posts out of there and copy and paste them with recreating them. Like for example I went into the Index Page then went for source, and managed to get all of the style sheets, Java, bootstrap and more.  I then copied them with NotePad++.  I was able to get the index page up - the github was able to do it too, but unable to get the blog posts, project and technical posts. I was also able to go deeper and get the blog posts in one very long page, project and technical posts. My few pages versus the 680 files were more meaningful in the end.

BUT, maybe I'll try your suggestion as well.  
https://github.com/sangaline/wayback-machine-scraper

I had to do an allnighter last night.  It started at a reasonable hour when I decided I was going to use my paid HostUS VPS for the wayback-machine-downloader.  One has to load Ruby first and then the downloader.  But when I tried to update my YUM it came with serious errors regarding mirrors.  I didn't realize it but my VPS was on CentOS 6.  Any way, took me hours to Google this as there were so many reports about it, but the solution looked so iffy and remote, I didn't want to lose the content, but the more answers I got the more complicated it was until I decided I'd save more time if I just rebuilt the VPS with CentOS 7 and redo the content.  CentOS 8 still has some hickups with some of the scripts I use and CentOS 7 just right.  Anyway, after I'd put it back together again with a healthy yum, I decided to use the host's random port number.  It gets created online, and then they send it to you by e-mail.  I then locked myself out of my VPS as my SSH was blocked.  Took more searches to finally figure out that I had to provide an accept rule for fail2ban for the new port.  And after a reboot, it was fine again.  By that time it was morning over here.

I then very successfully managed to install ruby and the wayback-machine-downloader.  Everything went well.  I downloaded it to my VPS, and then with FileZilla to my computer.  The downloader scraped all of the files on the domain over and above the ones for the blog.  I focused only on the blog.  But had to give it up after a while.  I'm really sad about the database thing.If I was specialist enough, maybe I could have built a database and a WordPress theme, but this project taught me how limited I am.

OK, I'll now try your suggestion and will let you know how it went.

Anyway I've just looked at your suggestion, looks like one can search in more detail. Like if I went into the Wayback Machine I could get the IDs of the blog, project etc pages. I'll see if I have enough energy to do that. But wow, it's very time consuming. Educational too, but I'm wondering whether I'm wasting time now. Particularly since I'm not that well versed in the contents of the WordPress site enough to know what to look for in the theme. I searched all of the pages and not one of them came up with the theme except a micro blog. I got a style sheet for a BlueDiamond theme, but it's a serious premium theme, and it doesn't look at all like his index page. Maybe it wasn't the one in use, but the downloader only got parts of the theme and when I started a new WordPress installation, it came up with it as broken. So am hopeful maybe the guy has copies of the theme on his computer. To be honest I haven't heard back from him either. Hope he is OK.
Terminal
Thank you to Post4VPS and VirMach for my awesome VPS 9!  



person_pin_circle Users browsing this thread: 1 Guest(s)
Sponsors: VirMach - Host4Fun - CubeData - Evolution-Host - HostDare - Hyper Expert - Shadow Hosting - Bladenode - Hostlease - RackNerd - ReadyDedis - Limitless Hosting