arrow_upward

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Any one with experience - scraping WP site from Wayback Machine?
#1
Is there any one here with experience of scraping material from the WayBack Machine?  I Googled it, and there are so many options, but I'm not comfortable with how legit or otherwise some of them are.  

So this starts as a sad story.  When Gigarocket closed, one of the members from Iran failed to respond in time, and as a consequence lost his blog.  Initially I didn't take it to heart, as in this day and age it is very difficult for me to understand when someone doesn't make regular backups of their Website, particularly if it is a valued Website.  But then I learned this guy has rheumatoid arthritis, and can barely use his fingers.  The blog was really important to him.  I had hoped that the owner of Gigarocket had made some backups of all of the hosting accounts, but learned that that had not happened, so the blog was no more.

I then checked first with DuckDuckGo to see whether I could find bits and pieces, which I could.  When I read the bits and pieces I began to develop an understanding that this guy must be a lecturer or professor in engineering with lots of IT capabilities. The blog posts were written in excellent English so I began to understand that he must have put a lot of energy in it.  It must be a great loss to him.  I then thought to check the WayBack Machine to see whether I could find any material there, being doubtful initially as I wouldn't have thought the WayBack Machine would take copies of a personal blog.  But then discovered that the blog had been "photographed" by the WayBack Machine 57 times, the last one on 31 October.  It was that good a blog with substance.  

The blog is a WordPress blog, but I get the feeling some of it was designed by the owner of the blog himself.  Like the design is based on a WordPress Micro Blog template.  Obviously I can't get the database, but I was able to copy and paste the source of the pages and CSS stylesheets to NotePad++, but there are still lots of material missing.  Such as the images.

So if there is any one who has suggestions of how to scrape the WordPress site deeper than what I've been able to do, this will be much appreciated.
Terminal
Thank you to Post4VPS and VirMach for my awesome VPS 9!  
#2
@deanhills actually I had to do something similar last year when it came to a website for a board I was assisting with. I used this tool I found from GitHub and it worked really well so we could claim back some assets being used on the old site before it was removed.
https://github.com/sangaline/wayback-machine-scraper
Thank you to CubeData and Posts4VPS for the services of VPS 8.
#3
(12-10-2020, 02:08 PM)tbelldesignco Wrote: @deanhills actually I had to do something similar last year when it came to a website for a board I was assisting with. I used this tool I found from GitHub and it worked really well so we could claim back some assets being used on the old site before it was removed.
https://github.com/sangaline/wayback-machine-scraper

Thx @tbelldesignco.  The one I tried must have been similar, but maybe I should try this one too.  This is the one I tried:
https://github.com/hartator/wayback-machine-downloader

It is good, but at the same time can't really download a Wordpress site as it can't get the database.  It managed to get themes, but not the theme that was being used.  Like there are too many holes to really save the blog as it used to be.  He would have to start a completely new blog, and then use the materials available to create the posts and pages from scratch.

To be honest, I think with my way of hitting the source of the WayBack Machine pages, I got more materials for the guy's blog posts, project posts and technical posts. I got a long one page blog page with many blog posts, ditto project post and technical posts. One could probably separate posts out of there and copy and paste them with recreating them. Like for example I went into the Index Page then went for source, and managed to get all of the style sheets, Java, bootstrap and more.  I then copied them with NotePad++.  I was able to get the index page up - the github was able to do it too, but unable to get the blog posts, project and technical posts. I was also able to go deeper and get the blog posts in one very long page, project and technical posts. My few pages versus the 680 files were more meaningful in the end.

BUT, maybe I'll try your suggestion as well.  
https://github.com/sangaline/wayback-machine-scraper

I had to do an allnighter last night.  It started at a reasonable hour when I decided I was going to use my paid HostUS VPS for the wayback-machine-downloader.  One has to load Ruby first and then the downloader.  But when I tried to update my YUM it came with serious errors regarding mirrors.  I didn't realize it but my VPS was on CentOS 6.  Any way, took me hours to Google this as there were so many reports about it, but the solution looked so iffy and remote, I didn't want to lose the content, but the more answers I got the more complicated it was until I decided I'd save more time if I just rebuilt the VPS with CentOS 7 and redo the content.  CentOS 8 still has some hickups with some of the scripts I use and CentOS 7 just right.  Anyway, after I'd put it back together again with a healthy yum, I decided to use the host's random port number.  It gets created online, and then they send it to you by e-mail.  I then locked myself out of my VPS as my SSH was blocked.  Took more searches to finally figure out that I had to provide an accept rule for fail2ban for the new port.  And after a reboot, it was fine again.  By that time it was morning over here.

I then very successfully managed to install ruby and the wayback-machine-downloader.  Everything went well.  I downloaded it to my VPS, and then with FileZilla to my computer.  The downloader scraped all of the files on the domain over and above the ones for the blog.  I focused only on the blog.  But had to give it up after a while.  I'm really sad about the database thing.If I was specialist enough, maybe I could have built a database and a WordPress theme, but this project taught me how limited I am.

OK, I'll now try your suggestion and will let you know how it went.

Anyway I've just looked at your suggestion, looks like one can search in more detail. Like if I went into the Wayback Machine I could get the IDs of the blog, project etc pages. I'll see if I have enough energy to do that. But wow, it's very time consuming. Educational too, but I'm wondering whether I'm wasting time now. Particularly since I'm not that well versed in the contents of the WordPress site enough to know what to look for in the theme. I searched all of the pages and not one of them came up with the theme except a micro blog. I got a style sheet for a BlueDiamond theme, but it's a serious premium theme, and it doesn't look at all like his index page. Maybe it wasn't the one in use, but the downloader only got parts of the theme and when I started a new WordPress installation, it came up with it as broken. So am hopeful maybe the guy has copies of the theme on his computer. To be honest I haven't heard back from him either. Hope he is OK.
Terminal
Thank you to Post4VPS and VirMach for my awesome VPS 9!  
#4
A couple alternatives:

http://waybackdownloader.com/ - this is a service that will download and archive a site from the wayback machine. It's not free - pricing starts at 15$ USD for 1 site; but maybe worth it for the person in question to save the hassle. 15$ is a small price to pay to salvage years of work (imo). Disclaimer: I've not used this service, so I can't offer any comments as to its effectiveness.

https://archivarix.com/en/ - another paid service, but this one gives you the first 200 files free and an additional .005$USD for each additional file. Same disclaimer as above.

I found this info on a wiki: http://www.archiveteam.org/index.php?title=Restoring there's a few other things there possibly worth looking at if one wants to research further (the hartator scraper is mentioned here)
#5
(12-11-2020, 08:26 PM)deanhills Wrote: Thx @tbelldesignco.  The one I tried must have been similar, but maybe I should try this one too.  This is the one I tried:
https://github.com/hartator/wayback-machine-downloader

It is good, but at the same time can't really download a Wordpress site as it can't get the database.  It managed to get themes, but not the theme that was being used.  Like there are too many holes to really save the blog as it used to be.  He would have to start a completely new blog, and then use the materials available to create the posts and pages from scratch.

To be honest, I think with my way of hitting the source of the WayBack Machine pages, I got more materials for the guy's blog posts, project posts and technical posts.  I got a long one page blog page with many blog posts, ditto project post and technical posts.  One could probably separate posts out of there and copy and paste them with recreating them. Like for example I went into the Index Page then went for source, and managed to get all of the style sheets, Java, bootstrap and more.  I then copied them with NotePad++.  I was able to get the index page up - the github was able to do it too, but unable to get the blog posts, project and technical posts.  I was also able to go deeper and get the blog posts in one very long page, project and technical posts.  My few pages versus the 680 files were more meaningful in the end.

BUT, maybe I'll try your suggestion as well.  
https://github.com/sangaline/wayback-machine-scraper

I had to do an allnighter last night.  It started at a reasonable hour when I decided I was going to use my paid HostUS VPS for the wayback-machine-downloader.  One has to load Ruby first and then the downloader.  But when I tried to update my YUM it came with serious errors regarding mirrors.  I didn't realize it but my VPS was on CentOS 6.  Any way, took me hours to Google this as there were so many reports about it, but the solution looked so iffy and remote, I didn't want to lose the content, but the more answers I got the more complicated it was until I decided I'd save more time if I just rebuilt the VPS with CentOS 7 and redo the content.  CentOS 8 still has some hickups with some of the scripts I use and CentOS 7 just right.  Anyway, after I'd put it back together again with a healthy yum, I decided to use the host's random port number.  It gets created online, and then they send it to you by e-mail.  I then locked myself out of my VPS as my SSH was blocked.  Took more searches to finally figure out that I had to provide an accept rule for fail2ban for the new port.  And after a reboot, it was fine again.  By that time it was morning over here.

I then very successfully managed to install ruby and the wayback-machine-downloader.  Everything went well.  I downloaded it to my VPS, and then with FileZilla to my computer.  The downloader scraped all of the files on the domain over and above the ones for the blog.  I focused only on the blog.  But had to give it up after a while.  I'm really sad about the database thing.If I was specialist enough, maybe I could have built a database and a WordPress theme, but this project taught me how limited I am.

OK, I'll now try your suggestion and will let you know how it went.

Anyway I've just looked at your suggestion, looks like one can search in more detail.  Like if I went into the Wayback Machine I could get the IDs of the blog, project etc pages.  I'll see if I have enough energy to do that.  But wow, it's very time consuming.  Educational too, but I'm wondering whether I'm wasting time now.  Particularly since I'm not that well versed in the contents of the WordPress site enough to know what to look for in the theme.  I searched all of the pages and not one of them came up with the theme except a micro blog.  I got a style sheet for a BlueDiamond theme, but it's a serious premium theme, and it doesn't look at all like his index page. Maybe it wasn't the one in use, but the downloader only got parts of the theme and when I started a new WordPress installation, it came up with it as broken. So am hopeful maybe the guy has copies of the theme on his computer.  To be honest I haven't heard back from him either.  Hope he is OK.

Yeah I don't think there is a way to pull down the database of a WP site once the main site has been pulled down from the internet. The tool I provided should provide most if not all access to the assets of the website, but you will have to copy and paste from Wayback to the new installation. We used this tool to get the css and assets of the old site so I could start building a Wordpress theme to get us through to the next phase of development.
Thank you to CubeData and Posts4VPS for the services of VPS 8.
#6
(12-11-2020, 09:23 PM)fitkoh Wrote: https://archivarix.com/en/ - another paid service, but this one gives you the first 200 files free and an additional .005$USD for each additional file. Same disclaimer as above.

I like the above service very much and have recommended the guy to go for it.  When I checked in at Gigarocket earlier on there was a PM from our guy.  He provided me with his e-mail address, as his previous address was the domain that is no longer working for him since the hosting stopped.  But interesting when I sent an e-mail to him from yahoo it got blocked.  I guess there must be a ban from the States as in the block it said "due to security concerns".  

Anyway, I was able to zip the materials I've already collected and post it to my new cpanel hosting account, so hopefully that won't be blocked either.  The hosting is from Sri Lanka.  Hopefully they're friendlier with Iran.   Tongue

I then recommended to the guy to use archivarix as tonight I studied through it too.  For 10$ you get all the materials nicely cleaned up, PLUS software to search through them and edit them.  When I tried to work through all the files from the Github wayback downloader it was tedious.  Like most of the files are in single folders - a folder per every file.  The archivarix services cleans all of that up, and also make them searchable.  For 10$ I think it is well worth it.

@tbelldesignco  I had a look at the Github scraper download method, I think it's going to be very much the same as the other one I already used.  I think I'm a bit exhausted and want to move on to a next project.  I've come to the end of the road with my WayBack Machine project.  It was a very interesting learning curve.  I may just one day try and download a site from the WayBack Machine for the fun of it to try out archivarix. Thinking about this, this may be a lucrative way to create source materials for new websites.
Terminal
Thank you to Post4VPS and VirMach for my awesome VPS 9!  
#7
Long time ago I went through the same problem. Got careless and lost a blog. At end I just copied content and put in a new WP Blog. I released it was easier and can do it in the time I waste on trying to find perfect and easy solution. It can be a real pain doing it manual but it did work.

01. Had to set permalink same as the old blog
02. Created posts and pages same as in the old blog
03. Copied content into those posts and pages.

This can take long if it's a large blog. But if you are desperate this is the best solution. Even hard if the person have some sort of trouble using keyword properly.


~ Be yourself everybody else is taken ~




#8
(12-12-2020, 01:22 AM)xdude Wrote: Long time ago I went through the same problem. Got careless and lost a blog. At end I just copied content and put in a new WP Blog. I released it was easier and can do it in the time I waste on trying to find perfect and easy solution. It can be a real pain doing it manual but it did work.

01. Had to set permalink same as the old blog
02. Created posts and pages same as in the old blog
03. Copied content into those posts and pages.

This can take long if it's a large blog. But if you are desperate this is the best solution. Even hard if the person have some sort of trouble using keyword properly.

I think this is pretty much what the guy will have to do as there is no database. But at least I was able to get the blog posts, project posts and technical posts manually, by copying and pasting the source code from the WayBack Machine. He'll have to recreate the WP Blog and then redo the posts.

What happened to you @xdude also happened to me once at Gigarocket around 2014 I think it was. cPanel melted down and the backups got corrupted at the same time. It was an out of the blue unexpected thing. I had backups, but had done lots of work on the WordPress site which I had not backed up. That was very painful at the time. First to figure out what was lost. And then to work on fixing it. It was a great lesson to me as ever since then I make backups every time after I've updated or added new material to a blog.
Terminal
Thank you to Post4VPS and VirMach for my awesome VPS 9!  
#9
(12-11-2020, 11:56 PM)deanhills Wrote: @tbelldesignco  I had a look at the Github scraper download method, I think it's going to be very much the same as the other one I already used.  I think I'm a bit exhausted and want to move on to a next project.  I've come to the end of the road with my WayBack Machine project.  It was a very interesting learning curve.  I may just one day try and download a site from the WayBack Machine for the fun of it to try out archivarix. Thinking about this, this may be a lucrative way to create source materials for new websites.

That's what I do, typically I see a site where I am intrigued by how it functions or looks and I scrape it so I can study the JS and CSS so I can learn and add that function to my rolodex.
Thank you to CubeData and Posts4VPS for the services of VPS 8.


person_pin_circle Users browsing this thread: 1 Guest(s)
Sponsors: VirMach - Host4Fun - CubeData - Evolution-Host - HostDare - Hyper Expert - Shadow Hosting - Bladenode - Hostlease - RackNerd - ReadyDedis - Limitless Hosting