12-10-2020, 11:20 AM
Is there any one here with experience of scraping material from the WayBack Machine? I Googled it, and there are so many options, but I'm not comfortable with how legit or otherwise some of them are.
So this starts as a sad story. When Gigarocket closed, one of the members from Iran failed to respond in time, and as a consequence lost his blog. Initially I didn't take it to heart, as in this day and age it is very difficult for me to understand when someone doesn't make regular backups of their Website, particularly if it is a valued Website. But then I learned this guy has rheumatoid arthritis, and can barely use his fingers. The blog was really important to him. I had hoped that the owner of Gigarocket had made some backups of all of the hosting accounts, but learned that that had not happened, so the blog was no more.
I then checked first with DuckDuckGo to see whether I could find bits and pieces, which I could. When I read the bits and pieces I began to develop an understanding that this guy must be a lecturer or professor in engineering with lots of IT capabilities. The blog posts were written in excellent English so I began to understand that he must have put a lot of energy in it. It must be a great loss to him. I then thought to check the WayBack Machine to see whether I could find any material there, being doubtful initially as I wouldn't have thought the WayBack Machine would take copies of a personal blog. But then discovered that the blog had been "photographed" by the WayBack Machine 57 times, the last one on 31 October. It was that good a blog with substance.
The blog is a WordPress blog, but I get the feeling some of it was designed by the owner of the blog himself. Like the design is based on a WordPress Micro Blog template. Obviously I can't get the database, but I was able to copy and paste the source of the pages and CSS stylesheets to NotePad++, but there are still lots of material missing. Such as the images.
So if there is any one who has suggestions of how to scrape the WordPress site deeper than what I've been able to do, this will be much appreciated.
So this starts as a sad story. When Gigarocket closed, one of the members from Iran failed to respond in time, and as a consequence lost his blog. Initially I didn't take it to heart, as in this day and age it is very difficult for me to understand when someone doesn't make regular backups of their Website, particularly if it is a valued Website. But then I learned this guy has rheumatoid arthritis, and can barely use his fingers. The blog was really important to him. I had hoped that the owner of Gigarocket had made some backups of all of the hosting accounts, but learned that that had not happened, so the blog was no more.
I then checked first with DuckDuckGo to see whether I could find bits and pieces, which I could. When I read the bits and pieces I began to develop an understanding that this guy must be a lecturer or professor in engineering with lots of IT capabilities. The blog posts were written in excellent English so I began to understand that he must have put a lot of energy in it. It must be a great loss to him. I then thought to check the WayBack Machine to see whether I could find any material there, being doubtful initially as I wouldn't have thought the WayBack Machine would take copies of a personal blog. But then discovered that the blog had been "photographed" by the WayBack Machine 57 times, the last one on 31 October. It was that good a blog with substance.
The blog is a WordPress blog, but I get the feeling some of it was designed by the owner of the blog himself. Like the design is based on a WordPress Micro Blog template. Obviously I can't get the database, but I was able to copy and paste the source of the pages and CSS stylesheets to NotePad++, but there are still lots of material missing. Such as the images.
So if there is any one who has suggestions of how to scrape the WordPress site deeper than what I've been able to do, this will be much appreciated.