arrow_upward

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
WP Extractor | Simple WordPress Posts & Pages Extractor in Python
#1
WPExtractor - WordPress Blog Post Extractor in JSON Format

WPExtractor is a python-based tool specifically made for Artificial Intelligence-based projects to make datasets. This helps to collect data from blogs which can be used to train bot in many useful ways.
Features
  • Automatically extract all posts from a WordPress website within seconds.

  • Saves the data in the JSON file in the directory for you.

  • Easily understandable JSON format to make your life easier Big Grin

  • Responsive developers. Just make an issue, we'll fix it for you Smile

Usage
Usage:
python main.py -u https://csrockers.in

By default, it will fetch posts from the website. To fetch pages, use the following.

python main.py -u https://fulltimehosting.net --pages

Credits
Manal Shaikh & Somil Gumber.
Premium Web Hosting | ShadowCrypt | Manal Shaikh Official Website
If you find my post/thread useful, you're supposed to +rep me. 
#2
Ooh I am curious about this further. I am working on creating an app for my business and I could possibly use something like this to feed content in JSON structs to my app. Thank you! I will defiantly be looking into this tool further!
Thank you to CubeData and Posts4VPS for the services of VPS 8.
#3
Just looking at the script. I never thought that Wordpress have that /wp-json/ directory. I tested with my Wordpress website and it shows all of my pages and posts. Nice works, Manal! This one can help us to show the posts for a non-WP website.

Out of topic, but, how did you find the URL to that JSON?
Thanks to Limitless Hosting and Post4VPS for providing me excellent VPS 13!
#4
I just want to confirm how amazing this tool is - I’ve currently have my app fetching content from my site and populating dynamic content into the app, I will have to get screenshots up when I’m back on my MacBook.
Thank you to CubeData and Posts4VPS for the services of VPS 8.
#5
Nice projects, it is sad when an open source software became an abandonware. You did a very nice job picking up it and reviving it.
I did a very fast read of the code and you seems to be using wp-json stuff, it is a good choice but why don't use the sitemap?

Moreover you could replace the ifs inside the error handling with a swtich, it does the same thing but it is more "appropriate" for that situation.
Thanks to Post4VPS and Bladenodefor VPS 14
#6
This looks like an interesting tool. I'm not familiar with python, but I can see it's value. Unfortunately I wasn't able to get it working (remember, python newbie).

First, I installed python.

Quote:~$ sudo apt-get install python


no problems there. Then I try to run main.py as recommended in your readme.

Quote:~$ python main.py -u https://url.url
Traceback (most recent call last):
  File "main.py", line 5, in <module>
    import requests
ImportError: No module named requests

Okay so something is missing. I do some googling and find first I need to install pip.
Quote:~$ sudo apt-get install python3-pip
Reading package lists... Done
Building dependency tree       
Reading state information... Done
python3-pip is already the newest version (20.0.2-5ubuntu1.1).

that one looks good. So lets try to install request...

Quote:~$ sudo pip3 install requests
Requirement already satisfied: requests in /usr/lib/python3/dist-packages (2.22.0)

So I have requests installed but for some reason my python installation isn't recognizing it. I'm thinking maybe the problem is due to multiple versions of python installed but I'm not sure which one to keep or if there's a safe way to remove one without breaking the other. Perhaps it'd be simpler to reinstall and start from scratch? (I'm working on a dev server so no worries about losing anything).
#7
Try using
python3 main.py -u https://url.url

The tool is based on python3, and I recommend you install requests using pip3, as you installed.
(11-26-2020, 05:18 PM)fitkoh Wrote: This looks like an interesting tool. I'm not familiar with python, but I can see it's value. Unfortunately I wasn't able to get it working (remember, python newbie).

First, I installed python.



no problems there. Then I try to run main.py as recommended in your readme.


Okay so something is missing. I do some googling and find first I need to install pip.

that one looks good. So lets try to install request...


So I have requests installed but for some reason my python installation isn't recognizing it. I'm thinking maybe the problem is due to multiple versions of python installed but I'm not sure which one to keep or if there's a safe way to remove one without breaking the other. Perhaps it'd be simpler to reinstall and start from scratch? (I'm working on a dev server so no worries about losing anything).

(11-26-2020, 12:47 PM)LightDestory Wrote: Nice projects, it is sad when an open source software became an abandonware. You did a very nice job picking up it and reviving it.
I did a very fast read of the code and you seems to be using wp-json stuff, it is a good choice but why don't use the sitemap?

Moreover you could replace the ifs inside the error handling with a swtich, it does the same thing but it is more "appropriate" for that situation.

Noted for next update.
The reason why I didn't use sitemap is sitemap may often missout on posts that are outside the website. This tool aims at everything that is not draft or password from the main content.
And I just started learning Python. This is just a pet project :3
Premium Web Hosting | ShadowCrypt | Manal Shaikh Official Website
If you find my post/thread useful, you're supposed to +rep me. 
#8
Great project @Manal! It works perfectly. Moreover, it's open-source, I took a look at the source code too. Big Grin

I'm myself a Python Developer and glad to see you are starting with Python, you won't regret it. Wink
Sayan Bhattacharyya,

Heartiest thanks to Post4VPS and Virmach for my wonderful VPS 9!


Possibly Related Threads…
Thread
Author
Replies
Views
Last Post
5,035
11-30-2020, 03:51 AM
Last Post: Sn1F3rt
8,250
05-01-2020, 01:16 PM
Last Post: curious_dg

person_pin_circle Users browsing this thread: 1 Guest(s)
Sponsors: VirMach - Host4Fun - CubeData - Evolution-Host - HostDare - Hyper Expert - Shadow Hosting - Bladenode - Hostlease - RackNerd - ReadyDedis - Limitless Hosting