-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting data from table #125
Comments
I am glad that you find the package useful! I haven't been able to find an API to pull information from tables directly from the wiki api, but you could use beautifulsoup to parse the html directly. Something like: from bs4 import BeautifulSoup
from mediawiki import MediaWiki
wikipedia = MediaWiki()
p = wikipedia.page('Andor_(TV_series)')
soup = BeautifulSoup(p.html, "html.parser")
episodes = soup.find("table", {"class": "wikiepisodetable"})
# Do something to parse the table as per the documentation on bs4 I hope this is helpful! |
Thank you for your reply, I have used beautifulsoup:
This work for the Andor page, however I have realized not all the page are the same, and I was wondering if there a way to extra the same information in agnostic way, something that taking a series x it provide you the text of episodes' description. |
Sadly, not that I know of as I haven't been able to find an MediaWiki API that can help with that. I will have to look at the |
There is a maybe a way to interact with the database? like information is a sort SQL or GrapSQL db of wiki? |
Not though this python package as it is just a wrapper for the API and doesn't have access to the back-end system, just what is provided through the API. |
I understand, thank you for your help |
The
Which means that could also be used to parse the text; I still haven't seen an API to pull tables directly from the API. |
I would try with p.wikitest! However, I still have to find a way when the episodes (and the table) is in another page. The problem with the wiki pages is that the format is not uniform |
Yes, that is the one draw back is that it isn't always standardized. Good luck! |
yes, it is a pity, since there is so much interesting information in wiki for model training or doing apps. thank you very much for your help! |
Very nice package.
I am trying to write a script that for a tv series extract the content of the season episodes:
In the content there is not the text (in the page is inside a table), and I also have tried
p.table_of_contents['Episodes']['Season 1 (2022)']
which returns an empty structure
Thank you very much for your help
The text was updated successfully, but these errors were encountered: