-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flatten OVSP dataset media links #46
Comments
@Edilmo @marianelamin please feel free to modify the action items as you see fit. I have created the first pass based on what we talked about at today's meeting. |
C'est bonne! I have identified 108 websites to visit. This Colab Notebook shows how the number was computed. |
I have a few questions 😃 Is this something we want to have as part of the library (in the src/c4v/data), or rather have as a separate one off script? An idea of how we can take this: A) write a more general B) Separately we can write the reducer function, which extracts the link from the row and passes it to a specific scraper (using a map of C) Besides that, we develop the scrapers that apply to the domains we want to scrape. If we define an interface for them it'll be easy to integrate them with B) D) Then putting it all together we can batch process the excel as a csv (or other format), passing each individual row to A) and using the reducer from B), writing the output to a new column Regarding A) - the question I have is whether there are libraries that already do this? (found this for example) |
Hey guys let's discuss this. The task is too broad. The idea was to get the current dataset flatten in simple way. And use that to make some decisions. We are not going to create a crawler. That is too complicated task to start with. We are going to start with scrapping. And we are going to align that with the overall public service project, which will get funded with a student dedicated 100%. |
Problem Description
Currently, we have a dataset with media links (Twitter or news article). We need to flatten the dataset by adding a new column that contains the raw text from their respective media link.
Proposed Solution
Develop a web crawler that parses media links text and add it to a new column named
media_text
.General web scraper:
Deliverable
py
script with web crawler.Action items (checkboxes)
DataTransformer
class. @VKorelskymedia_text
column.elpitazo
Add Positive labels [PSCDD] - elpitazo #48Further Documentation
The text was updated successfully, but these errors were encountered: