Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flatten OVSP dataset media links #46

Open
3 of 13 tasks
dieko95 opened this issue Mar 6, 2021 · 4 comments
Open
3 of 13 tasks

Flatten OVSP dataset media links #46

dieko95 opened this issue Mar 6, 2021 · 4 comments

Comments

@dieko95
Copy link
Member

dieko95 commented Mar 6, 2021

Problem Description

Currently, we have a dataset with media links (Twitter or news article). We need to flatten the dataset by adding a new column that contains the raw text from their respective media link.

Proposed Solution

Develop a web crawler that parses media links text and add it to a new column named media_text.

General web scraper:

  • Web scraper by clusters:
    • Group publishers per news frequency.
    • Common Publishers (high frequent):
      • Create dedicated methods to extract news articles.
    • Low Frequency Publishers.
      • Half-machine half-human solution. Explore Selenium for this purpose.

Deliverable

  • py script with web crawler.

Action items (checkboxes)

  • Identify how many different media links the dataset has (e.g., twitter, newspaper, etc..)
  • Group publishers per news frequency.
  • Create scaffold for the more general DataTransformer class. @VKorelsky
    • Develop a web crawler for the different publisher links.
    • Group the different web crawlers into one Class.
    • Create the updated dataset with the media_text column.
    • ...
  • Create dedicated reducers for the most frequent webpages, (see details on the comments).
  • Research on how to best extract text from less frequent links.
    • Use a combo between Selenium, Python, human text selection in order get text. To get precise information from unstructured data structure in a more automated fashion.
    • ?

Further Documentation

@dieko95 dieko95 added the good first issue Good for newcomers label Mar 6, 2021
@dieko95
Copy link
Member Author

dieko95 commented Mar 6, 2021

@Edilmo @marianelamin please feel free to modify the action items as you see fit. I have created the first pass based on what we talked about at today's meeting.

@marianelamin
Copy link
Collaborator

C'est bonne! I have identified 108 websites to visit. This Colab Notebook shows how the number was computed.

@VKorelsky
Copy link
Collaborator

VKorelsky commented Mar 8, 2021

I have a few questions 😃

Is this something we want to have as part of the library (in the src/c4v/data), or rather have as a separate one off script?
Or a combination of the two? If we have it as a one-off, does the code still go in this repo?

An idea of how we can take this:

A) write a more general DataTransformer class (in the c4v lib) offering a method that takes in an object representing a row and a reducer function, and returns the value computed by applying the reducer to the row.

B) Separately we can write the reducer function, which extracts the link from the row and passes it to a specific scraper (using a map of domain_name to instance of scraper) for example), returning the text the scraper extracts

C) Besides that, we develop the scrapers that apply to the domains we want to scrape. If we define an interface for them it'll be easy to integrate them with B)

D) Then putting it all together we can batch process the excel as a csv (or other format), passing each individual row to A) and using the reducer from B), writing the output to a new column

Regarding A) - the question I have is whether there are libraries that already do this? (found this for example)

@dieko95 dieko95 linked a pull request Mar 13, 2021 that will close this issue
4 tasks
@Edilmo
Copy link
Member

Edilmo commented Mar 14, 2021

Hey guys let's discuss this. The task is too broad.

The idea was to get the current dataset flatten in simple way. And use that to make some decisions.

We are not going to create a crawler. That is too complicated task to start with. We are going to start with scrapping. And we are going to align that with the overall public service project, which will get funded with a student dedicated 100%.

@marianelamin marianelamin removed the good first issue Good for newcomers label Apr 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants