Flatten OVSP dataset media links #46

dieko95 · 2021-03-06T20:29:57Z

dieko95 · 2021-03-06T21:50:14Z

@Edilmo @marianelamin please feel free to modify the action items as you see fit. I have created the first pass based on what we talked about at today's meeting.

marianelamin · 2021-03-07T20:07:26Z

C'est bonne! I have identified 108 websites to visit. This Colab Notebook shows how the number was computed.

VKorelsky · 2021-03-08T21:46:43Z

I have a few questions 😃

Is this something we want to have as part of the library (in the src/c4v/data), or rather have as a separate one off script?
Or a combination of the two? If we have it as a one-off, does the code still go in this repo?

An idea of how we can take this:

A) write a more general DataTransformer class (in the c4v lib) offering a method that takes in an object representing a row and a reducer function, and returns the value computed by applying the reducer to the row.

B) Separately we can write the reducer function, which extracts the link from the row and passes it to a specific scraper (using a map of domain_name to instance of scraper) for example), returning the text the scraper extracts

C) Besides that, we develop the scrapers that apply to the domains we want to scrape. If we define an interface for them it'll be easy to integrate them with B)

D) Then putting it all together we can batch process the excel as a csv (or other format), passing each individual row to A) and using the reducer from B), writing the output to a new column

Regarding A) - the question I have is whether there are libraries that already do this? (found this for example)

Edilmo · 2021-03-14T00:33:34Z

Hey guys let's discuss this. The task is too broad.

The idea was to get the current dataset flatten in simple way. And use that to make some decisions.

We are not going to create a crawler. That is too complicated task to start with. We are going to start with scrapping. And we are going to align that with the overall public service project, which will get funded with a student dedicated 100%.

dieko95 added the good first issue Good for newcomers label Mar 6, 2021

dieko95 linked a pull request Mar 13, 2021 that will close this issue

Add Positive labels [PSCDD] - elpitazo #48

Closed

4 tasks

marianelamin removed the good first issue Good for newcomers label Apr 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flatten OVSP dataset media links #46

Flatten OVSP dataset media links #46

dieko95 commented Mar 6, 2021 •

edited

Loading

dieko95 commented Mar 6, 2021

marianelamin commented Mar 7, 2021

VKorelsky commented Mar 8, 2021 •

edited

Loading

Edilmo commented Mar 14, 2021

Flatten OVSP dataset media links #46

Flatten OVSP dataset media links #46

Comments

dieko95 commented Mar 6, 2021 • edited Loading

Problem Description

Proposed Solution

General web scraper:

Deliverable

Action items (checkboxes)

Further Documentation

dieko95 commented Mar 6, 2021

marianelamin commented Mar 7, 2021

VKorelsky commented Mar 8, 2021 • edited Loading

Edilmo commented Mar 14, 2021

dieko95 commented Mar 6, 2021 •

edited

Loading

VKorelsky commented Mar 8, 2021 •

edited

Loading