Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define training set schema #49

Open
dieko95 opened this issue Mar 24, 2021 · 2 comments
Open

Define training set schema #49

dieko95 opened this issue Mar 24, 2021 · 2 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@dieko95
Copy link
Member

dieko95 commented Mar 24, 2021

Problem

We currently haven't defined the flattened dataset's schema that will be consumed by the huggingface transformer.

Proposed Solution

Define the training dataset schema that will be used to train the huggingface transformer.

  • For example:
    • Column names: text, news_title, location, issue, source_type, author, etc...
    • Is column Nullable
    • Variable type (varchar, int, float, etc..)

Deliverable

  • readme.md with dataset's schema.
@dieko95 dieko95 self-assigned this Mar 24, 2021
@dieko95 dieko95 added the documentation Improvements or additions to documentation label Mar 24, 2021
@marianelamin
Copy link
Collaborator

marianelamin commented Apr 20, 2021

From the first PoC with El Pitazo, looks like we can get:
title, content, date, author, categories and tags.
It would be good to explore on our next sources whether or not they can be extracted also.

In the mean time, for a VP we are counting on just the content of the post. In case of a change on design, it will be notified here.

@dieko95
Copy link
Member Author

dieko95 commented Apr 21, 2021

From the first PoC with El Pitazo, looks like we can get:
title, content, date, author, categories and tags.
It would be good to explore on our next sources whether or not they can be extracted also.

In the mean time, for a VP we are counting on just the content of the post. In case of a change on design, it will be notified here.

@marianelamin Gotcha! Thanks a lot for the update 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants