Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LinusTechTip Programming Forum #32

Open
3 tasks
PhungVanDuy opened this issue Sep 30, 2022 · 1 comment
Open
3 tasks

LinusTechTip Programming Forum #32

PhungVanDuy opened this issue Sep 30, 2022 · 1 comment
Labels
dataset-request Request for addition of new dataset

Comments

@PhungVanDuy
Copy link
Collaborator

Title

Dataset URL - LinusTechTip

Does the dataset exist in a scraped format? No

Description

This well-known programming forum, just scanned there have more than 10.000 topics from 2013

Procedure

Tests

Include a dummy_dataset.parquet file to test your code against. This dummy_dataset should include the columns for the data and metadata associated with the dataset, which will then be converted into the final format for language model consumption, along with an example row or rows that you can verify your code correctly collects. In addition to this file, include the unit test that evaluates your code against this dummy_dataset.

Give an example of the columns and data:

col1 col2 ....
row1 row1 ....
@PhungVanDuy PhungVanDuy added the dataset-request Request for addition of new dataset label Sep 30, 2022
@bentrevett
Copy link

Wrote a quick scraper for this, unsure of the format required, but this writes each page of a thread as JSON file per line.

https://gist.github.com/bentrevett/274db7de0258bab8adf235045344bed7

There's two types of threads:

  • Standard threads with comments
  • "Question" threads, in which each comment is a "suggestedAnswer", and also provides the "acceptedAnswer" (potentially useful for some form of QA data?)

@ncoop57 ncoop57 moved this to In Progress in Pile V2 Nov 11, 2022
@ncoop57 ncoop57 added this to Pile V2 Nov 11, 2022
@ncoop57 ncoop57 moved this from In Progress to Todo in Pile V2 Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset-request Request for addition of new dataset
Projects
Status: Todo
Development

No branches or pull requests

2 participants