Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find and upload Persian News Corpus #1

Open
sehsanm opened this issue Dec 2, 2018 · 9 comments
Open

Find and upload Persian News Corpus #1

sehsanm opened this issue Dec 2, 2018 · 9 comments
Assignees
Labels
Milestone

Comments

@sehsanm
Copy link
Owner

sehsanm commented Dec 2, 2018

  • Find the Persian news Corpus
  • Define a corpus file standard. (To be discussed with other Corpus builders) - Most probably one sentence in each line
  • Upload the zipped version of the corpus in S3 bucket (Contact @sehsanm to get the access details)
@abb4s
Copy link
Collaborator

abb4s commented Dec 3, 2018

recently I found "sketchengine" as a tool for making corpus from web. it is explained how to use it in this tutorial : https://www.sketchengine.eu/quick-start-guide/create-your-corpus-lesson-4/ . I have maked sample corpus from Hamshahri news by this tool which I'll attach it here.
ham2_2(1).txt

and also we can just find corpus like "Hamshahri" or "irBlog".

@sehsanm
Copy link
Owner Author

sehsanm commented Dec 3, 2018

Nice tool. I know that someone already has collected the Persian news corpus in our NLP lab. Also note that we should be seeking more than 10 milion sentences. If this tool is able to crawl all of that. Lets build a fresh copy.

@abb4s
Copy link
Collaborator

abb4s commented Dec 3, 2018

it has 1,000,000 words limitation .

@sehsanm
Copy link
Owner Author

sehsanm commented Dec 4, 2018

If you are starting to work on this please move it to in progress

@zahramajd zahramajd removed their assignment Dec 4, 2018
@maryambiabani maryambiabani self-assigned this Dec 4, 2018
@FullDataAlchemist FullDataAlchemist self-assigned this Dec 8, 2018
@sehsanm
Copy link
Owner Author

sehsanm commented Dec 19, 2018

Any progress in this @PoriNiki ?

@FullDataAlchemist
Copy link
Collaborator

FullDataAlchemist commented Dec 19, 2018

yes. I trying to get the corpus from "Kanal e Khabar".

@FullDataAlchemist
Copy link
Collaborator

I'm recently talking with them but, this could take time.

@sehsanm
Copy link
Owner Author

sehsanm commented Dec 27, 2018

سلام
تا انجام شدن کامل این تسک فقط یک
Readme.md
فاصله داریم

@FullDataAlchemist
Copy link
Collaborator

سلام. از طرف من فعلا دیتای مذکور کنسل شد من از این مورد کنار میرم.

@FullDataAlchemist FullDataAlchemist removed their assignment Dec 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants