Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write script to Properly tokenize and normalize all corpora #3

Open
sehsanm opened this issue Dec 3, 2018 · 16 comments
Open

Write script to Properly tokenize and normalize all corpora #3

sehsanm opened this issue Dec 3, 2018 · 16 comments
Assignees
Labels
Milestone

Comments

@sehsanm
Copy link
Owner

sehsanm commented Dec 3, 2018

Write an script that properly normalize and tokenize the corpora that we had.

This includes:

  • normalization for Persian/Arabic characters
  • Unify spaces and blank spaces (non-width space)
  • Use advanced tokenization: By tokenization I mean that replace spaces in a token with non-width space

After clean up upload the clean corpora back to S3 bucket to be used later

@sehsanm sehsanm added the CORPUS label Dec 3, 2018
@sehsanm sehsanm added this to the Assignment milestone Dec 3, 2018
@nkm96
Copy link
Collaborator

nkm96 commented Dec 7, 2018

Hi, how can I take this task?
also invitation link is not working for me:)
my git id: nkm96
gmail: [email protected]

@nkm96 nkm96 self-assigned this Dec 8, 2018
@sehsanm
Copy link
Owner Author

sehsanm commented Dec 8, 2018 via email

@nkm96
Copy link
Collaborator

nkm96 commented Dec 8, 2018 via email

@sehsanm
Copy link
Owner Author

sehsanm commented Dec 19, 2018

Any progress on this ? what is your plan

@nkm96
Copy link
Collaborator

nkm96 commented Dec 22, 2018

I'll do it this weekend, when speech courses will finish. I wanna use something like max term algorithm and improve the case. I test a training last weekend code but it did not work.

@sehsanm
Copy link
Owner Author

sehsanm commented Dec 27, 2018

سلام
لطفا من را در جریان پیشرفت کار قرار دهید

@nkm96
Copy link
Collaborator

nkm96 commented Dec 27, 2018

سلام، چشم من نتایج رو تا شنبه در اختیار تون قرار میدم.

@sehsanm
Copy link
Owner Author

sehsanm commented Jan 1, 2019

سلام پیشرفت کار چطور بوده ؟
پیشنهاد من استفاده ازز سرویس استپ وان هست

@nkm96
Copy link
Collaborator

nkm96 commented Jan 2, 2019 via email

@sehsanm
Copy link
Owner Author

sehsanm commented Jan 2, 2019 via email

@nkm96
Copy link
Collaborator

nkm96 commented Jan 3, 2019

مثلا "علی درحال دویدن است رو بهم میریزه و میکنه دویدناست. کلا روند توکنایز موارد سالم رو بهم میریزه.
میخواستم بدونم اگه نتونم تسک رو انجام بدم نمره ای کسر میشه؟ چون جزو نفرات آخری بودم که گرفتم و کسی اینو نمیخواست و حقی ضایع نمیشه، و امروزهم تمرین شماره 5 رو انجام دادم که جبران بشه از لحاظ تعداد تمرینای تحویلی.

@nkm96
Copy link
Collaborator

nkm96 commented Jan 3, 2019

jhazm
رو هم امتحان کردم، نسخه خام خود کتابخانه بهم نتایج بهتری داد تا سی شارپ و استپ وان

@sehsanm
Copy link
Owner Author

sehsanm commented Jan 6, 2019

سلام
شما الان این تسک رو انجام خواهید داد یا نه ؟

@nkm96
Copy link
Collaborator

nkm96 commented Jan 7, 2019 via email

@sehsanm
Copy link
Owner Author

sehsanm commented Jan 7, 2019 via email

@nkm96
Copy link
Collaborator

nkm96 commented Jan 7, 2019 via email

@sehsanm sehsanm assigned kibamin and unassigned nkm96 Jan 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants