-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Categorize "scientific"/"non scientific" medical documents #4
Comments
In order to do this we will label top level domains as scientific / pseudo-scientific / trivial by an expert (medical doctor, coder..) |
Use these two sets in order to create a translation service from professional scientific texts to non scientific texts easily understandable by patients. |
Labeled as scientific (to be updated): Labeled as non-scientific (to be updated): |
Maybe we could use something like https://link.springer.com/article/10.1023%2FA%3A1007692713085?LI=true (Text Classification from Labeled and Unlabeled Documents using EM - I haven't read it yet) in order to start with a small labeled set (e.g. part of "scientific") and to use a large unlabeled set (e.g. "scientific" + "non-scientific") as leverage in order to learn a stable "scientific"/"non-scientific" classifier. |
Create a data set with pairs of synonyms, one being scientific, the other being non-scientific:
Maybe test out in https://github.com/eonum/medword |
In a second step we could extract the "scientific" medical documents from our positive set.
The text was updated successfully, but these errors were encountered: