Octopus

TODOs

Text analysis: Looking at patterns of problem vs non-problem statements eg. often occuring bigrams, trigrams, phrases. Interesting library: scattertext
Rule-based matching as final processing step (after model prediction) to clean false positives and false negatives. Either regex or spaCy's Phrase Matcher [interactive] are good options
- Advanced: Dependency Matching working on syntax trees instead of sentence patterns
Hierarchical Clustering: exploratory notebooks understanding the current SotA in unsupervised clustering and trying promising libraries or algorithms with Octopus' data and seeing if it’s feasible
DevOps: hooks, AWS configs, scripts, GH actions and general CI / CD for successful testing, validating and building workflows
Software 2.0 Infra: Setup of an active learning for efficient human labeling using prodi.gy, labelstud.io or similar
Bespoke App for Language Model Interpretation ala Markus' Netlens
Clustering and Analysis (use clusteval or hnet) or define custom cluster-quality metric. Try different approaches (HDBSCAN, UMAP, T-SNE)
Bespoke App for open source contributors to label data and create Regex-like pattern matching through an easy to learn syntax eliminating/supporting software dev / modeling
Advanced: Automatic Pattern discovery : Given, examples of text, find the underlying common patterns of subsets of them. This probably involves evolutionary algorithms, a good comp. linguistics knowledge and will warrant a stand-alone library. Example: PatternOmatic(doesn't really work)

DATASETs `/datasets`

Datasets in public Google Drive @ https://drive.google.com/drive/folders/1SN6nHxgW9InLpJhUm7bzirU4wzd7NT0G
problem_statements.csv: processed dataset consisting of 3500+ labels. Rows with "PMID" entry are biomedical and human-labeled by a team member. Check "source" column. Includes ~500 problem statements and 1500 non-problem statements from ACL (computational linguistics) papers; source: Identifying problems and solutions in scientific text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Octopus

TODOs

DATASETs `/datasets`

Files

README.md

Latest commit

History

README.md

File metadata and controls

Octopus

TODOs

DATASETs /datasets

DATASETs `/datasets`