Skip to content

Latest commit

 

History

History
42 lines (15 loc) · 2.56 KB

README.md

File metadata and controls

42 lines (15 loc) · 2.56 KB

Octopus

TODOs

  • Text analysis: Looking at patterns of problem vs non-problem statements eg. often occuring bigrams, trigrams, phrases. Interesting library: scattertext

  • Rule-based matching as final processing step (after model prediction) to clean false positives and false negatives. Either regex or spaCy's Phrase Matcher [interactive] are good options

  • Hierarchical Clustering: exploratory notebooks understanding the current SotA in unsupervised clustering and trying promising libraries or algorithms with Octopus' data and seeing if it’s feasible

  • DevOps: hooks, AWS configs, scripts, GH actions and general CI / CD for successful testing, validating and building workflows

  • Software 2.0 Infra: Setup of an active learning for efficient human labeling using prodi.gy, labelstud.io or similar

  • Bespoke App for Language Model Interpretation ala Markus' Netlens

  • Clustering and Analysis (use clusteval or hnet) or define custom cluster-quality metric. Try different approaches (HDBSCAN, UMAP, T-SNE)

  • Bespoke App for open source contributors to label data and create Regex-like pattern matching through an easy to learn syntax eliminating/supporting software dev / modeling

  • Advanced: Automatic Pattern discovery : Given, examples of text, find the underlying common patterns of subsets of them. This probably involves evolutionary algorithms, a good comp. linguistics knowledge and will warrant a stand-alone library. Example: PatternOmatic(doesn't really work)

DATASETs /datasets