Notebooks for ML Tasks w/ Scikit and LLMs

Jupyter notebooks to apply and experiment with ML and Large Language Models (LLMs) provided by industry leaders such as Cohere, HuggingFace, LangChain, and OpenAI.

01. Binary Classification w/ SVM and Transformer-based Embeddings

[Notebook] [Open in Colab]

Tags: [binary-classification] [embeddings] [svm] [cohere] [openai] [tfidfvectorizer]

This notebook illustrates how to perform binary text classification with just a few hundred samples. It trains a basic Support Vector Machine with a collection of labeled financial sentences (400 training samples), and compares its accuracy with:

transformer-based embeddings using Cohere.
transformer-based embeddings using OpenAI.
frequency-based embeddings using TfidfVectorizer.

SVM Binary-Text Classification Accuracy (550 samples):
------------------------------------------------------
w/ Cohere 'embed-english-v3.0': 94.93%
w/ OpenAI 'text-embedding-ada-002': 89.13%
w/ TfidfVectorizer: 65.22%

02. Multiclass Classification w/ Random Forest and Transformer-based Embeddings

[Notebook] [Open in Colab]

Tags: [multiclass-classification] [embeddings] [hyperparameter-tuning] [random-forest] [cohere] [countvectorizer]

This notebook illustrates how to train a random-forest model with hyperparameter tuning for multiclass classification. It assesses the perfomance of combining said random-forest with:

transformer-based embeddings using Cohere.
bag-of-words vectorizer using CountVectorizer.

It achieves 88.80% accuracy with approximately 200 training samples per class.

Accuracy: 88.80%

              precision    recall  f1-score   support

    Business       0.85      0.82      0.83        55
    Sci/Tech       0.89      0.85      0.87        65
      Sports       0.90      0.93      0.91        69
       World       0.91      0.95      0.93        61

    accuracy                           0.89       250
   macro avg       0.89      0.89      0.89       250
weighted avg       0.89      0.89      0.89       250

03. Multiclass Classification w/ Cohere-Classify

[Notebook] [Open in Colab]

Tags: [multiclass-classification] [cohere]

This notebook illustrates how to use Cohere Classify for multiclass classification. It achieves 94.74% accuracy with approximately 200 training samples per class.

Accuracy: 94.74%

              precision    recall  f1-score   support

    Business       0.90      0.90      0.90        20
    Sci/Tech       0.96      0.92      0.94        24
      Sports       1.00      0.96      0.98        28
       World       0.92      1.00      0.96        23

    accuracy                           0.95        95
   macro avg       0.94      0.95      0.94        95
weighted avg       0.95      0.95      0.95        95

04. OpenAI Functions w/ Langchain and Pydantic

[Notebook] [Open in Colab]

Tags: [openai] [langchain] [pydantic] [function-calling] [function-creation]

This notebook demonstrates how to combine LangChain and Pydantic as an abstraction layer to facilitate the process of creating OpenAI functions and handling JSON formatting.

05. Named Entity Recognition to Enrich Text

[Notebook] [Open in Colab]

Tags: [openai] [named-entity-recognition] [function-calling] [function-creation] [wikipedia]

Named Entity Recognition (NER) is a Natural Language Processing task that identifies and classifies named entities (NE) into predefined semantic categories (such as persons, organizations, locations, events, time expressions, and quantities). By converting raw text into structured information, NER makes data more actionable, facilitating tasks like information extraction, data aggregation, analytics, and social media monitoring.

This notebook demonstrates how to carry out NER with OpenAI Chat Completion and functions-calling to enrich a block of text with links to a knowledge base such as Wikipedia.

This notebook is also available at openai/openai-cookbook/examples/Named_Entity_Recognition_to_enrich_text.ipynb

06. Clustering and Topic Modeling of arXiv dataset (10k) w/ Cohere Embedv3 | Pydantic | OpenAI | LangChain

[Notebook] [Open in Colab]

Tags: [clustering] [cohere] [embeddings] [HDBSCAN] [langchain] [pydantic] [topic-modeling] [openai]

We combine the advanced Cohere and GPT-4 Large Langaguge Models with HDBSCAN, Pydantic and LangChain for Clustering and Topic Modeling. Our playground is a dataset of 10,000 research arXiv documents from Computational Linguistics (Natural Language Processing) published between 2019 and 2023, and enriched with title and abstract embeddings that have been generated with the newest Cohere Embedv3 for the specific clustering task. To measure the clustering and topic modeling effectiveness, we visualize the outcomes after applying UMAP dimensionality reduction.

07. Transformers Self-Attention

[Notebook] [Open in Colab]

Transformers have revolutionized the way we approach tasks in NLP. At its core lies self-attention, a mechanism that allows models to weigh the importance of each sequence element (token embeddings). This basic notebook explores the intricacies of self-attention by providing bertviz visualizations on model, heads and neurons.

Tags: [bertviz] [transformers] [tokenizer] [self-attention]

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
static		static
.gitignore		.gitignore
01_binary_classification_svm.ipynb		01_binary_classification_svm.ipynb
02_multiclass_classification_random_forest.ipynb		02_multiclass_classification_random_forest.ipynb
03_multiclass_classification_cohere_classify.ipynb		03_multiclass_classification_cohere_classify.ipynb
04_openai_functions_langchain_pydantic.ipynb		04_openai_functions_langchain_pydantic.ipynb
05_ner_text_enrich.ipynb		05_ner_text_enrich.ipynb
06_clustering_topicmodeling_arxiv.ipynb		06_clustering_topicmodeling_arxiv.ipynb
07_transformers_selfattention.ipynb		07_transformers_selfattention.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Notebooks for ML Tasks w/ Scikit and LLMs

01. Binary Classification w/ SVM and Transformer-based Embeddings

02. Multiclass Classification w/ Random Forest and Transformer-based Embeddings

03. Multiclass Classification w/ Cohere-Classify

04. OpenAI Functions w/ Langchain and Pydantic

05. Named Entity Recognition to Enrich Text

06. Clustering and Topic Modeling of arXiv dataset (10k) w/ Cohere Embedv3 | Pydantic | OpenAI | LangChain

07. Transformers Self-Attention

About

Releases

Languages

turinglayer/notebooks

Folders and files

Latest commit

History

Repository files navigation

Notebooks for ML Tasks w/ Scikit and LLMs

01. Binary Classification w/ SVM and Transformer-based Embeddings

02. Multiclass Classification w/ Random Forest and Transformer-based Embeddings

03. Multiclass Classification w/ Cohere-Classify

04. OpenAI Functions w/ Langchain and Pydantic

05. Named Entity Recognition to Enrich Text

06. Clustering and Topic Modeling of arXiv dataset (10k) w/ Cohere Embedv3 | Pydantic | OpenAI | LangChain

07. Transformers Self-Attention

About

Topics

Resources

Stars

Watchers

Forks

Releases

Languages