Jupyter notebooks to apply and experiment with ML and Large Language Models (LLMs) provided by industry leaders such as Cohere, HuggingFace, LangChain, and OpenAI.
- 01. Binary Classification w/ SVM and Transformer-based Embeddings
- 02. Multiclass Classification w/ Random Forest and Transformer-based Embeddings
- 03. Multiclass Classification w/ Cohere-Classify
- 04. OpenAI Functions w/ Langchain and Pydantic
- 05. Named Entity Recognition to Enrich Text
- 06. Clustering and Topic Modeling of arXiv dataset (10k) w/ Cohere Embedv3 | Pydantic | OpenAI | LangChain
- 07. Transformers Self-Attention
Tags: [binary-classification]
[embeddings]
[svm]
[cohere]
[openai]
[tfidfvectorizer]
This notebook illustrates how to perform binary text classification
with just a few hundred samples. It trains a basic Support Vector Machine
with a collection of labeled financial sentences (400 training samples), and compares its accuracy with:
- transformer-based embeddings using Cohere.
- transformer-based embeddings using OpenAI.
- frequency-based embeddings using TfidfVectorizer.
SVM Binary-Text Classification Accuracy (550 samples):
------------------------------------------------------
w/ Cohere 'embed-english-v3.0': 94.93%
w/ OpenAI 'text-embedding-ada-002': 89.13%
w/ TfidfVectorizer: 65.22%
Tags: [multiclass-classification]
[embeddings]
[hyperparameter-tuning]
[random-forest]
[cohere]
[countvectorizer]
This notebook illustrates how to train a random-forest
model with hyperparameter tuning
for multiclass classification. It assesses the perfomance of combining said random-forest
with:
- transformer-based embeddings using Cohere.
- bag-of-words vectorizer using CountVectorizer.
It achieves 88.80%
accuracy with approximately 200 training samples per class
.
Accuracy: 88.80%
precision recall f1-score support
Business 0.85 0.82 0.83 55
Sci/Tech 0.89 0.85 0.87 65
Sports 0.90 0.93 0.91 69
World 0.91 0.95 0.93 61
accuracy 0.89 250
macro avg 0.89 0.89 0.89 250
weighted avg 0.89 0.89 0.89 250
Tags: [multiclass-classification]
[cohere]
This notebook illustrates how to use Cohere Classify for multiclass classification. It achieves 94.74% accuracy
with approximately 200 training samples per class
.
Accuracy: 94.74%
precision recall f1-score support
Business 0.90 0.90 0.90 20
Sci/Tech 0.96 0.92 0.94 24
Sports 1.00 0.96 0.98 28
World 0.92 1.00 0.96 23
accuracy 0.95 95
macro avg 0.94 0.95 0.94 95
weighted avg 0.95 0.95 0.95 95
Tags: [openai]
[langchain]
[pydantic]
[function-calling]
[function-creation]
This notebook demonstrates how to combine LangChain and Pydantic as an abstraction layer to facilitate the process of creating OpenAI
functions
and handling JSON
formatting.
Tags: [openai]
[named-entity-recognition]
[function-calling]
[function-creation]
[wikipedia]
Named Entity Recognition
(NER) is a Natural Language Processing
task that identifies and classifies named entities (NE) into predefined semantic categories (such as persons, organizations, locations, events, time expressions, and quantities). By converting raw text into structured information, NER makes data more actionable, facilitating tasks like information extraction, data aggregation, analytics, and social media monitoring.
This notebook demonstrates how to carry out NER with OpenAI Chat Completion and functions-calling to enrich a block of text with links to a knowledge base such as Wikipedia.
This notebook is also available at openai/openai-cookbook/examples/Named_Entity_Recognition_to_enrich_text.ipynb
06. Clustering and Topic Modeling of arXiv dataset (10k) w/ Cohere Embedv3 | Pydantic | OpenAI | LangChain
Tags: [clustering]
[cohere]
[embeddings]
[HDBSCAN]
[langchain]
[pydantic]
[topic-modeling]
[openai]
We combine the advanced Cohere and GPT-4 Large Langaguge Models with HDBSCAN, Pydantic and LangChain for Clustering and Topic Modeling. Our playground is a dataset of 10,000 research arXiv documents from Computational Linguistics (Natural Language Processing) published between 2019 and 2023, and enriched with title
and abstract
embeddings that have been generated with the newest Cohere Embedv3 for the specific clustering task. To measure the clustering and topic modeling effectiveness, we visualize the outcomes after applying UMAP dimensionality reduction.
Transformers have revolutionized the way we approach tasks in NLP. At its core lies self-attention, a mechanism that allows models to weigh the importance of each sequence element (token embeddings). This basic notebook explores the intricacies of self-attention by providing bertviz visualizations on model, heads and neurons.
Tags: [bertviz]
[transformers]
[tokenizer]
[self-attention]