Build software better, together

alasdairforsythe / tokenmonster

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript

tokenizer vocabulary vocabulary-builder tokenize tokenization tokenisation tokenizing text-tokenization vocabulary-generator

Updated Jul 2, 2024
Go

twardoch / split-markdown4gpt

Star

A Python tool for splitting large Markdown files into smaller sections based on a specified token limit. This is particularly useful for processing large Markdown files with GPT models, as it allows the models to handle the data in manageable chunks.

Updated Oct 16, 2024
Python

SayamAlt / Resume-Classification-using-fine-tuned-BERT

Star

Successfully developed a resume classification model which can accurately classify the resume of any person into its corresponding job with a tremendously high accuracy of more than 99%.

nlp exploratory-data-analysis word-embeddings model-evaluation text-preprocessing bert-model text-tokenization fine-tuning-bert

Updated Jan 13, 2023
Jupyter Notebook

markiskorova / Machine-Learning-NLP-Predict-Author

Star

Machine Learning & Natural Language Processing: Reads Classic Novels and Predicts the Author of a Phrase

python machine-learning natural-language-processing tensorflow keras text-vectorization text-tokenization

Updated Sep 23, 2024
Python

victoryosiobe / kingchop

Star

Kingchop ⚔️ is a JavaScript English based library for tokenizing text (chopping text). It uses vast rules for tokenizing, and you can adjust them easily.

nodejs javascript natural-language-processing text-processing sentence-tokenizer text-tokenization word-tokenizer tokenizers paragraph-tokenizer

Updated Jul 19, 2024
JavaScript

Software-Research-Lab / dropsuit-tok

Star

The tok function is a JavaScript and Node.js function that processes object instances and tokenizes text arrays. It returns tokenized words number, tokenized words array, and tokenized words concatenated string. It's part of the open-source DropSuit NLP library under the Apache License 2.0.

text-analysis text-processing language-understanding text-tokenization

Updated May 1, 2023
JavaScript

katanabana / Nihotip

Star

Nihotip is a web app that lets users explore Japanese text through interactive tokenization and detailed insights. Built with React and Python, it offers a dynamic way to analyze words and symbols with tooltips for deeper understanding.

react tooltips python nlp language japanese text-analysis webapp japanese-language mecab tokenization japanese-characters wanakana text-tokenization japanese-learning sudachipy jmdictfurigana

Updated Sep 26, 2024
JavaScript

SayamAlt / News-Category-Classification

Star

Successfully developed a news category classification model using fine-tuned BERT which can accurately classify any news text into its respective category i.e. Politics, Business, Technology and Entertainment.

nlp text-classification exploratory-data-analysis feature-engineering model-evaluation text-cleaning text-preprocessing bert-embeddings text-tokenization fine-tuning-bert

Updated Jan 17, 2023
Jupyter Notebook

cedrickchee / tokenizers

Star

💥Fast State-of-the-Art Tokenizers optimized for Research and Production

nlp natural-language-processing transformers gpt language-model bert natural-language-understanding text-tokenization

Updated Jan 15, 2020
Rust

SayamAlt / Cyberbullying-Classification-using-fine-tuned-DistilBERT

Star

Successfully fine-tuned a pretrained DistilBERT transformer model that can classify social media text data into one of 4 cyberbullying labels i.e. ethnicity/race, gender/sexual, religion and not cyberbullying with a remarkable accuracy of 99%.

natural-language-processing text-classification exploratory-data-analysis data-exploration multiclass-classification cyberbullying-detection text-preprocessing text-tokenization distilbert-model llm fine-tune-bert-tensorflow model-inference model-training-and-evaluation

Updated Jun 10, 2024
Jupyter Notebook

SayamAlt / Global-News-Headlines-Text-Summarization

Star

Successfully established a text summarization model using Seq2Seq modeling with Luong Attention, which can give a short and concise summary of the global news headlines.

natural-language-processing text-generation text-summarization attention-mechanism seq2seq-model luong-attention text-tokenization model-inference model-architecture-and-implementation data-exploration-and-preprocessing

Updated May 6, 2024
Jupyter Notebook

SayamAlt / Symptoms-Disease-Text-Classification

Star

Successfully developed a fine-tuned BERT transformer model which can accurately classify symptoms to their corresponding diseases upto an accuracy of 89%.

natural-language-processing text-classification exploratory-data-analysis multiclass-classification text-preprocessing text-tokenization bert-fine-tuning hugging-face-transformers fine-tune-bert-tensorflow model-inference model-architecture-and-implementation model-training-and-evaluation data-exploration-and-preprocessing

Updated May 6, 2024
Jupyter Notebook

SayamAlt / Financial-News-Sentiment-Analysis

Star

Successfully developed a fine-tuned DistilBERT transformer model which can accurately predict the overall sentiment of a piece of financial news up to an accuracy of nearly 81.5%.

natural-language-processing sentiment-analysis multiclass-classification text-preprocessing text-tokenization distilbert-model hugging-face-transformers fine-tune-bert-tensorflow model-inference model-architecture-and-implementation model-training-and-evaluation data-exploration-and-preprocessing

Updated May 6, 2024
Jupyter Notebook

SayamAlt / Customer-Support-Chatbot-using-NLTK

Star

Successfully developed a chatbot model which can provide accurate and concise responses to a wide variety of customer queries regarding the services offered by a particular company as well as general topics.

nlp deep-neural-networks deep-learning nltk chatbots text-tokenization

Updated Mar 29, 2023
Python

SayamAlt / Fake-News-Classification-using-fine-tuned-BERT

Star

Successfully developed a text classification model to predict whether a given news text is fake or not by fine-tuning a pretrained BERT transformed model imported from Hugging Face.

deep-learning text-classification data-visualization data-analysis model-evaluation text-preprocessing bert-model bert-embeddings text-tokenization wordcloud-visualization fine-tuning-bert tokenizer-nlp model-training-and-evaluation

Updated Dec 10, 2024
Jupyter Notebook

LokeshKenche / ISP_ChatBot

Star

ISPY ChatBot ISPY is a chatbot designed for ISP customer service, providing automated responses and assistance for various queries such as connection issues, payments, and service requests. Built using Python with libraries like nltk and newspaper3k, it simulates conversation and handles customer interactions effectively.

machine-learning chatbot nltk cosine-similarity webscraping nlp-machine-learning textanalysis customer-services newspaper3k text-tokenization text-based-chatbot article-parsing

Updated Apr 14, 2024
Jupyter Notebook

SayamAlt / English-to-Spanish-Language-Translation-using-Seq2Seq-and-Attention

Star

Successfully established a Seq2Seq with attention model which can perform English to Spanish language translation up to an accuracy of almost 97%.

natural-language-processing language-translation exploratory-data-analysis text-generation neural-machine-translation attention-model attention-is-all-you-need text-preprocessing luong-attention text-tokenization seq2seq-modeling fine-tuning-bert bert-transformer hugging-face-transformers model-inference model-architecture-and-implementation model-training-and-evaluation

Updated May 6, 2024
Jupyter Notebook

Aalaa4444 / Text_Processing-and-Unique_Word_Extraction_fromHTML

Star

Extract text content from an HTML page, process it, and extract unique words from the processed text. This notebook utilizes various text processing techniques including cleaning, normalization, tokenization, lemmatization or stemming, and stop words removal.

tokenizer text-extraction requests data-extraction beautifulsoup text-processing tokenization stemming lemmatization stopwords-removal text-cleaning text-normalization extract-html text-tokenization text-lemmatization

Updated Apr 5, 2024
Jupyter Notebook

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text-tokenization

Here are 18 public repositories matching this topic...

alasdairforsythe / tokenmonster

twardoch / split-markdown4gpt

SayamAlt / Resume-Classification-using-fine-tuned-BERT

markiskorova / Machine-Learning-NLP-Predict-Author

victoryosiobe / kingchop

Software-Research-Lab / dropsuit-tok

katanabana / Nihotip

SayamAlt / News-Category-Classification

cedrickchee / tokenizers

SayamAlt / Cyberbullying-Classification-using-fine-tuned-DistilBERT

SayamAlt / Global-News-Headlines-Text-Summarization

SayamAlt / Symptoms-Disease-Text-Classification

SayamAlt / Financial-News-Sentiment-Analysis

SayamAlt / Customer-Support-Chatbot-using-NLTK

SayamAlt / Fake-News-Classification-using-fine-tuned-BERT

LokeshKenche / ISP_ChatBot

SayamAlt / English-to-Spanish-Language-Translation-using-Seq2Seq-and-Attention

Aalaa4444 / Text_Processing-and-Unique_Word_Extraction_fromHTML

Improve this page

Add this topic to your repo