Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with learning alphanumeric tokens #363

Open
yamika-g opened this issue Nov 18, 2024 · 1 comment
Open

Issue with learning alphanumeric tokens #363

yamika-g opened this issue Nov 18, 2024 · 1 comment

Comments

@yamika-g
Copy link

Hi. I'm using Top2Vec for a project and this is how I have configured the model:


model = Top2Vec(documents = texts_unified, 
                 min_count=10,
                 topic_merge_delta=0.1,
                 ngram_vocab=False,
                 ngram_vocab_args=None,
                 embedding_model='universal-sentence-encoder-large',
                 embedding_model_path=None,
                 embedding_batch_size=32,
                 split_documents=False,
                 document_chunker='sequential',
                 chunk_length=100,
                 max_num_chunks=None,
                 chunk_overlap_ratio=0.5,
                 chunk_len_coverage_ratio=1.0,
                 sentencizer=None,
                 speed='learn',
                 use_corpus_file=False,
                 document_ids=None,
                 keep_documents=True,
                 workers=None,
                 tokenizer=None,
                 use_embedding_model_tokenizer=True,
                 umap_args=None,
                 gpu_umap=False,
                 hdbscan_args = {'min_cluster_size': 50,
                            'metric': 'euclidean',
                            'cluster_selection_method': 'eom'},
                 gpu_hdbscan=False,
                 index_topics=False,
                 verbose=True
                 )

My issue is that there are certain alphanumeric words in my corpus like 'm24' or 'm4' or '1v1' which are very crucial to my domain. They are jargons. However, the model is unable to learn embeddings for such alphanumeric words and therefore these words are not in the model vocabulary, and they are not being assigned to any topic. I can't figure out why this is happening.

The issue is not with the word frequency. Those words occur more than 200 times in the corpus.
I've also tried changing the embedding_model. Even then those words are not being learnt. Are they internally being filtered out? I noticed that no numerical token is being learnt at all. How can I change it?

@yamika-g
Copy link
Author

I figured it out. It's because of return simple_preprocess(strip_tags(document), deacc=True). I'll just use my own tokenizer :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant