Issue with learning alphanumeric tokens #363

yamika-g · 2024-11-18T01:22:18Z

Hi. I'm using Top2Vec for a project and this is how I have configured the model:


model = Top2Vec(documents = texts_unified, 
                 min_count=10,
                 topic_merge_delta=0.1,
                 ngram_vocab=False,
                 ngram_vocab_args=None,
                 embedding_model='universal-sentence-encoder-large',
                 embedding_model_path=None,
                 embedding_batch_size=32,
                 split_documents=False,
                 document_chunker='sequential',
                 chunk_length=100,
                 max_num_chunks=None,
                 chunk_overlap_ratio=0.5,
                 chunk_len_coverage_ratio=1.0,
                 sentencizer=None,
                 speed='learn',
                 use_corpus_file=False,
                 document_ids=None,
                 keep_documents=True,
                 workers=None,
                 tokenizer=None,
                 use_embedding_model_tokenizer=True,
                 umap_args=None,
                 gpu_umap=False,
                 hdbscan_args = {'min_cluster_size': 50,
                            'metric': 'euclidean',
                            'cluster_selection_method': 'eom'},
                 gpu_hdbscan=False,
                 index_topics=False,
                 verbose=True
                 )

My issue is that there are certain alphanumeric words in my corpus like 'm24' or 'm4' or '1v1' which are very crucial to my domain. They are jargons. However, the model is unable to learn embeddings for such alphanumeric words and therefore these words are not in the model vocabulary, and they are not being assigned to any topic. I can't figure out why this is happening.

The issue is not with the word frequency. Those words occur more than 200 times in the corpus.
I've also tried changing the embedding_model. Even then those words are not being learnt. Are they internally being filtered out? I noticed that no numerical token is being learnt at all. How can I change it?

The text was updated successfully, but these errors were encountered:

yamika-g · 2024-11-18T06:45:03Z

I figured it out. It's because of return simple_preprocess(strip_tags(document), deacc=True). I'll just use my own tokenizer :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with learning alphanumeric tokens #363

Issue with learning alphanumeric tokens #363

yamika-g commented Nov 18, 2024

yamika-g commented Nov 18, 2024

Issue with learning alphanumeric tokens #363

Issue with learning alphanumeric tokens #363

Comments

yamika-g commented Nov 18, 2024

yamika-g commented Nov 18, 2024