Below you find embeddings for different sizes computed from the Spanish Unannotated Corpora.
Links to the embeddings:
- Vector format (.vec) (122 MB)
- Binary format (.bin) (209 MB)
- Vector format (.vec) (348 MB)
- Binary format (.bin) (579 MB)
- Vector format (.vec) (1.1 GB)
- Binary format (.bin) (1.9 GB)
- Vector format (.vec) (3.4 GB)
- Binary format (.bin) (5.6 GB)
- Vector format (.vec) (3.8 GB)
- Binary format (.bin) (5.9 GB)
- Implementation: FastText with Skipgram
- Parameters:
- min subword-ngram = 3
- max subword-ngram = 6
- minCount = 5
- epochs = 20
- dim = 10, 30, 100, 300, 300
- all other parameters set as default
- Spanish Unannotated Corpora
- Corpus Size: 2.6 billion words and 3 billion words (for the new 300 dim)
- Post processing: Explained in Embeddings and Corpora repos, that include tokenization, lowercase, removed listings and urls.