spanish-word-embeddings/emb-from-suc.md at master · AlphaMoury/spanish-word-embeddings · GitHub

FastText embeddings from SUC

Below you find embeddings for different sizes computed from the Spanish Unannotated Corpora.

Embeddings

Links to the embeddings:

XS (#dimensions=10, #vectors=1313423):

Vector format (.vec) (122 MB)
Binary format (.bin) (209 MB)

S (#dimensions=30, #vectors=1313423):

Vector format (.vec) (348 MB)
Binary format (.bin) (579 MB)

M (#dimensions=100, #vectors=1313423):

Vector format (.vec) (1.1 GB)
Binary format (.bin) (1.9 GB)

L (#dimensions=300, #vectors=1313423):

Vector format (.vec) (3.4 GB)
Binary format (.bin) (5.6 GB)

new L (#dimensions=300, #vectors=1451827):

Vector format (.vec) (3.8 GB)
Binary format (.bin) (5.9 GB)

Algorithm

Implementation: FastText with Skipgram
Parameters:
- min subword-ngram = 3
- max subword-ngram = 6
- minCount = 5
- epochs = 20
- dim = 10, 30, 100, 300, 300
- all other parameters set as default

Corpus

Spanish Unannotated Corpora
Corpus Size: 2.6 billion words and 3 billion words (for the new 300 dim)
Post processing: Explained in Embeddings and Corpora repos, that include tokenization, lowercase, removed listings and urls.