Skip to content

A list of words from the SUBTLEX movie subtitles corpus, sorted by frequency.

License

Notifications You must be signed in to change notification settings

words/subtlex-word-frequencies

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

subtlex-word-frequencies

Build Downloads Size

List of 74,286 words sorted by frequency of use in spoken English.

The word counts are derived from SUBTLEXus, a corpus of American English subtitles of movies.

Install

npm:

npm install subtlex-word-frequencies

Use

var subtlex = require('subtlex-word-frequencies')

console.log(words.length)

console.log(words.slice(0, 3))

console.log(words.filter(d => d.word.match(/chick/)).slice(0, 5))

Yields:

74286
[
  {word: 'you', count: 2134713},
  {word: 'I', count: 2038529},
  {word: 'the', count: 1501908}
]
[
  {word: 'chicken', count: 3148},
  {word: 'chick', count: 1334},
  {word: 'chicks', count: 742},
  {word: 'chickens', count: 520},
  {word: 'chickenshit', count: 85}
]

API

subtlexWordFrequencies

Array.<Entry> — List of all entries in SUBTLEXus. Each entry has the following properties:

  • word (string) — Unique word (example: git)
  • value (number) — Number of times the word appears in the corpus (example: 101)

word starts with a capital when the word more often starts with an uppercase letter than with a lowercase letter (example: I).

The entire original corpus consists of 51 million words.

License

ISC © Zeke Sikelianos