-
Notifications
You must be signed in to change notification settings - Fork 0
/
notes
53 lines (45 loc) · 2.3 KB
/
notes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
1) Need to parse text. Turn each word into an indexed thing. Each word
is either masked out for being a super common word or is grouped into
a lemma.
-Have code go word by word, search document for all occurrences of
word
-Remove all punctuation. For apostrophes search for 's and
remove. Since contractions will all fall under very common words,
they'll all be removed. There are probably already existing lists
which I can pull from other code
-For words that aren't going to be masked query python dictionary for
other forms of the word as well as synonyms. Each one of these groups
of words becomes an indexed lemma
2) Take word list and find a distance metric between lemmas
-Can do nearest neighbor for every word or find each pair with each
other. This can probably get pretty expensive so may only find a
finite number of these
3) Based on distances, partition phase space and find strongest
clusters.
4) Make cuts on the tightness of a cluster by comparing to randomly
distributed words
Questions
1) Do we consider particles in the distance metric? What constitutes
nearness of a word or theme to another word or theme. Perhaps test on
a short story
2) Do we consider distances across chapters or just within a chapter?
This would save a lot of computation time. May depend on the
author. Even Doctor Zhivago which has very short chapters seems to
have things contained within the chapter especially since it switches
so often between different character perspectives
3) How does the distance metric contribute to the clustering of words?
The math for this is probably in some of the machine learning
clustering algorithms
4) Where do we make the cut of what's associated with a word or not?
5) Check for multi-word lemmas
-Each word has an affinity to its own cluster of words.
-Want to try distances of individual words before working on clusters
-How to find problem words with too many meanings? If lemma set is too
large, not a good set to consider
We only want to consider the most common words. Instead of forming
lemmas out of every single word, should take the top maybe 300 words
and form tight clusters from those.
-Let's first try finding the occurrences of the top 300 words and look
at how they are clustered.
-Then take a fairly representative one and see how many clusters we
can group together as a cohesive theme.