Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aft analysis #36

Merged
merged 43 commits into from
Dec 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
2647ef0
Create aft_notebook.Rmd
j-hagedorn Mar 1, 2022
59943f8
Merge branch 'master' into aft-analysis
j-hagedorn Mar 1, 2022
fb349b5
Merge branch 'master' into aft-analysis
j-hagedorn Mar 1, 2022
1df7197
Configure py env and document
j-hagedorn Mar 19, 2022
92d3fd6
Add py env
j-hagedorn Mar 19, 2022
008bc45
Update summaries
j-hagedorn Mar 19, 2022
b0f3bf6
Update docs
j-hagedorn Mar 21, 2022
7177676
Create aft_scratchpad.py
j-hagedorn Mar 21, 2022
9826801
Update aft_scratchpad.py
j-hagedorn Mar 21, 2022
06771eb
Update py_config.Rmd
j-hagedorn Mar 21, 2022
c02217c
Update .Rprofile
j-hagedorn Mar 26, 2022
4a7e253
Update py_config.Rmd
j-hagedorn Mar 26, 2022
e22d698
Create aft_topics.py
j-hagedorn Mar 26, 2022
aac0969
Update aft_scratchpad.py
j-hagedorn Mar 26, 2022
8eef695
Update aft_topics.py
j-hagedorn Mar 26, 2022
f9239e6
Update aft_topics.py
j-hagedorn Mar 29, 2022
bd929d7
Update aft_notebook.Rmd
j-hagedorn Mar 30, 2022
550f205
Add topics viz
j-hagedorn Mar 30, 2022
7dd5ac3
Add NER testing
j-hagedorn Mar 30, 2022
aa31f9b
Add ATU df
j-hagedorn Apr 1, 2022
e6f516d
Update aft_ner.ipynb
j-hagedorn Apr 1, 2022
4379790
Update aft_topics_viz.ipynb
j-hagedorn Apr 1, 2022
fa4d7fb
Update py_config.Rmd
j-hagedorn Apr 1, 2022
8a16906
Update aft_notebook.Rmd
j-hagedorn Apr 1, 2022
11f1242
Create aft_notebook.html
j-hagedorn Apr 1, 2022
add7fcb
Update aft_scratchpad.py
j-hagedorn Apr 2, 2022
5dfa7d6
Update aft_notebook.Rmd
j-hagedorn Apr 2, 2022
e44f705
Update aft_notebook.html
j-hagedorn Apr 2, 2022
4ef3eec
Update aft_notebook.Rmd
j-hagedorn Apr 6, 2022
3e744da
Merge branch 'master' into aft-analysis
j-hagedorn May 28, 2022
9849722
Add notes from Sandor
j-hagedorn May 28, 2022
7ab53b9
Update aft_notebook.Rmd
j-hagedorn May 28, 2022
a5d848f
Update aft_notebook.Rmd
j-hagedorn May 28, 2022
84d72e4
Update aft_notebook.html
j-hagedorn May 28, 2022
1184d4e
Update aft_notebook.Rmd
j-hagedorn Jun 2, 2022
5be0ed8
Update trilogy.Rproj
j-hagedorn Aug 19, 2022
36c6c0b
Create atu_clust.R
j-hagedorn Aug 24, 2022
b349be2
Update aft_notebook.Rmd
j-hagedorn Aug 24, 2022
10b41dd
Update py_config.Rmd
j-hagedorn Aug 26, 2022
3967bcd
Update atu_clust.R
j-hagedorn Aug 26, 2022
0319312
Update aft_notebook.Rmd
j-hagedorn May 14, 2023
9e4ea95
Merge branch 'master' into aft-analysis
j-hagedorn May 14, 2023
678b89c
Move docs
j-hagedorn Dec 19, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .Rprofile
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Sys.setenv(RETICULATE_PYTHON = "my_env/Scripts/python.exe")
2,250 changes: 2,250 additions & 0 deletions data/atu.csv

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"cells": [],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 5
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"cells": [],
"metadata": {},
"nbformat": 4,
"nbformat_minor": 5
}
301 changes: 301 additions & 0 deletions docs/analyses/aft_analysis/aft_ner.ipynb

Large diffs are not rendered by default.

238 changes: 238 additions & 0 deletions docs/analyses/aft_analysis/aft_notebook.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,238 @@
---
title: "Notes for AFT Article"
output: html_document
date: '`r Sys.Date()`'
---

```{r setup, include=FALSE}
library(tidyverse)
knitr::opts_chunk$set(echo = F, warning = F, message = F)
aft <- read_csv("../../data/aft.csv")
tmi <- read_csv("../../data/tmi.csv")
atu <- read_csv("../../data/atu.csv")
```

# Cleaning

Additional cleaning of datasets needed:
- Clean up tale names regex in `atu`
- Finish ATU sequences


# Exploratory analysis: What's in the AFT corpus?

## Tale Types

- Proportion of ATU represented by `aft`

```{r unmatched_in_atu, eval=FALSE, include=FALSE}

tst <-
aft %>%
anti_join(atu %>% select(chapter:tale_name), by = "atu_id") %>%
distinct()

```

```{r merge_aft_atu}

# Merge AFT with ATU and filter unmatched items from AFT

aft_in_atu <-
aft %>%
select(atu_id) %>%
mutate(in_aft = T) %>%
full_join(atu %>% select(chapter:tale_name), by = "atu_id") %>%
filter(!is.na(chapter)) %>% # remove items in AFT but not ATU
mutate(in_aft = ifelse(!is.na(in_aft),T,F)) %>%
group_by(atu_id,chapter,division,sub_division,tale_name) %>%
summarize(n_tales = sum(in_aft))

```

### By ATU Chapter/Division

Summary stats by ATU chapter:

```{r}
library(formattable)

aft_in_atu %>%
mutate(with_tale = n_tales > 0) %>%
group_by(chapter) %>%
summarize(
n_types = n_distinct(atu_id),
n_types_with_tale = sum(with_tale),
n_tales = sum(n_tales)
) %>%
mutate(
pct_with_tales = round(n_types_with_tale / n_types * 100, digits = 1),
tales_per = round(n_tales / n_types_with_tale, digits = 1)
) %>%
arrange(desc(tales_per)) %>%
select(-n_types_with_tale) %>%
formattable(
list(
n_tales = color_bar("#1BB6AFFF"),
# n_types_with_tale = color_bar("#EC921DFF"),
n_types = color_bar("#1BB6AFFF"),
pct_with_tales = color_tile("transparent", "#EF562AFF"),
tales_per = color_tile("transparent", "#EF562AFF")
)
)


```

The treemap below shows the nested sets of the ATU into which AFT texts fall, by `chapter`, `division`, and `sub_division`.

```{r}

aft_tree <-
aft_in_atu %>%
group_by(chapter,division) %>%
summarize(
n_types = n_distinct(atu_id),
n_tales = sum(n_tales)
) %>%
filter(!is.na(division)) %>%
rename(parent = chapter, labels = division) %>%
bind_rows(
aft_in_atu %>%
filter(!is.na(division)) %>%
group_by(chapter) %>%
summarize(
n_types = n_distinct(atu_id),
n_tales = sum(n_tales)
) %>%
mutate(parent = "root") %>%
rename(labels = chapter)
) %>%
bind_rows(
aft_in_atu %>%
filter(!is.na(division)) %>%
group_by(division, sub_division) %>%
summarize(
n_types = n_distinct(atu_id),
n_tales = sum(n_tales)
) %>%
rename(parent = division, labels = sub_division)
) # %>%
# bind_rows(
# aft_in_atu %>%
# filter(!is.na(division)) %>%
# group_by(sub_division, tale_name) %>%
# summarize(
# n_types = n_distinct(atu_id),
# n_tales = sum(n_tales)
# ) %>%
# rename(parent = sub_division, labels = tale_name)
# )

plotly::plot_ly(
aft_tree,
type = 'treemap',
labels = ~labels,
parents = ~parent,
values = ~n_tales
)

```

## Textual Content

### Entities

To do:

- Find a well performing model for NER (cf. https://huggingface.co/flair/ner-english-ontonotes-fast, https://huggingface.co/stanfordnlp/stanza-en)
- Find a way to label the name of a main character (example Miller, prince, etc.) as more than simply a noun
- Pull out subject and predicate from sentence. Explore verbs related to subjects.

### Common phrases

```{python}

```

- TextRank
- collocation/word frequency

### Topic modeling

- Define cleaning tasks and stop words to improve topic models performance; right now they are too close together, with a few main clusters of topics that are difficult to distinguish

# Motif identification?

Notes/questions from Sándor:

Atu markup segments motif abstracts
TMI defines motifs by 1-2 sentences
Relate words in both, cooccurrence
Relate this cooccurrence matrix to AFT types

Each of the three resources provides a fragment of the problem. Is this enough for a solution?

- *TMI*: a list of brief motif names with IDs and defined relationships in up to 6 hierarchical levels.
- *ATU*: a list of tale types with both (a) summary descriptions, and (b) motif sequences built from TMI items. Includes `chapter` and `division` groups
- *AFT*: a selection of representative tale texts annotated with ATU tale types.

Are the genres (`chapter` and `division`) of tale types structurally distinct, in terms of the motif sequences from which they are made?

Can we assume that a tale text annotated as a specific `tale_type` contains one of the variants of motif sequences which are defined in the ATU for that tale type?
If so, how might we discover the occurrence of motifs within unstructured tale texts?
How might we develop a clean, well-annotated set of texts with motif-based markup, given the large number of motifs?
Can real motifs from tale texts be extracted by means of theoretical strings of theoretical motifs?
Is there a way to validate the TMI by automatic means, like in an ML experiment, out of many?


Match between label (TMI) and ill-bounded/demarcated text fragment over an AFT set. Can we find a transformation which converts the set of segments into the label? By means of abstraction/abstracting. Text summarisation in Python and DL available. Reverse problem: how to arrive at text set from label as string. Depends on set size and topic composition, possibly a set of particular mixes.

Given a label and a set of text segments to arrive at that label by DL. Which architecture/method yields the best heuristics? Approximate transformation by back-propagation (?). Consult JEK.

Add MFTL. LRRH. Custom-built for experimentation, for researchers with interest in the intersection of data science and folk tale studies. For work in progress.

For every motif in string, correlation between TMI label and ATU segment content vs ATU segment content and AFT segment set, manually marked up.

Convert type sample to robust conceptual equivalent.

- Do you think you could build a 3-d matrix (tensor) from the trilogy, or respective- motifs and types constrained by the AFT? E.g. axis x would be the TMI-based motifs aka concept strings; axis y would be ATU types aka motif strings; and axis z would be the AFT where some types are exemplified by text sets.

Then we could expose this tensor to all kinds of analysis, including DL by CNN (Johan's favourite), or co-clustering (my bet).

As food for thought, consider this as a working hypothesis: "a motif is a multiple co-occurrence of concept strings anchored in the trilogy". Whatever the outcome, negative or positive, the hypothesis can be tested, and we could learn if this definition can be falsified.

Plus look at the visuals from co-clustering results for 'multiple cooccurrence' as a GS query. Just 150 hits which sounds quite promising for explaining the idea by references from multiple domains, ie methodological cross-pollination.

By concept strings in the TMI I would expect some normalization of word forms to concepts just like eg Propp's characters, actions/functions, situations etc. There we could perhaps look into ontologies if they exist. Thierry Declerck's work comes to mind.

The idea was this: multiple, because in a set representing a type, the reason why the type exists is that the 'motif', or its precursors, exist in several non-identical versions. Hence co-occurrence (could be even upgraded to attention to stress contextuality contributing to the meaning of X). Concept strings refers to two facts: (a) multiple lexemes / word forms / synonyms / functional equivalents will occur, all of them manifestations of concepts key to the identity of a particular motif; and (b) such verbal concept manifestations will occur in a particular sequence. This should hold for the AFT because it is type-based; to the ATU because it is motif-based; and the TMI, because it is concept-based (or so do I hope). E.g. the robustness of TMI motifs could perhaps be increased by merging and adding the vocabularies of subtypes to that of the key entry. After such a parametrized normalization process, documented for posteriority while being built, a 3-d tensor should arise.

So to derive them, we should extract pregrouped word forms to cover concepts, and preserve their order how they pop up. Say for strike = kill, hurt, wound, emasculate, castrate, etc could yield the Verb part in SVO form; dispatcher = king, ruler, monarch, father, fairy Queen etc, could be the Subject, and so on. Any ideas welcome on how one might speed up this part of the building process.

Another way to put it could be this: "The AFT is type-based. The ATU is motif (motif string) based. The TMI is concept string based. So the bottleneck is to extract from the AFT words that constitute concepts in specific sequences to manifest motifs, so that such motif sequences then constitute types." Where the expectation is that in order to comply, word sequences from the AFT should be the same as in the TMI; such word sequences would be set members of a concept string.

Suppose we can convert the motifs inherent in the 182 AFT types to SBERT embeddings. Thereby we get a distribution of located sentence content based on motif definition.

Next we take the same 182 type descriptions aka tale abstracts, segmented by motif numbers into respective chunks. Their SBERT as the bottom line could lead to similarity 1 between the TMI and ATU embeddings.

In step 3 one would have to screen the text of type subsets for the -- probably variable -- size of the optimal chunk where its SBERT value is most similar to sim 1. I don't know how to carry out this screening, but maybe you get an idea.

The same logic might work without the TMI step as well, although we would have a more nuanced overview with it included.

###

nigel osbourne
x-system

unseen species problem
---

bert topic
we haave

can we use gpt to come up with dummy variants of the stories, thus increasing its size

how could we create dummy variants

# Extract triples from TMI
2,591 changes: 2,591 additions & 0 deletions docs/analyses/aft_analysis/aft_notebook.html

Large diffs are not rendered by default.

89 changes: 89 additions & 0 deletions docs/analyses/aft_analysis/aft_scratchpad.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# pip install -U spacy
# python -m spacy download en_core_web_sm
import spacy
import pytextrank
import pandas as pd
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('textrank')

# Get df from R environment
df = r["aft"]

doc = nlp(df.text[0])
docs = list(nlp.pipe(df.text))

list(doc._.phrases)

[(i, i.label_, i.vector_norm) for i in doc.sents]

# Concatenate multiple texts

# s = ""
# for item in df.text[0:22]:
# s += item
#
# doc2 = nlp(s)

# List of token attributes: https://spacy.io/api/token#attributes

def extract_tokens(doc:spacy.tokens.doc.Doc):
"""Extract tokens and metadata from individual spaCy doc."""
return [
(i.text, i.i, i.lemma_, i.ent_type_, i.ent_iob_, i.tag_,
i.dep_, i.pos_, i.is_stop, i.is_alpha,
i.is_digit, i.is_punct, i.is_sent_end) for i in doc
]

# x = extract_tokens(doc)

def tidy_tokens(docs):
"""Extract tokens and metadata from list of spaCy docs."""

# Any token attributes added above need to be named here
cols = [
"doc_id", "token", "token_order", "lemma",
"ent_type", "ent_iob", "tag", "dep", "pos", "is_stop",
"is_alpha", "is_digit", "is_punct", "is_sent_end"
]

meta_df = []
for ix, doc in enumerate(docs):
meta = extract_tokens(doc)
meta = pd.DataFrame(meta)
meta.columns = cols[1:]
meta = meta.assign(doc_id = ix).loc[:, cols]
meta_df.append(meta)

return pd.concat(meta_df)

# x = tidy_tokens(docs)

def extract_phrases(doc):
"""Extract pytextrank phrases from individual spaCy doc."""
return [(p.rank, p.count, p.text) for p in doc._.phrases]


def tidy_phrases(docs):
"""Extract tokens and metadata from list of spaCy docs."""

# Any token attributes added above need to be named here
cols = [
"doc_id", "rank", "count", "phrase"
]

meta_df = []
for ix, doc in enumerate(docs):
meta = extract_phrases(doc)
meta = pd.DataFrame(meta)
meta.columns = cols[1:]
meta = meta.assign(doc_id = ix).loc[:, cols]
meta_df.append(meta)

return pd.concat(meta_df)


tr = doc._.textrank

for sent in tr.summary(limit_phrases=15, limit_sentences=5):
print(sent)

Loading
Loading