You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Expected output: A subcorpus bundle with the same subcorpora (assuming that there are no speeches which only contain interjections) but without paragraphs which are not of type "speech".
Observed output: A subcorpus bundle with about 4400 subcorpora.
It seems like here there is one subcorpus for each unique speaker, not for each speech.
Scenario II: Subsetting by paragraph type, then splitting into speeches
Aside from the second approach being very slow, it does not seem obvious to me why the first approach should not work. Is the first scenario supposed to work in the first place? If it should work like this, there might be a bug. If it not supposed to work like that, then some additional documentation might be useful.
Additional Remarks
The as.speeches() method also has a subset argument but as also written in the documentation, this is currently only useful for speaker names (speaker) and dates (date) and does not work for other structural attributes.
This was tested using polmineR 0.8.9.9001.
The text was updated successfully, but these errors were encountered:
Subsetting a speech bundle results in a subcorpus bundle with unexpected subcorpora as the initial separation into speeches is not kept.
Hence the question: What is the most efficient way to create a speech bundle without interjections?
Scenario I: Splitting into speeches, then subsetting by paragraph type
Using
GERMAPARL2
to create a speech bundle seems to work fine. The output is a subcorpus bundle with about 450 thousand subcorpora.Assumption: I want to omit all interjections from these speeches. I think the logical step would be a subset.
Expected output: A subcorpus bundle with the same subcorpora (assuming that there are no speeches which only contain interjections) but without paragraphs which are not of type "speech".
Observed output: A subcorpus bundle with about 4400 subcorpora.
It seems like here there is one subcorpus for each unique speaker, not for each speech.
Scenario II: Subsetting by paragraph type, then splitting into speeches
In contrast, this seems to work.
Discussion
Aside from the second approach being very slow, it does not seem obvious to me why the first approach should not work. Is the first scenario supposed to work in the first place? If it should work like this, there might be a bug. If it not supposed to work like that, then some additional documentation might be useful.
Additional Remarks
The
as.speeches()
method also has asubset
argument but as also written in the documentation, this is currently only useful for speaker names (speaker
) and dates (date
) and does not work for other structural attributes.This was tested using
polmineR 0.8.9.9001
.The text was updated successfully, but these errors were encountered: