`decode()` does not work reliably for subcorpora created by `as.speeches()` #292

ChristophLeonhardt · 2024-06-04T14:24:22Z

Summary

as.speeches() creates subcorpora from parliamentary corpora based on the speaker name, the date of the protocol and a gap value
each subcorpus (i.e. each speech) is described by a matrix of corpus positions which can be
- a single continuous span of tokens
- multiple spans of corpus positions when the speech is interrupted by utterances of other speakers smaller than gap

At least for nested corpora such as GermaParl2, when decoding these subcorpora, the combination of s-attributes gets challenging when the gap parameter is not zero: While the inclusion of the gap parameter might result in the speaker name to change between the very start and the end of a speech (when the speech gets interrupted) which results in multiple non-consecutive but also non-overlapping spans of tokens in the same speech subcorpus, the date of the protocol does not change in the same speech. The protocol date is then later represented by the same span of tokens which is included in the subcorpus object multiple times. This causes incorrect return values and potential errors.

Example Data

Let's create two example speech subcorpora, one with a gap of 50 tokens and one without any allowed tokens between utterances. This uses the GermaParl2 corpus.

library(polmineR)

germaparl_speech_with_gap <- corpus("GERMAPARL2") |>
  subset(protocol_date == "1949-09-07") |>
  as.speeches(
    s_attribute_name = "speaker_name",
    s_attribute_date = "protocol_date",
    gap = 50
  ) |>
  _[[1]]

germaparl_speech_no_gap <- corpus("GERMAPARL2") |>
  subset(protocol_date == "1949-09-07") |>
  as.speeches(
    s_attribute_name = "speaker_name",
    s_attribute_date = "protocol_date",
    gap = 0
  ) |>
  _[[1]]

Issues

Issue 1: Tokens added multiple times in the data.table

Under some circumstances, the attempt to decode the speech with a gap of 50 tokens results in an incorrect data.table:

decode(
  germaparl_speech_with_gap,
  to = "data.table",
  p_attributes = "word",
  s_attributes = "protocol_lp",
  verbose = TRUE
)

The returned data.table contains each token three times.

This is not the case when decoding the speech which was created with a gap of 0:

decode(
  germaparl_speech_no_gap,
  to = "data.table",
  p_attributes = "word",
  s_attributes = "protocol_lp",
  verbose = TRUE
)

This also only happens for "document-level" attributes such as "protocol_lp" or "protocol_date".

Issue 2: Error when multiple document-level attributes are to be encoded

The issue gets more severe if multiple document-level attributes should be decoded at once. In the example above, more than two of these attributes result in an error which essentially states that data.table was about to join a large number of rows.

Based on the observation above, I think that each additional attribute on this level introduces more tokens to the token stream as the "join" of the data.table matches increasingly more corpus positions in the initial token stream object. At some point, this seems to reach a limit which causes the error.

Possible Cause

I assume that the issue essentially is that respective structural attributes are retrieved for each of the token spans described by the cpos slot of the speech subcorpus regardless of whether they ultimately refer to the same "struc".

When you include a gap parameter of sufficient size, there are multiple token spans: each time, the speaker changes, a new span of corpus positions is stored to exclude those parts of the data which are uttered by other speakers. See the difference between a gap of 0 and a gap of 50 tokens:

germaparl_speech_no_gap@cpos # 1 row
germaparl_speech_with_gap@cpos # 3 rows

While these spans of tokens are unique and non-overlapping, during decode() these are essentially resolved to their corresponding struc. While this results in unique "strucs" for structural attributes on the speaker level, on the level of protocols the resulting "strucs" are all the same for each token span in the cpos slot. When these non-unique "strucs" are then ultimately resolved to corpus positions again, the same sequence of corpus positions is returned three times, resolved to their corresponding structural attributes three times and then added to the initial token stream three times.

Possible Solution

I did not test this apart from the examples above, but maybe it would suffice to add a unique() to the call which creates the strucs vector in the first place? When it is made unique here, then the same corpus positions cannot be added multiple times later on.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`decode()` does not work reliably for subcorpora created by `as.speeches()` #292

`decode()` does not work reliably for subcorpora created by `as.speeches()` #292

ChristophLeonhardt commented Jun 4, 2024

decode() does not work reliably for subcorpora created by as.speeches() #292

decode() does not work reliably for subcorpora created by as.speeches() #292

Comments

ChristophLeonhardt commented Jun 4, 2024

Summary

Example Data

Issues

Issue 1: Tokens added multiple times in the data.table

Issue 2: Error when multiple document-level attributes are to be encoded

Possible Cause

Possible Solution

`decode()` does not work reliably for subcorpora created by `as.speeches()` #292

`decode()` does not work reliably for subcorpora created by `as.speeches()` #292