Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong data type of NA timestamps depending on order of data sources #269

Open
bockthom opened this issue Sep 19, 2024 · 5 comments
Open

Comments

@bockthom
Copy link
Collaborator

Description

project.data$get.data.cut.to.same.date(data.sources = c("issues", "mails", "commits")) sometimes fails if some of the data sources are empty, ending up in the following error message:

ERROR::The bins parameter, if present, needs to be a vector whose elements represent dates

This error occurs if the issue data is the empty data source, but not if the mail data is the empty data source, which is an unexpected behavior.

After some debugging, I noticed that the error is caused by wrong data types when converting strings to POSIXct dates: Usually, if the date is not a POSIXct object, it is converted to POSIXct when splitting the data:

coronet/util-split.R

Lines 69 to 70 in 24005e4

if (!is.null(bins) && !lubridate::is.POSIXct(bins)) {
dates = parallel::mclapply(unlist(bins), get.date.from.string)

However, this fails if the bins (i.e., the timestamps used for cutting) are not a string but a numeric (such as UNIX timestamps). Therefore, we need to look at where the timestamps for cutting come from: They are extracted in the function extract.timestamps:

coronet/util-data.R

Lines 724 to 758 in 24005e4

extract.timestamps = function(name, source) {
## initialize data structure for timestamp
if (is.null(private$data.timestamps)) {
## let this stay as a sanity-check; this is actually initialized this way when creating the object
private$data.timestamps = data.frame(start = numeric(0), end = numeric(0))
}
## collect minimum and maximum date for data source
## 1) if we have data available
if (nrow(private[[source]]) > 0) {
source.date.min = min(private[[source]][, "date"])
source.date.max = max(private[[source]][, "date"])
}
## NAs otherwise
else {
source.date.min = NA
source.date.max = NA
}
## remove old line if existing
private$data.timestamps = subset(
private$data.timestamps,
!(rownames(private$data.timestamps) == name)
)
## store the data in the timestamp data set
private$data.timestamps = rbind(
private$data.timestamps,
data.frame(
start = source.date.min,
end = source.date.max,
row.names = name
)
)
}

If no data is available for the specified data source, the corresponding timestamps are set to NA. For whatever reason, this converts the data frame used for cutting into a numeric data frame of UNIX timestamps if the first row contains NA values. Otherwise, if there are already POSIXct objects in there, the inserted NA values are interpreted to be POSIXct objects.

Suggested Fix

Instead of setting the timestamps to NA in case of missing data sources, set the timestamps to as.POSIXct(NA). This way, we ensure that the data type is POSIXct even if the NA values are the very first ones that are inserted into the timestamps data frame.

Versions

This affects, at least, coronet version 4.4 in combination with R version 4.4.1. Also earlier coronet versions or R versions may be affected by this bug.

@bockthom bockthom added the bug label Sep 19, 2024
@bockthom bockthom added this to the v4.5 milestone Sep 19, 2024
@bockthom
Copy link
Collaborator Author

I have already implemented a fix for that, see the linked patch:
0001-Ensure-correct-data-type-of-NA-timestamps.patch

However, we also should properly test that, to make sure that the patch properly works in various situations and does not break anything.

@maxloeffler
Copy link
Contributor

Your fix does not break any of the tests in our suite. From your last message I was not sure if you want me to write new tests that specifically test the previously faulty behavior or not since testing it would be a bit hacky with it only working when issue data is empty. Please let me know :)

@bockthom
Copy link
Collaborator Author

Your fix does not break any of the tests in our suite.

Sounds good.

From your last message I was not sure if you want me to write new tests that specifically test the previously faulty behavior or not since testing it would be a bit hacky with it only working when issue data is empty. Please let me know :)

Please write a test / tests to test the previously faulty behavior. It should be easy to achieve: Use the setter for issue data to set empty issue data. Then lock issues. This way, reading the actual issue data again should be prevented, if I am not mistaken.

@maxloeffler
Copy link
Contributor

Small issue here. I tried to reproduce the error you are describing as it very reasonable to me and I want to figure out the best way to test against it. Unfortunately, I was not able to reproduce it at all. The timestamps used in get.data.cut.to.same.date stem from ProjectData$get.data.timestamps(data.sources = data.sources , simple = TRUE). In this function we consistently obtain start-, and endtimes by calling max and min with na.rm = TRUE which disregards NA timestamps as created by empty data sources. Therefore, empty datasources do not interfere with the final timestamps which are later used to split the data by, which is were the error you mention is originating from. Im kind of clueless about this right now.

@bockthom
Copy link
Collaborator Author

Hm, I am pretty sure that this issue resulted from non-existent (a.k.a. empty) data. However, the order of data sources makes a difference in how NA values are treated.

Let's discuss that in our next meeting. I can show you the context in which I encountered the issue, and then we can discuss how we can reproduce this in our tests.

maxloeffler pushed a commit to maxloeffler/coronet that referenced this issue Oct 16, 2024
Up until now, `get.data.cut.to.same.date(data.sources =
c("issues", "mails", "commits"))` failed if some of the first data
source was empty, but not if the second one was empty. The reason was
that `NA` values introduced by empty data sources at the beginning of
the data frame turned the data frame into a data frame of numeric
objects instead of POSIXct objects. If there were already POSIXct
objects in the data frame, this did not happen. To prevent the
timestamps to be interpreted as numeric values, make sure that the `NA`
values are always POSIXct objects.

This fixes se-sic#269.

Signed-off-by: Thomas Bock <[email protected]>
maxloeffler added a commit to maxloeffler/coronet that referenced this issue Oct 16, 2024
This test fails without the previous fix by Thomas Bock but does not
fail when the fix is in place.

This works towards fixing se-sic#269.

Signed-off-by: Maximilian Löffler <[email protected]>
maxloeffler added a commit to maxloeffler/coronet that referenced this issue Oct 16, 2024
This test fails without the previous fix by Thomas Bock but does not
fail when the fix is in place.

This works towards fixing se-sic#269.

Signed-off-by: Maximilian Löffler <[email protected]>
maxloeffler added a commit to maxloeffler/coronet that referenced this issue Oct 17, 2024
This test fails without the previous fix by Thomas Bock but does not
fail when the fix is in place.

This works towards fixing se-sic#269.

Signed-off-by: Maximilian Löffler <[email protected]>
maxloeffler pushed a commit to maxloeffler/coronet that referenced this issue Oct 17, 2024
Up until now, `get.data.cut.to.same.date(data.sources =
c("issues", "mails", "commits"))` failed if some of the first data
source was empty, but not if the second one was empty. The reason was
that `NA` values introduced by empty data sources at the beginning of
the data frame turned the data frame into a data frame of numeric
objects instead of POSIXct objects. If there were already POSIXct
objects in the data frame, this did not happen. To prevent the
timestamps to be interpreted as numeric values, make sure that the `NA`
values are always POSIXct objects.

This fixes se-sic#269.

Signed-off-by: Thomas Bock <[email protected]>
maxloeffler added a commit to maxloeffler/coronet that referenced this issue Oct 17, 2024
This test fails without the previous fix by Thomas Bock but does not
fail when the fix is in place.

This works towards fixing se-sic#269.

Signed-off-by: Maximilian Löffler <[email protected]>
Leo-Send pushed a commit to Leo-Send/coronet that referenced this issue Nov 13, 2024
Up until now, `get.data.cut.to.same.date(data.sources =
c("issues", "mails", "commits"))` failed if some of the first data
source was empty, but not if the second one was empty. The reason was
that `NA` values introduced by empty data sources at the beginning of
the data frame turned the data frame into a data frame of numeric
objects instead of POSIXct objects. If there were already POSIXct
objects in the data frame, this did not happen. To prevent the
timestamps to be interpreted as numeric values, make sure that the `NA`
values are always POSIXct objects.

This fixes se-sic#269.

Signed-off-by: Thomas Bock <[email protected]>
Leo-Send pushed a commit to Leo-Send/coronet that referenced this issue Nov 13, 2024
This test fails without the previous fix by Thomas Bock but does not
fail when the fix is in place.

This works towards fixing se-sic#269.

Signed-off-by: Maximilian Löffler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants