Wrong data type of NA timestamps depending on order of data sources #269

bockthom · 2024-09-19T16:28:37Z

Description

project.data$get.data.cut.to.same.date(data.sources = c("issues", "mails", "commits")) sometimes fails if some of the data sources are empty, ending up in the following error message:

ERROR::The bins parameter, if present, needs to be a vector whose elements represent dates

This error occurs if the issue data is the empty data source, but not if the mail data is the empty data source, which is an unexpected behavior.

After some debugging, I noticed that the error is caused by wrong data types when converting strings to POSIXct dates: Usually, if the date is not a POSIXct object, it is converted to POSIXct when splitting the data:

coronet/util-split.R

Lines 69 to 70 in 24005e4

    
           if (!is.null(bins) && !lubridate::is.POSIXct(bins)) { 
        
               dates = parallel::mclapply(unlist(bins), get.date.from.string)

However, this fails if the bins (i.e., the timestamps used for cutting) are not a string but a numeric (such as UNIX timestamps). Therefore, we need to look at where the timestamps for cutting come from: They are extracted in the function extract.timestamps:

coronet/util-data.R

Lines 724 to 758 in 24005e4

    
           extract.timestamps = function(name, source) { 
        
               ## initialize data structure for timestamp 
        
               if (is.null(private$data.timestamps)) { 
        
                   ## let this stay as a sanity-check; this is actually initialized this way when creating the object 
        
                   private$data.timestamps = data.frame(start = numeric(0), end = numeric(0)) 
        
               } 
        
               ## collect minimum and maximum date for data source 
        
               ## 1) if we have data available 
        
               if (nrow(private[[source]]) > 0) { 
        
                   source.date.min = min(private[[source]][, "date"]) 
        
                   source.date.max = max(private[[source]][, "date"]) 
        
               } 
        
               ## NAs otherwise 
        
               else { 
        
                   source.date.min = NA 
        
                   source.date.max = NA 
        
               } 
        
               ## remove old line if existing 
        
               private$data.timestamps = subset( 
        
                   private$data.timestamps, 
        
                   !(rownames(private$data.timestamps) == name) 
        
               ) 
        
               ## store the data in the timestamp data set 
        
               private$data.timestamps = rbind( 
        
                   private$data.timestamps, 
        
                   data.frame( 
        
                       start = source.date.min, 
        
                       end = source.date.max, 
        
                       row.names = name 
        
                   ) 
        
               ) 
        
           }

If no data is available for the specified data source, the corresponding timestamps are set to NA. For whatever reason, this converts the data frame used for cutting into a numeric data frame of UNIX timestamps if the first row contains NA values. Otherwise, if there are already POSIXct objects in there, the inserted NA values are interpreted to be POSIXct objects.

Suggested Fix

Instead of setting the timestamps to NA in case of missing data sources, set the timestamps to as.POSIXct(NA). This way, we ensure that the data type is POSIXct even if the NA values are the very first ones that are inserted into the timestamps data frame.

Versions

This affects, at least, coronet version 4.4 in combination with R version 4.4.1. Also earlier coronet versions or R versions may be affected by this bug.

The text was updated successfully, but these errors were encountered:

bockthom · 2024-09-19T16:32:09Z

I have already implemented a fix for that, see the linked patch:
0001-Ensure-correct-data-type-of-NA-timestamps.patch

However, we also should properly test that, to make sure that the patch properly works in various situations and does not break anything.

maxloeffler · 2024-10-14T12:43:14Z

Your fix does not break any of the tests in our suite. From your last message I was not sure if you want me to write new tests that specifically test the previously faulty behavior or not since testing it would be a bit hacky with it only working when issue data is empty. Please let me know :)

bockthom · 2024-10-14T12:48:23Z

Your fix does not break any of the tests in our suite.

Sounds good.

From your last message I was not sure if you want me to write new tests that specifically test the previously faulty behavior or not since testing it would be a bit hacky with it only working when issue data is empty. Please let me know :)

Please write a test / tests to test the previously faulty behavior. It should be easy to achieve: Use the setter for issue data to set empty issue data. Then lock issues. This way, reading the actual issue data again should be prevented, if I am not mistaken.

maxloeffler · 2024-10-15T15:41:22Z

Small issue here. I tried to reproduce the error you are describing as it very reasonable to me and I want to figure out the best way to test against it. Unfortunately, I was not able to reproduce it at all. The timestamps used in get.data.cut.to.same.date stem from ProjectData$get.data.timestamps(data.sources = data.sources , simple = TRUE). In this function we consistently obtain start-, and endtimes by calling max and min with na.rm = TRUE which disregards NA timestamps as created by empty data sources. Therefore, empty datasources do not interfere with the final timestamps which are later used to split the data by, which is were the error you mention is originating from. Im kind of clueless about this right now.

bockthom · 2024-10-15T18:04:33Z

Hm, I am pretty sure that this issue resulted from non-existent (a.k.a. empty) data. However, the order of data sources makes a difference in how NA values are treated.

Let's discuss that in our next meeting. I can show you the context in which I encountered the issue, and then we can discuss how we can reproduce this in our tests.

Up until now, `get.data.cut.to.same.date(data.sources = c("issues", "mails", "commits"))` failed if some of the first data source was empty, but not if the second one was empty. The reason was that `NA` values introduced by empty data sources at the beginning of the data frame turned the data frame into a data frame of numeric objects instead of POSIXct objects. If there were already POSIXct objects in the data frame, this did not happen. To prevent the timestamps to be interpreted as numeric values, make sure that the `NA` values are always POSIXct objects. This fixes se-sic#269. Signed-off-by: Thomas Bock <[email protected]>

This test fails without the previous fix by Thomas Bock but does not fail when the fix is in place. This works towards fixing se-sic#269. Signed-off-by: Maximilian Löffler <[email protected]>

Up until now, `get.data.cut.to.same.date(data.sources = c("issues", "mails", "commits"))` failed if some of the first data source was empty, but not if the second one was empty. The reason was that `NA` values introduced by empty data sources at the beginning of the data frame turned the data frame into a data frame of numeric objects instead of POSIXct objects. If there were already POSIXct objects in the data frame, this did not happen. To prevent the timestamps to be interpreted as numeric values, make sure that the `NA` values are always POSIXct objects. This fixes se-sic#269. Signed-off-by: Thomas Bock <[email protected]>

This test fails without the previous fix by Thomas Bock but does not fail when the fix is in place. This works towards fixing se-sic#269. Signed-off-by: Maximilian Löffler <[email protected]>

Up until now, `get.data.cut.to.same.date(data.sources = c("issues", "mails", "commits"))` failed if some of the first data source was empty, but not if the second one was empty. The reason was that `NA` values introduced by empty data sources at the beginning of the data frame turned the data frame into a data frame of numeric objects instead of POSIXct objects. If there were already POSIXct objects in the data frame, this did not happen. To prevent the timestamps to be interpreted as numeric values, make sure that the `NA` values are always POSIXct objects. This fixes se-sic#269. Signed-off-by: Thomas Bock <[email protected]>

This test fails without the previous fix by Thomas Bock but does not fail when the fix is in place. This works towards fixing se-sic#269. Signed-off-by: Maximilian Löffler <[email protected]>

bockthom added the bug label Sep 19, 2024

bockthom added this to the v4.5 milestone Sep 19, 2024

bockthom assigned maxloeffler Oct 9, 2024

maxloeffler mentioned this issue Oct 16, 2024

Fix a bug in 'extract.timestamps' when dealing with an empty data source #270

Merged

7 tasks

bockthom added fixed/has-pr fixed/future-release labels Oct 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong data type of NA timestamps depending on order of data sources #269

Wrong data type of NA timestamps depending on order of data sources #269

bockthom commented Sep 19, 2024

bockthom commented Sep 19, 2024

maxloeffler commented Oct 14, 2024

bockthom commented Oct 14, 2024

maxloeffler commented Oct 15, 2024

bockthom commented Oct 15, 2024

Wrong data type of NA timestamps depending on order of data sources #269

Wrong data type of NA timestamps depending on order of data sources #269

Comments

bockthom commented Sep 19, 2024

Description

Suggested Fix

Versions

bockthom commented Sep 19, 2024

maxloeffler commented Oct 14, 2024

bockthom commented Oct 14, 2024

maxloeffler commented Oct 15, 2024

bockthom commented Oct 15, 2024