-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to specify split.basis
by stating multiple data sources and start splitting according to the earliest data source
#254
Comments
I have finally came to have a look at this issue and I have gathered some ideas I want to talk about. First of all, it is crucial that we get on the same definition of what it should mean to split by multiple basis. Splitting boils down to first obtaining all elements of the data source given as the splitting basis, then grouping these elements into bins either based by time or activity (or by starting with predefined bins in the first place) and then grouping all element (including elements from different data sources to our splitting basis) into the just constructed bins. In your description you say:
This to me sounds like given data sources of However, I do believe, and im not sure if that is just what you meant initially, that it would make more sense to consider the elements of
Further, this approach would mean that identifying the earliest element of both data sources would not be necessary anymore. In addition to adapting the validation as you described, the only other change would be when the bin construction is called in ...
if (is.null(bins)) { |||||||||||
## get bins based on split.basis vvvvvvvvvvv
bins = split.get.bins.time.based(data[[split.basis]][["date"]], splitting.length, number.windows)$bins
bins.labels = head(bins, -1)
## logging
logging::loginfo("Splitting data '%s' into time ranges of %s based on '%s' data.",
project.data$get.class.name(), splitting.length, split.basis)
}
... Let me know if I misinterpreted something and your opinion on the matter! |
Agreed 😄
Right. This is exactly what I have meant.
I completely agree with you. This would be exactly what we need. However, I am not sure if this is easy to implement given the complex splitting functions we already have in coronet. That's why I have suggested to determine the earliest data source, split according to the earliest data source, and add the excluded tail elements of the other data source(s) as additional bins. If you find an easy way to implement splitting such that we can directly pass multiple data sources and all activities (commits/mails/issue events/etc.) are considered at once to determine the splitting bins, then of course I would prefer such a solution. But I am not sure how many functions need to be immensely changed in order to achieve such a solution. To the best of my knowledge, the only data-related function that considers multiple data sources at once is the function
That's the tricky part. How to actually achieve considering multiple data sources here? I am not sure if this is the only code location where this needs to be adjusted or whether there need to be additional adjustments at similar code.
I completely agree with your interpretation, but I am not sure if it is as easy to implement as you have sketched here. Let's try it... |
I implemented a prototype and a first test for it. In the test we have mail data with dates: Browse[1]> mail.data[["date"]]
[1] "2016-07-12 15:58:40 UTC" "2016-07-12 15:58:50 UTC" "2016-07-12 16:04:40 UTC" "2016-07-12 16:05:37 UTC" ... issue data with dates: Browse[1]> issue.data[["date"]]
[1] "2016-07-12 15:59:25 UTC" "2016-07-12 15:59:59 UTC" "2016-07-12 16:01:01 UTC" "2016-07-12 16:02:02 UTC" "2016-07-12 16:02:30 UTC" "2016-07-12 16:03:59 UTC"
[7] "2016-07-12 16:06:01 UTC" ... and commit data with dates: Browse[1]> commit.data[["date"]]
[1] "2016-07-12 15:58:59 UTC" "2016-07-12 16:00:45 UTC" "2016-07-12 16:06:32 UTC" I then split based on attr(,"bins")
[1] "2016-07-12 15:58:40 UTC" "2016-07-12 15:59:40 UTC" "2016-07-12 16:00:40 UTC" "2016-07-12 16:01:40 UTC" "2016-07-12 16:02:40 UTC" "2016-07-12 16:03:40 UTC"
[7] "2016-07-12 16:04:40 UTC" "2016-07-12 16:05:40 UTC" "2016-07-12 16:06:02 UTC" We observe that the start of the first bin corresponds with the earliest |
From my quick look, the example looks good! 😄 |
Description
The splitting function
split.data.time.based
(and possibly also similar splitting functions) takes a parametersplit.basis
which defines the data source that is used to calculate the time ranges. Currently, it is only allowed to set one specific data source as a split basis. However, it might also be the case that one wants to split the data based on the earliest of two or more data sources, which is currently not possible.Motivation
This way, we could automatically determine the appropriate (i.e., earliest data source) split basis from a set of data sources.
Ideas for the Implementation
To enable this, there are three things to change:
Change the argument validation of the
split.basis
parameter, such that it allows multiple values simultaneously but keeps the current default value (which is"commits"
) if no value is given. Therefore, we need to differentiate between the case that one explicitly sets all data source (e.g.,c("commits", "mails", "issues")
and the case that nothing is specified and just the choices are retrieved from the method signature. This can be done using thehasArg(...)
function as the very first statement inside the splitting function, in order to identify whether the caller has set the parameter or not, as in the following snippet:If multiple data sources are set, identify the earliest data source, that is, the data source whose earliest date is earlier.
Here it might be helpful to define a helper function
get.earliest.data.source(data.sources = c("commits", "mails", "issues"))
in theProjectData
class.Use
earliest.split.basis
as basis for actual splitting and also take care of missing data at the end (i.e., increase last range to include later data from other data sources or add missing ranges at the end).The text was updated successfully, but these errors were encountered: