-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align features refactoring (vol. 2) #88
Conversation
hist(mz_sd, xlab = "m/z SD", ylab = "Frequency", | ||
main = "m/z SD distribution") | ||
hist(apply(pk.times[, -1:-4], 1, sd, na.rm = TRUE), | ||
xlab = "Retention time SD", ylab = "Frequency", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be moved to plot.R
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any suggestions? I think we would just create an artificial function which takes the same arguments as hist
.
R/feature.align.R
Outdated
if (is.null(nrow(pick))) { | ||
# this is very strange if it can ever happen | ||
# maybe commas are missing? we want the same as below | ||
# but also why if there are no rows... | ||
strengths[pick[6]] <- pick[5] | ||
return(c(pick[1], pick[2], pick[1], pick[1], strengths)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the problem lies there that nrow(pick) might be null
if pick has the wrong type - meaning it is not a table but a list or so and then it actually behaves differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the problem lies there that nrow(pick) might be
null
if pick has the wrong type - meaning it is not a table but a list or so and then it actually behaves differently.
Exactly. That also happens with nrow
check in prof.to.features
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that makes sense, it was just suspicious for me.
|
||
create_output <- function(sample_grouped, number_of_samples, deviation) { | ||
return(c(to_attach(sample_grouped, number_of_samples, use = "sum"), | ||
to_attach(sample_grouped[, c(1, 2, 3, 4, 2, 6)], number_of_samples, use = "median"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the role of the vector with swapped indices here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I understand the question. to_attach
is used to compute the sum of areas per sample (first call) and retention times (second call) - that's why in the second call, column 2
(RTs) is passed instead of column 5
(areas). I want to simplify this final step and also outputs of the alignment step as part of #87.
validate_contents <- function(samples, min_occurrence) { | ||
# validate whether data is still from at least 'min_occurrence' number of samples | ||
if (!is.null(nrow(samples))) { | ||
if (length(unique(samples[, 6])) >= min_occurrence) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Index should be rather done via column name than index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, samples
is just a list of vectors with no column names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh - okay.
find_optima <- function(data, bandwidth) { | ||
# Kernel Density Estimation | ||
den <- density(data, bw = bandwidth) | ||
# select statistically significant points | ||
turns <- find.turn.point(den$y) | ||
return(list(peaks = den$x[turns$pks], valleys = den$x[turns$vlys])) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function was also extracted in many other refactorings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's address this together in #65.
filter_based_on_density <- function(sample, turns, index, i) { | ||
# select data within lower and upper bound from density estimation | ||
lower_bound <- max(turns$valleys[turns$valleys < turns$peaks[i]]) | ||
upper_bound <- min(turns$valleys[turns$valleys > turns$peaks[i]]) | ||
selected <- which(sample[, index] > lower_bound & sample[, index] <= upper_bound) | ||
return(sample[selected, ]) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same - this function should be re-used in adaptive.bin
and recover.weaker
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, should be properly fixed as part of #65.
# sort all values by m/z, if equal by rt | ||
ordering <- order(mz_values, rt) | ||
mz_values <- mz_values[ordering] | ||
rt <- rt[ordering] | ||
sample_id <- sample_id[ordering] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data could be arranged in a tibble
or data.frame
.
pk.times <- aligned_features[, (5 + number_of_samples):(2 * (4 + number_of_samples))] | ||
mz_sd <- aligned_features[, ncol(aligned_features)] | ||
# select columns: average of m/z, average of rt, min of m/z, max of m/z, sum of areas per sample (the first to_attach call) | ||
aligned_features <- aligned_features[, 1:(4 + number_of_samples)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Computations for the numbers of columns could be extracted to variables to clearly indicate what is being used and why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be changed in #87 anyway.
if (do.plot) { | ||
hist(mz.sd.rec, xlab = "m/z SD", ylab = "Frequency", | ||
hist(mz_sd, xlab = "m/z SD", ylab = "Frequency", | ||
main = "m/z SD distribution") | ||
hist(apply(pk.times[, -1:-4], 1, sd, na.rm = TRUE), | ||
xlab = "Retention time SD", ylab = "Frequency", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be moved to the plotting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as in @zargham-ahmad's comment, do you have a suggestion? I think we would just create an artificial function which takes the same arguments as hist
.
warning("Automatic tolerance finding failed, 10 ppm was assigned. | ||
May need to manually assign alignment mz tolerance level.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something that we should address maybe: I doubt someone will see this warning given that most of our test-cases produce dozens of warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the function should fail with an error? Or we could parametrise the whole function to use default value vs. fail with an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think an error would be an appropriate solution. I just wanted to point out that apLCMS may produce a meaningful warning, so we should make sure that the warning doesn't get lost in a bunch of deprecation warnings, etc. I think we just have to reduce the number of other warnings like deprecation, imports, and so on. Should be done with #93.
R/feature.align.R
Outdated
# order them again? should be ordered already... | ||
sample_grouped <- sample_grouped[order(sample_grouped[, 1], sample_grouped[, 2]),] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try to run tests without this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, without that the test case fails, so I guess they weren't really ordered...
Next iteration of
align.features
refactoring, this time focusing on code readability and its understanding (second step in #37). Also refactored several functions which are being called, especially those for calculatingrt
andm/z
relative tolerances.After this is merged, we can start with #87.
Close $37.