Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align features refactoring (vol. 2) #88

Merged
merged 9 commits into from
Aug 10, 2022

Conversation

xtrojak
Copy link

@xtrojak xtrojak commented Aug 4, 2022

Next iteration of align.features refactoring, this time focusing on code readability and its understanding (second step in #37). Also refactored several functions which are being called, especially those for calculating rt and m/z relative tolerances.

After this is merged, we can start with #87.

Close $37.

@hechth hechth linked an issue Aug 5, 2022 that may be closed by this pull request
2 tasks
Comment on lines +259 to 262
hist(mz_sd, xlab = "m/z SD", ylab = "Frequency",
main = "m/z SD distribution")
hist(apply(pk.times[, -1:-4], 1, sd, na.rm = TRUE),
xlab = "Retention time SD", ylab = "Frequency",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be moved to plot.R

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any suggestions? I think we would just create an artificial function which takes the same arguments as hist.

Comment on lines 3 to 8
if (is.null(nrow(pick))) {
# this is very strange if it can ever happen
# maybe commas are missing? we want the same as below
# but also why if there are no rows...
strengths[pick[6]] <- pick[5]
return(c(pick[1], pick[2], pick[1], pick[1], strengths))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the problem lies there that nrow(pick) might be null if pick has the wrong type - meaning it is not a table but a list or so and then it actually behaves differently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the problem lies there that nrow(pick) might be null if pick has the wrong type - meaning it is not a table but a list or so and then it actually behaves differently.

Exactly. That also happens with nrow check in prof.to.features.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that makes sense, it was just suspicious for me.


create_output <- function(sample_grouped, number_of_samples, deviation) {
return(c(to_attach(sample_grouped, number_of_samples, use = "sum"),
to_attach(sample_grouped[, c(1, 2, 3, 4, 2, 6)], number_of_samples, use = "median"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the role of the vector with swapped indices here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if I understand the question. to_attach is used to compute the sum of areas per sample (first call) and retention times (second call) - that's why in the second call, column 2 (RTs) is passed instead of column 5 (areas). I want to simplify this final step and also outputs of the alignment step as part of #87.

validate_contents <- function(samples, min_occurrence) {
# validate whether data is still from at least 'min_occurrence' number of samples
if (!is.null(nrow(samples))) {
if (length(unique(samples[, 6])) >= min_occurrence) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index should be rather done via column name than index.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, samples is just a list of vectors with no column names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh - okay.

Comment on lines +47 to +53
find_optima <- function(data, bandwidth) {
# Kernel Density Estimation
den <- density(data, bw = bandwidth)
# select statistically significant points
turns <- find.turn.point(den$y)
return(list(peaks = den$x[turns$pks], valleys = den$x[turns$vlys]))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was also extracted in many other refactorings.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's address this together in #65.

Comment on lines +56 to +62
filter_based_on_density <- function(sample, turns, index, i) {
# select data within lower and upper bound from density estimation
lower_bound <- max(turns$valleys[turns$valleys < turns$peaks[i]])
upper_bound <- min(turns$valleys[turns$valleys > turns$peaks[i]])
selected <- which(sample[, index] > lower_bound & sample[, index] <= upper_bound)
return(sample[selected, ])
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same - this function should be re-used in adaptive.bin and recover.weaker.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, should be properly fixed as part of #65.

Comment on lines +138 to +142
# sort all values by m/z, if equal by rt
ordering <- order(mz_values, rt)
mz_values <- mz_values[ordering]
rt <- rt[ordering]
sample_id <- sample_id[ordering]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data could be arranged in a tibble or data.frame.

Comment on lines +242 to +245
pk.times <- aligned_features[, (5 + number_of_samples):(2 * (4 + number_of_samples))]
mz_sd <- aligned_features[, ncol(aligned_features)]
# select columns: average of m/z, average of rt, min of m/z, max of m/z, sum of areas per sample (the first to_attach call)
aligned_features <- aligned_features[, 1:(4 + number_of_samples)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computations for the numbers of columns could be extracted to variables to clearly indicate what is being used and why.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be changed in #87 anyway.

Comment on lines 258 to 262
if (do.plot) {
hist(mz.sd.rec, xlab = "m/z SD", ylab = "Frequency",
hist(mz_sd, xlab = "m/z SD", ylab = "Frequency",
main = "m/z SD distribution")
hist(apply(pk.times[, -1:-4], 1, sd, na.rm = TRUE),
xlab = "Retention time SD", ylab = "Frequency",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be moved to the plotting.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in @zargham-ahmad's comment, do you have a suggestion? I think we would just create an artificial function which takes the same arguments as hist.

Comment on lines +109 to +110
warning("Automatic tolerance finding failed, 10 ppm was assigned.
May need to manually assign alignment mz tolerance level.")
Copy link
Member

@maximskorik maximskorik Aug 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that we should address maybe: I doubt someone will see this warning given that most of our test-cases produce dozens of warnings.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the function should fail with an error? Or we could parametrise the whole function to use default value vs. fail with an error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think an error would be an appropriate solution. I just wanted to point out that apLCMS may produce a meaningful warning, so we should make sure that the warning doesn't get lost in a bunch of deprecation warnings, etc. I think we just have to reduce the number of other warnings like deprecation, imports, and so on. Should be done with #93.

Comment on lines 156 to 157
# order them again? should be ordered already...
sample_grouped <- sample_grouped[order(sample_grouped[, 1], sample_grouped[, 2]),]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try to run tests without this line?

Copy link
Author

@xtrojak xtrojak Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, without that the test case fails, so I guess they weren't really ordered...

@hechth hechth merged commit cf207d7 into RECETOX:master Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

refactor feature.align.R
4 participants