Align features refactoring (vol. 2) #88

xtrojak · 2022-08-04T15:26:30Z

Next iteration of align.features refactoring, this time focusing on code readability and its understanding (second step in #37). Also refactored several functions which are being called, especially those for calculating rt and m/z relative tolerances.

After this is merged, we can start with #87.

Close $37.

zargham-ahmad · 2022-08-05T11:48:37Z

R/feature.align.R

+            hist(mz_sd, xlab = "m/z SD", ylab = "Frequency",
                 main = "m/z SD distribution")
            hist(apply(pk.times[, -1:-4], 1, sd, na.rm = TRUE), 
                 xlab = "Retention time SD", ylab = "Frequency",


can be moved to plot.R

Do you have any suggestions? I think we would just create an artificial function which takes the same arguments as hist.

hechth · 2022-08-05T08:44:06Z

R/feature.align.R

    if (is.null(nrow(pick))) {
+        # this is very strange if it can ever happen
+        # maybe commas are missing? we want the same as below
+        # but also why if there are no rows...
        strengths[pick[6]] <- pick[5]
        return(c(pick[1], pick[2], pick[1], pick[1], strengths))


I guess the problem lies there that nrow(pick) might be null if pick has the wrong type - meaning it is not a table but a list or so and then it actually behaves differently.

I guess the problem lies there that nrow(pick) might be null if pick has the wrong type - meaning it is not a table but a list or so and then it actually behaves differently.

Exactly. That also happens with nrow check in prof.to.features.

Okay, that makes sense, it was just suspicious for me.

hechth · 2022-08-05T08:45:32Z

R/feature.align.R

+
+create_output <- function(sample_grouped, number_of_samples, deviation) {
+    return(c(to_attach(sample_grouped, number_of_samples, use = "sum"),
+             to_attach(sample_grouped[, c(1, 2, 3, 4, 2, 6)], number_of_samples, use = "median"),


What is the role of the vector with swapped indices here?

I'm not sure if I understand the question. to_attach is used to compute the sum of areas per sample (first call) and retention times (second call) - that's why in the second call, column 2 (RTs) is passed instead of column 5 (areas). I want to simplify this final step and also outputs of the alignment step as part of #87.

hechth · 2022-08-05T08:46:21Z

R/feature.align.R

+validate_contents <- function(samples, min_occurrence) {
+    # validate whether data is still from at least 'min_occurrence' number of samples
+    if (!is.null(nrow(samples))) {
+        if (length(unique(samples[, 6])) >= min_occurrence) {


Index should be rather done via column name than index.

At this point, samples is just a list of vectors with no column names.

hechth · 2022-08-05T08:46:55Z

R/feature.align.R

+find_optima <- function(data, bandwidth) {
+    # Kernel Density Estimation
+    den <- density(data, bw = bandwidth)
+    # select statistically significant points
+    turns <- find.turn.point(den$y)
+    return(list(peaks = den$x[turns$pks], valleys = den$x[turns$vlys]))
+}


This function was also extracted in many other refactorings.

Let's address this together in #65.

hechth · 2022-08-05T08:47:29Z

R/feature.align.R

+filter_based_on_density <- function(sample, turns, index, i) {
+    # select data within lower and upper bound from density estimation
+    lower_bound <- max(turns$valleys[turns$valleys < turns$peaks[i]])
+    upper_bound <- min(turns$valleys[turns$valleys > turns$peaks[i]])
+    selected <- which(sample[, index] > lower_bound & sample[, index] <= upper_bound)
+    return(sample[selected, ])
+}


Same - this function should be re-used in adaptive.bin and recover.weaker.

Again, should be properly fixed as part of #65.

hechth · 2022-08-05T08:48:22Z

R/feature.align.R

+        # sort all values by m/z, if equal by rt
+        ordering <- order(mz_values, rt)
+        mz_values <- mz_values[ordering]
+        rt <- rt[ordering]
+        sample_id <- sample_id[ordering]


Data could be arranged in a tibble or data.frame.

hechth · 2022-08-05T08:50:52Z

R/feature.align.R

+        pk.times <- aligned_features[, (5 + number_of_samples):(2 * (4 + number_of_samples))]
+        mz_sd <- aligned_features[, ncol(aligned_features)]
+        # select columns: average of m/z, average of rt, min of m/z, max of m/z, sum of areas per sample (the first to_attach call)
+        aligned_features <- aligned_features[, 1:(4 + number_of_samples)]


Computations for the numbers of columns could be extracted to variables to clearly indicate what is being used and why.

This will be changed in #87 anyway.

hechth · 2022-08-05T08:51:16Z

R/feature.align.R

        if (do.plot) {
-            hist(mz.sd.rec, xlab = "m/z SD", ylab = "Frequency",
+            hist(mz_sd, xlab = "m/z SD", ylab = "Frequency",
                 main = "m/z SD distribution")
            hist(apply(pk.times[, -1:-4], 1, sd, na.rm = TRUE), 
                 xlab = "Retention time SD", ylab = "Frequency",


This could be moved to the plotting.

Same as in @zargham-ahmad's comment, do you have a suggestion? I think we would just create an artificial function which takes the same arguments as hist.

maximskorik · 2022-08-09T13:14:39Z

R/feature.align.R

+                warning("Automatic tolerance finding failed, 10 ppm was assigned. 
+                        May need to manually assign alignment mz tolerance level.")


This is something that we should address maybe: I doubt someone will see this warning given that most of our test-cases produce dozens of warnings.

Do you think the function should fail with an error? Or we could parametrise the whole function to use default value vs. fail with an error.

No, I don't think an error would be an appropriate solution. I just wanted to point out that apLCMS may produce a meaningful warning, so we should make sure that the warning doesn't get lost in a bunch of deprecation warnings, etc. I think we just have to reduce the number of other warnings like deprecation, imports, and so on. Should be done with #93.

maximskorik · 2022-08-09T13:19:17Z

R/feature.align.R

+            # order them again? should be ordered already...
+            sample_grouped <- sample_grouped[order(sample_grouped[, 1], sample_grouped[, 2]),]


Did you try to run tests without this line?

Good point, without that the test case fails, so I guess they weren't really ordered...

xtrojak added 6 commits July 23, 2022 20:59

Refactored get_feature_values

5d3df6e

Refactored find.turn.point

4610739

Refactor find.tol.time

30c2d70

Refactor find.tol for finding relative m/z tolerance

e57c27a

Refactored feature.align function

9772986

Finished refactoring of feature.align

34e23c9

xtrojak requested review from hechth, maximskorik and zargham-ahmad August 4, 2022 15:26

Merge branch 'master' into 37-alignment-refactor-final

cdd72d4

hechth linked an issue Aug 5, 2022 that may be closed by this pull request

refactor feature.align.R #37

Closed

2 tasks

zargham-ahmad reviewed Aug 5, 2022

View reviewed changes

hechth approved these changes Aug 5, 2022

View reviewed changes

maximskorik reviewed Aug 9, 2022

View reviewed changes

maximskorik approved these changes Aug 9, 2022

View reviewed changes

xtrojak added 2 commits August 10, 2022 08:55

Update comments in to_attach

753875a

Remove comment

bcb3b59

hechth merged commit cf207d7 into RECETOX:master Aug 10, 2022

xtrojak mentioned this pull request Aug 26, 2022

Feature align - multiple peaks undefined behaviour #144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align features refactoring (vol. 2) #88

Align features refactoring (vol. 2) #88

xtrojak commented Aug 4, 2022 •

edited

Loading

zargham-ahmad Aug 5, 2022

xtrojak Aug 10, 2022

hechth Aug 5, 2022

maximskorik Aug 9, 2022

xtrojak Aug 10, 2022

hechth Aug 5, 2022

xtrojak Aug 10, 2022

hechth Aug 5, 2022

xtrojak Aug 10, 2022

hechth Aug 10, 2022

hechth Aug 5, 2022

xtrojak Aug 10, 2022

hechth Aug 5, 2022

xtrojak Aug 10, 2022

hechth Aug 5, 2022

hechth Aug 5, 2022

xtrojak Aug 10, 2022

hechth Aug 5, 2022

xtrojak Aug 10, 2022

maximskorik Aug 9, 2022 •

edited

Loading

xtrojak Aug 10, 2022

maximskorik Aug 10, 2022

maximskorik Aug 9, 2022

xtrojak Aug 10, 2022 •

edited

Loading

		warning("Automatic tolerance finding failed, 10 ppm was assigned.
		May need to manually assign alignment mz tolerance level.")

		# order them again? should be ordered already...
		sample_grouped <- sample_grouped[order(sample_grouped[, 1], sample_grouped[, 2]),]

Align features refactoring (vol. 2) #88

Align features refactoring (vol. 2) #88

Conversation

xtrojak commented Aug 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maximskorik Aug 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xtrojak Aug 10, 2022 • edited Loading

Choose a reason for hiding this comment

xtrojak commented Aug 4, 2022 •

edited

Loading

maximskorik Aug 9, 2022 •

edited

Loading

xtrojak Aug 10, 2022 •

edited

Loading