Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

function to find top 10 recalibrant series #44

Merged
merged 28 commits into from
Sep 5, 2024

Conversation

KristinaGomoryova
Copy link
Collaborator

This PR implements a function to find 10 most suitable recalibrant series and resolves #22

Copy link
Member

@hechth hechth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good first step, please see the comments on what should be changed.

R/FindRecalSeries.R Outdated Show resolved Hide resolved
R/FindRecalSeries.R Outdated Show resolved Hide resolved
Comment on lines 37 to 59
tolerance <- 100
global_min <- min(df$Min.Mass.Range) + tolerance
global_max <- max(df$Max.Mass.Range) - tolerance

# Create all combinations of ions
iter <- combinations(nrow(df), 5, v = 1:nrow(df))

# Helper dataframe with information which combinations do cover range
coversRange <- data.frame(iter, coversRange = 0)

# Check if the combinations cover the whole data range
for (i in 1:nrow(iter)) {
comb <- iter[i, ]
subset <- df[comb, ]
local_min <- min(subset$Min.Mass.Range)
local_max <- max(subset$Max.Mass.Range)
if (local_min <= global_min & local_max >= global_max) {
coversRange$coversRange[i] <- 1
}
}

# Subset only those, which cover whole range
coversRangeTrue <- coversRange[coversRange$coversRange == 1, ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be its own function which creates the combinations and filters them based on the coverage criteria. The tall peaks every 100mz should also be included as a optional criterion - so a boolean parameter should be added.

R/FindRecalSeries.R Outdated Show resolved Hide resolved
Comment on lines 86 to 91
for (i in 1:nrow(coversRangeTrue)) {
comb <- iter[i, ]
subset <- df[comb, ]
comb_score <- score_combination(subset)
scores <- append(scores, list(comb_score))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section can be parallelized single function call

Comment on lines 93 to 94
# Append all scored combinations into a dataframe
scores_df <- do.call(rbind, lapply(scores, as.data.frame))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be part of the combination statement of the parallel for loop

Comment on lines 51 to 55
local_min <- min(subset$Min.Mass.Range)
local_max <- max(subset$Max.Mass.Range)
if (local_min <= global_min & local_max >= global_max) {
coversRange$coversRange[i] <- 1
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This first rough check should actually be removed and replaced with the finer detailed check and the actual coverage percentage which is currently part of the score calculation.

Comment on lines 96 to 103
# Filter for the 10 top scoring series
finalSeries <- scores_df %>%
filter(coverage_percent > 90) %>%
rowwise() %>%
mutate(sum_score = sum(total_abundance, total_series_length, peak_proximity, peak_distance_proximity, coverage_percent)) %>%
arrange(desc(sum_score)) %>%
filter(!duplicated(series)) %>%
head(10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved into its own function to choose the best series and the coverage percent should be part of the combination filtering.

head(10)

# Return the top scoring series
return(finalSeries)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be kept as the main output, maybe we can also do something with the scoring and create some report or so?

mutate(sum_score = sum(total_abundance, total_series_length, peak_proximity, peak_distance_proximity, coverage_percent)) %>%
arrange(desc(sum_score)) %>%
filter(!duplicated(series)) %>%
head(10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filling in to 10 should be removed or actually made optional.

R/FindRecalSeries.R Outdated Show resolved Hide resolved
Comment on lines 20 to 22
#' @param df An output from RecalList, containing recalibrant CH2 series.
#' @return A dataframe of 10 best-scoring series.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing descriptions for the other parameters

R/FindRecalSeries.R Outdated Show resolved Hide resolved
@hechth hechth merged commit e08ee5f into RECETOX:master Sep 5, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Investigate the calibration functions for which series to use
2 participants