Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

function to find top 10 recalibrant series #44

Merged
merged 28 commits into from
Sep 5, 2024
Merged
Changes from 2 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
9503c51
function to find top 10 recalibrant series
KristinaGomoryova Aug 22, 2024
4502399
typos corrected
KristinaGomoryova Aug 23, 2024
52978ba
filter_input as a function
KristinaGomoryova Sep 2, 2024
99a1a37
description of the filter_input function added
KristinaGomoryova Sep 2, 2024
7ae9894
description of filter_input function added
KristinaGomoryova Sep 2, 2024
bc75cf5
compute_scores moved out of top-level
KristinaGomoryova Sep 2, 2024
eb6893e
number of combinations and tolerance as arguments
KristinaGomoryova Sep 2, 2024
49dba75
abundance_score_threshold and peak_distance_threshold as global argum…
KristinaGomoryova Sep 2, 2024
f65564b
filter_input changed to filter_recal_series
KristinaGomoryova Sep 2, 2024
60898f7
test for compute_coverage
KristinaGomoryova Sep 2, 2024
4368a58
test and test data for compute_coverage function
KristinaGomoryova Sep 3, 2024
368abe0
gtools added as dependency
KristinaGomoryova Sep 3, 2024
4c23182
tests for the compute_combinations and compute_subsets
KristinaGomoryova Sep 3, 2024
8a9445d
test for selecting final series
KristinaGomoryova Sep 3, 2024
7497e64
test data
KristinaGomoryova Sep 3, 2024
204450f
code refactored
KristinaGomoryova Sep 3, 2024
5f37bdd
test for filtering the input recal list
KristinaGomoryova Sep 4, 2024
f388877
functions documented
KristinaGomoryova Sep 4, 2024
a0c72c8
test for computing final scores
KristinaGomoryova Sep 4, 2024
6da3fca
test for compute_scores
KristinaGomoryova Sep 4, 2024
00fd06b
linting done
KristinaGomoryova Sep 4, 2024
066d94a
linting done
KristinaGomoryova Sep 4, 2024
495a0b4
fixed filling
hechth Sep 5, 2024
e1b7755
type annotations, slice_head
KristinaGomoryova Sep 5, 2024
3617285
Merge branch 'master' into findRecalSeries
hechth Sep 5, 2024
54613e2
arrange instead of order
KristinaGomoryova Sep 5, 2024
dcd4051
find_final_series works only partially
KristinaGomoryova Sep 5, 2024
0380c83
upsated test output to ungrouped df
hechth Sep 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 107 additions & 0 deletions R/FindRecalSeries.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
#' Attempts to find most suitable series for recalibration.
#'
#' This function takes on input the CH2 homologous recalibration series, which are provided by the RecalList function #' and tries to find the most suitable series combination for recalibration based on the following criteria:
#' 1) Series should cover the full mass spectral range
#' 2) Series should be optimally long and combined have a “Tall Peak” at least every 100 m/z.
#' 3) Abundance score: the higher, the better
#' 4) Peak score: the closer to 0, the better
#' 5) Peak Distance: the closer to 1, the better
#' 6) Series Score: the closer to this value, the better
#'
#' The recal function can take up to 10 series - due to the size of the search space when looking for combinations of 10
#' elements, a pre-filtering is done: only the series which have Abundance score > 100 are considered and the one #'
#' having Peak Distance < 2.
#' Combinations of 5 series are assembled, scores are computed for other metrics (in case of Peak proximity and Peak
#' distance, an inverted score is computed) and these are summed. Finally, top 10 unique series having the highest
#' score are outputted.
#'
#' Warning: this step is in general computationally demanding, for ~30 series it took around 30 min.
#'
#' @param df An output from RecalList, containing recalibrant CH2 series.
#' @return A dataframe of 10 best-scoring series.

findSeries <- function(df) {

# Arrange the data
df <- df %>%
separate(col = Mass.Range, into = c('Min.Mass.Range', 'Max.Mass.Range'), sep = "-") %>%
mutate(Min.Mass.Range = as.numeric(Min.Mass.Range),
Max.Mass.Range = as.numeric(Max.Mass.Range)) %>%
mutate(Series.Length = Max.Mass.Range - Min.Mass.Range) %>%
filter(Abundance.Score > 100) %>%
filter(Peak.Distance < 2)
KristinaGomoryova marked this conversation as resolved.
Show resolved Hide resolved
KristinaGomoryova marked this conversation as resolved.
Show resolved Hide resolved

# Compute the global minimum and maximum (range of a dataset)
# We need to add some tolerance, because there is low chance full 100% would be covered

tolerance <- 100
global_min <- min(df$Min.Mass.Range) + tolerance
global_max <- max(df$Max.Mass.Range) - tolerance

# Create all combinations of ions
iter <- combinations(nrow(df), 5, v = 1:nrow(df))

# Helper dataframe with information which combinations do cover range
coversRange <- data.frame(iter, coversRange = 0)

# Check if the combinations cover the whole data range
for (i in 1:nrow(iter)) {
comb <- iter[i, ]
subset <- df[comb, ]
local_min <- min(subset$Min.Mass.Range)
local_max <- max(subset$Max.Mass.Range)
if (local_min <= global_min & local_max >= global_max) {
coversRange$coversRange[i] <- 1
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This first rough check should actually be removed and replaced with the finer detailed check and the actual coverage percentage which is currently part of the score calculation.

}

# Subset only those, which cover whole range
coversRangeTrue <- coversRange[coversRange$coversRange == 1, ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be its own function which creates the combinations and filters them based on the coverage criteria. The tall peaks every 100mz should also be included as a optional criterion - so a boolean parameter should be added.


# Compute the scores
score_combination <- function(combination) {
series <- paste0(combination$Series)
total_abundance <- sum(combination$Abundance.Score)
total_series_length <- sum(combination$Series.Length)
peak_proximity <- sum(1/(combination$Peak.Score))
peak_distance_proximity <- sum(1/(combination$Peak.Distance - 1))
coverage <- sum(combination$Max.Mass.Range - pmax(combination$Min.Mass.Range, lag(combination$Max.Mass.Range, default = 0)))
coverage_percent <- coverage/((global_max+100) - (global_min-100))*100

return(list(
total_abundance = total_abundance,
total_series_length = total_series_length,
peak_proximity = peak_proximity,
peak_distance_proximity = peak_distance_proximity,
series = series,
coverage = coverage,
coverage_percent = coverage_percent
))
}
KristinaGomoryova marked this conversation as resolved.
Show resolved Hide resolved

# Create empty list for scored combinations
scores <- list()

# Iterate over combinations and score them
for (i in 1:nrow(coversRangeTrue)) {
comb <- iter[i, ]
subset <- df[comb, ]
comb_score <- score_combination(subset)
scores <- append(scores, list(comb_score))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section can be parallelized single function call


# Append all scored combinations into a dataframe
scores_df <- do.call(rbind, lapply(scores, as.data.frame))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be part of the combination statement of the parallel for loop


# Filter for the 10 top scoring series
finalSeries <- scores_df %>%
filter(coverage_percent > 90) %>%
rowwise() %>%
mutate(sum_score = sum(total_abundance, total_series_length, peak_proximity, peak_distance_proximity, coverage_percent)) %>%
arrange(desc(sum_score)) %>%
filter(!duplicated(series)) %>%
head(10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved into its own function to choose the best series and the coverage percent should be part of the combination filtering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filling in to 10 should be removed or actually made optional.


# Return the top scoring series
return(finalSeries)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be kept as the main output, maybe we can also do something with the scoring and create some report or so?

}
Loading