Skip to content

Commit

Permalink
Mann whitney test (#518)
Browse files Browse the repository at this point in the history
* evidence import

* tests

* filter proteins

* filter sampels

* Select Peptides by group or sample for coverage visualisation (#410)

* untested working, needs cleanup

* linter

* remove dupliacte method (merge screw-up)

* add doc strings, fix first few tests (miss. args)

* metadata df now optional param + adjusted asserts, fixed test

* formatting

* conditional metadata input for protein graphs

* import logging levels, formatting

* add grouping tests for peptides, add error handling

* spelling, formatting

* remove unused import, format

* fix error msgs in tests

* fix line length

* Convert has_metadata to property

* Use logger instead of print statement

* An attempt at consistent text-casing

* formatting

---------

Co-authored-by: henninggaertner <[email protected]>

* tests for filter_proteins with peptide file

* tests for filter_proteins with peptide file, better mock dataframe

* output types

* added testing that only filtered Proteins are removed from peptides_df

* tests for filter_samples with peptide file

* extract test that peptide filtering matches protein filtering into Method

* refactor: shortened tests by integrating extracted method and mock data

* exclude peptides

* tests

* minor fixes in test

* minor fixes in test

* fix test that peptide filtering matches protein filtering Method

* fix test that peptide filtering matches protein filtering Method

* fix test that peptide filtering matches protein filtering Method

* fix test that peptide filtering matches protein filtering Method

* match mock protein dataframe to mock peptide dataframe

* extract test that peptide filtering matches protein filtering into Method

* implementation

* tests

* tests improve error logging

* tests git testing

* peptide import added missing column

* implements form, form_mapping, method and the actual functionality (untested)

* merge peptide filtering and outlier_detection

* merge peptide transformation

* implement passing of peptide data between preprocessing steps

* complete implement passing of peptide data between preprocessing steps

* implement overhead for new step evidence import

* enable selecting multiple proteins of interest and implements tests

* Fix EvidenceImport class (now the right method is called)

* implement option to choose protein from list of significant proteins

* implement option to let the system choose the most expressed protein

* include messages for peptide filtering

* remove selecting multiple proteins and add substring search

* add cleaning of protein groups to evidence import

* implement suggested changes: remove default values and correct grammar mistake

* make preprocessing work without peptide_df

* clean up

* add cleaning of protein groups to evidence import

* rename

* clean Protein IDs for peptide_import and name intensity column Intensity

* implement requested changes: make peptide_df input optional for all preprocessing steps, remove peptide_df input in preprocessing steps, where it is not used

* implement requested changes: make peptide_df input optional for all preprocessing steps, remove peptide_df input in preprocessing steps, where it is not used

* unify into one method

* cleanup old names

* cleanup names

* cleanup names

* setup form and method

* small refactor: utilize fill_helper better

* implemented method

* cleanup rename and docstring

* tests and named sample column from "0" to "Sample"

* implemented slow version

* enable selecting multiple proteins

* complete merge

* rename file and fix selecting multiple proteins

* improve efficiency of methods ptms_per_sample and ptms_per_protein_and_sample

* added test for ptms_per_protein_and_sample

* complete merge (fix testes after merge)

* cleanup

* implement differential expression with mann-whitney-test on ptm data

* implement differential expression with mann-whitney-test on protein data

* refactor input and output handling for Mann-Whitney-test on ptm data

* add additional parameter form user and docstrings

* fix typo

* add test for mann whitney on intensity data

* add test for mann whitney on ptm data

* add mann whitney to volcano plot form

* add missing value to output_keys of mann whitney test

* Implement requested changes

---------

Co-authored-by: janni.roebbecke <[email protected]>
  • Loading branch information
JanniRoebbecke and janni.roebbecke authored Aug 14, 2024
1 parent 81e30c0 commit e6c29e2
Show file tree
Hide file tree
Showing 5 changed files with 277 additions and 46 deletions.
115 changes: 102 additions & 13 deletions protzilla/data_analysis/differential_expression_mann_whitney.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,34 @@ def mann_whitney_test_on_intensity_data(
group2: str,
log_base: str = None,
alpha=0.05,
multiple_testing_correction_method: str = "",
multiple_testing_correction_method: str = "Benjamini-Hochberg",
p_value_calculation_method: str = "auto"
) -> dict:
"""
Perform Mann-Whitney U test on all proteins in the given intensity data frame.
:param intensity_df: A protein dataframe in typical PROTzilla long format.
:param metadata_df: The metadata data frame containing the grouping information.
:param grouping: The column name in the metadata data frame that contains the grouping information,
that should be used.
:param group1: The name of the first group for the Mann-Whitney U test.
:param group2: The name of the second group for the Mann-Whitney U test.
:param log_base: The base of the logarithm that was used to transform the data.
:param alpha: The significance level for the test.
:param multiple_testing_correction_method: The method for multiple testing correction.
:param p_value_calculation_method: The method for p-value calculation.
:return: a dict containing
- a df differentially_expressed_proteins_df in long format containing all test results
- a df significant_proteins_df, containing the proteins of differentially_expressed_column_df,
that are significant after multiple testing correction
- a df corrected_p_values, containing the p_values after application of multiple testing correction
- a df log2_fold_change, containing the log2 fold changes per protein
- a df u_statistic_df, containing the u-statistic per protein
- a float corrected_alpha, containing the alpha value after application of multiple testing correction
(depending on the selected multiple testing correction method corrected_alpha may be equal to alpha)
- a list messages (optional), containing messages for the user
"""
wide_df = long_to_wide(intensity_df)

outputs = mann_whitney_test_on_columns(
Expand All @@ -30,6 +56,7 @@ def mann_whitney_test_on_intensity_data(
alpha=alpha,
multiple_testing_correction_method=multiple_testing_correction_method,
columns_name="Protein ID",
p_value_calculation_method=p_value_calculation_method
)
differentially_expressed_proteins_df = pd.merge(intensity_df, outputs["differential_expressed_columns_df"], on="Protein ID", how="left")
differentially_expressed_proteins_df = differentially_expressed_proteins_df.loc[
Expand All @@ -50,6 +77,65 @@ def mann_whitney_test_on_intensity_data(
messages=outputs["messages"],
)

def mann_whitney_test_on_ptm_data(
ptm_df: pd.DataFrame,
metadata_df: pd.DataFrame,
grouping: str,
group1: str,
group2: str,
alpha=0.05,
multiple_testing_correction_method: str = "Benjamini-Hochberg",
p_value_calculation_method: str = "auto"
) -> dict:
"""
Perform Mann-Whitney U test on all PTMs in the given PTM data frame.
:param ptm_df: The data frame containing the PTM data in columns and a
"Sample" column that can be mapped to the metadata, to assign the groups.
:param metadata_df: The metadata data frame containing the grouping information.
:param grouping: The column name in the metadata data frame that contains the grouping information,
that should be used.
:param group1: The name of the first group for the Mann-Whitney U test.
:param group2: The name of the second group for the Mann-Whitney U test.
:param log_base: The base of the logarithm that was used to transform the data.
:param alpha: The significance level for the test.
:param multiple_testing_correction_method: The method for multiple testing correction.
:param p_value_calculation_method: The method for p-value calculation.
:return: a dict containing
- a df differentially_expressed_ptm_df in wide format containing all test results
- a df significant_ptm_df, containing the ptm of differentially_expressed_column_df,
that are significant after multiple testing correction
- a df corrected_p_values, containing the p_values after application of multiple testing correction,
- a df log2_fold_change, containing the log2 fold changes per column,
- a df t_statistic_df, containing the t-statistic per protein,
- a float corrected_alpha, containing the alpha value after application of multiple testing correction (depending on the selected multiple testing correction method corrected_alpha may be equal to alpha),
- a list messages, containing messages for the user
"""
output = mann_whitney_test_on_columns(
df=ptm_df,
metadata_df=metadata_df,
grouping=grouping,
group1=group1,
group2=group2,
log_base=None,
alpha=alpha,
multiple_testing_correction_method=multiple_testing_correction_method,
columns_name="PTM",
p_value_calculation_method=p_value_calculation_method
)

return dict(
differentially_expressed_ptm_df=output["differential_expressed_columns_df"],
significant_ptm_df=output["significant_columns_df"],
corrected_p_values_df=output["corrected_p_values_df"],
u_statistic_df=output["u_statistic_df"],
log2_fold_change_df=output["log2_fold_change_df"],
corrected_alpha=output["corrected_alpha"],
messages=output["messages"],
)


def mann_whitney_test_on_columns(
df: pd.DataFrame,
metadata_df: pd.DataFrame,
Expand All @@ -58,25 +144,28 @@ def mann_whitney_test_on_columns(
group2: str,
log_base: str = None,
alpha=0.05,
multiple_testing_correction_method: str = "",
multiple_testing_correction_method: str = "Benjamini-Hochberg",
columns_name: str = "Protein ID",
p_value_calculation_method: str = "auto"
) -> dict:
"""
Perform Mann-Whitney U test on all columns of the data frame.
@param df: The data frame containing the data in columns and a
:param df: The data frame containing the data in columns and a
"Sample" column that can be mapped to the metadata, to assign the groups.
@param metadata_df: The metadata data frame containing the grouping information.
@param grouping: The column name in the metadata data frame that contains the grouping information,
:param metadata_df: The metadata data frame containing the grouping information.
:param grouping: The column name in the metadata data frame that contains the grouping information,
that should be used.
@param group1: The name of the first group for the Mann-Whitney U test.
@param group2: The name of the second group for the Mann-Whitney U test.
@param log_base: The base of the logarithm that was used to transform the data.
@param alpha: The significance level for the test.
@param multiple_testing_correction_method: The method for multiple testing correction.
:param group1: The name of the first group for the Mann-Whitney U test.
:param group2: The name of the second group for the Mann-Whitney U test.
:param log_base: The base of the logarithm that was used to transform the data.
:param alpha: The significance level for the test.
:param multiple_testing_correction_method: The method for multiple testing correction.
:param columns_name: The semantics of the column names. This is used to name the columns in the output data frames.
:param p_value_calculation_method: The method for p-value calculation.
:return: a dict containing
- a df differentially_expressed_column_df in wide format containing the t-test results
- a df differentially_expressed_column_df in wide format containing the test results
- a df significant_columns_df, containing the columns of differentially_expressed_column_df,
that are significant after multiple testing correction
- a df corrected_p_values, containing the p_values after application of multiple testing correction,
Expand Down Expand Up @@ -104,7 +193,7 @@ def mann_whitney_test_on_columns(
for column in data_columns:
group1_data = df_with_groups[df_with_groups[grouping] == group1][column]
group2_data = df_with_groups[df_with_groups[grouping] == group2][column]
u_statistic, p_value = stats.mannwhitneyu(group1_data, group2_data, alternative="two-sided")
u_statistic, p_value = stats.mannwhitneyu(group1_data, group2_data, alternative="two-sided", method=p_value_calculation_method)

if not np.isnan(p_value):
log2_fold_change = (
Expand Down Expand Up @@ -139,7 +228,7 @@ def mann_whitney_test_on_columns(
)
u_statistic_df = pd.DataFrame(
list(zip(valid_columns, u_statistics)),
columns=[columns_name, "t_statistic"],
columns=[columns_name, "u_statistic"],
)

combined_df = pd.DataFrame(
Expand Down
26 changes: 9 additions & 17 deletions protzilla/methods/data_analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,7 @@
from protzilla.data_analysis.differential_expression_anova import anova
from protzilla.data_analysis.differential_expression_linear_model import linear_model
from protzilla.data_analysis.differential_expression_mann_whitney import (
mann_whitney_test_on_columns,
mann_whitney_test_on_intensity_data,
)
mann_whitney_test_on_intensity_data, mann_whitney_test_on_ptm_data)
from protzilla.data_analysis.differential_expression_t_test import t_test
from protzilla.data_analysis.dimension_reduction import t_sne, umap
from protzilla.data_analysis.model_evaluation import evaluate_classification_model
Expand Down Expand Up @@ -175,11 +173,13 @@ class DifferentialExpressionMannWhitneyOnIntensity(DataAnalysisStep):
"group2",
"alpha",
"multiple_testing_correction_method",
"p_value_calculation_method",
]
output_keys = [
"differentially_expressed_proteins_df",
"significant_proteins_df",
"corrected_p_values_df",
"u_statistic_df",
"log2_fold_change_df",
"corrected_alpha",
]
Expand Down Expand Up @@ -209,40 +209,32 @@ class DifferentialExpressionMannWhitneyOnPTM(DataAnalysisStep):
)

input_keys = [
"df",
"ptm_df",
"metadata_df",
"grouping",
"group1",
"group2",
"alpha",
"multiple_testing_correction_method",
"columns_name",
"p_value_calculation_method",
]
output_keys = [
"differentially_expressed_ptm_df",
"significant_ptm_df",
"corrected_p_values_df",
"u_statistic_df",
"log2_fold_change_df",
"corrected_alpha",
]

def method(self, inputs: dict) -> dict:
return mann_whitney_test_on_columns(**inputs)
return mann_whitney_test_on_ptm_data(**inputs)

def insert_dataframes(self, steps: StepManager, inputs) -> dict:
inputs["df"] = steps.get_step_output(Step, "ptm_df", inputs["ptm_df"])
inputs["columns_name"] = "PTM"
inputs["ptm_df"] = steps.get_step_output(Step, "ptm_df", inputs["ptm_df"])
inputs["metadata_df"] = steps.metadata_df
inputs["log_base"] = steps.get_step_input(TransformationLog, "log_base")
return inputs

def handle_outputs(self, outputs: dict) -> None:
outputs["differentially_expressed_ptm_df"] = outputs.pop(
"differential_expressed_columns_df", None
)
outputs["significant_ptm_df"] = outputs.pop("significant_columns_df", None)
super().handle_outputs(outputs)



class PlotVolcano(PlotStep):
display_name = "Volcano Plot"
Expand Down
2 changes: 1 addition & 1 deletion protzilla/utilities/transform_dfs.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from protzilla.utilities import default_intensity_column


def long_to_wide(intensity_df: pd.DataFrame, value_name: str = None):
def long_to_wide(intensity_df: pd.DataFrame, value_name: str | None = None):
"""
This function transforms the dataframe to a wide format that
can be more easily handled by packages such as sklearn.
Expand Down
Loading

0 comments on commit e6c29e2

Please sign in to comment.