diff --git a/README.md b/README.md index 56e1ef7..d374edb 100644 --- a/README.md +++ b/README.md @@ -147,12 +147,15 @@ $$ --> | **edge_list** | ***list***
A list of lists corresponding to a prior network involving predictors (nodes) and relationships among them (edges):
[[source1, target1, weight1], ..., [sourceZ, targetZ, weightZ]]. Here, weight1, ..., weightZ are optional. Nodes found in the `edge_list` are referred to as *network nodes* | | **beta_net** | ***float, default = 1***
Regularization parameter for network penalization: $\beta_{net} \geq 0$. | | **alpha_lasso** | ***float, default = 0.01***
A numerical regularization parameter for the lasso term ($\alpha_{lasso} \geq 0$) needed if `model_type = LassoCV`. Larger values typically reduce the number of final predictors in the model. | -| **default_edge_weight** | ***float, default = 0.1***
Default edge weight ($w$) assigned to any edge with missing weight | -| **degree_threshold** | ***float, default = 0.5***
Edges with weight $w$ > degree_threshold are counted as 1 towards the node degree $d$ | +| **default_edge_weight** | ***float, default = 0.01***
Default edge weight ($w$) assigned to any edge with missing weight | +| **edge_vals_for_d** | ***boolean, default = True***
If True, we focus on summing the edge weights to calculate the node degree $d$ | +| **w_transform_for_d** | ***string, default = "none"***
Other options are "sqrt", "square", "avg". Here, "none" means we add the weights of the edges that are connected to the node to get the node degree. These other options represent transformations that can be done on this sum to yield various other node degree values $d$ | +| **degree_threshold** | ***float, default = 0.5***
If *edge_vals_for_d* is False, then Edges with weight $w$ > degree_threshold are counted as 1 towards the node degree $d$ | | **gene_expression_nodes** | ***list, default = []***
A list of predictors (e.g. TFs) to use that typically is found as columns in the training gene expression data $X_{train}$.
Any `gene_expression_nodes` not found in the `edge_list` are added internally into the network prior `edge_list` using pairwise `default_edge_weight`. Specifying `gene_expression_nodes` is *optional* but may boost the speed of training and fitting NetREm models (by adjusting the network prior in the beginning). Thus, if the gene expression data ($X$) is available, it is recommended to input `gene_expression_nodes`. Otherwise, NetREm automatically determines `gene_expression_nodes` when fitting the model with $X_{train}$ gene expression data (when *fit(X,y)* is called), but needs time to recalibrate the network prior based on $X_{train}$ nodes and value set for `overlapped_nodes_only`. | | **overlapped_nodes_only** | ***boolean, default = False***
This determines if NetREm should focus on common nodes found in *network nodes* (from `edge_list`) and gene expression data (based on `gene_expression_nodes`). Here, *network nodes* not found in the gene expression data will always be removed. The priority is given to `gene_expression_nodes` since those have gene expression values that are used by the regression.
• If `overlapped_nodes_only = False`, the predictors will come from `gene_expression_nodes`, even if those are not found in the network `edge_list`. Some predictors may lack relationships in the prior network.
• If `overlapped_nodes_only = True`, the predictors used will need to be a common node: *network node* also found in the `gene_expression_nodes`.
See [overlapped_nodes_only.pdf](https://github.com/SaniyaKhullar/NetREm/blob/main/user_guide/overlapped_nodes_only.pdf) for hands-on examples. | -| **standardize_X** | ***boolean, default = True***
This determines if NetREm should standardize $X$: subtracting the mean of $X$ and dividing by the standard deviation of $X$ using the training data.
| -| **center_y** | ***boolean, default = True***
This determines if NetREm should center $y$: subtracting the mean of $y$ based on the training data
| +| **standardize_X** | ***boolean, default = True***
This determines if NetREm should standardize $X$, for each predictor column: subtracting the mean of $X$ and dividing by the standard deviation of $X$ using the training data.
| +| **standardize_y** | ***boolean, default = True***
This determines if NetREm should standardize $y$: subtracting the mean of $y$ and dividing by the standard deviation of $y$ using the training data.
| +| **center_y** | ***boolean, default = False***
This determines if NetREm should center $y$: subtracting the mean of $y$ based on the training data
| | **y_intercept** | ***boolean, default = 'False'***
This is the `fit_intercept` parameter found in the [Lasso](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) and [LassoCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html) classes in sklearn.
• If `y_intercept = True`, the model will be fit with a y-intercept term included.
• If `y_intercept = False`, the model will be fit with no y-intercept term. | | **view_network** | ***boolean, default = False***
• If `view_network = True`, then NetREm outputs visualizations of the prior graph network. Recommended for small networks (instead of for dense hairballs)
If `view_network = False`, then NetREm saves time by not outputting visuals of the network. | | **model_type** | ***{'Lasso', 'LassoCV'}, default = 'Lasso'***
• Lasso: user specifies value of $\alpha_{lasso}$
• LassoCV: NetREm performs cross-validation (CV) on training data to determine optimal $\alpha_{lasso}$ | diff --git a/code/DemoDataBuilderXandY.py b/code/DemoDataBuilderXandY.py index bb12db7..981d8c5 100644 --- a/code/DemoDataBuilderXandY.py +++ b/code/DemoDataBuilderXandY.py @@ -6,6 +6,8 @@ import numpy as np from sklearn.model_selection import train_test_split import plotly.express as px +from scipy.stats import zscore + class DemoDataBuilderXandY: """:) Please note that this class focuses on building Y data based on a normal distribution (specified mean and standard deviation). M is the # of samples we want to generate. Thus, Y is a vector with M elements. @@ -17,6 +19,7 @@ class DemoDataBuilderXandY: _parameter_constraints = { "test_data_percent": (0, 100), + "sparsity_factor_perc": (0, 100), "mu": (0, None), "std_dev": (0, None), "num_iters_to_generate_X": (1, None), @@ -44,6 +47,7 @@ def __init__(self, **kwargs): self.orthogonal_X_bool = True # False adjustment made on 9/20 self.ortho_scalar = 10 self.tol = 1e-2 + self.sparsity_factor_perc = 0 self.view_input_correlations_plot = False # reading in user inputs self.__dict__.update(kwargs) @@ -56,8 +60,9 @@ def __init__(self, **kwargs): raise ValueError(f":( Please note ye are missing information for these keys: {missing_keys}") self.M = self.num_samples_M self.N = self.get_N() - self.y = self.generate_Y() - self.X = self.generate_X() + generated_data_dict = self.generate_data() + self.y = generated_data_dict["y"]# self.generate_Y() + self.X = generated_data_dict["X"] #self.generate_X() self.same_train_and_test_data_bool = self.same_train_test_data if self.same_train_and_test_data_bool: self.testing_size = 1 @@ -151,19 +156,27 @@ def view_y_test_df(self): y_test_df = pd.DataFrame(self.y_test, columns = ["y"]) return y_test_df + def combine_X_and_y_train_and_test_data(self): - X_p1 = self.X_train_df + # Reset indices to ensure alignment + X_p1 = self.X_train_df.reset_index(drop=True) + X_p2 = self.X_test_df.reset_index(drop=True) + y_p1 = self.y_train_df.reset_index(drop=True) + y_p2 = self.y_test_df.reset_index(drop=True) + + # Add 'info' columns X_p1["info"] = "training" - X_p2 = self.X_test_df X_p2["info"] = "testing" - X_combined = pd.concat([X_p1, X_p2]).drop_duplicates() - y_p1 = self.y_train_df y_p1["info"] = "training" - y_p2 = self.y_test_df y_p2["info"] = "testing" + + # Concatenate dataframes + X_combined = pd.concat([X_p1, X_p2]).drop_duplicates() y_combined = pd.concat([y_p1, y_p2]).drop_duplicates() - combining_df = X_combined - combining_df["y"] = y_combined["y"] + + # Merge X_combined and y_combined on the index + combining_df = X_combined.merge(y_combined[['y', 'info']], left_index=True, right_index=True, how='left') + return combining_df def return_correlations_dataframe(self): @@ -174,11 +187,92 @@ def return_correlations_dataframe(self): corr_df["data"] = "correlations" return corr_df - def generate_Y(self): - seed_val = self.rng_seed - rng = np.random.default_rng(seed=seed_val) - y = rng.normal(self.mu, self.std_dev, self.M) - return y + + # def generate_Y(self): + # #seed_val = self.rng_seed + # #rng = np.random.default_rng(seed=seed_val) + # np.random.seed(seed_val) + # y = np.random.randn(self.M) + # y = apply_sparsity(y, self.sparsity_factor_perc) + # # y = rng.normal(self.mu, self.std_dev, self.M) + # # if self.sparsity_factor_perc > 0: + # # sparsity_mask = np.random.uniform(0, 1, self.M) < self.sparsity_factor_perc / 100.0 + # # y[sparsity_mask] = 0 + # return y + + def generate_data(self): + np.random.seed(self.rng_seed) + y = np.random.randn(self.M) + y = apply_sparsity(y, self.sparsity_factor_perc) + + + np.random.seed(self.randSeed) # For reproducibility + sparsity_factor_perc_list = [self.sparsity_factor_perc] * len(self.corrVals) + + predictors = [] + + for i, (corr, sparsity_factor) in tqdm(enumerate(zip(self.corrVals, sparsity_factor_perc_list)), desc = ":) Generating Data for Predictors"): + # Generate X initially + x = np.random.randn(self.M) * np.sqrt(1 - corr**2) + y * corr + + # Adjust X to ensure the correlation is within tolerance + x = adjust_data_to_correlation(x, y, corr, self.tol, self.num_iters_to_generate_X) + + # Apply sparsity + x = apply_sparsity(x, sparsity_factor) + + predictors.append(x) + + X = np.column_stack(predictors) + results_dict = {} + results_dict["X"] = X#pd.DataFrame(X, columns=[f'X{i+1}' for i in range(len(self.corrVals))]).values + results_dict["y"] = y + + return results_dict + + + def generate_X_old(self): + np.random.seed(self.rng_seed) + y_std = (self.y - np.mean(self.y)) / np.std(self.y) # Standardized Y for correlation calculation + + X = np.random.normal(0, 1, (self.M, self.N)) # Initialize X + + # Apply sparsity to X + if self.sparsity_factor_perc > 0: + sparsity_mask = np.random.uniform(0, 1, X.shape) < self.sparsity_factor_perc / 100.0 + X[sparsity_mask] = 0 + + for i in tqdm(range(self.N), desc="Adjusting predictors for desired correlations"): + desired_corr = self.corrVals[i] + + for attempt in range(1000): # Limit attempts to find a match within tolerance + # Adjust X[:, i] to attempt achieving the desired correlation + temp_Xi = np.random.normal(0, 1, self.M) + if self.sparsity_factor_perc > 0: + temp_sparsity_mask = np.random.uniform(0, 1, self.M) < self.sparsity_factor_perc / 100.0 + temp_Xi[temp_sparsity_mask] = 0 + temp_Xi_std = (temp_Xi - np.mean(temp_Xi)) / np.std(temp_Xi) + + actual_corr = np.corrcoef(y_std, temp_Xi_std)[0, 1] + + if abs(actual_corr - desired_corr) < self.tol: + X[:, i] = temp_Xi + break + + # Optionally orthogonalize X + if self.orthogonal_X_bool: + X, _ = np.linalg.qr(X) + + return X + + + # def generate_Y(self): + # # Generate Y with optional sparsity + # y = np.random.normal(self.mu, self.std_dev, self.M) + # if self.sparsity_factor_perc > 0: + # sparsity_mask = np.random.uniform(0, 1, self.M) < self.sparsity_factor_perc / 100.0 + # y[sparsity_mask] = 0 + # return y # Check if Q is orthogonal using the is_orthogonal function def is_orthogonal(matrix): @@ -198,604 +292,49 @@ def is_orthogonal(matrix): # Check if the product is equal to the identity matrix return np.allclose(matrix_matrix_T, np.eye(matrix.shape[0])) -# # Define the modified generate_X function -# def generate_X(self): -# """Generates a design matrix X with the given correlations while introducing noise and dependencies. -# Parameters: -# orthogonal (bool): Whether to generate an orthogonal matrix (default=False). - -# Returns: -# numpy.ndarray: The design matrix X. -# """ -# orthogonal = self.orthogonal_X_bool -# scalar = self.ortho_scalar -# np.random.seed(self.randSeed) -# y = self.y -# n = len(y) -# numTFs = self.N # len(corrVals) -# numIterations = self.num_iters_to_generate_X -# correlations = self.corrVals -# corrVals = [correlations[0]] + correlations - -# # Step 1: Generate Initial X -# e = np.random.normal(0, 1, (n, numTFs + 1)) -# X = np.copy(e) -# X[:, 0] = y * np.sqrt(1.0 - corrVals[0]**2) / np.sqrt(1.0 - np.corrcoef(y, X[:,0])[0,1]**2) -# for j in range(numIterations): -# for i in range(1, numTFs + 1): -# corr = np.corrcoef(y, X[:, i])[0, 1] -# X[:, i] = X[:, i] + (corrVals[i] - corr) * y - -# # Step 2: Add Noise -# noise_scale = 0.1 # You can adjust this value -# X += np.random.normal(0, noise_scale, X.shape) - -# # Step 3: Introduce Inter-dependencies -# # Make the second predictor a combination of the first and third predictors -# X[:, 1] += 0.3 * X[:, 0] + 0.7 * X[:, 2] - -# # Step 4: Adjust for Correlations -# for j in range(numIterations): -# for i in range(1, numTFs + 1): -# corr = np.corrcoef(y, X[:, i])[0, 1] -# X[:, i] = X[:, i] + (corrVals[i] - corr) * y - -# if orthogonal: -# # Compute the QR decomposition of X and take only the Q matrix -# Q = np.linalg.qr(X)[0] -# Q = scalar * Q -# return Q[:, 1:] -# else: -# # Return the X matrix without orthogonalization -# return X[:, 1:] - -# # # Display the modified function to ensure it looks okay -# # print(generate_X_modified) - -# def generate_X(self): -# """Generates a design matrix X with the given correlations and introduces an interaction term. - -# Returns: -# numpy.ndarray: The design matrix X. -# """ -# np.random.seed(self.randSeed) -# y = self.y -# n = len(y) -# numTFs = self.N # Number of predictors -# numIterations = self.num_iters_to_generate_X -# corrVals = self.corrVals - -# # Step 1: Generate Initial X based on the specified correlations with Y -# e = np.random.normal(0, 1, (n, numTFs)) -# X = np.copy(e) -# for j in range(numIterations): -# for i in range(numTFs): -# corr = np.corrcoef(y, X[:, i])[0, 1] -# X[:, i] = X[:, i] + (corrVals[i] - corr) * y - -# # Step 2: Introduce Interaction Term into Y -# interaction_term = X[:, 3] * X[:, 4] -# self.y = y + 0.5 * interaction_term # Adjust the coefficient as needed - -# # Step 3: Re-adjust for specified correlations with Y -# for j in range(numIterations): -# for i in range(numTFs): -# corr = np.corrcoef(self.y, X[:, i])[0, 1] -# X[:, i] = X[:, i] + (corrVals[i] - corr) * self.y - -# return X - - - - # Define the modified generate_X function to highlight the benefits of network-regularized regression -# def generate_X(self): -# """Generates a design matrix X to highlight the benefits of network-regularized regression. - -# Returns: -# numpy.ndarray: The design matrix X. -# """ -# np.random.seed(self.randSeed) -# n = len(self.y) -# numTFs = self.N # Number of predictors -# numIterations = self.num_iters_to_generate_X -# corrVals = self.corrVals - -# # Step 1: Generate Initial X based on the specified correlations with Y -# e = np.random.normal(0, 1, (n, numTFs)) -# X = np.copy(e) -# for j in range(numIterations): -# for i in range(numTFs): -# corr = np.corrcoef(self.y, X[:, i])[0, 1] -# X[:, i] = X[:, i] + (corrVals[i] - corr) * self.y - -# # Step 2: Weaken X2 and X4 as predictors by introducing interactions in Y -# interaction_term = 0.3 * (X[:, 0] * X[:, 1]) + 0.3 * (X[:, 3] * X[:, 4]) # Interaction terms -# self.y = self.y + interaction_term # Update Y - -# # Step 3: Strengthen network edges by making X1 and X2, and X4 and X5 highly correlated -# X[:, 1] = 0.7 * X[:, 0] + 0.3 * X[:, 1] # X1 and X2 -# X[:, 3] = 0.7 * X[:, 4] + 0.3 * X[:, 3] # X4 and X5 - -# # Step 4: Re-adjust for specified correlations with Y -# for j in range(numIterations): -# for i in range(numTFs): -# corr = np.corrcoef(self.y, X[:, i])[0, 1] -# X[:, i] = X[:, i] + (corrVals[i] - corr) * self.y - -# return X -# def generate_X(self): -# """Generates a design matrix X with the given correlations and introduces specified network edges and interactions. - -# Returns: -# numpy.ndarray: The design matrix X. -# """ -# np.random.seed(self.randSeed) -# y = self.y -# n = len(y) -# numTFs = self.N # Number of predictors -# numIterations = self.num_iters_to_generate_X -# corrVals = self.corrVals - -# # Step 1: Generate Initial X based on the specified correlations with Y -# e = np.random.normal(0, 1, (n, numTFs)) -# X = np.copy(e) -# for j in range(numIterations): -# for i in range(numTFs): -# corr = np.corrcoef(y, X[:, i])[0, 1] -# X[:, i] = X[:, i] + (corrVals[i] - corr) * y - -# # Step 2: Weaken X2 and X4 as predictors by introducing interactions in Y -# self.y = y + 0.3 * (X[:, 1] * X[:, 0]) + 0.3 * (X[:, 3] * X[:, 4]) # Adjust the coefficients as needed - -# # Step 3: Strengthen network edges by making X1 and X2, and X4 and X5 highly correlated -# X[:, 1] = 0.7 * X[:, 0] + 0.3 * X[:, 1] # X1 and X2 -# X[:, 3] = 0.7 * X[:, 4] + 0.3 * X[:, 3] # X4 and X5 - -# # Step 4: Re-adjust for specified correlations with Y -# for j in range(numIterations): -# for i in range(numTFs): -# corr = np.corrcoef(self.y, X[:, i])[0, 1] -# X[:, i] = X[:, i] + (corrVals[i] - corr) * self.y - -# return X -# def generate_X(self): -# """Generates a design matrix X with given correlations and introduces inter-predictor correlations. - -# Returns: -# numpy.ndarray: The design matrix X. -# """ -# np.random.seed(self.randSeed) -# y = self.y -# n = len(y) -# numTFs = self.N # Number of predictors -# numIterations = self.num_iters_to_generate_X -# corrVals = self.corrVals - -# # Step 1: Generate Initial X based on the specified correlations with Y -# e = np.random.normal(0, 1, (n, numTFs)) -# X = np.copy(e) -# for j in range(numIterations): -# for i in range(numTFs): -# corr = np.corrcoef(y, X[:, i])[0, 1] -# X[:, i] = X[:, i] + (corrVals[i] - corr) * y - -# # Step 2: Introduce Inter-predictor Correlations -# # Make X1 and X2 highly correlated -# X[:, 0] = 0.5 * X[:, 0] + 0.5 * X[:, 1] -# # Make X4 and X5 highly correlated -# X[:, 3] = 0.525 * X[:, 3] + 0.475 * X[:, 4] - -# # Step 3: Re-adjust for specified correlations with Y -# for j in range(numIterations): -# for i in range(numTFs): -# corr = np.corrcoef(y, X[:, i])[0, 1] -# X[:, i] = X[:, i] + (corrVals[i] - corr) * y - -# return X - -# def generate_X(self, tol=1e-4): -# orthogonal = self.orthogonal_X_bool -# scalar = self.ortho_scalar -# np.random.seed(self.randSeed) -# y = self.y -# n = len(y) -# numTFs = self.N - -# # Initialize X with standard normal distribution -# X = np.random.normal(0, 1, (n, numTFs)) - -# for i in range(numTFs): -# desired_corr = self.corrVals[i] - -# while True: -# # Create a new predictor as a linear combination of original predictor and y -# X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] - -# # Standardize the predictor to have mean 0 and variance 1 -# X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) - -# # Calculate the actual correlation -# actual_corr = np.corrcoef(y, X[:, i])[0, 1] - -# # Calculate the difference between the actual and desired correlations -# diff = abs(actual_corr - desired_corr) - -# if diff < tol: -# break - -# # Orthogonalize the predictors to make them independent of each other -# Q, _ = np.linalg.qr(X) - -# if orthogonal: -# # Scale the orthogonalized predictors -# Q = scalar * Q -# return Q -# else: -# # Return the orthogonalized predictors without scaling -# return Q - -# def generate_X(self): -# np.random.seed(self.randSeed) -# y = self.y -# n = len(y) -# numTFs = self.N -# tol = self.tol - -# # Initialize X with standard normal distribution (vectorized) -# X = np.random.normal(0, 1, (n, numTFs)) - -# # Standardize y for correlation calculation -# y_std = (y - np.mean(y)) / np.std(y) - -# for i in tqdm(range(numTFs), desc="Generating predictors"): -# desired_corr = self.corrVals[i] - -# while True: -# # Orthogonalize Xi against all previous predictors -# for j in range(i): -# coef = np.dot(X[:, i], X[:, j]) / np.dot(X[:, j], X[:, j]) -# X[:, i] -= coef * X[:, j] - -# # Create and standardize new predictor (vectorized) -# X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] -# X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) - -# # Calculate actual correlation (vectorized) -# actual_corr = np.dot(y_std, X[:, i]) / n - -# # Check if actual correlation is close enough to desired correlation -# if abs(actual_corr - desired_corr) < tol: -# break - -# # Orthogonalize X to reduce inter-predictor correlation (if required) -# if self.orthogonal_X_bool: -# X, _ = np.linalg.qr(X) - -# return X - - def generate_X(self): - np.random.seed(self.randSeed) - y = self.y - n = len(y) - numTFs = self.N - tol = self.tol - # Initialize X with standard normal distribution (vectorized) - X = np.random.normal(0, 1, (n, numTFs)) + def generate_X_old(self): + np.random.seed(self.rng_seed) + y_std = (self.y - np.mean(self.y)) / np.std(self.y) # Standardized Y for correlation calculation + + X = np.random.normal(0, 1, (self.M, self.N)) # Initialize X - # Standardize y for correlation calculation - y_std = (y - np.mean(y)) / np.std(y) + # Apply sparsity to X + if self.sparsity_factor_perc > 0: + sparsity_mask = np.random.uniform(0, 1, X.shape) < self.sparsity_factor_perc / 100.0 + X[sparsity_mask] = 0 - for i in tqdm(range(numTFs), desc="Generating predictors"): + for i in tqdm(range(self.N), desc="Adjusting predictors for desired correlations"): desired_corr = self.corrVals[i] - while True: - # Create and standardize new predictor (vectorized) - X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] - X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + for attempt in range(1000): # Limit attempts to find a match within tolerance + # Adjust X[:, i] to attempt achieving the desired correlation + temp_Xi = np.random.normal(0, 1, self.M) + if self.sparsity_factor_perc > 0: + temp_sparsity_mask = np.random.uniform(0, 1, self.M) < self.sparsity_factor_perc / 100.0 + temp_Xi[temp_sparsity_mask] = 0 + temp_Xi_std = (temp_Xi - np.mean(temp_Xi)) / np.std(temp_Xi) - # Calculate actual correlation (vectorized) - actual_corr = np.dot(y_std, X[:, i]) / n + actual_corr = np.corrcoef(y_std, temp_Xi_std)[0, 1] - # Check if actual correlation is close enough to desired correlation - if abs(actual_corr - desired_corr) < tol: + if abs(actual_corr - desired_corr) < self.tol: + X[:, i] = temp_Xi break - # Orthogonalize X to reduce inter-predictor correlation (if required) + # Optionally orthogonalize X if self.orthogonal_X_bool: X, _ = np.linalg.qr(X) return X - def generate_X7(self): - orthogonal = self.orthogonal_X_bool - scalar = self.ortho_scalar - np.random.seed(self.randSeed) - y = self.y - n = len(y) - numTFs = self.N - tol = self.tol - - # Initialize X with standard normal distribution - X = np.random.normal(0, 1, (n, numTFs)) - - desc_name = "Generating data for " + str(numTFs) + " Predictors with tolerance of " + str(tol) + " :) " - for i in tqdm(range(numTFs), desc=desc_name): - desired_corr = self.corrVals[i] - - while True: - # Create a new predictor as a linear combination of original predictor and y - X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] - - # Standardize the predictor to have mean 0 and variance 1 - X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) - - # Calculate the actual correlation - actual_corr = np.corrcoef(y, X[:, i])[0, 1] - - # Calculate the difference between the actual and desired correlations - diff = abs(actual_corr - desired_corr) - - if diff < tol: - break - - # Step 2: Orthogonalize the predictors to remove inter-predictor correlation - X_ortho, _ = np.linalg.qr(X) - - # Step 3: Scale each orthogonalized predictor to match the desired correlation with y - for i in tqdm(range(numTFs), desc="Rescaling orthogonalized predictors"): - desired_corr = self.corrVals[i] - - while True: - # Scale the orthogonalized predictor - X_ortho[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X_ortho[:, i] - - # Standardize the predictor - X_ortho[:, i] = (X_ortho[:, i] - np.mean(X_ortho[:, i])) / np.std(X_ortho[:, i]) - - # Calculate the actual correlation - actual_corr = np.corrcoef(y, X_ortho[:, i])[0, 1] - - # Calculate the difference between the actual and desired correlations - diff = abs(actual_corr - desired_corr) - - if diff < tol: - break - - if orthogonal: - # Compute the QR decomposition of X and take only the Q matrix - Q = np.linalg.qr(X_ortho)[0] - Q = scalar * Q - return Q - else: - # Return the X matrix without orthogonalization - return X_ortho - - - def generate_X5(self): - orthogonal = self.orthogonal_X_bool - scalar = self.ortho_scalar - np.random.seed(self.randSeed) - y = self.y - n = len(y) - numTFs = self.N - tol = self.tol - jitter = 0.05 # Noise level to reduce correlation between predictors - - # Initialize X with standard normal distribution - X = np.random.normal(0, 1, (n, numTFs)) - - desc_name = "Generating data for " + str(numTFs) + " Predictors with tolerance of " + str(tol) + " :) " - for i in tqdm(range(numTFs), desc=desc_name): - desired_corr = self.corrVals[i] - - while True: - # Create a new predictor as a linear combination of original predictor and y - X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] - - # Add a small amount of noise to reduce correlation with other predictors - X[:, i] += jitter * np.random.normal(0, 1, n) - - # Standardize the predictor to have mean 0 and variance 1 - X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) - - # Calculate the actual correlation - actual_corr = np.corrcoef(y, X[:, i])[0, 1] - - # Calculate the difference between the actual and desired correlations - diff = abs(actual_corr - desired_corr) - - if diff < tol: - break - - if orthogonal: - # Compute the QR decomposition of X and take only the Q matrix - Q = np.linalg.qr(X)[0] - Q = scalar * Q - return Q - else: - # Return the X matrix without orthogonalization - return X - - def generate_X3(self): - orthogonal = self.orthogonal_X_bool - scalar = self.ortho_scalar - np.random.seed(self.randSeed) - y = self.y - n = len(y) - numTFs = self.N - tol = self.tol - # Initialize X with standard normal distribution - X = np.random.normal(0, 1, (n, numTFs)) - desc_name = "Generating data for " + str(numTFs) + " Predictors with tolerance of " + str(tol) + " :) " - for i in tqdm(range(numTFs), desc=desc_name): - desired_corr = self.corrVals[i] - - while True: - # Create a new predictor as a linear combination of original predictor and y - X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] - - # Standardize the predictor to have mean 0 and variance 1 - X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) - - # Calculate the actual correlation - actual_corr = np.corrcoef(y, X[:, i])[0, 1] - - # Calculate the difference between the actual and desired correlations - diff = abs(actual_corr - desired_corr) - - if diff < tol: - break - - if orthogonal: - # Compute the QR decomposition of X and take only the Q matrix - Q = np.linalg.qr(X)[0] - Q = scalar * Q - return Q - else: - # Return the X matrix without orthogonalization - return X - - # Define the function for generating synthetic data with specific correlations and standard normal predictors - def generate_X1(self): - orthogonal = self.orthogonal_X_bool - scalar = self.ortho_scalar - np.random.seed(self.randSeed) - y = self.y - n = len(y) - numTFs = self.N - - # Initialize X with standard normal distribution - X = np.random.normal(0, 1, (n, numTFs)) - - # Adjust X to achieve the desired correlations with y - for i in range(numTFs): - corr = self.corrVals[i] - # Create a new predictor as a linear combination of original predictor and y - X[:, i] = corr * y + np.sqrt(1 - corr ** 2) * X[:, i] - - # Standardize the predictor to have mean 0 and variance 1 - X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) - - if orthogonal: - # Compute the QR decomposition of X and take only the Q matrix - Q = np.linalg.qr(X)[0] - Q = scalar * Q - return Q - else: - # Return the X matrix without orthogonalization - return X -# def generate_X(self): -# orthogonal = self.orthogonal_X_bool -# scalar = self.ortho_scalar -# np.random.seed(self.randSeed) -# y = self.y -# n = len(y) -# numTFs = self.N -# numIterations = self.num_iters_to_generate_X -# correlations = self.corrVals -# corrVals = [correlations[0]] + correlations - -# # Initialize X with standard normal distribution -# X = np.random.normal(0, 1, (n, numTFs)) - -# for j in range(numIterations): -# for i in range(numTFs): -# corr = np.corrcoef(y, X[:, i])[0, 1] -# X[:, i] = X[:, i] + (corrVals[i] - corr) * y -# # Standardize the predictor -# X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) - -# if orthogonal: -# # Compute the QR decomposition of X and take only the Q matrix -# Q = np.linalg.qr(X)[0] -# Q = scalar * Q -# return Q -# else: -# # Return the X matrix without orthogonalization -# return X - - -# def generate_X(self): -# orthogonal = self.orthogonal_X_bool -# scalar = self.ortho_scalar -# np.random.seed(self.randSeed) -# y = self.y -# n = len(y) -# numTFs = self.N -# tol=self.tol -# # Initialize X with standard normal distribution -# X = np.random.normal(0, 1, (n, numTFs)) -# numIterations = self.num_iters_to_generate_X -# for iter_count in range(numIterations): -# max_diff = 0 # Initialize maximum difference between actual and desired correlations for this iteration -# for i in range(numTFs): -# desired_corr = self.corrVals[i] - -# # Create a new predictor as a linear combination of original predictor and y -# X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] - -# # Standardize the predictor to have mean 0 and variance 1 -# X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) - -# # Calculate the actual correlation -# actual_corr = np.corrcoef(y, X[:, i])[0, 1] - -# # Calculate the difference between the actual and desired correlations -# diff = abs(actual_corr - desired_corr) -# max_diff = max(max_diff, diff) - -# # If the maximum difference between actual and desired correlations is below the tolerance, break the loop -# if max_diff < tol: -# break - -# if orthogonal: -# # Compute the QR decomposition of X and take only the Q matrix -# Q = np.linalg.qr(X)[0] -# Q = scalar * Q -# return Q -# else: -# # Return the X matrix without orthogonalization -# return X - - def generate_X_old(self): - """Generates a design matrix X with the given correlations. - Parameters: - orthogonal (bool): Whether to generate an orthogonal matrix (default=False). - - Returns: - numpy.ndarray: The design matrix X. - """ - orthogonal = self.orthogonal_X_bool - scalar = self.ortho_scalar - np.random.seed(self.randSeed) - y = self.y - n = len(y) - numTFs = self.N # len(corrVals) - numIterations = self.num_iters_to_generate_X - correlations = self.corrVals - corrVals = [correlations[0]] + correlations - e = np.random.normal(0, 1, (n, numTFs + 1)) - X = np.copy(e) - X[:, 0] = y * np.sqrt(1.0 - corrVals[0]**2) / np.sqrt(1.0 - np.corrcoef(y, X[:,0])[0,1]**2) - for j in range(numIterations): - for i in range(1, numTFs + 1): - corr = np.corrcoef(y, X[:, i])[0, 1] - X[:, i] = X[:, i] + (corrVals[i] - corr) * y - - if orthogonal: - # Compute the QR decomposition of X and take only the Q matrix - Q = np.linalg.qr(X)[0] - Q = scalar * Q - return Q[:, 1:] - else: - # Return the X matrix without orthogonalization - return X[:, 1:] - + def generate_training_and_testing_data(self): same_train_and_test_data_bool = self.same_train_and_test_data_bool X = self.X y = self.y + self.original_X = self.X + self.original_y = self.y + if same_train_and_test_data_bool == False: # different training and testing datasets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = self.testing_size) if self.verbose: @@ -837,6 +376,7 @@ def actual_vs_expected_corrs_DefensiveProgramming_all_groups(self, X, y, X_train testing_corrs_df = self.compare_actual_and_expected_correlations_DefensiveProgramming_one_data_group(X_test, y_test, corrVals, tf_names_list, same_train_and_test_data_bool, "Testing") combined_correlations_df = pd.concat([overall_corrs_df, training_corrs_df, testing_corrs_df]).drop_duplicates() + combined_correlations_df["sparsity_factor_perc"] = self.sparsity_factor_perc return combined_correlations_df def compare_actual_and_expected_correlations_DefensiveProgramming_one_data_group(self, X_matrix, y, corrVals, @@ -888,7 +428,9 @@ def generate_dummy_data(corrVals, train_data_percent = 70, mu = 0, std_dev = 1, - iters_to_generate_X = 100, + tol = 1e-3, + iters_to_generate_X = 10000, + sparsity_factor_perc = 0, orthogonal_X = False, ortho_scalar = 10, view_input_corrs_plot = False, @@ -904,9 +446,10 @@ def generate_dummy_data(corrVals, print(f":) same_train_test_data = {same_train_test_data}") demo_dict = { "test_data_percent": 100 - train_data_percent, - "mu": mu, "std_dev": std_dev, + "mu": mu, "std_dev": std_dev, "tol":tol, "num_iters_to_generate_X": iters_to_generate_X, "same_train_test_data": same_train_test_data, + "sparsity_factor_perc":sparsity_factor_perc, "rng_seed": rand_seed_y, #2023, # for Y "randSeed": rand_seed_x, #123, # for X "ortho_scalar": ortho_scalar, @@ -916,3 +459,26 @@ def generate_dummy_data(corrVals, "corrVals": corrVals, "verbose":verbose} dummy_data = DemoDataBuilderXandY(**demo_dict) # return dummy_data + + +def apply_sparsity(data, sparsity_factor_perc): + sparsity_threshold = np.percentile(np.abs(data), sparsity_factor_perc) + data[np.abs(data) < sparsity_threshold] = 0 + return data + + +def adjust_data_to_correlation(x, y, desired_corr, tolerance=0.001, max_iterations=10000): + actual_corr = np.corrcoef(x, y)[0, 1] + iterations = 0 + while np.abs(actual_corr - desired_corr) >= tolerance and iterations < max_iterations: + # Adjust x slightly to move correlation closer to the desired value + adjustment = np.random.randn(*x.shape) * 0.01 + x_temp = x + adjustment + temp_corr = np.corrcoef(x_temp, y)[0, 1] + + # If the adjustment improves the correlation, accept the change + if np.abs(temp_corr - desired_corr) < np.abs(actual_corr - desired_corr): + x = x_temp + actual_corr = temp_corr + iterations += 1 + return x \ No newline at end of file diff --git a/code/ElasticNetREm.py b/code/ElasticNetREm.py new file mode 100644 index 0000000..0abf09f --- /dev/null +++ b/code/ElasticNetREm.py @@ -0,0 +1,881 @@ +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt +import random +import copy +from tqdm import tqdm +import os +import sys # https://www.dev2qa.com/how-to-run-python-script-py-file-in-jupyter-notebook-ipynb-file-and-ipython/#:~:text=How%20To%20Run%20Python%20Script%20.py%20File%20In,2.%20Invoke%20Python%20Script%20File%20From%20Ipython%20Command-Line. +import networkx as nx +import scipy +from scipy.linalg import svd as robust_svd +from sklearn.model_selection import KFold, train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score +from sklearn.decomposition import TruncatedSVD +from sklearn import linear_model, preprocessing # 9/19 +from sklearn.linear_model import Lasso, ElasticNetCV, LinearRegression, ElasticNetCV, ElasticNet, Ridge +from numpy.typing import ArrayLike +from typing import Optional, List, Tuple +from sklearn.metrics import make_scorer +import plotly.express as px +from sklearn.base import RegressorMixin, ClassifierMixin, BaseEstimator +from numpy.typing import ArrayLike +from skopt import gp_minimize, space +from scipy.sparse.linalg.interface import LinearOperator +import warnings +from sklearn.exceptions import ConvergenceWarning +printdf = lambda *args, **kwargs: print(pd.DataFrame(*args, **kwargs)) +rng_seed = 2023 # random seed for reproducibility +randSeed = 123 +# from packages_needed import * +import essential_functions as ef +import error_metrics as em # why to do import +#import Netrem_model_builder as nm +import DemoDataBuilderXandY as demo +import PriorGraphNetwork as graph +import netrem_evaluation_functions as nm_eval +import networkx as nx +from tqdm.auto import tqdm +import copy +""" +Optimization for +(1 / (2 * M)) * ||y - Xc||^2_2 + (beta / (2 * N^2)) * c'Ac + alpha * ||c||_1 +Which is converted to lasso +(1 / (2 * M)) * ||y_tilde - X_tilde @ c||^2_2 + alpha * ||c||_1 +where M = n_samples and N is the dimension of c. +Check compute_X_tilde_y_tilde() to see how we make sure above normalization is applied using Lasso of sklearn +""" + +class ElasticNetREmModel(BaseEstimator, RegressorMixin): + """ :) Please note that this class focuses on building a Gene Regulatory Network (GRN) from gene expression data for Transcription Factors (TFs), gene expression data for the target gene (TG), and a prior biological network (W). This class performs Network-penalized regression :) """ + _parameter_constraints = { + "alpha_enet": (0, None), + "beta_net": (0, None), + "num_cv_folds": (0, None), + "y_intercept": [False, True], + "use_network": [True, False], + "max_enet_iterations": (1, None), + "l1_ratio_en": (0, None), + "model_type": ["ElasticNet", "ElasticNetCV", "Linear"], + "tolerance": (0, None), + "num_jobs": (1, 1e10), + "enet_selection": ["cyclic", "random"], + "enet_cv_eps": (0, None), + "enet_cv_n_alphas": (1, None), + "standardize_X": [True, False], + "standardize_y": [True, False], + "center_y": [True, False] + } + + def __init__(self, **kwargs): + self.info = "NetREm Model" + self.verbose = False + self.overlapped_nodes_only = False # restrict the nodes to only being those found in the network? overlapped_nodes_only + self.num_cv_folds = 5 # for cross-validation models + self.num_jobs = -1 # for ElasticNetCV or LinearRegression (here, -1 is the max possible for CPU) + self.all_pos_coefs = False # for coefficients + self.model_type = "ElasticNet" + self.standardize_X = True + self.standardize_y = True + self.center_y = False + self.use_network = True + self.y_intercept = False + self.max_enet_iterations = 10000 + self.view_network = False + self.l1_ratio_en = 0.5 + self.model_info = "unfitted_model :(" + self.target_gene_y = "Unknown :(" + self.tolerance = 1e-4 + self.enet_selection = "cyclic" # default in sklearn + self.enet_cv_eps = 1e-3 # default in sklearn + self.enet_cv_n_alphas = 100 # default in sklearn + self.enet_cv_alphas = None # default in sklearn + self.beta_net = kwargs.get('beta_net', 1) + self.__dict__.update(kwargs) + required_keys = ["network", "beta_net"] + if self.model_type == "ElasticNet": + self.alpha_enet = kwargs.get('alpha_enet', 0.01) + self.optimal_alpha = "User-specified optimal alpha elasticnet: " + str(self.alpha_enet) + required_keys += ["alpha_enet"] + elif self.model_type == "ElasticNetCV": + self.alpha_enet = "ElasticNetCV finds optimal alpha" + self.optimal_alpha = "Since ElasticNetCV is model_type, please fit model using X and y data to find optimal_alpha." + else: # model_type == "Linear": + self.alpha_enet = "No alpha needed" + self.optimal_alpha = "No alpha needed" # + missing_keys = [key for key in required_keys if key not in self.__dict__] # check that all required keys are present: + if missing_keys: + raise ValueError(f":( Please note ye are missing information for these keys: {missing_keys}") + if self.use_network: + prior_network = self.network + self.prior_network = prior_network + self.preprocessed_network = prior_network.preprocessed_network + self.network_params = prior_network.param_lists + self.network_nodes_list = prior_network.final_nodes # tf_names_list + self.kwargs = kwargs + self._apply_parameter_constraints() # ensuring that the parameter constraints are met + + + def __repr__(self): + args = [f"{k}={v}" for k, v in self.__dict__.items() if k != 'param_grid' and k in self.kwargs] + return f"{self.__class__.__name__}({', '.join(args)})" + + + def check_overlaps_work(self): + final_set = set(self.final_nodes) + network_set = set(self.network_nodes_list) + if self.tg_is_tf: + return False + return network_set != final_set + + + def standardize_X_data(self, X_df): # if the user opts to + """ :) If the user opts to standardize the X data (so that predictors have a mean of 0 + and a standard deviation of 1), then this method will be run, which uses the preprocessing + package StandardScalar() functionality. """ + if self.standardize_X: + # Transform both the training and test data + X_scaled = self.scaler_X.transform(X_df) + X_scaled_df = pd.DataFrame(X_scaled, columns=X_df.columns) + return X_scaled_df + else: + return X_df + + + def standardize_y_data(self, y_df): # if the user opts to + """ :) If the user opts to standardize the y data (so that the TG will have a mean of 0 + and a standard deviation of 1), then this method will be run, which uses the preprocessing + package StandardScalar() functionality. """ + if self.standardize_y: + # Transform both the training and test data + y_scaled = self.scaler_y.transform(y_df) + y_scaled_df = pd.DataFrame(y_scaled, columns=y_df.columns) + return y_scaled_df + else: + return y_df + + + def center_y_data(self, y_df): # if the user opts to + """ :) If the user opts to center the response y data: + subtracting its mean from each observation.""" + if self.center_y: + # Center the response + y_train_centered = y_df - self.mean_y_train + return y_train_centered + else: + return y_df + + + def updating_network_and_X_during_fitting(self, X, y): + # updated one :) + """ Update the prior network information and the + X input data (training) during the fitting of the model. It determines if the common predictors + should be used (based on if overlapped_nodes_only is True) or if all of the X input data should be used. """ + X_df = X.sort_index(axis=1) # sorting the X dataframe by columns. (rows are samples) + + #X_df = X.sort_index(axis=0).sort_index(axis=1) # sorting the X dataframe by rows and columns. + #self.X_df = X_df + self.target_gene_y = y.columns[0] + + if self.standardize_X: # we will standardize X then + if self.verbose: + print(":) Standardizing the X data") + self.old_X_df = X_df + self.scaler_X = preprocessing.StandardScaler().fit(X_df) # Fit the scaler to the training data only + # this self.scalar will be utilized for the testing data to prevent data leakage and to ensure generalization :) + self.X_df = self.standardize_X_data(X_df) + X = self.X_df # overwriting and updating the X df + else: + self.X_df = X_df + X = self.X_df + + self.mean_y_train = np.mean(y) # the average y value + if self.center_y: # we will center y then + if self.verbose: + print(":) centering the y data") + # Assuming y_train and y_test are your training and test labels + self.old_y = y + y = self.center_y_data(y) + + if self.standardize_y: # we will standardize y then + if self.verbose: + print(":) Standardizing the y data") + self.old_y_df = y + self.scaler_y = preprocessing.StandardScaler().fit(y) # Fit the scaler to the training data only + # this self.scalar will be utilized for the testing data to prevent data leakage and to ensure generalization :) + self.y_df = self.standardize_y_data(y) + y = self.y_df # overwriting and updating the y df + else: + self.y_df = y #= self.y_df + + tg_name = y.columns.tolist()[0] + self.tg_is_tf = False + if tg_name in X_df.columns.tolist(): + X_df = X_df.drop(columns = [tg_name]) + self.tg_is_tf = True # 1/31/24 + #self.X_df = X_df # 1/31/24 + #X = self.X_df # 1/31/24 + tg_is_tf = self.tg_is_tf + #gene_expression_nodes = list(set(X_df.columns.tolist()) - tg_name) # these are already sorted + gene_expression_nodes = sorted(X_df.columns.tolist()) # these will be sorted + ppi_net_nodes = set(self.network_nodes_list) # set(self.network_nodes_list) - tg_name + common_nodes = list(ppi_net_nodes.intersection(gene_expression_nodes)) + + if not common_nodes: # may be possible that the X dataframe needs to be transposed if provided incorrectly + print("Please note: we are flipping X dataframe around so that the rows are samples and the columns are gene/TF names :)") + X_df = X_df.transpose() + gene_expression_nodes = sorted(X_df.columns.tolist()) + common_nodes = list(ppi_net_nodes.intersection(gene_expression_nodes)) + + self.gene_expression_nodes = gene_expression_nodes + self.common_nodes = sorted(common_nodes) + gene_expression_nodes = sorted(gene_expression_nodes) # 10/22 + self.final_nodes = gene_expression_nodes + if self.overlapped_nodes_only: + self.final_nodes = common_nodes + elif self.preprocessed_network: + self.final_nodes = self.prior_network.final_nodes + else: + self.final_nodes = gene_expression_nodes + self.final_nodes = sorted(self.final_nodes) # 10/22 + if tg_is_tf: # 1/31/24 + self.final_nodes.remove(tg_name) + + final_nodes_set = set(self.final_nodes) + ppi_nodes_to_remove = list(ppi_net_nodes - final_nodes_set) + if tg_is_tf: # 1/31/24 + ppi_nodes_to_remove = list(set(ppi_nodes_to_remove) + set(tg_name)) + + self.gexpr_nodes_added = list(set(gene_expression_nodes) - final_nodes_set) + self.gexpr_nodes_to_add_for_net = list(set(gene_expression_nodes) - set(common_nodes)) + + if self.verbose: + if ppi_nodes_to_remove: + print(f"Please note that we remove {len(ppi_nodes_to_remove)} nodes found in the input network that are not found in the input gene expression data (X) :)") + print(ppi_nodes_to_remove) + else: + print(f":) Please note that all {len(common_nodes)} nodes found in the network are also found in the input gene expression data (X) :)") + self.filter_network_bool = self.check_overlaps_work() + filter_network_bool = self.filter_network_bool #self.check_overlaps_work(X_df) + + + if filter_network_bool: + print("Please note that we need to update the network information") + self.updating_network_A_matrix_given_X() # updating the A matrix given the gene expression data X + if self.view_network: + ef.draw_arrow() + self.view_W_network = self.view_W_network() + else: + self.A_df = self.network.A_df + self.A = self.network.A + self.nodes = self.A_df.columns.tolist() + + self.network_params = self.prior_network.param_lists + self.network_info = "fitted_network" + self.M = y.shape[0] + self.N = len(self.final_nodes) # pre-processing: + self.X_train = self.preprocess_X_df(X) + self.y_train = self.preprocess_y_df(y) + return self + + + def organize_B_interaction_list(self): # TF-TF interactions to output :) + self.B_train = self.compute_B_matrix(self.X_train) + self.B_interaction_df = pd.DataFrame(self.B_train, index = self.final_nodes, columns = self.final_nodes) + return self + + + def fit(self, X, y): # fits a model Function used for model training + + tg_name = y.columns.tolist()[0] + tg_is_tf = False + if tg_name in X.columns.tolist(): + if verbose: + print(f":) dropping TG {tg_name} from list of TF predictors!") + X = X.drop(columns = [tg_name]) + tg_is_tf = True # 1/31/24 + self.tg_is_tf = tg_is_tf + self.tg_name = tg_name + self.updating_network_and_X_during_fitting(X, y) + self.organize_B_interaction_list() + self.B_train_times_M = self.compute_B_matrix_times_M(self.X_train) + self.X_tilda_train, self.y_tilda_train = self.compute_X_tilde_y_tilde(self.B_train_times_M, self.X_train, + self.y_train) + # learning latent embedding values for X and y, respectively. + self.X_training_to_use, self.y_training_to_use = self.X_tilda_train, self.y_tilda_train + self.regr = self.return_fit_ml_model(self.X_training_to_use, self.y_training_to_use) + ml_model = self.regr + self.final_alpha = self.alpha_enet + if self.model_type == "ElasticNetCV": + self.final_alpha = ml_model.alpha_ + self.optimal_alpha = "Cross-Validation optimal alpha elasticnet: " + str(self.final_alpha) + if self.verbose: + print(self.optimal_alpha) + self.coef = ml_model.coef_ # Please Get the coefficients + self.coef[self.coef == -0.0] = 0 + if self.y_intercept: + self.intercept = ml_model.intercept_ + self.predY_tilda_train = ml_model.predict(self.X_training_to_use) # training data + self.mse_tilda_train = self.calculate_mean_square_error(self.y_training_to_use, self.predY_tilda_train) # Calculate MSE + self.predY_train = ml_model.predict(self.X_train) # training data + # training metrics: + self.mse_train = self.calculate_mean_square_error(self.y_train, self.predY_train) # Calculate MSE + self.nmse_train = self.calculate_nmse(self.y_train, self.predY_train) # Calculate NMSE + self.snr_train = self.calculate_snr(self.y_train, self.predY_train) # Calculate SNR (Signal to Noise Ratio) + self.psnr_train = self.calculate_psnr(self.y_train, self.predY_train) # Calculate PSNR (Peak Signal to Noise Ratio) + + if self.y_intercept: + coeff_terms = [self.intercept] + list(self.coef) + index_names = ["y_intercept"] + self.nodes + self.model_coef_df = pd.DataFrame(coeff_terms, index = index_names).transpose() + else: + coeff_terms = ["None"] + list(self.coef) + index_names = ["y_intercept"] + self.nodes + self.model_coef_df = pd.DataFrame(coeff_terms, index = index_names).transpose() + self.model_info = "fitted_model :)" + selected_row = self.model_coef_df.iloc[0] + selected_cols = selected_row[selected_row != 0].index # Filter out the columns with value 0 + if len(selected_cols) == 0: + self.model_nonzero_coef_df = None + self.num_final_predictors = 0 + else: + self.model_nonzero_coef_df = self.model_coef_df[selected_cols] + if len(selected_cols) > 1: # and self.model_type != "Linear": + self.netrem_model_predictor_results(y) + self.num_final_predictors = len(selected_cols) + if "y_intercept" in selected_cols: + self.num_final_predictors = self.num_final_predictors - 1 + return self + + + def netrem_model_predictor_results(self, y): # olders + """ :) Please note that this function by Saniya works on a netrem model and returns information about the predictors + such as their Pearson correlations with y, their rankings as well. + It returns: sorted_df, final_corr_vs_coef_df, combined_df """ + abs_df = self.model_nonzero_coef_df.replace("None", np.nan).apply(pd.to_numeric, errors='coerce').abs() + if abs_df.shape[0] == 1: + abs_df = pd.DataFrame([abs_df.squeeze()]) + sorted_series = abs_df.squeeze().sort_values(ascending=False) + sorted_df = pd.DataFrame(sorted_series) # convert the sorted series back to a DataFrame + sorted_df['Rank'] = range(1, len(sorted_df) + 1) # add a column for the rank + sorted_df['TF'] = sorted_df.index + sorted_df = sorted_df.rename(columns = {0:"AbsoluteVal_coefficient"}) + self.sorted_coef_df = sorted_df # print the sorted DataFrame + tg = y.columns.tolist()[0] + corr = pd.DataFrame(self.X_df.corrwith(y[tg])).transpose() + corr["info"] = "corr (r) with y: " + tg + all_df = self.model_coef_df + all_df = all_df.iloc[:, 1:] + all_df["info"] = "network regression coeff. with y: " + tg + all_df = pd.concat([all_df, corr]) + all_df["input_data"] = "X_train" + sorting = self.sorted_coef_df[["Rank"]].transpose().drop(columns = ["y_intercept"]) + sorting = sorting.reset_index().drop(columns = ["index"]) + sorting["info"] = "Absolute Value NetREm Coefficient Ranking" + sorting["input_data"] = "X_train" + all_df = pd.concat([all_df, sorting]) + self.corr_vs_coef_df = all_df + self.final_corr_vs_coef_df = self.corr_vs_coef_df[["info", "input_data"] + self.model_nonzero_coef_df.columns.tolist()[1:]] + + netrem_model_df = self.model_nonzero_coef_df.transpose() + netrem_model_df.columns = ["coef"] + netrem_model_df["TF"] = netrem_model_df.index.tolist() + netrem_model_df["TG"] = tg + if self.y_intercept: + netrem_model_df["info"] = "netrem_with_intercept" + else: + netrem_model_df["info"] = "netrem_no_intercept" + netrem_model_df["train_mse"] = self.mse_train + ## Oct 28 + netrem_model_df["train_nmse"] = self.nmse_train + netrem_model_df["train_snr"] = self.snr_train + netrem_model_df["train_psnr"] = self.psnr_train + ## end of Oct 28 + + if self.model_type != "Linear": + netrem_model_df["beta_net"] = self.beta_net + if self.model_type == "ElasticNetCV": + netrem_model_df["alpha_enetCV"] = self.optimal_alpha + else: + netrem_model_df["alpha_enet"] = self.alpha_enet + if netrem_model_df.shape[0] > 1: + self.combined_df = pd.merge(netrem_model_df, self.sorted_coef_df) + self.combined_df["final_model_TFs"] = max(self.sorted_coef_df["Rank"]) - 1 + else: + self.combined_df = netrem_model_df + self.combined_df["TFs_input_to_model"] = len(self.final_nodes) + self.combined_df["original_TFs_in_X"] = len(self.gene_expression_nodes) + self.combined_df["standardized_X"] = self.standardize_X + self.combined_df["standardized_y"] = self.standardize_y + self.combined_df["centered_y"] = self.center_y + return self + + + def view_W_network(self): + roundedW = np.round(self.W, decimals=4) + wMat = ef.view_matrix_as_dataframe(roundedW, column_names_list=self.final_nodes, row_names_list=self.final_nodes) + w_edgeList = wMat.stack().reset_index() + w_edgeList = w_edgeList[w_edgeList["level_0"] != w_edgeList["level_1"]] + w_edgeList = w_edgeList.rename(columns={"level_0": "source", "level_1": "target", 0: "weight"}) + w_edgeList = w_edgeList[w_edgeList["weight"] != 0] + + G = nx.from_pandas_edgelist(w_edgeList, source="source", target="target", edge_attr="weight") + pos = nx.spring_layout(G) + weights_list = [G.edges[e]['weight'] * self.prior_network.edge_weight_scaling for e in G.edges] + + fig, ax = plt.subplots() + + if not self.overlapped_nodes_only: + nodes_to_add = list(set(self.gene_expression_nodes) - set(self.common_nodes)) + if nodes_to_add: + print(f":) {len(nodes_to_add)} new nodes added to network based on gene expression data {nodes_to_add}") + node_color_map = { + node: self.prior_network.added_node_color_name if node in nodes_to_add else self.prior_network.node_color_name + for node in G.nodes + } + nx.draw(G, pos, node_color=node_color_map.values(), edge_color=self.prior_network.edge_color_name, with_labels=True, width=weights_list, ax=ax) + else: + nx.draw(G, pos, node_color=self.prior_network.node_color_name, edge_color=self.prior_network.edge_color_name, with_labels=True, width=weights_list, ax=ax) + else: + nx.draw(G, pos, node_color=self.prior_network.node_color_name, edge_color=self.prior_network.edge_color_name, with_labels=True, width=weights_list, ax=ax) + + labels = {e: G.edges[e]['weight'] for e in G.edges} + return nx.draw_networkx_edge_labels(G, pos, edge_labels=labels, ax=ax) + + + def compute_B_matrix_times_M(self, X): + """ M is N_sample, because ||y - Xc||^2 need to be normalized by 1/n_sample, but not the 2 * beta_L2 * c'Ac term + see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html + The optimization objective for Lasso is: + (1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1 where M = n_sample + Calculations""" + XtX = X.T @ X + beta_L2 = self.beta_net + N_squared = self.N * self.N + part_2 = 2.0 * float(beta_L2) * self.M / (N_squared) * self.A + B = XtX + part_2 + return B + + + def compute_B_matrix(self, X): + """ M is N_sample, because ||y - Xc||^2 need to be normalized by 1/n_sample, but not the 2 * beta_L2 * c'Ac term + see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html + The optimization objective for Lasso is: + (1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1 + where M = n_sample + Outputting for user """ + return self.compute_B_matrix_times_M(X) / self.M + + + def compute_X_tilde_y_tilde(self, B, X, y): + """Compute X_tilde, y_tilde such that X_tilde.T @ X_tilde = B, y_tilde.T @ X_tilde = y.T @ X """ + U, s, _Vh = np.linalg.svd(B, hermitian=True) # B = U @ np.diag(s) @ _Vh + if (cond := s[0] / s[-1]) > 1e10: + print(f'Large conditional number of B matrix: {cond: .2f}') + S_sqrt = ef.DiagonalLinearOperator(np.sqrt(s)) + S_inv_sqrt = ef.DiagonalLinearOperator(1 / np.sqrt(s)) + X_tilde = S_sqrt @ U.T + y_tilde = (y @ X @ U @ S_inv_sqrt).T + # assert(np.allclose(y.T @ X, y_tilde.T @ X_tilde)) + # assert(np.allclose(B, X_tilde.T @ X_tilde)) + # scale: we normalize by 1/M, but sklearn.linear_model.Lasso normalize by 1/N because X_tilde is N*N matrix, + # so Lasso thinks the number of sample is N instead of M, to use lasso solve our desired problem, correct the scale + scale = np.sqrt(self.N / self.M) + X_tilde *= scale + y_tilde *= scale + return X_tilde, y_tilde + + + def predict_y_from_y_tilda(self, X, X_tilda, pred_y_tilda): + + X = self.preprocess_X_df(X) + # Transposing the matrix before inverting + X_transpose_inv = np.linalg.inv(X.T) + + # Efficiently compute pred_y by considering the dimensions of matrices + pred_y = np.dot(np.dot(X_transpose_inv, X_tilda.T), pred_y_tilda) + + return pred_y + + + def _apply_parameter_constraints(self): + constraints = {**ElasticNetREmModel._parameter_constraints} + for key, value in self.__dict__.items(): + if key in constraints: + if isinstance(constraints[key], tuple): + if isinstance(constraints[key][0], type) and not isinstance(value, constraints[key][0]): + setattr(self, key, constraints[key][0]) + elif constraints[key][1] is not None and isinstance(constraints[key][1], type) and not isinstance(value, constraints[key][1]): + setattr(self, key, constraints[key][1]) + elif value not in constraints[key]: + setattr(self, key, constraints[key][0]) + return self + + + def calculate_mean_square_error(self, actual_values, predicted_values): + difference = (actual_values - predicted_values)# Please note that this function by Saniya calculates the Mean Square Error (MSE) + squared_diff = difference ** 2 # square of the difference + mean_squared_diff = np.mean(squared_diff) + return mean_squared_diff + + + def predict(self, X_test): + X_test = X_test[self.final_nodes] # Oct 28 + + if self.standardize_X: + self.X_test_standardized = self.standardize_X_data(X_test) + X_test = self.preprocess_X_df(self.X_test_standardized) + else: + X_test = self.preprocess_X_df(X_test) # X_test + #X_test = X_test[self.X_df.columns] # Oct 28 + return self.regr.predict(X_test) + + + def test_mse(self, X_test, y_test): + X_test = X_test.sort_index(axis=1) # 9/20 + if self.standardize_X: + self.X_test_standardized = self.standardize_X_data(X_test) + X_test = self.preprocess_X_df(self.X_test_standardized) + else: + X_test = self.preprocess_X_df(X_test) # X_test + #X_test = X_test[self.X_df.columns] # # Oct 28 + if self.center_y: + y_test = self.center_y_data(y_test) + if self.standardize_y: + self.y_test_standardized = self.standardize_y_data(y_test) + y_test = self.preprocess_y_df(self.y_test_standardized) + else: + y_test = self.preprocess_y_df(y_test) # X_test + + predY_test = self.regr.predict(X_test) # training data + mse_test = self.calculate_mean_square_error(y_test, predY_test) # Calculate MSE + return mse_test #mse_test + + ## October 28: :) + def calculate_nmse(self, actual_values, predicted_values):#(self, X_test, y_test): + nmse_test = em.nmse(actual_values, predicted_values) #(y_test, predY_test) # Calculate MSE + return nmse_test + + + def calculate_snr(self, actual_values, predicted_values):#(self, X_test, y_test): + snr_test = em.snr(actual_values, predicted_values) #(y_test, predY_test) # Calculate MSE + return snr_test + + + def calculate_psnr(self, actual_values, predicted_values):#(self, X_test, y_test): + psnr_test = em.psnr(actual_values, predicted_values) #(y_test, predY_test) # Calculate MSE + return psnr_test + +## end of Oct 28 + + + def get_params(self, deep=True): + params_dict = {"info":self.info, "alpha_enet": self.alpha_enet, "beta_net": self.beta_net, + "y_intercept": self.y_intercept, "model_type":self.model_type, + "standardize_X":self.standardize_X, + "center_y":self.center_y, + "max_enet_iterations":self.max_enet_iterations, + "network":self.network, "verbose":self.verbose, + "all_pos_coefs":self.all_pos_coefs, "model_info":self.model_info, + "target_gene_y":self.target_gene_y} + if self.model_type == "ElasticNetCV": + params_dict["num_cv_folds"] = self.num_cv_folds + params_dict["num_jobs"] = self.num_jobs + params_dict["alpha_enet"] = "ElasticNetCV finds optimal alpha" + params_dict["enet_cv_eps"] = self.enet_cv_eps + params_dict["enet_cv_n_alphas"] = self.enet_cv_n_alphas + params_dict["enet_cv_alphas"] = self.enet_cv_alphas + params_dict["optimal_alpha"] = self.optimal_alpha + elif self.model_type == "Linear": + params_dict["alpha_enet"] = "No alpha needed" + params_dict["num_jobs"] = self.num_jobs + if self.model_type != "Linear": + params_dict["tolerance"] = self.tolerance + params_dict["enet_selection"] = self.enet_selection + params_dict["l1_ratio"] = self.l1_ratio_en + if not deep: + return params_dict + else: + return copy.deepcopy(params_dict) + + + def set_params(self, **params): + """ Sets the value of any parameters in this estimator + Parameters: **params: Dictionary of parameter names mapped to their values + Returns: self: Returns an instance of self """ + if not params: + return self + for key, value in params.items(): + if key not in self.get_params(): + raise ValueError(f'Invalid parameter {key} for estimator {self.__class__.__name__}') + setattr(self, key, value) + return self + + + def __deepcopy__(self, memo): + cls = self.__class__ + result = cls.__new__(cls) + memo[id(self)] = result + for k, v in self.__dict__.items(): + setattr(result, k, deepcopy(v, memo)) + result.optimal_alpha = self.optimal_alpha + return result + + + def clone(self): + return deepcopy(self) + + + def score(self, X, y, zero_coef_penalty=10): + if isinstance(X, pd.DataFrame): + X = self.preprocess_X_df(X) # X_test + if isinstance(y, pd.DataFrame): + y = self.preprocess_y_df(y) + + # Make predictions using the predict method of your custom estimator + y_pred = self.predict(X) + + # Handle cases where predictions are exactly zero + y_pred[y_pred == 0] = 1e-10 + + # Calculate the normalized mean squared error between the true and predicted values + nmse_ = (y - y_pred)**2 + nmse_[y_pred == 1e-10] *= zero_coef_penalty + nmse_ = nmse_.mean() / (y**2).mean() + + if nmse_ == 0: + #return float(1e1000) # Return positive infinity if nmse_ is zero + + return float("inf") # Return positive infinity if nmse_ is zero + else: + return -nmse_ + + + def updating_network_A_matrix_given_X(self) -> np.ndarray: + """ When we call the fit method, this function is used to help us update the network information. + Here, we can generate updated W matrix, updated D matrix, and updated V matrix. + Then, those updated derived matrices are used to calculate the A matrix. + """ + network = self.network + final_nodes = self.final_nodes + W_df = network.W_df.copy() # updating the W matrix + + #1/31/24: + if self.tg_is_tf: + W_df = W_df.drop(columns = [self.tg_name]) + + # Simplified addition of new nodes + if self.gexpr_nodes_added: + for node in self.gexpr_nodes_added: + W_df[node] = np.nan + W_df.loc[node] = np.nan + + # Consolidated indexing and reindexing operations + W_df = W_df.reindex(index=final_nodes, columns=final_nodes) + + # Handle missing values + W_df.fillna(value=self.prior_network.default_edge_weight, inplace=True) + np.fill_diagonal(W_df.values, 0) + + N = len(final_nodes) + self.N = N + W = W_df.values + np.fill_diagonal(W, (W.sum(axis=0) - W.diagonal()) / (N - 1)) + + self.W = W + self.W_df = W_df + + # Check for symmetric matrix + if not ef.check_symmetric(W): + print(":( W matrix is NOT symmetric") + + # Update V matrix + self.V = N * np.eye(N) - np.ones(N) + + # Update D matrix + if not network.edge_values_for_degree: + W_bool = (W > network.threshold_for_degree) + d = np.float64(W_bool.sum(axis=0) - W_bool.diagonal()) + else: + if network.w_transform_for_d == "sqrt": + W_to_use = np.sqrt(W) + elif network.w_transform_for_d == "square": + W_to_use = W ** 2 + else: + W_to_use = W + d = W_to_use.diagonal() * (self.N - 1) + + # Handle pseudocount and self loops + d += network.pseudocount_for_degree + if network.consider_self_loops: + d += 1 + + d_inv_sqrt = 1 / np.sqrt(d) + self.D = ef.DiagonalLinearOperator(d_inv_sqrt) + + # Update inv_sqrt_degree_df + self.inv_sqrt_degree_df = pd.DataFrame({ + "TF": self.final_nodes, + "degree_D": self.D * np.ones(self.N) + }) + + Amat = self.D @ (self.V * W) @ self.D + A_df = pd.DataFrame(Amat, columns=final_nodes, index=final_nodes, dtype=np.float64) + + # Handle nodes based on `overlapped_nodes_only` + gene_expression_nodes = self.gene_expression_nodes + nodes_to_add = list(set(gene_expression_nodes) - set(final_nodes)) + self.nodes_to_add = nodes_to_add + if not self.overlapped_nodes_only: + for name in nodes_to_add: + A_df[name] = 0 + A_df.loc[name] = 0 + A_df = A_df.reindex(columns=sorted(gene_expression_nodes), index=sorted(gene_expression_nodes)) + else: + if len(nodes_to_add) == 1: + print(f"Please note that we remove 1 node {nodes_to_add[0]} found in the input gene expression data (X) that is not found in the input network :)") + elif len(nodes_to_add) > 1: + print(f":) Since overlapped_nodes_only = True, please note that we remove {len(nodes_to_add)} gene expression nodes that are not found in the input network.") + print(nodes_to_add) + A_df = A_df.sort_index(axis=0).sort_index(axis=1) + if graph.is_positive_semi_definite(A_df) == False: + print(":( Error! A is NOT positive semi-definite! There exist some negative eigenvalues for A! :(") + self.A_df = A_df + self.A = A_df.values + self.nodes = A_df.columns.tolist() + self.tf_names_list = self.nodes + return self + + def preprocess_X_df(self, X): + if isinstance(X, pd.DataFrame): + X_df = X + column_names_list = list(X_df.columns) + if self.tg_name in column_names_list: + X_df = X_df.drop(columns = [self.tg_name]) + + overlap_num = len(ef.intersection(column_names_list, self.final_nodes)) + if overlap_num == 0: + print("Please note: we are flipping X dataframe around so that the rows are samples and the columns are gene/TF names :)") + X_df = X_df.transpose() + column_names_list = list(X_df.columns) + overlap_num = len(ef.intersection(column_names_list, self.common_nodes)) + gene_names_list = self.final_nodes # so that this matches the order of columns in A matrix as well + X_df = X_df.loc[:, X_df.columns.isin(gene_names_list)] # filtering the X_df as needed based on the columns + X_df = X_df.reindex(columns=gene_names_list)# Reorder columns of dataframe to match order in `column_order` + X = np.array(X_df.values.tolist()) + return X + + + def preprocess_y_df(self, y): + if isinstance(y, pd.DataFrame): + y = y.values.flatten() + return y + + + def return_Linear_ML_model(self, X, y): + regr = LinearRegression(fit_intercept = self.y_intercept, + positive = self.all_pos_coefs, + n_jobs = self.num_jobs) + regr.fit(X, y) + return regr + + + def return_ElasticNet_ML_model(self, X, y): + regr = ElasticNet(alpha = self.alpha_enet, fit_intercept = self.y_intercept, + max_iter = self.max_enet_iterations, tol = self.tolerance, + selection = self.enet_selection, l1_ratio = self.l1_ratio_en, + positive = self.all_pos_coefs) + regr.fit(X, y) + return regr + + + def return_ElasticNetCV_ML_model(self, X, y): + regr = ElasticNetCV(cv = self.num_cv_folds, random_state = 0, + fit_intercept = self.y_intercept, + max_iter = self.max_enet_iterations, + n_jobs = self.num_jobs, + tol = self.tolerance, + l1_ratio = self.l1_ratio_en, + selection = self.enet_selection, + positive = self.all_pos_coefs, + eps = self.enet_cv_eps, + n_alphas = self.enet_cv_n_alphas, + alphas = self.enet_cv_alphas) + regr.fit(X, y) + return regr + + + def return_fit_ml_model(self, X, y): + if self.model_type == "Linear": + model_to_return = self.return_Linear_ML_model(X, y) + elif self.model_type == "ElasticNet": + model_to_return = self.return_ElasticNet_ML_model(X, y) + elif self.model_type == "ElasticNetCV": + model_to_return = self.return_ElasticNetCV_ML_model(X, y) + return model_to_return + + +def elasticnetrem(edge_list, beta_net = 1, alpha_enet = 0.01, default_edge_weight = 0.1, + degree_threshold = 0.5, gene_expression_nodes = [], overlapped_nodes_only = False, + y_intercept = False, standardize_X = True, standardize_y = True, center_y = False, view_network = False, + model_type = "ElasticNet", enet_selection = "cyclic", all_pos_coefs = False, tolerance = 1e-4, maxit = 10000, + l1_ratio_en = 0.5, num_jobs = -1, num_cv_folds = 5, enet_cv_eps = 1e-3, + enet_cv_n_alphas = 100, # default in sklearn + enet_cv_alphas = None, # default in sklearn + verbose = False, + hide_warnings = True): + degree_pseudocount = 1e-3 + if hide_warnings: + warnings.filterwarnings("ignore") + default_beta = False + default_alpha = False + if beta_net == 1: + print("using beta_net default of", 1) + default_beta = True + if alpha_enet == 0.01: + if model_type != "ElasticNetCV": + print("using alpha_enet default of", 0.01) + default_alpha = True + edge_vals_for_d = False + self_loops = False + w_transform_for_d = "none" + + prior_graph_dict = {"edge_list": edge_list, + "gene_expression_nodes":gene_expression_nodes, + "edge_values_for_degree": edge_vals_for_d, + "consider_self_loops":self_loops, + "pseudocount_for_degree":degree_pseudocount, + "default_edge_weight": default_edge_weight, + "w_transform_for_d":w_transform_for_d, + "threshold_for_degree": degree_threshold, + "verbose":verbose, + "view_network":view_network} + netty = graph.PriorGraphNetwork(**prior_graph_dict) # uses the network to get features like the A matrix. + greg_dict = {"network": netty, + "model_type": model_type, + "use_network":True, + "standardize_X":standardize_X, + "standardize_y":standardize_y, + "center_y":center_y, + "y_intercept":y_intercept, + "overlapped_nodes_only":overlapped_nodes_only, + "max_enet_iterations":maxit, + "all_pos_coefs":all_pos_coefs, + "view_network":view_network, + "l1_ratio_en":l1_ratio_en, + "verbose":verbose} + if default_alpha == False: + greg_dict["alpha_enet"] = alpha_enet + if default_beta == False: + greg_dict["beta_net"] = beta_net + if model_type != "Linear": + greg_dict["tolerance"] = tolerance + greg_dict["enet_selection"] = enet_selection + if model_type != "ElasticNet": + greg_dict["num_jobs"] = num_jobs + if model_type == "ElasticNetCV": + greg_dict["num_cv_folds"] = num_cv_folds + greg_dict["enet_cv_eps"] = enet_cv_eps + greg_dict["enet_cv_n_alphas"] = enet_cv_n_alphas + greg_dict["enet_cv_alphas"] = enet_cv_alphas + greggy = ElasticNetREmModel(**greg_dict) + return greggy \ No newline at end of file diff --git a/code/Netrem_model_builder.py b/code/Netrem_model_builder.py index 1e13f6d..3ba8e69 100644 --- a/code/Netrem_model_builder.py +++ b/code/Netrem_model_builder.py @@ -1,4 +1,6 @@ +# February 22, 2024 import pandas as pd +import polars as pl import numpy as np import matplotlib.pyplot as plt import random @@ -8,6 +10,11 @@ import sys # https://www.dev2qa.com/how-to-run-python-script-py-file-in-jupyter-notebook-ipynb-file-and-ipython/#:~:text=How%20To%20Run%20Python%20Script%20.py%20File%20In,2.%20Invoke%20Python%20Script%20File%20From%20Ipython%20Command-Line. import networkx as nx import scipy +import math +import shap +from pecanpy.graph import SparseGraph, DenseGraph # https://pecanpy.readthedocs.io/en/latest/pecanpy.html#pecanpy.cli.main +from pecanpy import pecanpy as node2vec +from sklearn.metrics.pairwise import cosine_similarity from scipy.linalg import svd as robust_svd from sklearn.model_selection import KFold, train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score from sklearn.decomposition import TruncatedSVD @@ -22,6 +29,8 @@ from skopt import gp_minimize, space from scipy.sparse.linalg.interface import LinearOperator import warnings +from sklearn.preprocessing import PolynomialFeatures +from sklearn.ensemble import RandomForestRegressor from sklearn.exceptions import ConvergenceWarning printdf = lambda *args, **kwargs: print(pd.DataFrame(*args, **kwargs)) rng_seed = 2023 # random seed for reproducibility @@ -29,17 +38,20 @@ # from packages_needed import * import essential_functions as ef import error_metrics as em # why to do import -#import Netrem_model_builder as nm import DemoDataBuilderXandY as demo import PriorGraphNetwork as graph import netrem_evaluation_functions as nm_eval -import matplotlib.pyplot as plt -import pandas as pd -import numpy as np -import networkx as nx -from sklearn.linear_model import LinearRegression, Lasso, LassoCV from tqdm.auto import tqdm -import copy +from pecanpy import pecanpy as node2vec +from pecanpy.graph import SparseGraph, DenseGraph # https://pecanpy.readthedocs.io/en/latest/pecanpy.html#pecanpy.cli.main +from sklearn.metrics.pairwise import cosine_similarity +from sklearn.preprocessing import StandardScaler +from gensim.models import Word2Vec +import subprocess +#import dask.dataframe as dd + +os.environ['PYTHONHASHSEED'] = '0' + """ Optimization for (1 / (2 * M)) * ||y - Xc||^2_2 + (beta / (2 * N^2)) * c'Ac + alpha * ||c||_1 @@ -65,10 +77,13 @@ class NetREmModel(BaseEstimator, RegressorMixin): "lassocv_eps": (0, None), "lassocv_n_alphas": (1, None), "standardize_X": [True, False], + "standardize_y": [True, False], "center_y": [True, False] } + os.environ['PYTHONHASHSEED'] = '0' def __init__(self, **kwargs): + self.info = "NetREm Model" self.verbose = False self.overlapped_nodes_only = False # restrict the nodes to only being those found in the network? overlapped_nodes_only @@ -77,7 +92,8 @@ def __init__(self, **kwargs): self.all_pos_coefs = False # for coefficients self.model_type = "Lasso" self.standardize_X = True - self.center_y = True + self.standardize_y = True + self.center_y = False self.use_network = True self.y_intercept = False self.max_lasso_iterations = 10000 @@ -90,8 +106,18 @@ def __init__(self, **kwargs): self.lassocv_n_alphas = 100 # default in sklearn self.lassocv_alphas = None # default in sklearn self.beta_net = kwargs.get('beta_net', 1) + self.small_sparse_graph = True, + self.dimensions: int = 128 + self.walk_length: int = 10 + self.num_walks: int = 10 + self.p: float = 1 + self.q: float = 1 #0.5 + self.workers: int = -1 + self.epochs: int = 1 + ######################################## + self.__dict__.update(kwargs) - required_keys = ["network", "beta_net"] + required_keys = ["network", "beta_net"]#, "gamma_net"] if self.model_type == "Lasso": self.alpha_lasso = kwargs.get('alpha_lasso', 0.01) self.optimal_alpha = "User-specified optimal alpha lasso: " + str(self.alpha_lasso) @@ -99,18 +125,29 @@ def __init__(self, **kwargs): elif self.model_type == "LassoCV": self.alpha_lasso = "LassoCV finds optimal alpha" self.optimal_alpha = "Since LassoCV is model_type, please fit model using X and y data to find optimal_alpha." - else: # model_type == "Linear": - self.alpha_lasso = "No alpha needed" - self.optimal_alpha = "No alpha needed" # + else: # linear regression + self.alpha_lasso = 0#"No alpha needed" + self.optimal_alpha = 0#"No alpha needed" # missing_keys = [key for key in required_keys if key not in self.__dict__] # check that all required keys are present: if missing_keys: raise ValueError(f":( Please note ye are missing information for these keys: {missing_keys}") if self.use_network: prior_network = self.network self.prior_network = prior_network + edge_list = self.network.W_df# Feb 12, 2024 + + edge_list = ( + edge_list + .reset_index() + .melt(id_vars=["index"], var_name="TF2", value_name="weight_W") + .rename(columns={"index": "TF1"}) + .query("TF1 != TF2") + ) + self.edge_list = edge_list.values.tolist() + ######################################################################### self.preprocessed_network = prior_network.preprocessed_network - self.network_params = prior_network.param_lists self.network_nodes_list = prior_network.final_nodes # tf_names_list + self.default_edge_weight = prior_network.default_edge_weight self.kwargs = kwargs self._apply_parameter_constraints() # ensuring that the parameter constraints are met @@ -121,9 +158,13 @@ def __repr__(self): def check_overlaps_work(self): - final_set = set(self.final_nodes) - network_set = set(self.network_nodes_list) - return network_set != final_set + #final_set = self.final_nodes_set + #network_set = self.ppi_net_nodes # set(self.network_nodes_list) + if self.tg_is_tf: + return False + if self.tg_name in self.final_nodes: + return False + return self.ppi_net_nodes != self.final_nodes_set def standardize_X_data(self, X_df): # if the user opts to @@ -131,13 +172,29 @@ def standardize_X_data(self, X_df): # if the user opts to and a standard deviation of 1), then this method will be run, which uses the preprocessing package StandardScalar() functionality. """ if self.standardize_X: + # if is_standardized(X_df): + # return X_df # Transform both the training and test data - X_scaled = self.scaler.transform(X_df) + X_scaled = self.scaler_X.transform(X_df) X_scaled_df = pd.DataFrame(X_scaled, columns=X_df.columns) return X_scaled_df else: return X_df + + def standardize_y_data(self, y_df): # if the user opts to + """ :) If the user opts to standardize the y data (so that the TG will have a mean of 0 + and a standard deviation of 1), then this method will be run, which uses the preprocessing + package StandardScalar() functionality. """ + if self.standardize_y: + # Transform both the training and test data + y_scaled = self.scaler_y.transform(y_df) + y_scaled_df = pd.DataFrame(y_scaled, columns=y_df.columns) + return y_scaled_df + else: + return y_df + + def center_y_data(self, y_df): # if the user opts to """ :) If the user opts to center the response y data: subtracting its mean from each observation.""" @@ -148,44 +205,51 @@ def center_y_data(self, y_df): # if the user opts to else: return y_df + def updating_network_and_X_during_fitting(self, X, y): # updated one :) """ Update the prior network information and the X input data (training) during the fitting of the model. It determines if the common predictors should be used (based on if overlapped_nodes_only is True) or if all of the X input data should be used. """ X_df = X.sort_index(axis=1) # sorting the X dataframe by columns. (rows are samples) - - #X_df = X.sort_index(axis=0).sort_index(axis=1) # sorting the X dataframe by rows and columns. - #self.X_df = X_df self.target_gene_y = y.columns[0] - + tg_name = self.tg_name if self.standardize_X: # we will standardize X then if self.verbose: print(":) Standardizing the X data") self.old_X_df = X_df - self.scaler = preprocessing.StandardScaler().fit(X_df) # Fit the scaler to the training data only + self.scaler_X = preprocessing.StandardScaler().fit(X_df) # Fit the scaler to the training data only # this self.scalar will be utilized for the testing data to prevent data leakage and to ensure generalization :) self.X_df = self.standardize_X_data(X_df) X = self.X_df # overwriting and updating the X df else: self.X_df = X_df + X = self.X_df - self.mean_y_train = np.mean(y) # the average y value if self.center_y: # we will center y then + self.mean_y_train = np.mean(y) # the average y value if self.verbose: print(":) centering the y data") # Assuming y_train and y_test are your training and test labels self.old_y = y y = self.center_y_data(y) - #gene_expression_nodes = X_df.columns.tolist() # these are already sorted - tg_name = y.columns.tolist()[0] - if tg_name in X_df.columns.tolist(): - X_df = X_df.drop(columns = [tg_name]) - - #gene_expression_nodes = list(set(X_df.columns.tolist()) - tg_name) # these are already sorted - gene_expression_nodes = sorted(X_df.columns.tolist()) # these will be sorted - ppi_net_nodes = set(self.network_nodes_list) # set(self.network_nodes_list) - tg_name + if self.standardize_y: # we will standardize y then + if self.verbose: + print(":) Standardizing the y data") + self.old_y_df = y + self.scaler_y = preprocessing.StandardScaler().fit(y) # Fit the scaler to the training data only + # this self.scalar will be utilized for the testing data to prevent data leakage and to ensure generalization :) + self.y_df = self.standardize_y_data(y) + y = self.y_df # overwriting and updating the y df + else: + self.y_df = y + if self.tg_name in X_df.columns.tolist(): + X_df.drop(columns = [tg_name], inplace = True) + self.tg_is_tf = True # 1/31/24 + gene_expression_nodes = X_df.columns.tolist() #sorted(X_df.columns.tolist()) # these will be sorted + self.ppi_net_nodes = set(self.network_nodes_list) + ppi_net_nodes = self.ppi_net_nodes common_nodes = list(ppi_net_nodes.intersection(gene_expression_nodes)) if not common_nodes: # may be possible that the X dataframe needs to be transposed if provided incorrectly @@ -196,17 +260,23 @@ def updating_network_and_X_during_fitting(self, X, y): self.gene_expression_nodes = gene_expression_nodes self.common_nodes = sorted(common_nodes) - gene_expression_nodes = sorted(gene_expression_nodes) # 10/22 self.final_nodes = gene_expression_nodes if self.overlapped_nodes_only: - self.final_nodes = common_nodes + self.final_nodes = self.common_nodes elif self.preprocessed_network: - self.final_nodes = self.prior_network.final_nodes + self.final_nodes = sorted(self.prior_network.final_nodes) else: self.final_nodes = gene_expression_nodes - self.final_nodes = sorted(self.final_nodes) # 10/22 - final_nodes_set = set(self.final_nodes) + self.final_nodes_set = set(self.final_nodes) + final_nodes_set = self.final_nodes_set ppi_nodes_to_remove = list(ppi_net_nodes - final_nodes_set) + self.ppi_nodes_to_remove = ppi_nodes_to_remove + if self.tg_name in self.final_nodes: + self.tg_is_tf = True + filter_network_bool = True + self.filter_network_bool = filter_network_bool + self.final_nodes.remove(self.tg_name) + ppi_nodes_to_remove = list(set(ppi_nodes_to_remove).union(set(self.tg_name))) self.gexpr_nodes_added = list(set(gene_expression_nodes) - final_nodes_set) self.gexpr_nodes_to_add_for_net = list(set(gene_expression_nodes) - set(common_nodes)) @@ -216,9 +286,11 @@ def updating_network_and_X_during_fitting(self, X, y): print(ppi_nodes_to_remove) else: print(f":) Please note that all {len(common_nodes)} nodes found in the network are also found in the input gene expression data (X) :)") - - filter_network_bool = self.filter_network_bool = self.check_overlaps_work() #self.check_overlaps_work(X_df) - if filter_network_bool: + self.filter_network_bool = self.check_overlaps_work() + + if self.tg_is_tf: + self.filter_network_bool = True + if self.filter_network_bool: print("Please note that we need to update the network information") self.updating_network_A_matrix_given_X() # updating the A matrix given the gene expression data X if self.view_network: @@ -228,28 +300,52 @@ def updating_network_and_X_during_fitting(self, X, y): self.A_df = self.network.A_df self.A = self.network.A self.nodes = self.A_df.columns.tolist() - - self.network_params = self.prior_network.param_lists self.network_info = "fitted_network" self.M = y.shape[0] self.N = len(self.final_nodes) # pre-processing: - self.X_train = self.preprocess_X_df(X) - self.y_train = self.preprocess_y_df(y) + self.X_train = self.preprocess_X_df(X) # dataframe to array + self.y_train = self.preprocess_y_df(y) # dataframe to array return self - def organize_B_interaction_list(self): # TF-TF interactions to output :) - self.B_train = self.compute_B_matrix(self.X_train) - self.B_interaction_df = pd.DataFrame(self.B_train, index = self.final_nodes, columns = self.final_nodes) - return self + final_tfs = self.model_nonzero_coef_df + final_tfs = final_tfs.drop(columns = ["y_intercept"]).columns.tolist() + if len(final_tfs) == 0: + self.coord_score_df = pd.DataFrame() + else: + X_tilda_train_df = self.X_tilda_train_df + c_df = self.model_coef_df + c_df = c_df.drop(columns = ["y_intercept"]) + coeff_vector = c_df.iloc[0].values + cos_sim = cosine_similarity(X_tilda_train_df.T) # Transpose DataFrame to calculate column-wise similarity + cos_sim_df = pd.DataFrame(cos_sim, index = c_df.columns, columns = c_df.columns) + coeff_matrix = np.outer(coeff_vector, coeff_vector) + sign_matrix = np.sign(coeff_matrix).astype(int) + coord_matrix = abs(cos_sim_df) * sign_matrix + result = coord_matrix.loc[final_tfs, final_tfs] + np.fill_diagonal(result.values, 0) + max_other = np.max(np.abs(result)).max() + coord_matrix = 100.0*result/max_other + self.coord_score_df = coord_matrix + self.TF_interaction_df = self.coord_score_df + return self def fit(self, X, y): # fits a model Function used for model training + tg_name = y.columns.tolist()[0] + self.tg_is_tf = False + if tg_name in X.columns.tolist(): + if self.verbose: + print(f":) dropping TG {tg_name} from list of TF predictors!") + X.drop(columns = [tg_name], inplace = True) + self.tg_is_tf = True # 1/31/24 + self.tg_name = tg_name self.updating_network_and_X_during_fitting(X, y) - self.organize_B_interaction_list() - self.B_train_times_M = self.compute_B_matrix_times_M(self.X_train) - self.X_tilda_train, self.y_tilda_train = self.compute_X_tilde_y_tilde(self.B_train_times_M, self.X_train, + self.E_train = self.compute_E_matrix(self.X_train) + self.X_tilda_train, self.y_tilda_train = self.compute_X_tilde_y_tilde(self.E_train, self.X_train, self.y_train) + self.standardize_X_tilde_y_tilde() + # learning latent embedding values for X and y, respectively. self.X_training_to_use, self.y_training_to_use = self.X_tilda_train, self.y_tilda_train self.regr = self.return_fit_ml_model(self.X_training_to_use, self.y_training_to_use) ml_model = self.regr @@ -263,10 +359,12 @@ def fit(self, X, y): # fits a model Function used for model training self.coef[self.coef == -0.0] = 0 if self.y_intercept: self.intercept = ml_model.intercept_ - self.predY_tilda_train = ml_model.predict(self.X_training_to_use) # training data - self.mse_tilda_train = self.calculate_mean_square_error(self.y_training_to_use, self.predY_tilda_train) # Calculate MSE - self.predY_train = ml_model.predict(self.X_train) # training data - self.mse_train = self.calculate_mean_square_error(self.y_train, self.predY_train) # Calculate MSE + self.predY_train = ml_model.predict(self.X_train) # training data + # training metrics: + self.mse_train = self.calculate_mean_square_error(self.y_train, self.predY_train) # Calculate MSE + self.nmse_train = self.calculate_nmse(self.y_train, self.predY_train) # Calculate NMSE + self.snr_train = self.calculate_snr(self.y_train, self.predY_train) # Calculate SNR (Signal to Noise Ratio) + self.psnr_train = self.calculate_psnr(self.y_train, self.predY_train) # Calculate PSNR (Peak Signal to Noise Ratio) if self.y_intercept: coeff_terms = [self.intercept] + list(self.coef) index_names = ["y_intercept"] + self.nodes @@ -288,6 +386,7 @@ def fit(self, X, y): # fits a model Function used for model training self.num_final_predictors = len(selected_cols) if "y_intercept" in selected_cols: self.num_final_predictors = self.num_final_predictors - 1 + self.organize_B_interaction_list() return self @@ -304,7 +403,7 @@ def netrem_model_predictor_results(self, y): # olders sorted_df['TF'] = sorted_df.index sorted_df = sorted_df.rename(columns = {0:"AbsoluteVal_coefficient"}) self.sorted_coef_df = sorted_df # print the sorted DataFrame - tg = y.columns.tolist()[0] + tg = self.tg_name corr = pd.DataFrame(self.X_df.corrwith(y[tg])).transpose() corr["info"] = "corr (r) with y: " + tg all_df = self.model_coef_df @@ -319,7 +418,6 @@ def netrem_model_predictor_results(self, y): # olders all_df = pd.concat([all_df, sorting]) self.corr_vs_coef_df = all_df self.final_corr_vs_coef_df = self.corr_vs_coef_df[["info", "input_data"] + self.model_nonzero_coef_df.columns.tolist()[1:]] - netrem_model_df = self.model_nonzero_coef_df.transpose() netrem_model_df.columns = ["coef"] netrem_model_df["TF"] = netrem_model_df.index.tolist() @@ -329,6 +427,9 @@ def netrem_model_predictor_results(self, y): # olders else: netrem_model_df["info"] = "netrem_no_intercept" netrem_model_df["train_mse"] = self.mse_train + netrem_model_df["train_nmse"] = self.nmse_train + netrem_model_df["train_snr"] = self.snr_train + netrem_model_df["train_psnr"] = self.psnr_train if self.model_type != "Linear": netrem_model_df["beta_net"] = self.beta_net if self.model_type == "LassoCV": @@ -343,9 +444,11 @@ def netrem_model_predictor_results(self, y): # olders self.combined_df["TFs_input_to_model"] = len(self.final_nodes) self.combined_df["original_TFs_in_X"] = len(self.gene_expression_nodes) self.combined_df["standardized_X"] = self.standardize_X + self.combined_df["standardized_y"] = self.standardize_y self.combined_df["centered_y"] = self.center_y return self + def view_W_network(self): roundedW = np.round(self.W, decimals=4) wMat = ef.view_matrix_as_dataframe(roundedW, column_names_list=self.final_nodes, row_names_list=self.final_nodes) @@ -353,13 +456,10 @@ def view_W_network(self): w_edgeList = w_edgeList[w_edgeList["level_0"] != w_edgeList["level_1"]] w_edgeList = w_edgeList.rename(columns={"level_0": "source", "level_1": "target", 0: "weight"}) w_edgeList = w_edgeList[w_edgeList["weight"] != 0] - G = nx.from_pandas_edgelist(w_edgeList, source="source", target="target", edge_attr="weight") pos = nx.spring_layout(G) weights_list = [G.edges[e]['weight'] * self.prior_network.edge_weight_scaling for e in G.edges] - fig, ax = plt.subplots() - if not self.overlapped_nodes_only: nodes_to_add = list(set(self.gene_expression_nodes) - set(self.common_nodes)) if nodes_to_add: @@ -373,12 +473,11 @@ def view_W_network(self): nx.draw(G, pos, node_color=self.prior_network.node_color_name, edge_color=self.prior_network.edge_color_name, with_labels=True, width=weights_list, ax=ax) else: nx.draw(G, pos, node_color=self.prior_network.node_color_name, edge_color=self.prior_network.edge_color_name, with_labels=True, width=weights_list, ax=ax) - labels = {e: G.edges[e]['weight'] for e in G.edges} return nx.draw_networkx_edge_labels(G, pos, edge_labels=labels, ax=ax) - def compute_B_matrix_times_M(self, X): + def compute_E_matrix(self, X): """ M is N_sample, because ||y - Xc||^2 need to be normalized by 1/n_sample, but not the 2 * beta_L2 * c'Ac term see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html The optimization objective for Lasso is: @@ -386,51 +485,62 @@ def compute_B_matrix_times_M(self, X): Calculations""" XtX = X.T @ X beta_L2 = self.beta_net - N_squared = self.N * self.N - part_2 = 2.0 * float(beta_L2) * self.M / (N_squared) * self.A - B = XtX + part_2 + part_1 = XtX/self.M + part_2 = float(beta_L2) * self.A + B = part_1 + part_2 + self.B_df = pd.DataFrame(B, index = self.final_nodes, columns = self.final_nodes) # please fix so it is self.E_df + self.E_part_XtX = pd.DataFrame(part_1, index = self.final_nodes, columns = self.final_nodes) + self.E_part_netReg = pd.DataFrame(part_2, index = self.final_nodes, columns = self.final_nodes) return B - def compute_B_matrix(self, X): - """ M is N_sample, because ||y - Xc||^2 need to be normalized by 1/n_sample, but not the 2 * beta_L2 * c'Ac term - see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html - The optimization objective for Lasso is: - (1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1 - where M = n_sample - Outputting for user """ - return self.compute_B_matrix_times_M(X) / self.M - - def compute_X_tilde_y_tilde(self, B, X, y): """Compute X_tilde, y_tilde such that X_tilde.T @ X_tilde = B, y_tilde.T @ X_tilde = y.T @ X """ U, s, _Vh = np.linalg.svd(B, hermitian=True) # B = U @ np.diag(s) @ _Vh + self.U = U + self.s = s + self.B = B if (cond := s[0] / s[-1]) > 1e10: print(f'Large conditional number of B matrix: {cond: .2f}') S_sqrt = ef.DiagonalLinearOperator(np.sqrt(s)) S_inv_sqrt = ef.DiagonalLinearOperator(1 / np.sqrt(s)) X_tilde = S_sqrt @ U.T + svd_problem_bool = np.isnan(U.T).any() # we may have problems here + if svd_problem_bool: + B_df = self.B_df + updated_B_df = B_df.dropna(how='all') + B = updated_B_df.values + U, s, _Vh = np.linalg.svd(B, hermitian=True) # B = U @ np.diag(s) @ _Vh + if (cond := s[0] / s[-1]) > 1e10: + print(f'Large conditional number of B matrix: {cond: .2f}') + S_sqrt = ef.DiagonalLinearOperator(np.sqrt(s)) + S_inv_sqrt = ef.DiagonalLinearOperator(1 / np.sqrt(s)) + X_tilde = S_sqrt @ U.T + self.revised_B_train = B y_tilde = (y @ X @ U @ S_inv_sqrt).T # assert(np.allclose(y.T @ X, y_tilde.T @ X_tilde)) # assert(np.allclose(B, X_tilde.T @ X_tilde)) # scale: we normalize by 1/M, but sklearn.linear_model.Lasso normalize by 1/N because X_tilde is N*N matrix, # so Lasso thinks the number of sample is N instead of M, to use lasso solve our desired problem, correct the scale - scale = np.sqrt(self.N / self.M) - X_tilde *= scale + scale = np.sqrt(self.N)/ self.M + X_tilde *= np.sqrt(self.N) y_tilde *= scale return X_tilde, y_tilde - def predict_y_from_y_tilda(self, X, X_tilda, pred_y_tilda): - - X = self.preprocess_X_df(X) - # Transposing the matrix before inverting - X_transpose_inv = np.linalg.inv(X.T) - - # Efficiently compute pred_y by considering the dimensions of matrices - pred_y = np.dot(np.dot(X_transpose_inv, X_tilda.T), pred_y_tilda) - - return pred_y + def standardize_X_tilde_y_tilde(self): + """Compute X_tilde, y_tilde such that X_tilde.T @ X_tilde = B, y_tilde.T @ X_tilde = y.T @ X """ + self.X_tilda_train_df = pd.DataFrame(self.X_tilda_train, index = self.final_nodes, columns = self.final_nodes) + scaler = StandardScaler() + self.X_tilda_train_standardized_df = pd.DataFrame(scaler.fit_transform(self.X_tilda_train), + columns=self.final_nodes, index = self.final_nodes) + scaler = StandardScaler() + # Assuming y_tilda_train is your 1D array + y_tilda_train_reshaped = self.y_tilda_train.reshape(-1, 1) + # Then you use the reshaped array with StandardScaler + self.y_tilda_train_standardized_df = pd.DataFrame(scaler.fit_transform(y_tilda_train_reshaped)) + self.standardized_X_tilda_train = self.X_tilda_train_standardized_df.values + self.standardized_y_tilda_train = self.y_tilda_train_standardized_df.T.values[0] def _apply_parameter_constraints(self): @@ -455,6 +565,7 @@ def calculate_mean_square_error(self, actual_values, predicted_values): def predict(self, X_test): + X_test = X_test[self.final_nodes] # Oct 28 if self.standardize_X: self.X_test_standardized = self.standardize_X_data(X_test) X_test = self.preprocess_X_df(self.X_test_standardized) @@ -465,20 +576,48 @@ def predict(self, X_test): def test_mse(self, X_test, y_test): X_test = X_test.sort_index(axis=1) # 9/20 + if self.tg_is_tf: # 3/30/24 + X_test = X_test.drop(columns = [self.tg_name]) + if self.standardize_X: - self.X_test_standardized = self.standardize_X_data(X_test) + if is_standardized(X_test) == False: + self.X_test_standardized = self.standardize_X_data(X_test) + else: + self.X_test_standardized = X_test X_test = self.preprocess_X_df(self.X_test_standardized) else: X_test = self.preprocess_X_df(X_test) # X_test if self.center_y: y_test = self.center_y_data(y_test) - #X_test = self.preprocess_X_df(X_test) # X_test - y_test = self.preprocess_y_df(y_test) + if self.standardize_y: + if is_standardized(y_test) == False: + self.y_test_standardized = self.standardize_y_data(y_test) + else: + self.y_test_standardized = y_test + y_test = self.preprocess_y_df(self.y_test_standardized) + else: + y_test = self.preprocess_y_df(y_test) # X_test + predY_test = self.regr.predict(X_test) # training data mse_test = self.calculate_mean_square_error(y_test, predY_test) # Calculate MSE - return mse_test #mse_test + return mse_test + + ## October 28: :) + def calculate_nmse(self, actual_values, predicted_values):#(self, X_test, y_test): + nmse_test = em.nmse(actual_values, predicted_values) #(y_test, predY_test) # Calculate MSE + return nmse_test + + def calculate_snr(self, actual_values, predicted_values):#(self, X_test, y_test): + snr_test = em.snr(actual_values, predicted_values) #(y_test, predY_test) # Calculate MSE + return snr_test + + def calculate_psnr(self, actual_values, predicted_values):#(self, X_test, y_test): + psnr_test = em.psnr(actual_values, predicted_values) #(y_test, predY_test) # Calculate MSE + return psnr_test +## end of Oct 28 + def get_params(self, deep=True): params_dict = {"info":self.info, "alpha_lasso": self.alpha_lasso, "beta_net": self.beta_net, "y_intercept": self.y_intercept, "model_type":self.model_type, @@ -497,7 +636,7 @@ def get_params(self, deep=True): params_dict["lassocv_alphas"] = self.lassocv_alphas params_dict["optimal_alpha"] = self.optimal_alpha elif self.model_type == "Linear": - params_dict["alpha_lasso"] = "No alpha needed" + params_dict["alpha_lasso"] = 0 #"No alpha needed" params_dict["num_jobs"] = self.num_jobs if self.model_type != "Linear": params_dict["tolerance"] = self.tolerance @@ -543,45 +682,52 @@ def score(self, X, y, zero_coef_penalty=10): # Make predictions using the predict method of your custom estimator y_pred = self.predict(X) - # Handle cases where predictions are exactly zero y_pred[y_pred == 0] = 1e-10 - # Calculate the normalized mean squared error between the true and predicted values nmse_ = (y - y_pred)**2 nmse_[y_pred == 1e-10] *= zero_coef_penalty nmse_ = nmse_.mean() / (y**2).mean() - if nmse_ == 0: - #return float(1e1000) # Return positive infinity if nmse_ is zero - return float("inf") # Return positive infinity if nmse_ is zero else: return -nmse_ - + def updating_network_A_matrix_given_X(self) -> np.ndarray: """ When we call the fit method, this function is used to help us update the network information. Here, we can generate updated W matrix, updated D matrix, and updated V matrix. Then, those updated derived matrices are used to calculate the A matrix. """ + #print("updating_network_A_matrix_given_X") network = self.network + #print("here we go") + + if self.tg_name in self.final_nodes: + self.tg_is_tf = True + self.final_nodes.remove(self.tg_name) final_nodes = self.final_nodes + W_df = network.W_df.copy() # updating the W matrix + + if self.tg_is_tf: #1/31/24: + W_df = W_df.drop(columns = [self.tg_name], index = [self.tg_name]) + + if len(self.ppi_nodes_to_remove) > 0: + W_df = W_df.drop(index=self.ppi_nodes_to_remove, columns=self.ppi_nodes_to_remove) + default_edge_weight = self.prior_network.default_edge_weight + + if len(self.gexpr_nodes_to_add_for_net) > 0: # Simplified addition of new nodes + for node in self.gexpr_nodes_to_add_for_net: #netrem_chosen_demo.gexpr_nodes_added: + W_df[node] = default_edge_weight + W_df.loc[node] = default_edge_weight - # Simplified addition of new nodes - if self.gexpr_nodes_added: - for node in self.gexpr_nodes_added: - W_df[node] = np.nan - W_df.loc[node] = np.nan # Consolidated indexing and reindexing operations W_df = W_df.reindex(index=final_nodes, columns=final_nodes) # Handle missing values - W_df.fillna(value=self.prior_network.default_edge_weight, inplace=True) np.fill_diagonal(W_df.values, 0) - N = len(final_nodes) self.N = N W = W_df.values @@ -589,10 +735,16 @@ def updating_network_A_matrix_given_X(self) -> np.ndarray: self.W = W self.W_df = W_df - - # Check for symmetric matrix - if not ef.check_symmetric(W): - print(":( W matrix is NOT symmetric") + # Feb 12, 2024 + edge_list = ( + W_df + .reset_index() + .melt(id_vars=["index"], var_name="TF2", value_name="weight_W") + .rename(columns={"index": "TF1"}) + .query("TF1 != TF2") + ) + self.edge_list = edge_list.values.tolist() + ########################################################################################## # Update V matrix self.V = N * np.eye(N) - np.ones(N) @@ -608,14 +760,27 @@ def updating_network_A_matrix_given_X(self) -> np.ndarray: W_to_use = W ** 2 else: W_to_use = W - d = W_to_use.diagonal() * (self.N - 1) + if network.w_transform_for_d == "avg": # added on 2/8/24 + d = W_to_use.diagonal() + else: # summing up the values + d = W_to_use.diagonal() * (self.N - 1) # Handle pseudocount and self loops d += network.pseudocount_for_degree + if network.consider_self_loops: d += 1 d_inv_sqrt = 1 / np.sqrt(d) + # 2/5/24 + self.node_degree_df = pd.DataFrame(d, index = self.final_nodes, columns = ["d_i"]) + self.D_df = pd.DataFrame(np.diag(d_inv_sqrt)) + annotated_D = self.D_df + annotated_D.columns = self.final_nodes + annotated_D.index = self.final_nodes + self.D_df = annotated_D + ###### + self.D = ef.DiagonalLinearOperator(d_inv_sqrt) # Update inv_sqrt_degree_df @@ -625,11 +790,10 @@ def updating_network_A_matrix_given_X(self) -> np.ndarray: }) Amat = self.D @ (self.V * W) @ self.D - A_df = pd.DataFrame(Amat, columns=final_nodes, index=final_nodes, dtype=np.float64) - + A_df = pd.DataFrame(Amat, columns=final_nodes, index=final_nodes, dtype=np.float32) # Handle nodes based on `overlapped_nodes_only` gene_expression_nodes = self.gene_expression_nodes - nodes_to_add = list(set(gene_expression_nodes) - set(final_nodes)) + nodes_to_add = list(set(self.gene_expression_nodes ) - set(final_nodes)) self.nodes_to_add = nodes_to_add if not self.overlapped_nodes_only: for name in nodes_to_add: @@ -645,33 +809,26 @@ def updating_network_A_matrix_given_X(self) -> np.ndarray: A_df = A_df.sort_index(axis=0).sort_index(axis=1) self.A_df = A_df + # if graph.is_positive_semi_definite(A_df) == False: + # print(":( Error! A is NOT positive semi-definite! There exist some negative eigenvalues for A! :(") self.A = A_df.values self.nodes = A_df.columns.tolist() self.tf_names_list = self.nodes return self - def preprocess_X_df(self, X): - if isinstance(X, pd.DataFrame): - X_df = X - column_names_list = list(X_df.columns) - overlap_num = len(ef.intersection(column_names_list, self.final_nodes)) - if overlap_num == 0: - print("Please note: we are flipping X dataframe around so that the rows are samples and the columns are gene/TF names :)") - X_df = X_df.transpose() - column_names_list = list(X_df.columns) - overlap_num = len(ef.intersection(column_names_list, self.common_nodes)) - gene_names_list = self.final_nodes # so that this matches the order of columns in A matrix as well - X_df = X_df.loc[:, X_df.columns.isin(gene_names_list)] # filtering the X_df as needed based on the columns - X_df = X_df.reindex(columns=gene_names_list)# Reorder columns of dataframe to match order in `column_order` - X = np.array(X_df.values.tolist()) - return X + def preprocess_X_df(self, X_df): + if self.tg_name in X_df.columns: + X_df.drop(columns=[self.tg_name], inplace=True) + + # Ensure X_df contains only final_nodes. + X_df = X_df.loc[:, self.final_nodes] if self.final_nodes else X_df + return X_df.values + def preprocess_y_df(self, y): - if isinstance(y, pd.DataFrame): - y = y.values.flatten() - return y - + return y.values.flatten() if isinstance(y, pd.DataFrame) else y + def return_Linear_ML_model(self, X, y): regr = LinearRegression(fit_intercept = self.y_intercept, @@ -691,6 +848,7 @@ def return_Lasso_ML_model(self, X, y): def return_LassoCV_ML_model(self, X, y): + self.num_cv_folds = min(X.shape[0], self.num_cv_folds) # April 2024 regr = LassoCV(cv = self.num_cv_folds, random_state = 0, fit_intercept = self.y_intercept, max_iter = self.max_lasso_iterations, @@ -715,31 +873,29 @@ def return_fit_ml_model(self, X, y): return model_to_return -def netrem(edge_list, beta_net = 1, alpha_lasso = 0.01, default_edge_weight = 0.1, - degree_threshold = 0.5, gene_expression_nodes = [], overlapped_nodes_only = False, - y_intercept = False, standardize_X = True, center_y = True, view_network = False, +def netrem(edge_list, beta_net = 1, alpha_lasso = 0.01, default_edge_weight = 0.01, + edge_vals_for_d = True, w_transform_for_d = "none", degree_threshold = 0.5, + gene_expression_nodes = [], overlapped_nodes_only = False, + y_intercept = False, standardize_X = True, standardize_y = True, center_y = False, view_network = False, model_type = "Lasso", lasso_selection = "cyclic", all_pos_coefs = False, tolerance = 1e-4, maxit = 10000, - num_jobs = -1, num_cv_folds = 5, lassocv_eps = 1e-3, + num_jobs = -1, num_cv_folds = 5, lassocv_eps = 1e-3, lassocv_n_alphas = 100, # default in sklearn lassocv_alphas = None, # default in sklearn - verbose = False, - hide_warnings = True): - degree_pseudocount = 1e-3 + verbose = False, degree_pseudocount = 0, + hide_warnings = True):#, gamma_net = 0,): + os.environ['PYTHONHASHSEED'] = '0' if hide_warnings: warnings.filterwarnings("ignore") default_beta = False default_alpha = False if beta_net == 1: - print("using beta_net default of", 1) + print(":) netrem (may have prior knowledge): using beta_net default of", 1) default_beta = True if alpha_lasso == 0.01: if model_type != "LassoCV": - print("using alpha_lasso default of", 0.01) + print(":) netrem (may have prior knowledge): using alpha_lasso default of", 0.01) default_alpha = True - edge_vals_for_d = False - self_loops = False - w_transform_for_d = "none" - + self_loops = False prior_graph_dict = {"edge_list": edge_list, "gene_expression_nodes":gene_expression_nodes, "edge_values_for_degree": edge_vals_for_d, @@ -755,7 +911,9 @@ def netrem(edge_list, beta_net = 1, alpha_lasso = 0.01, default_edge_weight = 0. "model_type": model_type, "use_network":True, "standardize_X":standardize_X, + "standardize_y":standardize_y, "center_y":center_y, + #"gamma_net":gamma_net, "y_intercept":y_intercept, "overlapped_nodes_only":overlapped_nodes_only, "max_lasso_iterations":maxit, @@ -791,7 +949,8 @@ def netremCV(edge_list, X, y, gene_expression_nodes = [], overlapped_nodes_only: bool = False, standardize_X: bool = True, - center_y: bool = True, + standardize_y: bool = True, + center_y: bool = False, y_intercept: bool = False, model_type = "Lasso", lasso_selection = "cyclic", @@ -882,6 +1041,7 @@ def netremCV(edge_list, X, y, network=prior_network, overlapped_nodes_only=overlapped_nodes_only, standardize_X = standardize_X, + standardize_y = standardize_y, center_y = center_y, y_intercept = y_intercept, max_lasso_iterations = maxit, @@ -922,6 +1082,7 @@ def netremCV(edge_list, X, y, model_type="Lasso", network=prior_network, standardize_X = standardize_X, + standardize_y = standardize_y, center_y = center_y, overlapped_nodes_only=overlapped_nodes_only, y_intercept = y_intercept, @@ -1006,32 +1167,310 @@ def netremCV(edge_list, X, y, return newest_netrem -def organize_B_interaction_network(netrem_model): - B_interaction_df = netrem_model.B_interaction_df - num_TFs = B_interaction_df.shape[0] - B_interaction_df = B_interaction_df.reset_index().melt(id_vars='index', var_name='TF2', value_name='B_train_weight') - B_interaction_df = B_interaction_df.rename(columns = {"index":"TF1"}) - B_interaction_df = B_interaction_df[B_interaction_df["TF1"] != B_interaction_df["TF2"]] - B_interaction_df = B_interaction_df.sort_values(by = ['B_train_weight'], ascending = False) - B_interaction_df["sign"] = np.where((B_interaction_df.B_train_weight > 0), ":)", ":(") - B_interaction_df["potential_interaction"] = np.where((B_interaction_df.B_train_weight > 0), ":(", - ":( competitive (-)") - B_interaction_df["absVal_B"] = abs(B_interaction_df["B_train_weight"]) - B_interaction_df["info"] = "B matrix of TF-TF interactions" - B_interaction_df["candidate_TFs_N"] = num_TFs - B_interaction_df["target_gene_y"] = netrem_model.target_gene_y - B_interaction_df["num_final_predictors"] = netrem_model.num_final_predictors - B_interaction_df["model_type"] = netrem_model.model_type - B_interaction_df["beta_net"] = netrem_model.beta_net - B_interaction_df["X_standardized"] = netrem_model.standardize_X - B_interaction_df["gene_data"] = "training gene expression data" +def organize_predictor_interaction_network(netrem_model): + if "model_nonzero_coef_df" not in vars(netrem_model).keys(): + print(":( No NetREm model was built") + return None + TF_interaction_df = netrem_model.TF_interaction_df + if "model_type" in TF_interaction_df.columns.tolist(): + TF_interaction_df = TF_interaction_df.drop(columns = ["model_type"]) + num_TFs = TF_interaction_df.shape[0] + TF_interaction_df = netrem_model.TF_coord_scores_pairwise_df.drop(columns = ["absVal_coord_score"]) + TF_interaction_df = TF_interaction_df.rename(columns = {"coordination_score":"coord_score_cs"}) + + TF_interaction_df["sign"] = np.where((TF_interaction_df.coord_score_cs > 0), ":)", ":(") + TF_interaction_df["potential_interaction"] = np.where((TF_interaction_df.coord_score_cs > 0), ":) cooperative (+)", + ":( competitive (-)") + TF_interaction_df["absVal_coordScore"] = abs(TF_interaction_df["coord_score_cs"]) + TF_interaction_df["model_type"] = netrem_model.model_type + TF_interaction_df["info"] = "matrix of TF-TF interactions" + TF_interaction_df["candidate_TFs_N"] = num_TFs + TF_interaction_df["target_gene_y"] = netrem_model.target_gene_y + if 'num_final_predictors' in vars(netrem_model).keys(): + TF_interaction_df["num_final_predictors"] = netrem_model.num_final_predictors + else: + TF_interaction_df["num_final_predictors"] = "No final model :(" + TF_interaction_df["model_type"] = netrem_model.model_type + TF_interaction_df["beta_net"] = netrem_model.beta_net + TF_interaction_df["X_standardized"] = netrem_model.standardize_X + TF_interaction_df["y_standardized"] = netrem_model.standardize_y + + TF_interaction_df["gene_data"] = "training gene expression data" # Step 1: Please Sort the DataFrame - B_interaction_df = B_interaction_df.sort_values('absVal_B', ascending=False) + TF_interaction_df = TF_interaction_df.sort_values('absVal_coordScore', ascending=False) # Step 2: Get the rank - B_interaction_df['rank'] = B_interaction_df['absVal_B'].rank(method='min', ascending=False) + TF_interaction_df['cs_magnitude_rank'] = TF_interaction_df['absVal_coordScore'].rank(method='min', ascending=False) # Step 3: Calculate the percentile - B_interaction_df['percentile'] = (1 - (B_interaction_df['rank'] / B_interaction_df['absVal_B'].count())) * 100 - return B_interaction_df \ No newline at end of file + TF_interaction_df['cs_magnitude_percentile'] = (1 - (TF_interaction_df['cs_magnitude_rank'] / TF_interaction_df['absVal_coordScore'].count())) * 100 + + TF_interaction_df["TF_TF"] = TF_interaction_df["TF1"] + "_" + TF_interaction_df["TF2"] + + standardized_data = True + if "old_X_df" in vars(netrem_model).keys(): + standardized_X_corr_mat = netrem_model.X_df.corr() + original_X_corr_mat = netrem_model.old_X_df.corr() + else: + original_X_corr_mat = netrem_model.X_df.corr() + standardized_data = False + # Melting the DataFrame into a 3-column edge list + original_X_corr_df = original_X_corr_mat.reset_index().melt(id_vars=["index"], var_name="TF2", value_name="original_corr") + original_X_corr_df.rename(columns={"index": "TF1"}, inplace=True) + original_X_corr_df = original_X_corr_df[original_X_corr_df["TF1"] != original_X_corr_df["TF2"]] + original_X_corr_df["TF_TF"] = original_X_corr_df["TF1"] + "_" + original_X_corr_df["TF2"] + # Display the first few rows to verify the format + + if standardized_data: + # Melting the DataFrame into a 3-column edge list + standardized_X_corr_df = standardized_X_corr_mat.reset_index().melt(id_vars=["index"], var_name="TF2", value_name="standardized_corr") + standardized_X_corr_df.rename(columns={"index": "TF1"}, inplace=True) + standardized_X_corr_df = standardized_X_corr_df[standardized_X_corr_df["TF1"] != standardized_X_corr_df["TF2"]] + standardized_X_corr_df["TF_TF"] = standardized_X_corr_df["TF1"] + "_" + standardized_X_corr_df["TF2"] + standardized_X_corr_df.drop(columns = ["TF1", "TF2"], inplace = True) + original_X_corr_df = pd.merge(original_X_corr_df, standardized_X_corr_df).drop(columns = ["TF1", "TF2"]) + + + TF_interaction_df = pd.merge(TF_interaction_df, original_X_corr_df) + + default_edge_w = netrem_model.network.default_edge_weight + if "W_df" in vars(netrem_model.network).keys(): + W_df = netrem_model.network.W_df + else: + W_df = netrem_model.W_df + + ppi_net_df = W_df.reset_index().melt(id_vars=["index"], var_name="TF2", value_name="PPI_score") + ppi_net_df.rename(columns={"index": "TF1"}, inplace=True) + + ppi_net_df["novel_link"] = np.where((ppi_net_df.PPI_score == default_edge_w), "yes", "no") + ppi_net_df["TF_TF"] = ppi_net_df["TF1"] + "_" + ppi_net_df["TF2"] + ppi_net_df = ppi_net_df[ppi_net_df["TF1"] != ppi_net_df["TF2"]] # 42849 rows × 3 columns + ppi_net_df = ppi_net_df.drop(columns = ["TF1", "TF2"]) + + TF_interaction_df = pd.merge(TF_interaction_df, ppi_net_df) + TF_interaction_df["absVal_diff_cs_and_originalCorr"] = abs(TF_interaction_df["coord_score_cs"] - TF_interaction_df["standardized_corr"]) + + TF_interaction_df['c_1'] = TF_interaction_df['TF1'].apply(lookup_coef, args=(netrem_model,)) + TF_interaction_df['c_2'] = TF_interaction_df['TF2'].apply(lookup_coef, args=(netrem_model,)) + return TF_interaction_df + + +def min_max_normalize(data, new_min=0.001, new_max=1): + """ + Scale data to a new range [new_min, new_max]. + + Parameters: + - data: array-like, original data. + - new_min: float, the minimum value of the scaled data. + - new_max: float, the maximum value of the scaled data. + + Returns: + - Array of normalized data. + """ + X_min = data.min() + X_max = data.max() + + # Apply the min-max normalization formula adjusted for the new range + normalized_data = new_min + ((data - X_min) * (new_max - new_min)) / (X_max - X_min) + return normalized_data + + +def netrem_info_breakdown_df(netrem_model): + + part1 = netrem_model.E_part_XtX + part1_df = part1.reset_index().melt(id_vars=["index"], var_name="TF2", value_name="part1_XtX/M") + part1_df.rename(columns={"index": "TF1"}, inplace=True) + part1_df = part1_df[part1_df["TF1"] != part1_df["TF2"]] + part1_df["TF1_TF2"] = part1_df["TF1"] + "_" + part1_df["TF2"] + if "W_df" in vars(netrem_model).keys(): + W_part = netrem_model.W_df + else: + W_part = netrem_model.network.W_df + W_part = W_part.reset_index().melt(id_vars=["index"], var_name="TF2", value_name="weight_W") + W_part.rename(columns={"index": "TF1"}, inplace=True) + W_part = W_part[W_part["TF1"] != W_part["TF2"]] + W_part["TF1_TF2"] = W_part["TF1"] + "_" + W_part["TF2"] + W_part.drop(columns = ["TF1", "TF2"], inplace = True) + if "A_df" in vars(netrem_model).keys(): + A_part = netrem_model.A_df + else: + A_part = netrem_model.network.A_df + A_part = A_part.reset_index().melt(id_vars=["index"], var_name="TF2", value_name="A") + A_part.rename(columns={"index": "TF1"}, inplace=True) + A_part = A_part[A_part["TF1"] != A_part["TF2"]] + A_part["TF1_TF2"] = A_part["TF1"] + "_" + A_part["TF2"] + A_part = A_part.drop(columns = ["TF1", "TF2"]) + part2 = netrem_model.E_part_netReg + part2_df = part2.reset_index().melt(id_vars=["index"], var_name="TF2", value_name="part2_betaA") + part2_df.rename(columns={"index": "TF1"}, inplace=True) + part2_df = part2_df[part2_df["TF1"] != part2_df["TF2"]] + part2_df["beta_net"] = netrem_model.beta_net + part2_df["TF1_TF2"] = part2_df["TF1"] + "_" + part2_df["TF2"] + part2_df = part2_df.drop(columns = ["TF1", "TF2"]) + + if "node_degree_df" in vars(netrem_model).keys(): + node_degrees_df = netrem_model.node_degree_df + else: + node_degrees_df = netrem_model.network.node_degree_df + + node_degrees_df = node_degrees_df.reset_index() + + TF1_degree_df = node_degrees_df.rename(columns = {"index":"TF1", "d_i":"deg_TF1"}) + TF2_degree_df = node_degrees_df.rename(columns = {"index":"TF2", "d_i":"deg_TF2"}) + + + main_parts_df = pd.merge(part1_df, W_part) + main_parts_df = pd.merge(main_parts_df, TF1_degree_df) + main_parts_df = pd.merge(main_parts_df, TF2_degree_df) + main_parts_df = pd.merge(main_parts_df, A_part) + main_parts_df = pd.merge(main_parts_df, part2_df) + + main_parts_df["B_score"] = main_parts_df["part1_XtX/M"] + main_parts_df["part2_betaA"] + main_parts_df["part1/part2"] = abs(main_parts_df["part1_XtX/M"]/main_parts_df["part2_betaA"]) + main_parts_df["abs_total"] = abs(main_parts_df["part1_XtX/M"]) + abs(main_parts_df["part2_betaA"]) + + main_parts_df["perc_XtX_part"] = abs(main_parts_df["part1_XtX/M"])/(main_parts_df["abs_total"]) * 100.0 + main_parts_df["perc_betaA_part"] = abs(main_parts_df["part2_betaA"])/(main_parts_df["abs_total"]) * 100.0 + main_parts_df = main_parts_df.drop(columns = ["abs_total", "TF1_TF2"]) + main_parts_df["abs_diff_in_perc"] = abs(main_parts_df["perc_betaA_part"] - main_parts_df["perc_XtX_part"]) + main_parts_df["TF1_TF2"] = main_parts_df["TF1"] + "_" + main_parts_df["TF2"] + return main_parts_df + + + +def multiply_frobenius_norm(norm, matrix, ignore_main_diag=False): + """ + Multiply the matrix by its Frobenius norm. If ignore_main_diag is True, + multiply only the off-diagonal elements by the Frobenius norm of the off-diagonal elements. + This function now supports receiving a Pandas DataFrame as the matrix. + + Parameters: + - norm: The norm to multiply the matrix by. + - matrix: The matrix (as a NumPy array or Pandas DataFrame) to be multiplied. + - ignore_main_diag (bool): Determines whether the main diagonal should be ignored. + + Returns: + - The matrix after multiplication, in the same format as the input (NumPy array or DataFrame). + """ + # Convert DataFrame to NumPy array if necessary + if isinstance(matrix, pd.DataFrame): + matrix_np = matrix.values + was_dataframe = True + else: + matrix_np = matrix + was_dataframe = False + + if ignore_main_diag: + # Create a mask for the off-diagonal elements + mask = np.ones_like(matrix_np, dtype=bool) + np.fill_diagonal(mask, False) + # Multiply only the off-diagonal elements by the norm + matrix_np[mask] *= norm + else: + # Multiply the entire matrix by the norm + matrix_np *= norm + + # Convert back to DataFrame if the original input was a DataFrame + if was_dataframe: + return pd.DataFrame(matrix_np, index=matrix.index, columns=matrix.columns) + else: + return matrix_np + + + +def is_standardized(df, tol=1e-4): + """ + Check if the given DataFrame is standardized (mean ~ 0 and std deviation ~ 1) for all columns. + + Parameters: + df (pd.DataFrame): The DataFrame to check. + tol (float): Tolerance for the mean and standard deviation checks. + + Returns: + bool: True if the DataFrame is standardized, False otherwise. + """ + # Calculate means and standard deviations for all columns at once + means = df.mean() + stds = df.std() + + # Check if all means are within the tolerance of 0 and all stds are within the tolerance of 1 + return np.all(np.abs(means) < tol) and np.all(np.abs(stds - 1) < tol) + + + +def return_TF_coord_scores_df(netrem_model): + TF_coord_scores_df = ( + netrem_model.TF_interaction_df + .reset_index() + .melt(id_vars='index', var_name='node_2', value_name='coordination_score') + .rename(columns={"index": "node_1"}) + ) + + # Filter out rows where node_1 equals node_2 and calculate absVal_coord_score in one step + TF_coord_scores_df = TF_coord_scores_df[TF_coord_scores_df["node_1"] != TF_coord_scores_df["node_2"]].assign(absVal_coord_score=lambda x: abs(x.coordination_score)) + + # Sort values based on absVal_coord_score without reassigning the dataframe + TF_coord_scores_pairwise_df = TF_coord_scores_df.sort_values(by="absVal_coord_score", ascending=False) + return TF_coord_scores_pairwise_df + + + +def simprem(prior_network, beta_net = 1, alpha_lasso = 0.01, overlapped_nodes_only = False, + y_intercept = False, standardize_X = True, standardize_y = True, center_y = False, view_network = False, + model_type = "Lasso", lasso_selection = "cyclic", all_pos_coefs = False, tolerance = 1e-4, maxit = 10000, + num_jobs = -1, num_cv_folds = 5, lassocv_eps = 1e-3, + lassocv_n_alphas = 100, # default in sklearn + lassocv_alphas = None, # default in sklearn + verbose = False, + hide_warnings = True): + """ + Please note this is :) Simpler NetREm when we do not have prior gene regulatory knowledge and all Target Genes (TGs) + in the cell-type have the same set of N* candidate TFs. + + Please note that to obtain prior_network, ye can directly use: graph.build_prior_network(edge_list = ...) function. + That is, prior_network = graph.build_prior_network() + """ + + os.environ['PYTHONHASHSEED'] = '0' + if hide_warnings: + warnings.filterwarnings("ignore") + default_beta = False + default_alpha = False + if beta_net == 1: + print(":) simprem (no prior knowledge): using beta_net default of", 1) + default_beta = True + if alpha_lasso == 0.01: + if model_type != "LassoCV": + print(":) simprem (no prior knowledge): using alpha_lasso default of", 0.01) + default_alpha = True + + greg_dict = {"network": prior_network, + "model_type": model_type, + "use_network":True, + "standardize_X":standardize_X, + "standardize_y":standardize_y, + "center_y":center_y, + #"gamma_net":gamma_net, + "y_intercept":y_intercept, + "overlapped_nodes_only":overlapped_nodes_only, + "max_lasso_iterations":maxit, + "all_pos_coefs":all_pos_coefs, + "view_network":view_network, + "verbose":verbose} + if default_alpha == False: + greg_dict["alpha_lasso"] = alpha_lasso + if default_beta == False: + greg_dict["beta_net"] = beta_net + if model_type != "Linear": + greg_dict["tolerance"] = tolerance + greg_dict["lasso_selection"] = lasso_selection + if model_type != "Lasso": + greg_dict["num_jobs"] = num_jobs + if model_type == "LassoCV": + greg_dict["num_cv_folds"] = num_cv_folds + greg_dict["lassocv_eps"] = lassocv_eps + greg_dict["lassocv_n_alphas"] = lassocv_n_alphas + greg_dict["lassocv_alphas"] = lassocv_alphas + greggy = NetREmModel(**greg_dict) + return greggy \ No newline at end of file diff --git a/code/PriorGraphNetwork.py b/code/PriorGraphNetwork.py index 29d474d..14ba6a7 100644 --- a/code/PriorGraphNetwork.py +++ b/code/PriorGraphNetwork.py @@ -37,6 +37,7 @@ import essential_functions as ef import error_metrics as em import DemoDataBuilderXandY as demo +import itertools import math @@ -51,6 +52,11 @@ # Utility functions printdf = lambda *args, **kwargs: print(pd.DataFrame(*args, **kwargs)) +import os +import subprocess + +# Set the PYTHONHASHSEED environment variable +os.environ['PYTHONHASHSEED'] = '0' class PriorGraphNetwork: """:) Please note that this class focuses on incorporating information from a prior network (in our case, @@ -61,9 +67,9 @@ class PriorGraphNetwork: an embedding, find the cosine similarity, and then use the node-node similarity values for our network. Ultimately, this class builds the W matrix (for the prior network weights to be used for our network regularization penalty), the D matrix (of degrees), and the V matrix (custom for our approach).""" - + os.environ['PYTHONHASHSEED'] = '0' _parameter_constraints = { - "w_transform_for_d": ["none", "sqrt", "square"], + "w_transform_for_d": ["none", "sqrt", "square", "avg"], "degree_pseudocount": (0, None), "default_edge_weight": (0, None), "threshold_for_degree": (0, None), @@ -71,7 +77,7 @@ class PriorGraphNetwork: "verbose":[True, False]} def __init__(self, **kwargs): # define default values for constants - + self.edge_values_for_degree = False # we instead consider a threshold by default (for counting edges into our degrees) self.consider_self_loops = False # no self loops considered self.verbose = True # printing out statements @@ -121,12 +127,21 @@ def __init__(self, **kwargs): # define default values for constants self.preprocessed_network = True self.gene_expression_nodes.sort() gene_expression_nodes = self.gene_expression_nodes - self.final_nodes = gene_expression_nodes + self.final_nodes = gene_expression_nodes # sorted common_nodes = ef.intersection(self.network_nodes, self.gene_expression_nodes) common_nodes.sort() self.common_nodes = common_nodes self.gex_nodes_to_add = list(set(self.gene_expression_nodes) - set(self.common_nodes)) self.network_nodes_to_remove = list(set(self.network_nodes) - set(self.common_nodes)) + + if len(self.original_edge_list) == 0: # April 23 2024: we ensure if this is empty we add the missing pairwise nodes + pairwise_combinations = list(itertools.combinations(self.final_nodes, 2)) + # Create a DataFrame with columns 'TF1', 'TF2', and 'score' + mini_ppi_df = pd.DataFrame(pairwise_combinations, columns=['TF1', 'TF2']) + # Assign a constant score of 0.01 to each pair + mini_ppi_df['score'] = 0.01 + mini_ppi_edge_list = mini_ppi_df.values.tolist() + self.original_edge_list = mini_ppi_edge_list # filtering the edge_list: self.edge_list = [edge for edge in self.original_edge_list if edge[0] in gene_expression_nodes and edge[1] in gene_expression_nodes] else: @@ -137,19 +152,19 @@ def __init__(self, **kwargs): # define default values for constants self.nodes = self.final_nodes self.N = len(self.tf_names_list) self.V = self.create_V_matrix() - if self.undirected_graph_bool: - self.directed=False - self.undirected_edge_list_to_matrix() - self.W_original = self.W + #if self.undirected_graph_bool: + self.directed=False + self.undirected_edge_list_to_matrix() + self.W_original = self.W #self.edge_df = self.undirected_edge_list_updated().drop_duplicates() - else: - self.directed=True - self.W_original = self.directed_node2vec_similarity(self.edge_list, self.dimensions, - self.walk_length, self.num_walks, - self.p, self.q, self.workers, - self.window, self.min_count, self.batch_words) - self.W = self.generate_symmetric_weight_matrix() - self.W_df = pd.DataFrame(self.W, columns = self.nodes, index = self.nodes) + # else: + # self.directed=True + # self.W_original = self.directed_node2vec_similarity(self.edge_list, self.dimensions, + # self.walk_length, self.num_walks, + # self.p, self.q, self.workers, + # self.window, self.min_count, self.batch_words) + # self.W = self.generate_symmetric_weight_matrix() + # self.W_df = pd.DataFrame(self.W, columns = self.nodes, index = self.nodes) if self.view_network: self.view_W_network = self.view_W_network() else: @@ -160,11 +175,18 @@ def __init__(self, **kwargs): # define default values for constants degree_df = pd.DataFrame(self.final_nodes, columns = ["TF"]) degree_df["degree_D"] = self.D * np.ones(self.N) self.inv_sqrt_degree_df = degree_df ######## - self.edge_list_from_W = self.return_W_edge_list() + #self.edge_list_from_W = self.return_W_edge_list() self.A = self.create_A_matrix() - self.A_df = pd.DataFrame(self.A, columns = self.nodes, index = self.nodes, dtype=np.float64) - self.param_lists = self.full_lists() - self.param_df = pd.DataFrame(self.full_lists(), columns = ["parameter", "data type", "description", "value", "class"]) + self.A_df = pd.DataFrame(self.A, columns = self.nodes, index = self.nodes, dtype=np.float32) + # added on 2/5/24 + annotated_D = self.D_df + annotated_D.columns = self.nodes + annotated_D.index = self.nodes + self.D_df = annotated_D + self.node_degree_df = pd.DataFrame(self.degree_vector, index = self.nodes, columns = ["d_i"]) + + #self.param_lists = self.full_lists() + #self.param_df = pd.DataFrame(self.full_lists(), columns = ["parameter", "data type", "description", "value", "class"]) self.node_status_df = self.find_node_status_df() self._apply_parameter_constraints() @@ -203,6 +225,7 @@ def find_node_status_df(self): def network_nodes_from_edge_list(self): edge_list = self.edge_list network_nodes = list({node for edge in edge_list for node in edge[:2]}) + #print(network_nodes) network_nodes.sort() return network_nodes @@ -220,81 +243,397 @@ def _apply_parameter_constraints(self): setattr(self, key, constraints[key][0]) return self - - def create_V_matrix(self): - V = self.N * np.eye(self.N) - np.ones(self.N) - return V - - - - # Optimized functions + # def preprocess_edge_list(self): + # processed_edge_list = [] + # default_edge_weight = self.default_edge_weight + + # for sublst in self.edge_list: + # if len(sublst) == 2: + # processed_edge_list.append(sublst + [default_edge_weight]) + # else: + # processed_edge_list.append(sublst) + + # return processed_edge_list def preprocess_edge_list(self): - processed_edge_list = [] default_edge_weight = self.default_edge_weight + processed_edge_list = [] for sublst in self.edge_list: if len(sublst) == 2: - processed_edge_list.append(sublst + [default_edge_weight]) - else: - processed_edge_list.append(sublst) - + sublst.append(default_edge_weight) + processed_edge_list.append(sublst) + #self.processed_edge_list = processed_edge_list return processed_edge_list - - def undirected_edge_list_to_matrix(self): + + + # def undirected_edge_list_to_matrix(self): + # all_nodes = self.final_nodes + # #default_edge_weight = self.default_edge_weight + # N = len(all_nodes) + # self.N = N + # weight_df = pd.DataFrame(self.edge_list, columns=['source', 'target', 'weight']) + # # self.gex_nodes_to_add + # # Pivot the DataFrame to get a matrix representation + # weight_df = weight_df.pivot(index='source', columns='target', values='weight').fillna(self.default_edge_weight) + # weight_df = weight_df.add(weight_df.T, fill_value=0) + # weight_df = weight_df.reindex(index=all_nodes, columns=all_nodes, fill_value=self.default_edge_weight) + + # #.loc[all_nodes, all_nodes] + # # Make the DataFrame symmetric + # # Convert DataFrame to NumPy array + # W = weight_df.values + # np.fill_diagonal(W, 0) + # np.fill_diagonal(W, (W.sum(axis=0) - W.diagonal()) / (N - 1)) + + # self.W = W + # self.W_df = pd.DataFrame(W, columns=all_nodes, index=self.final_nodes, dtype=np.float64) + # return self + + + # def undirected_edge_list_to_matrix(self): + # all_nodes = self.final_nodes + # edge_list = self.preprocess_edge_list() + # #default_edge_weight = self.default_edge_weight + # N = len(all_nodes) + # self.N = N + # weight_df = pd.DataFrame(edge_list, columns=['source', 'target', 'weight']) + + # # Pivot the DataFrame to get a matrix representation + # weight_df = weight_df.pivot(index='source', columns='target', values='weight').fillna(0) + # weight_df = weight_df.add(weight_df.T, fill_value=0).loc[all_nodes, all_nodes] + + # # Make the DataFrame symmetric + # #weight_df = weight_df.add(weight_df.T, fill_value=0).sort_index(axis=0).sort_index(axis=1) + # #weight_df = weight_df.fillna(default_edge_weight) + # # Convert DataFrame to NumPy array + # W = weight_df.values + # np.fill_diagonal(W, 0) + # np.fill_diagonal(W, (W.sum(axis=0) - W.diagonal()) / (N - 1)) + + # self.W = W + # self.W_df = pd.DataFrame(W, columns=all_nodes, index=self.final_nodes, dtype=np.float64) + # return self + def undirected_edge_list_to_matrix_old(self): all_nodes = self.final_nodes - edge_list = self.preprocess_edge_list() - default_edge_weight = self.default_edge_weight + edge_list = pd.DataFrame(self.edge_list, columns=['source', 'target', 'weight']).fillna(self.default_edge_weight).values + #edge_list = self.preprocess_edge_list() + #default_edge_weight = self.default_edge_weight N = len(all_nodes) self.N = N - weight_df = np.full((N, N), default_edge_weight) + weight_mat = np.full((N, N), self.default_edge_weight) # so even gexpr_nodes added will be included here with defaults # Create a mapping from node to index node_to_idx = {node: idx for idx, node in enumerate(all_nodes)} - for edge in tqdm(edge_list) if self.verbose else edge_list: + for edge in edge_list: try: - source, target, *weight = edge - weight = weight[0] if weight else default_edge_weight - weight = np.nan_to_num(weight, nan=default_edge_weight) + source, target, weight = edge source_idx, target_idx = node_to_idx[source], node_to_idx[target] - weight_df[source_idx, target_idx] = weight - weight_df[target_idx, source_idx] = weight + weight_mat[source_idx, target_idx] = weight + weight_mat[target_idx, source_idx] = weight except ValueError as e: print(f"An error occurred: {e}") continue - np.fill_diagonal(weight_df, 0) - W = weight_df + np.fill_diagonal(weight_mat, 0) + W = weight_mat np.fill_diagonal(W, (W.sum(axis=0) - W.diagonal()) / (N - 1)) - if not ef.check_symmetric(W): - print(":( W matrix is NOT symmetric") + self.W = W + self.W_df = pd.DataFrame(W, columns=all_nodes, index=self.final_nodes, dtype=np.float64) + return self + + def undirected_edge_list_to_matrix_older(self): + all_nodes = self.final_nodes + try: + edge_list = pd.DataFrame(self.edge_list, columns=['source', 'target', 'weight']).fillna(self.default_edge_weight).values + except: + edge_list = pd.DataFrame(self.edge_list, columns=['source', 'target']) + edge_list["weight"] = self.default_edge_weight + #edge_list = self.preprocess_edge_list() + #default_edge_weight = self.default_edge_weight + N = len(all_nodes) + self.N = N + weight_mat = np.full((N, N), self.default_edge_weight) + # Create a mapping from node to index + node_to_idx = {node: idx for idx, node in enumerate(self.final_nodes)} + + # Preprocess to vectorize edge assignments + edges = np.array([(node_to_idx[source], node_to_idx[target], weight) for source, target, weight in edge_list]) + source_indices, target_indices, weights = edges[:, 0].astype(int), edges[:, 1].astype(int), edges[:, 2] + + # Assign weights using advanced indexing + weight_mat[source_indices, target_indices] = weights + weight_mat[target_indices, source_indices] = weights + + np.fill_diagonal(weight_mat, 0) + W = weight_mat + np.fill_diagonal(W, (W.sum(axis=0) - W.diagonal()) / (N - 1)) + + self.W = W + self.W_df = pd.DataFrame(W, columns=all_nodes, index=self.final_nodes, dtype=np.float64) + return self + + + def undirected_edge_list_to_matrix(self): + all_nodes = self.final_nodes + edge_list = pd.DataFrame(self.edge_list) + + try: + edge_list = pd.DataFrame(self.edge_list, columns=['source', 'target', 'weight']).fillna(self.default_edge_weight).values + except: + edge_list = pd.DataFrame(self.edge_list, columns=['source', 'target']) + edge_list["weight"] = self.default_edge_weight + edge_list = edge_list.values + + #edge_list = pd.DataFrame(self.edge_list, columns=['source', 'target', 'weight']).fillna(self.default_edge_weight).values + #edge_list = self.preprocess_edge_list() + #default_edge_weight = self.default_edge_weight + N = len(all_nodes) + self.N = N + weight_mat = np.full((N, N), self.default_edge_weight) + # Create a mapping from node to index + node_to_idx = {node: idx for idx, node in enumerate(self.final_nodes)} + + # Preprocess to vectorize edge assignments + edges = np.array([(node_to_idx[source], node_to_idx[target], weight) for source, target, weight in edge_list]) + source_indices, target_indices, weights = edges[:, 0].astype(int), edges[:, 1].astype(int), edges[:, 2] + + # Assign weights using advanced indexing + weight_mat[source_indices, target_indices] = weights + weight_mat[target_indices, source_indices] = weights + + np.fill_diagonal(weight_mat, 0) + W = weight_mat + np.fill_diagonal(W, (W.sum(axis=0) - W.diagonal()) / (N - 1)) self.W = W self.W_df = pd.DataFrame(W, columns=all_nodes, index=self.final_nodes, dtype=np.float64) return self - def generate_symmetric_weight_matrix(self) -> np.ndarray: - """generate symmetric W matrix. W matrix (Symmetric --> W = W_Transpose). - Note: each diagonal element is the summation of other non-diagonal elements in the same row divided by (N-1) - 2023.02.14_Xiang. TODO: add parameter descriptions""" - W = self.W_original - np.fill_diagonal(W, (W.sum(axis=0) - W.diagonal()) / (self.N - 1)) - symmetric_W = ef.check_symmetric(W) - if symmetric_W == False: - print(":( W matrix is NOT symmetric") - return None - return W + def undirected_edge_list_to_matrix_older(self): + all_nodes = self.final_nodes + edge_list = pd.DataFrame(self.edge_list, columns=['source', 'target', 'weight']).fillna(self.default_edge_weight).values + #edge_list = self.preprocess_edge_list() + #default_edge_weight = self.default_edge_weight + N = len(all_nodes) + + self.N = N + weight_mat = np.full((N, N), self.default_edge_weight, dtype=np.float64) + + # Create a mapping from node to index + node_to_idx = {node: idx for idx, node in enumerate(self.final_nodes)} + + # Convert edge_list to indices and weights + indices = np.vectorize(node_to_idx.get)(self.edge_list[:, :2]).astype(int) + weights = edge_list[:, 2].astype(np.float64) + + # Assign weights to the matrix + weight_mat[indices[:, 0], indices[:, 1]] = weights + weight_mat[indices[:, 1], indices[:, 0]] = weights # Make the matrix symmetric + + # Adjust the diagonal if necessary + np.fill_diagonal(weight_mat, 0) + W = weight_mat + np.fill_diagonal(W, (W.sum(axis=0) - W.diagonal()) / (N - 1)) + # please check if W matrix is symmetric or not + # symmetric_W = ef.check_symmetric(W) + # if symmetric_W == False: # defensive programming + # print(":( W matrix is NOT symmetric. We will use the max value for each TF-TF pair.") + # W_symmetric = np.maximum(W, W) + # self.unsymmetric_orig_W = W# to keep track of this + # W = W_symmetric + # # np.maximum(W, W.T) computes the element-wise maximum between W and W.T, thus ensuring that for every pair + # # (i,j) and (j,i), the larger value is selected. + + self.W = W + self.W_df = pd.DataFrame(W, columns=self.final_nodes, index=self.final_nodes) + return self + # def undirected_edge_list_to_matrix(self): + # all_nodes = self.final_nodes + # edge_list = self.preprocess_edge_list() + # default_edge_weight = self.default_edge_weight + # N = len(all_nodes) + # self.N = N + # weight_df = np.full((N, N), default_edge_weight) + + # # Create a mapping from node to index + # node_to_idx = {node: idx for idx, node in enumerate(all_nodes)} + + # for edge in tqdm(edge_list) if self.verbose else edge_list: + # try: + # #source, target, weight = edge + # source, target, *weight = edge + # weight = weight[0] if weight else default_edge_weight + # weight = np.nan_to_num(weight, nan=default_edge_weight) + # source_idx, target_idx = node_to_idx[source], node_to_idx[target] + # weight_df[source_idx, target_idx] = weight + # weight_df[target_idx, source_idx] = weight + # except ValueError as e: + # print(f"An error occurred: {e}") + # continue + + # np.fill_diagonal(weight_df, 0) + # W = weight_df + # np.fill_diagonal(W, (W.sum(axis=0) - W.diagonal()) / (N - 1)) + + # # if not ef.check_symmetric(W): + # # print(":( W matrix is NOT symmetric") + + # self.W = W + # self.W_df = pd.DataFrame(W, columns=all_nodes, index=self.final_nodes, dtype=np.float64) + # return self + + def create_V_matrix(self): + V = self.N * np.eye(self.N) - np.ones(self.N) + return V - def return_W_edge_list(self): - wMat = ef.view_matrix_as_dataframe(self.W, column_names_list = self.tf_names_list, row_names_list = self.tf_names_list) - w_edgeList = wMat.stack().reset_index() - w_edgeList = w_edgeList[w_edgeList["level_0"] != w_edgeList["level_1"]] - w_edgeList = w_edgeList.rename(columns = {"level_0":"source", "level_1":"target", 0:"weight"}) - w_edgeList = w_edgeList.sort_values(by = ["weight"], ascending = False) - return w_edgeList + # def preprocess_edge_listOLD(self): + # default_edge_weight = self.default_edge_weight + # return [sublst if len(sublst) == 3 else sublst + [default_edge_weight] for sublst in self.edge_list] + + # def undirected_edge_list_to_matrixOLD(self): + # edge_list = self.preprocess_edge_list() + # N = len(self.final_nodes) + # weight_matrix = np.full((N, N), np.inf, dtype=float) # Use np.inf to better represent unconnected nodes initially + + # # Create a mapping from node to index + # node_to_idx = {node: idx for idx, node in enumerate(self.final_nodes)} + + # # Efficient matrix update + # for source, target, weight in edge_list: + # idx1, idx2 = node_to_idx[source], node_to_idx[target] + # weight_matrix[idx1, idx2] = weight_matrix[idx2, idx1] = weight + + # np.fill_diagonal(weight_matrix, 0) + # # Adjust diagonal based on average weights, only if necessary + # # np.fill_diagonal(weight_matrix, (weight_matrix.sum(axis=0) - np.diag(weight_matrix)) / (N - 1)) + + # self.W = weight_matrix + # self.W_df = pd.DataFrame(weight_matrix, columns=self.final_nodes, index=self.final_nodes, dtype=np.float64) + + # return self + # # Optimized functions + + # def preprocess_edge_list(self): + # # Assuming edge_list is a list of tuples or lists (source, target, [weight]) + # processed_edge_list = [ + # edge if len(edge) == 3 else (*edge, self.default_edge_weight) + # for edge in self.edge_list + # ] + # return processed_edge_list + # # def preprocess_edge_list(self): + # # processed_edge_list = [] + # # default_edge_weight = self.default_edge_weight + + # # for sublst in self.edge_list: + # # if len(sublst) == 2: + # # processed_edge_list.append(sublst + [default_edge_weight]) + # # else: + # # processed_edge_list.append(sublst) + + # # return processed_edge_list + # # def preprocess_edge_list(self): + # # # Convert to numpy array for efficient processing + # # edge_array = np.array(self.edge_list, dtype=object) + + # # # Identify edges without a weight + # # no_weight_mask = np.array([len(edge) == 2 for edge in edge_array]) + + # # # Initialize weights with default value + # # weights = np.full(edge_array.shape[0], self.default_edge_weight, dtype=float) + + # # # Update weights where provided + # # weights[~no_weight_mask] = np.array(edge_array[~no_weight_mask, 2], dtype=float) + + # # # Create processed edge list with weights included + # # processed_edge_list = np.hstack([edge_array[:, :2], weights[:, None]]) + + # # return processed_edge_list + + # def undirected_edge_list_to_matrix(self): + # edge_list = self.preprocess_edge_list() + # N = len(self.final_nodes) + # self.N = N + # weight_matrix = np.full((N, N), self.default_edge_weight, dtype=float) + + # # Create a mapping from node to index + # node_to_idx = {node: idx for idx, node in enumerate(self.final_nodes)} + + # # Efficient matrix update + # for edge in tqdm(edge_list, disable=not self.verbose): + # source_idx = node_to_idx[edge[0]] + # target_idx = node_to_idx[edge[1]] + # weight = edge[2] + # weight_matrix[source_idx, target_idx] = weight_matrix[target_idx, source_idx] = weight + + # np.fill_diagonal(weight_matrix, 0) + # W = weight_matrix + # np.fill_diagonal(W, (W.sum(axis=0) - W.diagonal()) / (N - 1)) + + # # Check for symmetry could be as simple as this, but it's generally unnecessary with this construction + # # as the matrix is built to be symmetric by design. + + # self.W = W + # self.W_df = pd.DataFrame(W, columns=self.final_nodes, index=self.final_nodes, dtype=np.float64) + # return self + # def undirected_edge_list_to_matrix(self): + # all_nodes = self.final_nodes + # edge_list = self.preprocess_edge_list() + # default_edge_weight = self.default_edge_weight + # N = len(all_nodes) + # self.N = N + # weight_df = np.full((N, N), default_edge_weight) + + # # Create a mapping from node to index + # node_to_idx = {node: idx for idx, node in enumerate(all_nodes)} + + # for edge in tqdm(edge_list) if self.verbose else edge_list: + # try: + # source, target, *weight = edge + # weight = weight[0] if weight else default_edge_weight + # weight = np.nan_to_num(weight, nan=default_edge_weight) + # source_idx, target_idx = node_to_idx[source], node_to_idx[target] + # weight_df[source_idx, target_idx] = weight + # weight_df[target_idx, source_idx] = weight + # except ValueError as e: + # print(f"An error occurred: {e}") + # continue + + # np.fill_diagonal(weight_df, 0) + # W = weight_df + # np.fill_diagonal(W, (W.sum(axis=0) - W.diagonal()) / (N - 1)) + + # if not ef.check_symmetric(W): + # print(":( W matrix is NOT symmetric") + + # self.W = W + # self.W_df = pd.DataFrame(W, columns=all_nodes, index=self.final_nodes, dtype=np.float64) + # return self + + + # def generate_symmetric_weight_matrix(self) -> np.ndarray: + # """generate symmetric W matrix. W matrix (Symmetric --> W = W_Transpose). + # Note: each diagonal element is the summation of other non-diagonal elements in the same row divided by (N-1) + # 2023.02.14_Xiang. TODO: add parameter descriptions""" + # W = self.W_original + # np.fill_diagonal(W, (W.sum(axis=0) - W.diagonal()) / (self.N - 1)) + # symmetric_W = ef.check_symmetric(W) + # if symmetric_W == False: + # print(":( W matrix is NOT symmetric") + # return None + # return W + + + # def return_W_edge_list(self): + # wMat = ef.view_matrix_as_dataframe(self.W, column_names_list = self.tf_names_list, row_names_list = self.tf_names_list) + # w_edgeList = wMat.stack().reset_index() + # w_edgeList = w_edgeList[w_edgeList["level_0"] != w_edgeList["level_1"]] + # w_edgeList = w_edgeList.rename(columns = {"level_0":"source", "level_1":"target", 0:"weight"}) + # w_edgeList = w_edgeList.sort_values(by = ["weight"], ascending = False) + # return w_edgeList def view_W_network(self): @@ -335,7 +674,11 @@ def generate_degree_vector_from_weight_matrix(self) -> np.ndarray: W_to_use = self.W ** 2 else: W_to_use = self.W - d = W_to_use.diagonal() * (self.N - 1) # summing the edge weights + if self.w_transform_for_d == "avg": # added on 2/8/24 + d = W_to_use.diagonal() + else: + d = W_to_use.diagonal() * (self.N - 1) # summing the edge weights + d = np.copy(d) d += self.pseudocount_for_degree if self.consider_self_loops: d += 1 # we also add in a self-loop :) @@ -358,8 +701,10 @@ def generate_degree_matrix_from_weight_matrix(self): # D matrix where the entries are 1/sqrt(d). Here, d is a vector corresponding to the degree of each matrix""" # we see that the D matrix is higher for nodes that are singletons, a much higher value because it is not connected d = self.degree_vector + d_inv_sqrt = 1 / np.sqrt(d) # D = np.diag(d_inv_sqrt) # full matrix D, only suitable for small scale. Use DiagonalLinearOperator instead. + self.D_df = pd.DataFrame(np.diag(d_inv_sqrt)) D = ef.DiagonalLinearOperator(d_inv_sqrt) return D @@ -372,12 +717,29 @@ def create_A_matrix(self): # A matrix # Please note that this function by Saniya creates the A matrix, which is: # (D_transpose) %*% (V*W) %*% (D) """ + A = self.D @ (self.V * self.W) @ self.D - approxSame = ef.check_symmetric(A) # please see if A is symmetric - if approxSame: + approxSame = ef.check_symmetric(A) # please see if A is symmetric + + if approxSame == False: + if ef.check_symmetric(self.W) == False: # defensive programming + print(":( W matrix is NOT symmetric. We will use the max value for each TF-TF pair.") + W_symmetric = np.maximum(self.W, self.W.T) + self.unsymmetric_orig_W = self.W # to keep track of this + self.W = W_symmetric + # np.maximum(W, W.T) computes the element-wise maximum between W and W.T, thus ensuring that for every pair + # (i,j) and (j,i), the larger value is selected. + A = self.D @ (self.V * self.W) @ self.D # please retry + approxSame = ef.check_symmetric(A) # please see if A is symmetric + + posSemiDef = is_positive_semi_definite(A) + if approxSame and posSemiDef: return A else: - print(f":( False. A is NOT a symmetric matrix.") + if approxSame == False: + print(f":( False. A is NOT a symmetric matrix.") + if posSemiDef == False: + print(f":( False. A is NOT positive semi-definite.") print(A) return False @@ -421,14 +783,39 @@ def full_lists(self): return full_lists -def build_prior_network(edge_list, gene_expression_nodes = [], default_edge_weight = 0.1, - degree_threshold = 0.5, - degree_pseudocount = 1e-3, - view_network = True, - verbose = True): - edge_vals_for_d = False +# def build_prior_network(edge_list, gene_expression_nodes = [], default_edge_weight = 0.1, +# degree_threshold = 0.5, edge_vals_for_d = False, w_transform_for_d = "none", +# degree_pseudocount = 1e-3, +# view_network = True, +# verbose = True): + +# self_loops = False + +# prior_graph_dict = {"edge_list": edge_list, +# "gene_expression_nodes":gene_expression_nodes, +# "edge_values_for_degree": edge_vals_for_d, +# "consider_self_loops":self_loops, +# "pseudocount_for_degree":degree_pseudocount, +# "default_edge_weight": default_edge_weight, +# "w_transform_for_d":w_transform_for_d, +# "threshold_for_degree": degree_threshold, +# "view_network": view_network, +# "verbose":verbose} +# if verbose: +# print("building prior network:") +# print("prior graph network used") +# netty = PriorGraphNetwork(**prior_graph_dict) # uses the network to get features like the A matrix. #################### +# return netty + + +def build_prior_network(edge_list, gene_expression_nodes = [], default_edge_weight = 0.01, + degree_threshold = 0.5, edge_vals_for_d = True, w_transform_for_d = "none", + degree_pseudocount = 0, + view_network = False, + verbose = False): + self_loops = False - w_transform_for_d = "none" + prior_graph_dict = {"edge_list": edge_list, "gene_expression_nodes":gene_expression_nodes, "edge_values_for_degree": edge_vals_for_d, @@ -446,6 +833,7 @@ def build_prior_network(edge_list, gene_expression_nodes = [], default_edge_weig return netty + def directed_node2vec_similarity(edge_list: List[Tuple[int, int, float]], dimensions: int = 64, walk_length: int = 30, @@ -544,4 +932,13 @@ def directed_node2vec_similarity(edge_list: List[Tuple[int, int, float]], results_dict["NetREm_edgelist"] = similarity_df.values.tolist() print(results_dict.keys()) - return results_dict \ No newline at end of file + return results_dict + + +def is_positive_semi_definite(matrix, tol=1e-10): + # Calculate the eigenvalues of the matrix + eigenvalues = np.linalg.eigvals(matrix) + # Check if all eigenvalues are non-negative + return np.all(eigenvalues >= -tol) + + diff --git a/code/error_metrics.py b/code/error_metrics.py index d4ca876..a243980 100644 --- a/code/error_metrics.py +++ b/code/error_metrics.py @@ -196,6 +196,7 @@ def psnr_custom_score(y_true, y_pred): psnrVal = psnr(y_true, y_pred) return psnrVal + # Create a custom scorer object using make_scorer mse_custom_scorer = make_scorer(mse_custom_score) nmse_custom_scorer = make_scorer(nmse_custom_score) @@ -223,10 +224,27 @@ def generate_model_metrics_for_baselines_df(X_train, y_train, X_test, y_test, mo selected_row = model_df.iloc[0] selected_cols = selected_row[selected_row != 0].index # Filter out the columns with value 0 model_df = model_df[selected_cols] + # Process the DataFrame df = model_df.replace("None", np.nan).apply(pd.to_numeric, errors='coerce') - sorted_series = df.abs().squeeze().sort_values(ascending=False) + # Check if df is not empty and has more than one row/column + # Check the structure of df + if isinstance(df, pd.DataFrame): + if df.size > 1: + # If df is a DataFrame with multiple elements + sorted_series = df.abs().squeeze().sort_values(ascending=False) + sorted_df = pd.DataFrame(sorted_series).reset_index() + else: + # If df is a single value DataFrame + sorted_df = pd.DataFrame([df.abs().values[0]], columns=['Value']) + elif isinstance(df, pd.Series): + # If df is a Series + sorted_series = df.abs().sort_values(ascending=False) + sorted_df = pd.DataFrame(sorted_series).reset_index() + else: + # If df is neither a DataFrame nor a Series (e.g., a scalar) + sorted_df = pd.DataFrame([df.abs()], columns=['Value']) + # convert the sorted series back to a DataFrame - sorted_df = pd.DataFrame(sorted_series) # add a column for the rank sorted_df['Rank'] = range(1, len(sorted_df) + 1) sorted_df['TF'] = sorted_df.index @@ -259,6 +277,7 @@ def generate_model_metrics_for_baselines_df(X_train, y_train, X_test, y_test, mo sorted_df["test_snr"] = snr(y_test.values.flatten(), predY_test) sorted_df["train_psnr"] = psnr(y_train.values.flatten(), predY_train) sorted_df["test_psnr"] = psnr(y_test.values.flatten(), predY_test) + sorted_df return sorted_df diff --git a/code/netrem_evaluation_functions.py b/code/netrem_evaluation_functions.py index b99b41d..ff0d723 100644 --- a/code/netrem_evaluation_functions.py +++ b/code/netrem_evaluation_functions.py @@ -30,10 +30,14 @@ from error_metrics import * from DemoDataBuilderXandY import * from PriorGraphNetwork import * -from Netrem_model_builder import * +#from Netrem_model_builder import * from sklearn.linear_model import ElasticNetCV, LinearRegression, LassoCV, RidgeCV from skopt import gp_minimize, space from skopt.utils import use_named_args +#import Netrem_model_builder as nm +from sklearn import linear_model, preprocessing # 9/19 +import NetremmerFinal2024 as nm +from NetremmerFinal2024 import * class BayesianObjective_Lasso: def __init__(self, X, y, cv_folds, model, scorer="mse", print_network=False): @@ -147,90 +151,6 @@ def optimal_netrem_model_via_bayesian_param_tuner(netrem_model, X_train, y_train results_dict["result"] = result return results_dict -# class BayesianObjective_Lasso: -# def __init__(self, X, y, cv_folds, model, scorer = "mse", print_network = False): -# self.X = X -# self.y = y -# self.cv_folds = cv_folds -# model.view_network = print_network -# self.model = model -# self.scorer_obj = 'neg_mean_squared_error' # the default -# if scorer == "mse": -# self.scorer_obj = mse_custom_scorer -# elif scorer == "nmse": -# self.scorer_obj = nmse_custom_scorer -# elif scorer == "snr": -# self.scorer_obj = snr_custom_scorer -# elif scorer == "psnr": -# self.scorer_obj = psnr_custom_scorer - - -# def __call__(self, params): - -# alpha_lasso, beta_network = params -# #network = PriorGraphNetwork(edge_list = edge_list) -# netrem_model = self.model -# #print(netrem_model.get_params()) -# netrem_model.alpha_lasso = alpha_lasso -# netrem_model.beta_network = beta_network -# #netrem_model.view_network = self.view_network -# score = -cross_val_score(netrem_model, self.X, self.y, cv=self.cv_folds, scoring=self.scorer_obj).mean() -# return score - - -# def optimal_netrem_model_via_bayesian_param_tuner(netrem_model, X_train, y_train, -# beta_net_min = 0.001, -# beta_net_max = 10, -# alpha_lasso_min = 0.0001, -# alpha_lasso_max = 0.1, -# num_grid_values = 100, -# gridSearchCV_folds = 5, -# scorer = "mse", -# verbose = False): -# if verbose: -# print(f":) Please note we are running Bayesian optimization (via skopt Python package) for parameter hunting for beta_network and alpha_lasso with model evaluation scorer: {scorer} :)") -# print("we use gp_minimize here for hyperparameter tuning") -# print(f":) Please note this is a start-to-finish optimizer for NetREm (Network regression embeddings reveal cell-type protein-protein interactions for gene regulation)") -# from skopt import gp_minimize, space -# model_type = netrem_model.model_type -# # param_space = [space.Real(alpha_lasso_min, alpha_lasso_max, name='alpha_lasso', prior='log-uniform'), -# # space.Real(beta_net_min, beta_net_max, name='beta_network', prior='log-uniform')] - -# if model_type == "LassoCV": -# print("please note that we can only do this for Lasso model not for LassoCV :(") -# print("Thus, we will alter the model_type to make it Lasso") -# netrem_model.model_type = "Lasso" - -# param_space = [space.Real(alpha_lasso_min, alpha_lasso_max, name='alpha_lasso', prior='log-uniform'), -# space.Real(beta_net_min, beta_net_max, name='beta_network', prior='log-uniform')] -# objective = BayesianObjective_Lasso(X_train, y_train, cv_folds = gridSearchCV_folds, model = netrem_model, scorer = scorer) - -# # Perform Bayesian optimization -# result = gp_minimize(objective, param_space, n_calls=num_grid_values, random_state=123) -# results_dict = {} -# optimal_model = netrem_model -# if verbose: -# print(":) ######################################################################\n") -# print(f":) Please note the optimal model based on Bayesian optimization found: ") - -# bayesian_alpha = result.x[0] -# bayesian_beta = result.x[1] -# optimal_model.alpha_lasso = bayesian_alpha -# optimal_model.beta_network = bayesian_beta -# results_dict["bayesian_alpha"] = bayesian_alpha -# print(f"alpha_lasso = {bayesian_alpha} ; beta_network = {bayesian_beta}") -# if verbose: -# print(":) ######################################################################\n") -# print("Fitting the model using these optimal hyperparameters for beta_net and alpha_lasso...") -# dict_ex = optimal_model.get_params() -# optimal_model = NetREmModel(**dict_ex) -# optimal_model.fit(X_train, y_train) -# print(optimal_model.get_params()) -# results_dict["optimal_model"] = optimal_model -# results_dict["bayesian_beta"] = bayesian_beta -# results_dict["result"] = result -# return results_dict - def optimal_netrem_model_via_gridsearchCV_param_tuner(netrem_model, X_train, y_train, num_grid_values, num_cv_jobs = -1): beta_max = 0.5 * np.max(np.abs(X_train.T.dot(y_train))) @@ -299,7 +219,7 @@ def model_comparison_metrics_for_target_gene_with_BayesianOpt_andOr_GridSearchCV tfs_for_tg = scgrnom_step2_df[scgrnom_step2_df["TG"] == focus_gene]["TF"].tolist() tfs_for_tg.sort() - tfs_for_tg = intersection(tfs_for_tg, tfs) + tfs_for_tg = ef.intersection(tfs_for_tg, tfs) len(tfs_for_tg) low_TFs_bool = False @@ -361,8 +281,6 @@ def model_comparison_metrics_for_target_gene_with_BayesianOpt_andOr_GridSearchCV tfs_added_list += js_minier[js_minier["TF1"] == tfs_to_use_list[tf_num]].head(3)["TF2"].tolist() tfs_added_list.sort() - - #################################### if verbose: print(len(tfs_added_list)) @@ -408,7 +326,6 @@ def model_comparison_metrics_for_target_gene_with_BayesianOpt_andOr_GridSearchCV verbose = verbose, gene_expression_nodes = key_genes, view_network = view_network) - model_comparison_df1 = pd.DataFrame() model_comparison_df2 = pd.DataFrame() bayes_optimizer_bool = False @@ -488,8 +405,6 @@ def model_comparison_metrics_for_target_gene_with_BayesianOpt_andOr_GridSearchCV model_comparison_df2["approach"] = "gridSearchCV" griddy_optimizer_bool = True - # except: - # print(":( gridsearchCV optimizer is not working") both_approaches_bool = False if bayes_optimizer_bool and griddy_optimizer_bool: combined_model_compare_df = pd.concat([model_comparison_df1, model_comparison_df2]) @@ -529,7 +444,7 @@ def model_comparison_metrics_for_target_gene_with_BayesianOpt_andOr_GridSearchCV return combined_model_compare_df -def baseline_metrics_function(X_train, y_train, X_test, y_test, tg, model_name, y_intercept, verbose = False): +def baseline_metrics_function(X_train, y_train, X_test, y_test, tg, model_name, y_intercept, standardize_X=True, standardize_y = True, center_y=False,verbose = False): if verbose: print(f"{model_name} results :) for fitting y_intercept = {y_intercept}") @@ -542,6 +457,39 @@ def baseline_metrics_function(X_train, y_train, X_test, y_test, tg, model_name, regr = LassoCV(cv=5, fit_intercept = y_intercept) elif model_name == "RidgeCV": regr = RidgeCV(cv=5, fit_intercept = y_intercept) + + + if standardize_X: + scaler_x = preprocessing.StandardScaler().fit(X_train) # Fit the scaler to the training data only + + # Transform both the training and test data + X_scaled_train = scaler_x.transform(X_train) + X_train = pd.DataFrame(X_scaled_train, columns=X_train.columns) + + X_scaled_test = scaler_x.transform(X_test) + X_test = pd.DataFrame(X_scaled_test, columns=X_test.columns) + + + if standardize_y: + scaler_y = preprocessing.StandardScaler().fit(y_train) # Fit the scaler to the training data only + + # Transform both the training and test data + y_scaled_train = scaler_y.transform(y_train) + y_train = pd.DataFrame(y_scaled_train, columns=y_train.columns) + + y_scaled_test = scaler_y.transform(y_test) + y_test = pd.DataFrame(y_scaled_test, columns=y_test.columns) + + + if center_y: + mean_y_train = np.mean(y_train) # the average y value + y_train = y_train - mean_y_train + y_test = y_test - mean_y_train + + if tg in X_train.columns: # March 4, 2024 + X_train = X_train.drop(columns = [tg]) + X_test = X_test.drop(columns = [tg]) + regr.fit(X_train, y_train) if model_name in ["RidgeCV", "LinearRegression"]: model_df = pd.DataFrame(regr.coef_) @@ -555,16 +503,14 @@ def baseline_metrics_function(X_train, y_train, X_test, y_test, tg, model_name, model_df = model_df[selected_cols] df = model_df.replace("None", np.nan).apply(pd.to_numeric, errors='coerce') sorted_series = df.abs().squeeze().sort_values(ascending=False) + original_coef_df = df.T.reset_index().rename(columns = {"index":"TF", 0:"coef"}) + # convert the sorted series back to a DataFrame sorted_df = pd.DataFrame(sorted_series) # add a column for the rank sorted_df['Rank'] = range(1, len(sorted_df) + 1) sorted_df['TF'] = sorted_df.index sorted_df = sorted_df.rename(columns = {0:"AbsoluteVal_coefficient"}) - # tfs = sorted_df["TF"].tolist() - # if tf_name not in tfs: - # sorted_df = pd.DataFrame(["N/A", tf_name]).transpose() - # sorted_df.columns = ["Rank", "TF"] sorted_df["Info"] = model_name if y_intercept: sorted_df["y_intercept"] = "True :)" @@ -586,9 +532,76 @@ def baseline_metrics_function(X_train, y_train, X_test, y_test, tg, model_name, sorted_df["test_snr"] = em.snr(y_test.values.flatten(), predY_test) sorted_df["train_psnr"] = em.psnr(y_train.values.flatten(), predY_train) sorted_df["test_psnr"] = em.psnr(y_test.values.flatten(), predY_test) + sorted_df["standardize_X"] = standardize_X + sorted_df["center_y"] = center_y + sorted_df["TG"] = tg sorted_df = sorted_df.reset_index().drop(columns = ["index"]) - sorted_df + sorted_df = pd.merge(sorted_df, original_coef_df) + except: + return pd.DataFrame() + return sorted_df + + +def baseline_metrics_functionOLD(X_train, y_train, X_test, y_test, tg, model_name, y_intercept, verbose = False): + + if verbose: + print(f"{model_name} results :) for fitting y_intercept = {y_intercept}") + try: + if model_name == "ElasticNetCV": + regr = ElasticNetCV(cv=5, random_state=0, fit_intercept = y_intercept) + elif model_name == "LinearRegression": + regr = LinearRegression(fit_intercept = y_intercept) + elif model_name == "LassoCV": + regr = LassoCV(cv=5, fit_intercept = y_intercept) + elif model_name == "RidgeCV": + regr = RidgeCV(cv=5, fit_intercept = y_intercept) + regr.fit(X_train, y_train) + if model_name in ["RidgeCV", "LinearRegression"]: + model_df = pd.DataFrame(regr.coef_) + else: + model_df = pd.DataFrame(regr.coef_).transpose() + if verbose: + print(model_df) + model_df.columns = X_train.columns.tolist() + selected_row = model_df.iloc[0] + selected_cols = selected_row[selected_row != 0].index # Filter out the columns with value 0 + model_df = model_df[selected_cols] + df = model_df.replace("None", np.nan).apply(pd.to_numeric, errors='coerce') + sorted_series = df.abs().squeeze().sort_values(ascending=False) + original_coef_df = df.T.reset_index().rename(columns = {"index":"TF", 0:"coef"}) + + # convert the sorted series back to a DataFrame + sorted_df = pd.DataFrame(sorted_series) + # add a column for the rank + sorted_df['Rank'] = range(1, len(sorted_df) + 1) + sorted_df['TF'] = sorted_df.index + sorted_df = sorted_df.rename(columns = {0:"AbsoluteVal_coefficient"}) + sorted_df["Info"] = model_name + if y_intercept: + sorted_df["y_intercept"] = "True :)" + else: + sorted_df["y_intercept"] = "False :(" + sorted_df["final_model_TFs"] = model_df.shape[1] + sorted_df["TFs_input_to_model"] = X_train.shape[1] + sorted_df["original_TFs_in_X"] = X_train.shape[1] + + predY_train = regr.predict(X_train) + predY_test = regr.predict(X_test) + train_mse = em.mse(y_train.values.flatten(), predY_train) + test_mse = em.mse(y_test.values.flatten(), predY_test) + sorted_df["train_mse"] = train_mse + sorted_df["test_mse"] = test_mse + sorted_df["train_nmse"] = em.nmse(y_train.values.flatten(), predY_train) + sorted_df["test_nmse"] = em.nmse(y_test.values.flatten(), predY_test) + sorted_df["train_snr"] = em.snr(y_train.values.flatten(), predY_train) + sorted_df["test_snr"] = em.snr(y_test.values.flatten(), predY_test) + sorted_df["train_psnr"] = em.psnr(y_train.values.flatten(), predY_train) + sorted_df["test_psnr"] = em.psnr(y_test.values.flatten(), predY_test) + sorted_df["TG"] = tg + sorted_df = sorted_df.reset_index().drop(columns = ["index"]) + + sorted_df = pd.merge(sorted_df, original_coef_df) except: return pd.DataFrame() return sorted_df \ No newline at end of file diff --git a/code/previous_version/DemoDataBuilderXandY.py b/code/previous_version/DemoDataBuilderXandY.py new file mode 100644 index 0000000..bb12db7 --- /dev/null +++ b/code/previous_version/DemoDataBuilderXandY.py @@ -0,0 +1,918 @@ +# DemoDataBuilder Class: :) +from packages_needed import * +import pandas as pd +import numpy as np +from tqdm.auto import tqdm +import numpy as np +from sklearn.model_selection import train_test_split +import plotly.express as px +class DemoDataBuilderXandY: + """:) Please note that this class focuses on building Y data based on a normal distribution (specified mean + and standard deviation). M is the # of samples we want to generate. Thus, Y is a vector with M elements. + Then, this class returns X for a set of N predictors (each with M # of samples) based on a list of N correlation + values. For instance, if N = 5 predictors (the Transcription Factors (TFs)), we have [X1, X2, X3, X4, X5], + and a respective list of correlation values: [cor(X1, Y), cor(X2, Y), cor(X3, Y), cor(X4, Y), cor(X5, Y)]. + Then, this class will generate X, a matrix of those 5 predictors (based on similar distribution as Y) + with these respective correlations.""" + + _parameter_constraints = { + "test_data_percent": (0, 100), + "mu": (0, None), + "std_dev": (0, None), + "num_iters_to_generate_X": (1, None), + "same_train_test_data": [False, True], + "rng_seed": (0, None), + "randSeed": (0, None), + "ortho_scalar": (1, None), + "orthogonal_X_bool": [True, False], + "view_input_correlations_plot": [False, True], + "num_samples_M": (1, None), + "corrVals": list + } + + def __init__(self, **kwargs): + + # define default values for constants + self.same_train_test_data = False + self.test_data_percent = 30 + self.mu = 0 + self.verbose = True + self.std_dev = 1 + self.num_iters_to_generate_X = 100 + self.rng_seed = 2023 # for Y + self.randSeed = 123 # for X + self.orthogonal_X_bool = True # False adjustment made on 9/20 + self.ortho_scalar = 10 + self.tol = 1e-2 + self.view_input_correlations_plot = False + # reading in user inputs + self.__dict__.update(kwargs) + ##################### other user parameters being loaded and checked + self.same_train_and_test_data_bool = self.same_train_test_data + # check that all required keys are present: + required_keys = ["corrVals", "num_samples_M"] + missing_keys = [key for key in required_keys if key not in self.__dict__] + if missing_keys: + raise ValueError(f":( Please note ye are missing information for these keys: {missing_keys}") + self.M = self.num_samples_M + self.N = self.get_N() + self.y = self.generate_Y() + self.X = self.generate_X() + self.same_train_and_test_data_bool = self.same_train_test_data + if self.same_train_and_test_data_bool: + self.testing_size = 1 + else: + self.testing_size = (self.test_data_percent/100.0) + self.data_sets = self.generate_training_and_testing_data() # [X_train, X_test, y_train, y_test] + self.X_train = self.data_sets[0] + self.X_test = self.data_sets[1] + self.y_train = self.data_sets[2] + self.y_test = self.data_sets[3] + + self.tf_names_list = self.get_tf_names_list() + self.corr_df = self.return_correlations_dataframe() + self.combined_correlations_df = self.get_combined_correlations_df() + if self.view_input_correlations_plot: + self.view_input_correlations = self.view_input_correlations() + self._apply_parameter_constraints() + self.X_train_df = self.view_X_train_df() + self.y_train_df = self.view_y_train_df() + self.X_test_df = self.view_X_test_df() + self.y_test_df = self.view_y_test_df() + self.X_df = self.view_original_X_df() + self.y_df = self.view_original_y_df() + self.combined_train_test_x_and_y_df = self.combine_X_and_y_train_and_test_data() + + def _apply_parameter_constraints(self): + constraints = {**DemoDataBuilderXandY._parameter_constraints} + for key, value in self.__dict__.items(): + if key in constraints: + if isinstance(constraints[key], tuple): + if isinstance(constraints[key][0], type) and not isinstance(value, constraints[key][0]): + setattr(self, key, constraints[key][0]) + elif constraints[key][1] is not None and isinstance(constraints[key][1], type) and not isinstance(value, constraints[key][1]): + setattr(self, key, constraints[key][1]) + elif key == "corrVals": # special case for corrVals + if not isinstance(value, list): + setattr(self, key, constraints[key]) + elif value not in constraints[key]: + setattr(self, key, constraints[key][0]) + return self + + def get_tf_names_list(self): + tf_names_list = [] + for i in range(0, self.N): + term = "TF" + str(i+1) + tf_names_list.append(term) + return tf_names_list + + # getter method + def get_N(self): + N = len(self.corrVals) + return N + + def get_X_train(self): + return self.data_sets[0] #X_train + + def get_y_train(self): + return self.data_sets[2] # y_train + + def get_X_test(self): + return self.data_sets[1] + + def get_y_test(self): + return self.data_sets[3] + + def view_original_X_df(self): + import pandas as pd + X_df = pd.DataFrame(self.X, columns = self.tf_names_list) + return X_df + + def view_original_y_df(self): + import pandas as pd + y_df = pd.DataFrame(self.y, columns = ["y"]) + return y_df + + def view_X_train_df(self): + import pandas as pd + X_train_df = pd.DataFrame(self.X_train, columns = self.tf_names_list) + return X_train_df + + def view_y_train_df(self): + import pandas as pd + y_train_df = pd.DataFrame(self.y_train, columns = ["y"]) + return y_train_df + + def view_X_test_df(self): + X_test_df = pd.DataFrame(self.X_test, columns = self.tf_names_list) + return X_test_df + + def view_y_test_df(self): + y_test_df = pd.DataFrame(self.y_test, columns = ["y"]) + return y_test_df + + def combine_X_and_y_train_and_test_data(self): + X_p1 = self.X_train_df + X_p1["info"] = "training" + X_p2 = self.X_test_df + X_p2["info"] = "testing" + X_combined = pd.concat([X_p1, X_p2]).drop_duplicates() + y_p1 = self.y_train_df + y_p1["info"] = "training" + y_p2 = self.y_test_df + y_p2["info"] = "testing" + y_combined = pd.concat([y_p1, y_p2]).drop_duplicates() + combining_df = X_combined + combining_df["y"] = y_combined["y"] + return combining_df + + def return_correlations_dataframe(self): + corr_info = ["expected_correlations"] * self.N + corr_df = pd.DataFrame(corr_info, columns = ["info"]) + corr_df["TF"] = self.tf_names_list + corr_df["value"] = self.corrVals + corr_df["data"] = "correlations" + return corr_df + + def generate_Y(self): + seed_val = self.rng_seed + rng = np.random.default_rng(seed=seed_val) + y = rng.normal(self.mu, self.std_dev, self.M) + return y + + # Check if Q is orthogonal using the is_orthogonal function + def is_orthogonal(matrix): + """ + Checks if a given matrix is orthogonal. + Parameters: + matrix (numpy.ndarray): The matrix to check + Returns: + bool: True if the matrix is orthogonal, False otherwise. + """ + # Compute the transpose of the matrix + matrix_T = matrix.T + + # Compute the product of the matrix and its transpose + matrix_matrix_T = np.dot(matrix, matrix_T) + + # Check if the product is equal to the identity matrix + return np.allclose(matrix_matrix_T, np.eye(matrix.shape[0])) + +# # Define the modified generate_X function +# def generate_X(self): +# """Generates a design matrix X with the given correlations while introducing noise and dependencies. +# Parameters: +# orthogonal (bool): Whether to generate an orthogonal matrix (default=False). + +# Returns: +# numpy.ndarray: The design matrix X. +# """ +# orthogonal = self.orthogonal_X_bool +# scalar = self.ortho_scalar +# np.random.seed(self.randSeed) +# y = self.y +# n = len(y) +# numTFs = self.N # len(corrVals) +# numIterations = self.num_iters_to_generate_X +# correlations = self.corrVals +# corrVals = [correlations[0]] + correlations + +# # Step 1: Generate Initial X +# e = np.random.normal(0, 1, (n, numTFs + 1)) +# X = np.copy(e) +# X[:, 0] = y * np.sqrt(1.0 - corrVals[0]**2) / np.sqrt(1.0 - np.corrcoef(y, X[:,0])[0,1]**2) +# for j in range(numIterations): +# for i in range(1, numTFs + 1): +# corr = np.corrcoef(y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * y + +# # Step 2: Add Noise +# noise_scale = 0.1 # You can adjust this value +# X += np.random.normal(0, noise_scale, X.shape) + +# # Step 3: Introduce Inter-dependencies +# # Make the second predictor a combination of the first and third predictors +# X[:, 1] += 0.3 * X[:, 0] + 0.7 * X[:, 2] + +# # Step 4: Adjust for Correlations +# for j in range(numIterations): +# for i in range(1, numTFs + 1): +# corr = np.corrcoef(y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * y + +# if orthogonal: +# # Compute the QR decomposition of X and take only the Q matrix +# Q = np.linalg.qr(X)[0] +# Q = scalar * Q +# return Q[:, 1:] +# else: +# # Return the X matrix without orthogonalization +# return X[:, 1:] + +# # # Display the modified function to ensure it looks okay +# # print(generate_X_modified) + +# def generate_X(self): +# """Generates a design matrix X with the given correlations and introduces an interaction term. + +# Returns: +# numpy.ndarray: The design matrix X. +# """ +# np.random.seed(self.randSeed) +# y = self.y +# n = len(y) +# numTFs = self.N # Number of predictors +# numIterations = self.num_iters_to_generate_X +# corrVals = self.corrVals + +# # Step 1: Generate Initial X based on the specified correlations with Y +# e = np.random.normal(0, 1, (n, numTFs)) +# X = np.copy(e) +# for j in range(numIterations): +# for i in range(numTFs): +# corr = np.corrcoef(y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * y + +# # Step 2: Introduce Interaction Term into Y +# interaction_term = X[:, 3] * X[:, 4] +# self.y = y + 0.5 * interaction_term # Adjust the coefficient as needed + +# # Step 3: Re-adjust for specified correlations with Y +# for j in range(numIterations): +# for i in range(numTFs): +# corr = np.corrcoef(self.y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * self.y + +# return X + + + + # Define the modified generate_X function to highlight the benefits of network-regularized regression +# def generate_X(self): +# """Generates a design matrix X to highlight the benefits of network-regularized regression. + +# Returns: +# numpy.ndarray: The design matrix X. +# """ +# np.random.seed(self.randSeed) +# n = len(self.y) +# numTFs = self.N # Number of predictors +# numIterations = self.num_iters_to_generate_X +# corrVals = self.corrVals + +# # Step 1: Generate Initial X based on the specified correlations with Y +# e = np.random.normal(0, 1, (n, numTFs)) +# X = np.copy(e) +# for j in range(numIterations): +# for i in range(numTFs): +# corr = np.corrcoef(self.y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * self.y + +# # Step 2: Weaken X2 and X4 as predictors by introducing interactions in Y +# interaction_term = 0.3 * (X[:, 0] * X[:, 1]) + 0.3 * (X[:, 3] * X[:, 4]) # Interaction terms +# self.y = self.y + interaction_term # Update Y + +# # Step 3: Strengthen network edges by making X1 and X2, and X4 and X5 highly correlated +# X[:, 1] = 0.7 * X[:, 0] + 0.3 * X[:, 1] # X1 and X2 +# X[:, 3] = 0.7 * X[:, 4] + 0.3 * X[:, 3] # X4 and X5 + +# # Step 4: Re-adjust for specified correlations with Y +# for j in range(numIterations): +# for i in range(numTFs): +# corr = np.corrcoef(self.y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * self.y + +# return X +# def generate_X(self): +# """Generates a design matrix X with the given correlations and introduces specified network edges and interactions. + +# Returns: +# numpy.ndarray: The design matrix X. +# """ +# np.random.seed(self.randSeed) +# y = self.y +# n = len(y) +# numTFs = self.N # Number of predictors +# numIterations = self.num_iters_to_generate_X +# corrVals = self.corrVals + +# # Step 1: Generate Initial X based on the specified correlations with Y +# e = np.random.normal(0, 1, (n, numTFs)) +# X = np.copy(e) +# for j in range(numIterations): +# for i in range(numTFs): +# corr = np.corrcoef(y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * y + +# # Step 2: Weaken X2 and X4 as predictors by introducing interactions in Y +# self.y = y + 0.3 * (X[:, 1] * X[:, 0]) + 0.3 * (X[:, 3] * X[:, 4]) # Adjust the coefficients as needed + +# # Step 3: Strengthen network edges by making X1 and X2, and X4 and X5 highly correlated +# X[:, 1] = 0.7 * X[:, 0] + 0.3 * X[:, 1] # X1 and X2 +# X[:, 3] = 0.7 * X[:, 4] + 0.3 * X[:, 3] # X4 and X5 + +# # Step 4: Re-adjust for specified correlations with Y +# for j in range(numIterations): +# for i in range(numTFs): +# corr = np.corrcoef(self.y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * self.y + +# return X +# def generate_X(self): +# """Generates a design matrix X with given correlations and introduces inter-predictor correlations. + +# Returns: +# numpy.ndarray: The design matrix X. +# """ +# np.random.seed(self.randSeed) +# y = self.y +# n = len(y) +# numTFs = self.N # Number of predictors +# numIterations = self.num_iters_to_generate_X +# corrVals = self.corrVals + +# # Step 1: Generate Initial X based on the specified correlations with Y +# e = np.random.normal(0, 1, (n, numTFs)) +# X = np.copy(e) +# for j in range(numIterations): +# for i in range(numTFs): +# corr = np.corrcoef(y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * y + +# # Step 2: Introduce Inter-predictor Correlations +# # Make X1 and X2 highly correlated +# X[:, 0] = 0.5 * X[:, 0] + 0.5 * X[:, 1] +# # Make X4 and X5 highly correlated +# X[:, 3] = 0.525 * X[:, 3] + 0.475 * X[:, 4] + +# # Step 3: Re-adjust for specified correlations with Y +# for j in range(numIterations): +# for i in range(numTFs): +# corr = np.corrcoef(y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * y + +# return X + +# def generate_X(self, tol=1e-4): +# orthogonal = self.orthogonal_X_bool +# scalar = self.ortho_scalar +# np.random.seed(self.randSeed) +# y = self.y +# n = len(y) +# numTFs = self.N + +# # Initialize X with standard normal distribution +# X = np.random.normal(0, 1, (n, numTFs)) + +# for i in range(numTFs): +# desired_corr = self.corrVals[i] + +# while True: +# # Create a new predictor as a linear combination of original predictor and y +# X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] + +# # Standardize the predictor to have mean 0 and variance 1 +# X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + +# # Calculate the actual correlation +# actual_corr = np.corrcoef(y, X[:, i])[0, 1] + +# # Calculate the difference between the actual and desired correlations +# diff = abs(actual_corr - desired_corr) + +# if diff < tol: +# break + +# # Orthogonalize the predictors to make them independent of each other +# Q, _ = np.linalg.qr(X) + +# if orthogonal: +# # Scale the orthogonalized predictors +# Q = scalar * Q +# return Q +# else: +# # Return the orthogonalized predictors without scaling +# return Q + +# def generate_X(self): +# np.random.seed(self.randSeed) +# y = self.y +# n = len(y) +# numTFs = self.N +# tol = self.tol + +# # Initialize X with standard normal distribution (vectorized) +# X = np.random.normal(0, 1, (n, numTFs)) + +# # Standardize y for correlation calculation +# y_std = (y - np.mean(y)) / np.std(y) + +# for i in tqdm(range(numTFs), desc="Generating predictors"): +# desired_corr = self.corrVals[i] + +# while True: +# # Orthogonalize Xi against all previous predictors +# for j in range(i): +# coef = np.dot(X[:, i], X[:, j]) / np.dot(X[:, j], X[:, j]) +# X[:, i] -= coef * X[:, j] + +# # Create and standardize new predictor (vectorized) +# X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] +# X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + +# # Calculate actual correlation (vectorized) +# actual_corr = np.dot(y_std, X[:, i]) / n + +# # Check if actual correlation is close enough to desired correlation +# if abs(actual_corr - desired_corr) < tol: +# break + +# # Orthogonalize X to reduce inter-predictor correlation (if required) +# if self.orthogonal_X_bool: +# X, _ = np.linalg.qr(X) + +# return X + + def generate_X(self): + np.random.seed(self.randSeed) + y = self.y + n = len(y) + numTFs = self.N + tol = self.tol + + # Initialize X with standard normal distribution (vectorized) + X = np.random.normal(0, 1, (n, numTFs)) + + # Standardize y for correlation calculation + y_std = (y - np.mean(y)) / np.std(y) + + for i in tqdm(range(numTFs), desc="Generating predictors"): + desired_corr = self.corrVals[i] + + while True: + # Create and standardize new predictor (vectorized) + X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] + X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + + # Calculate actual correlation (vectorized) + actual_corr = np.dot(y_std, X[:, i]) / n + + # Check if actual correlation is close enough to desired correlation + if abs(actual_corr - desired_corr) < tol: + break + + # Orthogonalize X to reduce inter-predictor correlation (if required) + if self.orthogonal_X_bool: + X, _ = np.linalg.qr(X) + + return X + def generate_X7(self): + orthogonal = self.orthogonal_X_bool + scalar = self.ortho_scalar + np.random.seed(self.randSeed) + y = self.y + n = len(y) + numTFs = self.N + tol = self.tol + + # Initialize X with standard normal distribution + X = np.random.normal(0, 1, (n, numTFs)) + + desc_name = "Generating data for " + str(numTFs) + " Predictors with tolerance of " + str(tol) + " :) " + for i in tqdm(range(numTFs), desc=desc_name): + desired_corr = self.corrVals[i] + + while True: + # Create a new predictor as a linear combination of original predictor and y + X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] + + # Standardize the predictor to have mean 0 and variance 1 + X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + + # Calculate the actual correlation + actual_corr = np.corrcoef(y, X[:, i])[0, 1] + + # Calculate the difference between the actual and desired correlations + diff = abs(actual_corr - desired_corr) + + if diff < tol: + break + + # Step 2: Orthogonalize the predictors to remove inter-predictor correlation + X_ortho, _ = np.linalg.qr(X) + + # Step 3: Scale each orthogonalized predictor to match the desired correlation with y + for i in tqdm(range(numTFs), desc="Rescaling orthogonalized predictors"): + desired_corr = self.corrVals[i] + + while True: + # Scale the orthogonalized predictor + X_ortho[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X_ortho[:, i] + + # Standardize the predictor + X_ortho[:, i] = (X_ortho[:, i] - np.mean(X_ortho[:, i])) / np.std(X_ortho[:, i]) + + # Calculate the actual correlation + actual_corr = np.corrcoef(y, X_ortho[:, i])[0, 1] + + # Calculate the difference between the actual and desired correlations + diff = abs(actual_corr - desired_corr) + + if diff < tol: + break + + if orthogonal: + # Compute the QR decomposition of X and take only the Q matrix + Q = np.linalg.qr(X_ortho)[0] + Q = scalar * Q + return Q + else: + # Return the X matrix without orthogonalization + return X_ortho + + + def generate_X5(self): + orthogonal = self.orthogonal_X_bool + scalar = self.ortho_scalar + np.random.seed(self.randSeed) + y = self.y + n = len(y) + numTFs = self.N + tol = self.tol + jitter = 0.05 # Noise level to reduce correlation between predictors + + # Initialize X with standard normal distribution + X = np.random.normal(0, 1, (n, numTFs)) + + desc_name = "Generating data for " + str(numTFs) + " Predictors with tolerance of " + str(tol) + " :) " + for i in tqdm(range(numTFs), desc=desc_name): + desired_corr = self.corrVals[i] + + while True: + # Create a new predictor as a linear combination of original predictor and y + X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] + + # Add a small amount of noise to reduce correlation with other predictors + X[:, i] += jitter * np.random.normal(0, 1, n) + + # Standardize the predictor to have mean 0 and variance 1 + X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + + # Calculate the actual correlation + actual_corr = np.corrcoef(y, X[:, i])[0, 1] + + # Calculate the difference between the actual and desired correlations + diff = abs(actual_corr - desired_corr) + + if diff < tol: + break + + if orthogonal: + # Compute the QR decomposition of X and take only the Q matrix + Q = np.linalg.qr(X)[0] + Q = scalar * Q + return Q + else: + # Return the X matrix without orthogonalization + return X + + def generate_X3(self): + orthogonal = self.orthogonal_X_bool + scalar = self.ortho_scalar + np.random.seed(self.randSeed) + y = self.y + n = len(y) + numTFs = self.N + tol = self.tol + # Initialize X with standard normal distribution + X = np.random.normal(0, 1, (n, numTFs)) + desc_name = "Generating data for " + str(numTFs) + " Predictors with tolerance of " + str(tol) + " :) " + for i in tqdm(range(numTFs), desc=desc_name): + desired_corr = self.corrVals[i] + + while True: + # Create a new predictor as a linear combination of original predictor and y + X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] + + # Standardize the predictor to have mean 0 and variance 1 + X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + + # Calculate the actual correlation + actual_corr = np.corrcoef(y, X[:, i])[0, 1] + + # Calculate the difference between the actual and desired correlations + diff = abs(actual_corr - desired_corr) + + if diff < tol: + break + + if orthogonal: + # Compute the QR decomposition of X and take only the Q matrix + Q = np.linalg.qr(X)[0] + Q = scalar * Q + return Q + else: + # Return the X matrix without orthogonalization + return X + + # Define the function for generating synthetic data with specific correlations and standard normal predictors + def generate_X1(self): + orthogonal = self.orthogonal_X_bool + scalar = self.ortho_scalar + np.random.seed(self.randSeed) + y = self.y + n = len(y) + numTFs = self.N + + # Initialize X with standard normal distribution + X = np.random.normal(0, 1, (n, numTFs)) + + # Adjust X to achieve the desired correlations with y + for i in range(numTFs): + corr = self.corrVals[i] + # Create a new predictor as a linear combination of original predictor and y + X[:, i] = corr * y + np.sqrt(1 - corr ** 2) * X[:, i] + + # Standardize the predictor to have mean 0 and variance 1 + X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + + if orthogonal: + # Compute the QR decomposition of X and take only the Q matrix + Q = np.linalg.qr(X)[0] + Q = scalar * Q + return Q + else: + # Return the X matrix without orthogonalization + return X +# def generate_X(self): +# orthogonal = self.orthogonal_X_bool +# scalar = self.ortho_scalar +# np.random.seed(self.randSeed) +# y = self.y +# n = len(y) +# numTFs = self.N +# numIterations = self.num_iters_to_generate_X +# correlations = self.corrVals +# corrVals = [correlations[0]] + correlations + +# # Initialize X with standard normal distribution +# X = np.random.normal(0, 1, (n, numTFs)) + +# for j in range(numIterations): +# for i in range(numTFs): +# corr = np.corrcoef(y, X[:, i])[0, 1] +# X[:, i] = X[:, i] + (corrVals[i] - corr) * y +# # Standardize the predictor +# X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + +# if orthogonal: +# # Compute the QR decomposition of X and take only the Q matrix +# Q = np.linalg.qr(X)[0] +# Q = scalar * Q +# return Q +# else: +# # Return the X matrix without orthogonalization +# return X + + +# def generate_X(self): +# orthogonal = self.orthogonal_X_bool +# scalar = self.ortho_scalar +# np.random.seed(self.randSeed) +# y = self.y +# n = len(y) +# numTFs = self.N +# tol=self.tol +# # Initialize X with standard normal distribution +# X = np.random.normal(0, 1, (n, numTFs)) +# numIterations = self.num_iters_to_generate_X +# for iter_count in range(numIterations): +# max_diff = 0 # Initialize maximum difference between actual and desired correlations for this iteration +# for i in range(numTFs): +# desired_corr = self.corrVals[i] + +# # Create a new predictor as a linear combination of original predictor and y +# X[:, i] = desired_corr * y + np.sqrt(1 - desired_corr ** 2) * X[:, i] + +# # Standardize the predictor to have mean 0 and variance 1 +# X[:, i] = (X[:, i] - np.mean(X[:, i])) / np.std(X[:, i]) + +# # Calculate the actual correlation +# actual_corr = np.corrcoef(y, X[:, i])[0, 1] + +# # Calculate the difference between the actual and desired correlations +# diff = abs(actual_corr - desired_corr) +# max_diff = max(max_diff, diff) + +# # If the maximum difference between actual and desired correlations is below the tolerance, break the loop +# if max_diff < tol: +# break + +# if orthogonal: +# # Compute the QR decomposition of X and take only the Q matrix +# Q = np.linalg.qr(X)[0] +# Q = scalar * Q +# return Q +# else: +# # Return the X matrix without orthogonalization +# return X + + def generate_X_old(self): + """Generates a design matrix X with the given correlations. + Parameters: + orthogonal (bool): Whether to generate an orthogonal matrix (default=False). + + Returns: + numpy.ndarray: The design matrix X. + """ + orthogonal = self.orthogonal_X_bool + scalar = self.ortho_scalar + np.random.seed(self.randSeed) + y = self.y + n = len(y) + numTFs = self.N # len(corrVals) + numIterations = self.num_iters_to_generate_X + correlations = self.corrVals + corrVals = [correlations[0]] + correlations + e = np.random.normal(0, 1, (n, numTFs + 1)) + X = np.copy(e) + X[:, 0] = y * np.sqrt(1.0 - corrVals[0]**2) / np.sqrt(1.0 - np.corrcoef(y, X[:,0])[0,1]**2) + for j in range(numIterations): + for i in range(1, numTFs + 1): + corr = np.corrcoef(y, X[:, i])[0, 1] + X[:, i] = X[:, i] + (corrVals[i] - corr) * y + + if orthogonal: + # Compute the QR decomposition of X and take only the Q matrix + Q = np.linalg.qr(X)[0] + Q = scalar * Q + return Q[:, 1:] + else: + # Return the X matrix without orthogonalization + return X[:, 1:] + + + def generate_training_and_testing_data(self): + same_train_and_test_data_bool = self.same_train_and_test_data_bool + X = self.X + y = self.y + if same_train_and_test_data_bool == False: # different training and testing datasets + X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = self.testing_size) + if self.verbose: + print(f"Please note that since we hold out {self.testing_size * 100.0}% of our {self.M} samples for testing, we have:") + print(f"X_train = {X_train.shape[0]} rows (samples) and {X_train.shape[1]} columns (N = {self.N} predictors) for training.") + print(f"X_test = {X_test.shape[0]} rows (samples) and {X_test.shape[1]} columns (N = {self.N} predictors) for testing.") + print(f"y_train = {y_train.shape[0]} corresponding rows (samples) for training.") + print(f"y_test = {y_test.shape[0]} corresponding rows (samples) for testing.") + else: # training and testing datasets are the same :) + X_train, X_test, y_train, y_test = X, X, y, y + y_train = y + y_test = y_train + X_test = X_train + if self.verbose: + print(f"Please note that since we use the same data for training and for testing :) of our {self.M} samples. Thus, we have:") + print(f"X_train = X_test = {X_train.shape[0]} rows (samples) and {X_train.shape[1]} columns (N = {self.N} predictors) for training and for testing") + print(f"y_train = y_test = {y_train.shape[0]} corresponding rows (samples) for training and for testing.") + return [X_train, X_test, y_train, y_test] + + + def get_combined_correlations_df(self): + combined_correlations_df = self.actual_vs_expected_corrs_DefensiveProgramming_all_groups(self.X, self.y, + self.X_train, + self.y_train, + self.X_test, + self.y_test, + self.corrVals, + self.tf_names_list, + self.same_train_and_test_data_bool) + return combined_correlations_df + + def actual_vs_expected_corrs_DefensiveProgramming_all_groups(self, X, y, X_train, y_train, X_test, y_test, + corrVals, tf_names_list, + same_train_and_test_data_bool): + overall_corrs_df = self.compare_actual_and_expected_correlations_DefensiveProgramming_one_data_group(X, y, corrVals, + tf_names_list, same_train_and_test_data_bool, "Overall") + training_corrs_df = self.compare_actual_and_expected_correlations_DefensiveProgramming_one_data_group(X_train, y_train, corrVals, + tf_names_list, same_train_and_test_data_bool, "Training") + testing_corrs_df = self.compare_actual_and_expected_correlations_DefensiveProgramming_one_data_group(X_test, y_test, corrVals, + tf_names_list, same_train_and_test_data_bool, "Testing") + combined_correlations_df = pd.concat([overall_corrs_df, training_corrs_df, testing_corrs_df]).drop_duplicates() + return combined_correlations_df + + def compare_actual_and_expected_correlations_DefensiveProgramming_one_data_group(self, X_matrix, y, corrVals, + predictor_names_list, + same_train_and_test_data_boolean, + data_type): + # please note that this function by Saniya ensures that the actual and expected correlations are close + # so that the simulation has the x-y correlations we were hoping for in corrVals + updatedDF = pd.DataFrame(X_matrix)#.shape + actualCorrsList = [] + for i in tqdm(range(0, len(corrVals))): + expectedCor = corrVals[i] + actualCor = np.corrcoef(updatedDF[i], y)[0][1] + difference = abs(expectedCor - actualCor) + predictor_name = predictor_names_list[i] + actualCorrsList.append([i, predictor_name, expectedCor, actualCor, difference]) + comparisonDF = pd.DataFrame(actualCorrsList, columns = ["i", "predictor", "expected_corr_with_Y", "actual_corr", "difference"]) + comparisonDF["X_group"] = data_type + num_samples = X_matrix.shape[0] + if same_train_and_test_data_boolean: + comparisonDF["num_samples"] = "same " + str(num_samples) + else: + comparisonDF["num_samples"] = "unique " + str(num_samples) + return comparisonDF + + # Visualizing Functions :) + def view_input_correlations(self): + corr_val_df = pd.DataFrame(self.corrVals, columns = ["correlation"])#.transpose() + corr_val_df.index = self.tf_names_list + corr_val_df["TF"] = self.tf_names_list + fig = px.bar(corr_val_df, x='TF', y='correlation', title = "Input Correlations for Dummy Example", barmode='group') + fig.show() + return fig + + + def view_train_vs_test_data_for_predictor(self, predictor_name): + combined_train_test_x_and_y_df = self.combined_train_test_x_and_y_df + combined_correlations_df = self.combined_correlations_df + print(combined_correlations_df[combined_correlations_df["predictor"] == predictor_name][["predictor", "actual_corr", "X_group", "num_samples"]]) + title_name = title = "Training Versus Testing Data Points for Predictor: " + predictor_name + fig = px.scatter(combined_train_test_x_and_y_df, x=predictor_name, y="y", color = "info", + title = title_name) + #fig.show() + return fig + + +def generate_dummy_data(corrVals, + num_samples_M = 10000, + train_data_percent = 70, + mu = 0, + std_dev = 1, + iters_to_generate_X = 100, + orthogonal_X = False, + ortho_scalar = 10, + view_input_corrs_plot = False, + verbose = True, rand_seed_x = 123, rand_seed_y = 2023): + + # the defaults + same_train_test_data = False + test_data_percent = 100 - train_data_percent + if train_data_percent == 100: # since all of the data is used for training, + # then the training and testing data will be the same :) + same_train_test_data = True + test_data_percent = 100 + print(f":) same_train_test_data = {same_train_test_data}") + demo_dict = { + "test_data_percent": 100 - train_data_percent, + "mu": mu, "std_dev": std_dev, + "num_iters_to_generate_X": iters_to_generate_X, + "same_train_test_data": same_train_test_data, + "rng_seed": rand_seed_y, #2023, # for Y + "randSeed": rand_seed_x, #123, # for X + "ortho_scalar": ortho_scalar, + "orthogonal_X_bool": orthogonal_X, + "view_input_correlations_plot": view_input_corrs_plot, + "num_samples_M": num_samples_M, + "corrVals": corrVals, "verbose":verbose} + dummy_data = DemoDataBuilderXandY(**demo_dict) # + return dummy_data diff --git a/code/previous_version/Netrem_model_builder.py b/code/previous_version/Netrem_model_builder.py new file mode 100644 index 0000000..1e13f6d --- /dev/null +++ b/code/previous_version/Netrem_model_builder.py @@ -0,0 +1,1037 @@ +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt +import random +import copy +from tqdm import tqdm +import os +import sys # https://www.dev2qa.com/how-to-run-python-script-py-file-in-jupyter-notebook-ipynb-file-and-ipython/#:~:text=How%20To%20Run%20Python%20Script%20.py%20File%20In,2.%20Invoke%20Python%20Script%20File%20From%20Ipython%20Command-Line. +import networkx as nx +import scipy +from scipy.linalg import svd as robust_svd +from sklearn.model_selection import KFold, train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score +from sklearn.decomposition import TruncatedSVD +from sklearn import linear_model, preprocessing # 9/19 +from sklearn.linear_model import Lasso, LassoCV, LinearRegression, ElasticNetCV, Ridge +from numpy.typing import ArrayLike +from typing import Optional, List, Tuple +from sklearn.metrics import make_scorer +import plotly.express as px +from sklearn.base import RegressorMixin, ClassifierMixin, BaseEstimator +from numpy.typing import ArrayLike +from skopt import gp_minimize, space +from scipy.sparse.linalg.interface import LinearOperator +import warnings +from sklearn.exceptions import ConvergenceWarning +printdf = lambda *args, **kwargs: print(pd.DataFrame(*args, **kwargs)) +rng_seed = 2023 # random seed for reproducibility +randSeed = 123 +# from packages_needed import * +import essential_functions as ef +import error_metrics as em # why to do import +#import Netrem_model_builder as nm +import DemoDataBuilderXandY as demo +import PriorGraphNetwork as graph +import netrem_evaluation_functions as nm_eval +import matplotlib.pyplot as plt +import pandas as pd +import numpy as np +import networkx as nx +from sklearn.linear_model import LinearRegression, Lasso, LassoCV +from tqdm.auto import tqdm +import copy +""" +Optimization for +(1 / (2 * M)) * ||y - Xc||^2_2 + (beta / (2 * N^2)) * c'Ac + alpha * ||c||_1 +Which is converted to lasso +(1 / (2 * M)) * ||y_tilde - X_tilde @ c||^2_2 + alpha * ||c||_1 +where M = n_samples and N is the dimension of c. +Check compute_X_tilde_y_tilde() to see how we make sure above normalization is applied using Lasso of sklearn +""" + +class NetREmModel(BaseEstimator, RegressorMixin): + """ :) Please note that this class focuses on building a Gene Regulatory Network (GRN) from gene expression data for Transcription Factors (TFs), gene expression data for the target gene (TG), and a prior biological network (W). This class performs Network-penalized regression :) """ + _parameter_constraints = { + "alpha_lasso": (0, None), + "beta_net": (0, None), + "num_cv_folds": (0, None), + "y_intercept": [False, True], + "use_network": [True, False], + "max_lasso_iterations": (1, None), + "model_type": ["Lasso", "LassoCV", "Linear"], + "tolerance": (0, None), + "num_jobs": (1, 1e10), + "lasso_selection": ["cyclic", "random"], + "lassocv_eps": (0, None), + "lassocv_n_alphas": (1, None), + "standardize_X": [True, False], + "center_y": [True, False] + } + + def __init__(self, **kwargs): + self.info = "NetREm Model" + self.verbose = False + self.overlapped_nodes_only = False # restrict the nodes to only being those found in the network? overlapped_nodes_only + self.num_cv_folds = 5 # for cross-validation models + self.num_jobs = -1 # for LassoCV or LinearRegression (here, -1 is the max possible for CPU) + self.all_pos_coefs = False # for coefficients + self.model_type = "Lasso" + self.standardize_X = True + self.center_y = True + self.use_network = True + self.y_intercept = False + self.max_lasso_iterations = 10000 + self.view_network = False + self.model_info = "unfitted_model :(" + self.target_gene_y = "Unknown :(" + self.tolerance = 1e-4 + self.lasso_selection = "cyclic" # default in sklearn + self.lassocv_eps = 1e-3 # default in sklearn + self.lassocv_n_alphas = 100 # default in sklearn + self.lassocv_alphas = None # default in sklearn + self.beta_net = kwargs.get('beta_net', 1) + self.__dict__.update(kwargs) + required_keys = ["network", "beta_net"] + if self.model_type == "Lasso": + self.alpha_lasso = kwargs.get('alpha_lasso', 0.01) + self.optimal_alpha = "User-specified optimal alpha lasso: " + str(self.alpha_lasso) + required_keys += ["alpha_lasso"] + elif self.model_type == "LassoCV": + self.alpha_lasso = "LassoCV finds optimal alpha" + self.optimal_alpha = "Since LassoCV is model_type, please fit model using X and y data to find optimal_alpha." + else: # model_type == "Linear": + self.alpha_lasso = "No alpha needed" + self.optimal_alpha = "No alpha needed" # + missing_keys = [key for key in required_keys if key not in self.__dict__] # check that all required keys are present: + if missing_keys: + raise ValueError(f":( Please note ye are missing information for these keys: {missing_keys}") + if self.use_network: + prior_network = self.network + self.prior_network = prior_network + self.preprocessed_network = prior_network.preprocessed_network + self.network_params = prior_network.param_lists + self.network_nodes_list = prior_network.final_nodes # tf_names_list + self.kwargs = kwargs + self._apply_parameter_constraints() # ensuring that the parameter constraints are met + + + def __repr__(self): + args = [f"{k}={v}" for k, v in self.__dict__.items() if k != 'param_grid' and k in self.kwargs] + return f"{self.__class__.__name__}({', '.join(args)})" + + + def check_overlaps_work(self): + final_set = set(self.final_nodes) + network_set = set(self.network_nodes_list) + return network_set != final_set + + + def standardize_X_data(self, X_df): # if the user opts to + """ :) If the user opts to standardize the X data (so that predictors have a mean of 0 + and a standard deviation of 1), then this method will be run, which uses the preprocessing + package StandardScalar() functionality. """ + if self.standardize_X: + # Transform both the training and test data + X_scaled = self.scaler.transform(X_df) + X_scaled_df = pd.DataFrame(X_scaled, columns=X_df.columns) + return X_scaled_df + else: + return X_df + + def center_y_data(self, y_df): # if the user opts to + """ :) If the user opts to center the response y data: + subtracting its mean from each observation.""" + if self.center_y: + # Center the response + y_train_centered = y_df - self.mean_y_train + return y_train_centered + else: + return y_df + + def updating_network_and_X_during_fitting(self, X, y): + # updated one :) + """ Update the prior network information and the + X input data (training) during the fitting of the model. It determines if the common predictors + should be used (based on if overlapped_nodes_only is True) or if all of the X input data should be used. """ + X_df = X.sort_index(axis=1) # sorting the X dataframe by columns. (rows are samples) + + #X_df = X.sort_index(axis=0).sort_index(axis=1) # sorting the X dataframe by rows and columns. + #self.X_df = X_df + self.target_gene_y = y.columns[0] + + if self.standardize_X: # we will standardize X then + if self.verbose: + print(":) Standardizing the X data") + self.old_X_df = X_df + self.scaler = preprocessing.StandardScaler().fit(X_df) # Fit the scaler to the training data only + # this self.scalar will be utilized for the testing data to prevent data leakage and to ensure generalization :) + self.X_df = self.standardize_X_data(X_df) + X = self.X_df # overwriting and updating the X df + else: + self.X_df = X_df + + self.mean_y_train = np.mean(y) # the average y value + if self.center_y: # we will center y then + if self.verbose: + print(":) centering the y data") + # Assuming y_train and y_test are your training and test labels + self.old_y = y + y = self.center_y_data(y) + + #gene_expression_nodes = X_df.columns.tolist() # these are already sorted + tg_name = y.columns.tolist()[0] + if tg_name in X_df.columns.tolist(): + X_df = X_df.drop(columns = [tg_name]) + + #gene_expression_nodes = list(set(X_df.columns.tolist()) - tg_name) # these are already sorted + gene_expression_nodes = sorted(X_df.columns.tolist()) # these will be sorted + ppi_net_nodes = set(self.network_nodes_list) # set(self.network_nodes_list) - tg_name + common_nodes = list(ppi_net_nodes.intersection(gene_expression_nodes)) + + if not common_nodes: # may be possible that the X dataframe needs to be transposed if provided incorrectly + print("Please note: we are flipping X dataframe around so that the rows are samples and the columns are gene/TF names :)") + X_df = X_df.transpose() + gene_expression_nodes = sorted(X_df.columns.tolist()) + common_nodes = list(ppi_net_nodes.intersection(gene_expression_nodes)) + + self.gene_expression_nodes = gene_expression_nodes + self.common_nodes = sorted(common_nodes) + gene_expression_nodes = sorted(gene_expression_nodes) # 10/22 + self.final_nodes = gene_expression_nodes + if self.overlapped_nodes_only: + self.final_nodes = common_nodes + elif self.preprocessed_network: + self.final_nodes = self.prior_network.final_nodes + else: + self.final_nodes = gene_expression_nodes + self.final_nodes = sorted(self.final_nodes) # 10/22 + final_nodes_set = set(self.final_nodes) + ppi_nodes_to_remove = list(ppi_net_nodes - final_nodes_set) + self.gexpr_nodes_added = list(set(gene_expression_nodes) - final_nodes_set) + self.gexpr_nodes_to_add_for_net = list(set(gene_expression_nodes) - set(common_nodes)) + + if self.verbose: + if ppi_nodes_to_remove: + print(f"Please note that we remove {len(ppi_nodes_to_remove)} nodes found in the input network that are not found in the input gene expression data (X) :)") + print(ppi_nodes_to_remove) + else: + print(f":) Please note that all {len(common_nodes)} nodes found in the network are also found in the input gene expression data (X) :)") + + filter_network_bool = self.filter_network_bool = self.check_overlaps_work() #self.check_overlaps_work(X_df) + if filter_network_bool: + print("Please note that we need to update the network information") + self.updating_network_A_matrix_given_X() # updating the A matrix given the gene expression data X + if self.view_network: + ef.draw_arrow() + self.view_W_network = self.view_W_network() + else: + self.A_df = self.network.A_df + self.A = self.network.A + self.nodes = self.A_df.columns.tolist() + + self.network_params = self.prior_network.param_lists + self.network_info = "fitted_network" + self.M = y.shape[0] + self.N = len(self.final_nodes) # pre-processing: + self.X_train = self.preprocess_X_df(X) + self.y_train = self.preprocess_y_df(y) + return self + + + def organize_B_interaction_list(self): # TF-TF interactions to output :) + self.B_train = self.compute_B_matrix(self.X_train) + self.B_interaction_df = pd.DataFrame(self.B_train, index = self.final_nodes, columns = self.final_nodes) + return self + + + def fit(self, X, y): # fits a model Function used for model training + self.updating_network_and_X_during_fitting(X, y) + self.organize_B_interaction_list() + self.B_train_times_M = self.compute_B_matrix_times_M(self.X_train) + self.X_tilda_train, self.y_tilda_train = self.compute_X_tilde_y_tilde(self.B_train_times_M, self.X_train, + self.y_train) + self.X_training_to_use, self.y_training_to_use = self.X_tilda_train, self.y_tilda_train + self.regr = self.return_fit_ml_model(self.X_training_to_use, self.y_training_to_use) + ml_model = self.regr + self.final_alpha = self.alpha_lasso + if self.model_type == "LassoCV": + self.final_alpha = ml_model.alpha_ + self.optimal_alpha = "Cross-Validation optimal alpha lasso: " + str(self.final_alpha) + if self.verbose: + print(self.optimal_alpha) + self.coef = ml_model.coef_ # Please Get the coefficients + self.coef[self.coef == -0.0] = 0 + if self.y_intercept: + self.intercept = ml_model.intercept_ + self.predY_tilda_train = ml_model.predict(self.X_training_to_use) # training data + self.mse_tilda_train = self.calculate_mean_square_error(self.y_training_to_use, self.predY_tilda_train) # Calculate MSE + self.predY_train = ml_model.predict(self.X_train) # training data + self.mse_train = self.calculate_mean_square_error(self.y_train, self.predY_train) # Calculate MSE + if self.y_intercept: + coeff_terms = [self.intercept] + list(self.coef) + index_names = ["y_intercept"] + self.nodes + self.model_coef_df = pd.DataFrame(coeff_terms, index = index_names).transpose() + else: + coeff_terms = ["None"] + list(self.coef) + index_names = ["y_intercept"] + self.nodes + self.model_coef_df = pd.DataFrame(coeff_terms, index = index_names).transpose() + self.model_info = "fitted_model :)" + selected_row = self.model_coef_df.iloc[0] + selected_cols = selected_row[selected_row != 0].index # Filter out the columns with value 0 + if len(selected_cols) == 0: + self.model_nonzero_coef_df = None + self.num_final_predictors = 0 + else: + self.model_nonzero_coef_df = self.model_coef_df[selected_cols] + if len(selected_cols) > 1: # and self.model_type != "Linear": + self.netrem_model_predictor_results(y) + self.num_final_predictors = len(selected_cols) + if "y_intercept" in selected_cols: + self.num_final_predictors = self.num_final_predictors - 1 + return self + + + def netrem_model_predictor_results(self, y): # olders + """ :) Please note that this function by Saniya works on a netrem model and returns information about the predictors + such as their Pearson correlations with y, their rankings as well. + It returns: sorted_df, final_corr_vs_coef_df, combined_df """ + abs_df = self.model_nonzero_coef_df.replace("None", np.nan).apply(pd.to_numeric, errors='coerce').abs() + if abs_df.shape[0] == 1: + abs_df = pd.DataFrame([abs_df.squeeze()]) + sorted_series = abs_df.squeeze().sort_values(ascending=False) + sorted_df = pd.DataFrame(sorted_series) # convert the sorted series back to a DataFrame + sorted_df['Rank'] = range(1, len(sorted_df) + 1) # add a column for the rank + sorted_df['TF'] = sorted_df.index + sorted_df = sorted_df.rename(columns = {0:"AbsoluteVal_coefficient"}) + self.sorted_coef_df = sorted_df # print the sorted DataFrame + tg = y.columns.tolist()[0] + corr = pd.DataFrame(self.X_df.corrwith(y[tg])).transpose() + corr["info"] = "corr (r) with y: " + tg + all_df = self.model_coef_df + all_df = all_df.iloc[:, 1:] + all_df["info"] = "network regression coeff. with y: " + tg + all_df = pd.concat([all_df, corr]) + all_df["input_data"] = "X_train" + sorting = self.sorted_coef_df[["Rank"]].transpose().drop(columns = ["y_intercept"]) + sorting = sorting.reset_index().drop(columns = ["index"]) + sorting["info"] = "Absolute Value NetREm Coefficient Ranking" + sorting["input_data"] = "X_train" + all_df = pd.concat([all_df, sorting]) + self.corr_vs_coef_df = all_df + self.final_corr_vs_coef_df = self.corr_vs_coef_df[["info", "input_data"] + self.model_nonzero_coef_df.columns.tolist()[1:]] + + netrem_model_df = self.model_nonzero_coef_df.transpose() + netrem_model_df.columns = ["coef"] + netrem_model_df["TF"] = netrem_model_df.index.tolist() + netrem_model_df["TG"] = tg + if self.y_intercept: + netrem_model_df["info"] = "netrem_with_intercept" + else: + netrem_model_df["info"] = "netrem_no_intercept" + netrem_model_df["train_mse"] = self.mse_train + if self.model_type != "Linear": + netrem_model_df["beta_net"] = self.beta_net + if self.model_type == "LassoCV": + netrem_model_df["alpha_lassoCV"] = self.optimal_alpha + else: + netrem_model_df["alpha_lasso"] = self.alpha_lasso + if netrem_model_df.shape[0] > 1: + self.combined_df = pd.merge(netrem_model_df, self.sorted_coef_df) + self.combined_df["final_model_TFs"] = max(self.sorted_coef_df["Rank"]) - 1 + else: + self.combined_df = netrem_model_df + self.combined_df["TFs_input_to_model"] = len(self.final_nodes) + self.combined_df["original_TFs_in_X"] = len(self.gene_expression_nodes) + self.combined_df["standardized_X"] = self.standardize_X + self.combined_df["centered_y"] = self.center_y + return self + + def view_W_network(self): + roundedW = np.round(self.W, decimals=4) + wMat = ef.view_matrix_as_dataframe(roundedW, column_names_list=self.final_nodes, row_names_list=self.final_nodes) + w_edgeList = wMat.stack().reset_index() + w_edgeList = w_edgeList[w_edgeList["level_0"] != w_edgeList["level_1"]] + w_edgeList = w_edgeList.rename(columns={"level_0": "source", "level_1": "target", 0: "weight"}) + w_edgeList = w_edgeList[w_edgeList["weight"] != 0] + + G = nx.from_pandas_edgelist(w_edgeList, source="source", target="target", edge_attr="weight") + pos = nx.spring_layout(G) + weights_list = [G.edges[e]['weight'] * self.prior_network.edge_weight_scaling for e in G.edges] + + fig, ax = plt.subplots() + + if not self.overlapped_nodes_only: + nodes_to_add = list(set(self.gene_expression_nodes) - set(self.common_nodes)) + if nodes_to_add: + print(f":) {len(nodes_to_add)} new nodes added to network based on gene expression data {nodes_to_add}") + node_color_map = { + node: self.prior_network.added_node_color_name if node in nodes_to_add else self.prior_network.node_color_name + for node in G.nodes + } + nx.draw(G, pos, node_color=node_color_map.values(), edge_color=self.prior_network.edge_color_name, with_labels=True, width=weights_list, ax=ax) + else: + nx.draw(G, pos, node_color=self.prior_network.node_color_name, edge_color=self.prior_network.edge_color_name, with_labels=True, width=weights_list, ax=ax) + else: + nx.draw(G, pos, node_color=self.prior_network.node_color_name, edge_color=self.prior_network.edge_color_name, with_labels=True, width=weights_list, ax=ax) + + labels = {e: G.edges[e]['weight'] for e in G.edges} + return nx.draw_networkx_edge_labels(G, pos, edge_labels=labels, ax=ax) + + + def compute_B_matrix_times_M(self, X): + """ M is N_sample, because ||y - Xc||^2 need to be normalized by 1/n_sample, but not the 2 * beta_L2 * c'Ac term + see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html + The optimization objective for Lasso is: + (1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1 where M = n_sample + Calculations""" + XtX = X.T @ X + beta_L2 = self.beta_net + N_squared = self.N * self.N + part_2 = 2.0 * float(beta_L2) * self.M / (N_squared) * self.A + B = XtX + part_2 + return B + + + def compute_B_matrix(self, X): + """ M is N_sample, because ||y - Xc||^2 need to be normalized by 1/n_sample, but not the 2 * beta_L2 * c'Ac term + see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html + The optimization objective for Lasso is: + (1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1 + where M = n_sample + Outputting for user """ + return self.compute_B_matrix_times_M(X) / self.M + + + def compute_X_tilde_y_tilde(self, B, X, y): + """Compute X_tilde, y_tilde such that X_tilde.T @ X_tilde = B, y_tilde.T @ X_tilde = y.T @ X """ + U, s, _Vh = np.linalg.svd(B, hermitian=True) # B = U @ np.diag(s) @ _Vh + if (cond := s[0] / s[-1]) > 1e10: + print(f'Large conditional number of B matrix: {cond: .2f}') + S_sqrt = ef.DiagonalLinearOperator(np.sqrt(s)) + S_inv_sqrt = ef.DiagonalLinearOperator(1 / np.sqrt(s)) + X_tilde = S_sqrt @ U.T + y_tilde = (y @ X @ U @ S_inv_sqrt).T + # assert(np.allclose(y.T @ X, y_tilde.T @ X_tilde)) + # assert(np.allclose(B, X_tilde.T @ X_tilde)) + # scale: we normalize by 1/M, but sklearn.linear_model.Lasso normalize by 1/N because X_tilde is N*N matrix, + # so Lasso thinks the number of sample is N instead of M, to use lasso solve our desired problem, correct the scale + scale = np.sqrt(self.N / self.M) + X_tilde *= scale + y_tilde *= scale + return X_tilde, y_tilde + + + def predict_y_from_y_tilda(self, X, X_tilda, pred_y_tilda): + + X = self.preprocess_X_df(X) + # Transposing the matrix before inverting + X_transpose_inv = np.linalg.inv(X.T) + + # Efficiently compute pred_y by considering the dimensions of matrices + pred_y = np.dot(np.dot(X_transpose_inv, X_tilda.T), pred_y_tilda) + + return pred_y + + + def _apply_parameter_constraints(self): + constraints = {**NetREmModel._parameter_constraints} + for key, value in self.__dict__.items(): + if key in constraints: + if isinstance(constraints[key], tuple): + if isinstance(constraints[key][0], type) and not isinstance(value, constraints[key][0]): + setattr(self, key, constraints[key][0]) + elif constraints[key][1] is not None and isinstance(constraints[key][1], type) and not isinstance(value, constraints[key][1]): + setattr(self, key, constraints[key][1]) + elif value not in constraints[key]: + setattr(self, key, constraints[key][0]) + return self + + + def calculate_mean_square_error(self, actual_values, predicted_values): + difference = (actual_values - predicted_values)# Please note that this function by Saniya calculates the Mean Square Error (MSE) + squared_diff = difference ** 2 # square of the difference + mean_squared_diff = np.mean(squared_diff) + return mean_squared_diff + + + def predict(self, X_test): + if self.standardize_X: + self.X_test_standardized = self.standardize_X_data(X_test) + X_test = self.preprocess_X_df(self.X_test_standardized) + else: + X_test = self.preprocess_X_df(X_test) # X_test + return self.regr.predict(X_test) + + + def test_mse(self, X_test, y_test): + X_test = X_test.sort_index(axis=1) # 9/20 + if self.standardize_X: + self.X_test_standardized = self.standardize_X_data(X_test) + X_test = self.preprocess_X_df(self.X_test_standardized) + else: + X_test = self.preprocess_X_df(X_test) # X_test + if self.center_y: + y_test = self.center_y_data(y_test) + #X_test = self.preprocess_X_df(X_test) # X_test + y_test = self.preprocess_y_df(y_test) + predY_test = self.regr.predict(X_test) # training data + mse_test = self.calculate_mean_square_error(y_test, predY_test) # Calculate MSE + return mse_test #mse_test + + + def get_params(self, deep=True): + params_dict = {"info":self.info, "alpha_lasso": self.alpha_lasso, "beta_net": self.beta_net, + "y_intercept": self.y_intercept, "model_type":self.model_type, + "standardize_X":self.standardize_X, + "center_y":self.center_y, + "max_lasso_iterations":self.max_lasso_iterations, + "network":self.network, "verbose":self.verbose, + "all_pos_coefs":self.all_pos_coefs, "model_info":self.model_info, + "target_gene_y":self.target_gene_y} + if self.model_type == "LassoCV": + params_dict["num_cv_folds"] = self.num_cv_folds + params_dict["num_jobs"] = self.num_jobs + params_dict["alpha_lasso"] = "LassoCV finds optimal alpha" + params_dict["lassocv_eps"] = self.lassocv_eps + params_dict["lassocv_n_alphas"] = self.lassocv_n_alphas + params_dict["lassocv_alphas"] = self.lassocv_alphas + params_dict["optimal_alpha"] = self.optimal_alpha + elif self.model_type == "Linear": + params_dict["alpha_lasso"] = "No alpha needed" + params_dict["num_jobs"] = self.num_jobs + if self.model_type != "Linear": + params_dict["tolerance"] = self.tolerance + params_dict["lasso_selection"] = self.lasso_selection + if not deep: + return params_dict + else: + return copy.deepcopy(params_dict) + + + def set_params(self, **params): + """ Sets the value of any parameters in this estimator + Parameters: **params: Dictionary of parameter names mapped to their values + Returns: self: Returns an instance of self """ + if not params: + return self + for key, value in params.items(): + if key not in self.get_params(): + raise ValueError(f'Invalid parameter {key} for estimator {self.__class__.__name__}') + setattr(self, key, value) + return self + + + def __deepcopy__(self, memo): + cls = self.__class__ + result = cls.__new__(cls) + memo[id(self)] = result + for k, v in self.__dict__.items(): + setattr(result, k, deepcopy(v, memo)) + result.optimal_alpha = self.optimal_alpha + return result + + + def clone(self): + return deepcopy(self) + + + def score(self, X, y, zero_coef_penalty=10): + if isinstance(X, pd.DataFrame): + X = self.preprocess_X_df(X) # X_test + if isinstance(y, pd.DataFrame): + y = self.preprocess_y_df(y) + + # Make predictions using the predict method of your custom estimator + y_pred = self.predict(X) + + # Handle cases where predictions are exactly zero + y_pred[y_pred == 0] = 1e-10 + + # Calculate the normalized mean squared error between the true and predicted values + nmse_ = (y - y_pred)**2 + nmse_[y_pred == 1e-10] *= zero_coef_penalty + nmse_ = nmse_.mean() / (y**2).mean() + + if nmse_ == 0: + #return float(1e1000) # Return positive infinity if nmse_ is zero + + return float("inf") # Return positive infinity if nmse_ is zero + else: + return -nmse_ + + + def updating_network_A_matrix_given_X(self) -> np.ndarray: + """ When we call the fit method, this function is used to help us update the network information. + Here, we can generate updated W matrix, updated D matrix, and updated V matrix. + Then, those updated derived matrices are used to calculate the A matrix. + """ + network = self.network + final_nodes = self.final_nodes + W_df = network.W_df.copy() # updating the W matrix + + # Simplified addition of new nodes + if self.gexpr_nodes_added: + for node in self.gexpr_nodes_added: + W_df[node] = np.nan + W_df.loc[node] = np.nan + + # Consolidated indexing and reindexing operations + W_df = W_df.reindex(index=final_nodes, columns=final_nodes) + + # Handle missing values + W_df.fillna(value=self.prior_network.default_edge_weight, inplace=True) + np.fill_diagonal(W_df.values, 0) + + N = len(final_nodes) + self.N = N + W = W_df.values + np.fill_diagonal(W, (W.sum(axis=0) - W.diagonal()) / (N - 1)) + + self.W = W + self.W_df = W_df + + # Check for symmetric matrix + if not ef.check_symmetric(W): + print(":( W matrix is NOT symmetric") + + # Update V matrix + self.V = N * np.eye(N) - np.ones(N) + + # Update D matrix + if not network.edge_values_for_degree: + W_bool = (W > network.threshold_for_degree) + d = np.float64(W_bool.sum(axis=0) - W_bool.diagonal()) + else: + if network.w_transform_for_d == "sqrt": + W_to_use = np.sqrt(W) + elif network.w_transform_for_d == "square": + W_to_use = W ** 2 + else: + W_to_use = W + d = W_to_use.diagonal() * (self.N - 1) + + # Handle pseudocount and self loops + d += network.pseudocount_for_degree + if network.consider_self_loops: + d += 1 + + d_inv_sqrt = 1 / np.sqrt(d) + self.D = ef.DiagonalLinearOperator(d_inv_sqrt) + + # Update inv_sqrt_degree_df + self.inv_sqrt_degree_df = pd.DataFrame({ + "TF": self.final_nodes, + "degree_D": self.D * np.ones(self.N) + }) + + Amat = self.D @ (self.V * W) @ self.D + A_df = pd.DataFrame(Amat, columns=final_nodes, index=final_nodes, dtype=np.float64) + + # Handle nodes based on `overlapped_nodes_only` + gene_expression_nodes = self.gene_expression_nodes + nodes_to_add = list(set(gene_expression_nodes) - set(final_nodes)) + self.nodes_to_add = nodes_to_add + if not self.overlapped_nodes_only: + for name in nodes_to_add: + A_df[name] = 0 + A_df.loc[name] = 0 + A_df = A_df.reindex(columns=sorted(gene_expression_nodes), index=sorted(gene_expression_nodes)) + else: + if len(nodes_to_add) == 1: + print(f"Please note that we remove 1 node {nodes_to_add[0]} found in the input gene expression data (X) that is not found in the input network :)") + elif len(nodes_to_add) > 1: + print(f":) Since overlapped_nodes_only = True, please note that we remove {len(nodes_to_add)} gene expression nodes that are not found in the input network.") + print(nodes_to_add) + A_df = A_df.sort_index(axis=0).sort_index(axis=1) + + self.A_df = A_df + self.A = A_df.values + self.nodes = A_df.columns.tolist() + self.tf_names_list = self.nodes + return self + + def preprocess_X_df(self, X): + if isinstance(X, pd.DataFrame): + X_df = X + column_names_list = list(X_df.columns) + overlap_num = len(ef.intersection(column_names_list, self.final_nodes)) + if overlap_num == 0: + print("Please note: we are flipping X dataframe around so that the rows are samples and the columns are gene/TF names :)") + X_df = X_df.transpose() + column_names_list = list(X_df.columns) + overlap_num = len(ef.intersection(column_names_list, self.common_nodes)) + gene_names_list = self.final_nodes # so that this matches the order of columns in A matrix as well + X_df = X_df.loc[:, X_df.columns.isin(gene_names_list)] # filtering the X_df as needed based on the columns + X_df = X_df.reindex(columns=gene_names_list)# Reorder columns of dataframe to match order in `column_order` + X = np.array(X_df.values.tolist()) + return X + + + def preprocess_y_df(self, y): + if isinstance(y, pd.DataFrame): + y = y.values.flatten() + return y + + + def return_Linear_ML_model(self, X, y): + regr = LinearRegression(fit_intercept = self.y_intercept, + positive = self.all_pos_coefs, + n_jobs = self.num_jobs) + regr.fit(X, y) + return regr + + + def return_Lasso_ML_model(self, X, y): + regr = Lasso(alpha = self.alpha_lasso, fit_intercept = self.y_intercept, + max_iter = self.max_lasso_iterations, tol = self.tolerance, + selection = self.lasso_selection, + positive = self.all_pos_coefs) + regr.fit(X, y) + return regr + + + def return_LassoCV_ML_model(self, X, y): + regr = LassoCV(cv = self.num_cv_folds, random_state = 0, + fit_intercept = self.y_intercept, + max_iter = self.max_lasso_iterations, + n_jobs = self.num_jobs, + tol = self.tolerance, + selection = self.lasso_selection, + positive = self.all_pos_coefs, + eps = self.lassocv_eps, + n_alphas = self.lassocv_n_alphas, + alphas = self.lassocv_alphas) + regr.fit(X, y) + return regr + + + def return_fit_ml_model(self, X, y): + if self.model_type == "Linear": + model_to_return = self.return_Linear_ML_model(X, y) + elif self.model_type == "Lasso": + model_to_return = self.return_Lasso_ML_model(X, y) + elif self.model_type == "LassoCV": + model_to_return = self.return_LassoCV_ML_model(X, y) + return model_to_return + + +def netrem(edge_list, beta_net = 1, alpha_lasso = 0.01, default_edge_weight = 0.1, + degree_threshold = 0.5, gene_expression_nodes = [], overlapped_nodes_only = False, + y_intercept = False, standardize_X = True, center_y = True, view_network = False, + model_type = "Lasso", lasso_selection = "cyclic", all_pos_coefs = False, tolerance = 1e-4, maxit = 10000, + num_jobs = -1, num_cv_folds = 5, lassocv_eps = 1e-3, + lassocv_n_alphas = 100, # default in sklearn + lassocv_alphas = None, # default in sklearn + verbose = False, + hide_warnings = True): + degree_pseudocount = 1e-3 + if hide_warnings: + warnings.filterwarnings("ignore") + default_beta = False + default_alpha = False + if beta_net == 1: + print("using beta_net default of", 1) + default_beta = True + if alpha_lasso == 0.01: + if model_type != "LassoCV": + print("using alpha_lasso default of", 0.01) + default_alpha = True + edge_vals_for_d = False + self_loops = False + w_transform_for_d = "none" + + prior_graph_dict = {"edge_list": edge_list, + "gene_expression_nodes":gene_expression_nodes, + "edge_values_for_degree": edge_vals_for_d, + "consider_self_loops":self_loops, + "pseudocount_for_degree":degree_pseudocount, + "default_edge_weight": default_edge_weight, + "w_transform_for_d":w_transform_for_d, + "threshold_for_degree": degree_threshold, + "verbose":verbose, + "view_network":view_network} + netty = graph.PriorGraphNetwork(**prior_graph_dict) # uses the network to get features like the A matrix. + greg_dict = {"network": netty, + "model_type": model_type, + "use_network":True, + "standardize_X":standardize_X, + "center_y":center_y, + "y_intercept":y_intercept, + "overlapped_nodes_only":overlapped_nodes_only, + "max_lasso_iterations":maxit, + "all_pos_coefs":all_pos_coefs, + "view_network":view_network, + "verbose":verbose} + if default_alpha == False: + greg_dict["alpha_lasso"] = alpha_lasso + if default_beta == False: + greg_dict["beta_net"] = beta_net + if model_type != "Linear": + greg_dict["tolerance"] = tolerance + greg_dict["lasso_selection"] = lasso_selection + if model_type != "Lasso": + greg_dict["num_jobs"] = num_jobs + if model_type == "LassoCV": + greg_dict["num_cv_folds"] = num_cv_folds + greg_dict["lassocv_eps"] = lassocv_eps + greg_dict["lassocv_n_alphas"] = lassocv_n_alphas + greg_dict["lassocv_alphas"] = lassocv_alphas + greggy = NetREmModel(**greg_dict) + return greggy + + +def netremCV(edge_list, X, y, + num_beta: int = 10, + extra_beta_list = [0.25, 0.5, 0.75, 1], # additional beta to try out + num_alpha: int = 10, + max_beta: float = 200, # max_beta used to help prevent explosion of beta_net values + reduced_cv_search: bool = False, # should we do a reduced search (Randomized Search) or a GridSearch? + default_edge_weight: float = 0.1, + degree_threshold: float = 0.5, + gene_expression_nodes = [], + overlapped_nodes_only: bool = False, + standardize_X: bool = True, + center_y: bool = True, + y_intercept: bool = False, + model_type = "Lasso", + lasso_selection = "cyclic", + all_pos_coefs: bool = False, + tolerance: float = 1e-4, + maxit: int = 10000, + num_jobs: int = -1, + num_cv_folds: int = 5, + lassocv_eps: float = 1e-3, + lassocv_n_alphas: int = 100, # default in sklearn + lassocv_alphas = None, # default in sklearn + verbose = False, + searchVerbosity: int = 2, + show_warnings: bool = False): + + X_train = X + y_train = y + if show_warnings == False: + warnings.filterwarnings('ignore') + prior_graph_dict = {"edge_list": edge_list, + "gene_expression_nodes":gene_expression_nodes, + "edge_values_for_degree": False, + "consider_self_loops":False, + "pseudocount_for_degree":1e-3, + "default_edge_weight": default_edge_weight, + "w_transform_for_d":"none", + "threshold_for_degree": degree_threshold, + "verbose":verbose, + "view_network":False} + + prior_network = graph.PriorGraphNetwork(**prior_graph_dict) + + # generate the beta grid: + if isinstance(X_train, pd.DataFrame): + X_df = X_train + gene_names_list = list(X_df.columns) + if overlapped_nodes_only: + nodes_list = prior_network.nodes#self.nodes + common_nodes = ef.intersection(gene_names_list, nodes_list) + common_nodes.sort() + + X_df = X_df.loc[:, X_df.columns.isin(common_nodes)] + # Reorder columns of dataframe to match order in `column_order` + X_df = X_df.reindex(columns=common_nodes) + else: + X_df = X_df.reindex(columns=gene_names_list) + + X_train_np = X_df.copy() + y_train_np = y_train.copy() + if standardize_X: + if verbose: + print("standardizing X :)") + scaler = preprocessing.StandardScaler().fit(X_df) + X_train_np = scaler.transform(X_df) + else: + X_train_np = np.array(X_df.values.tolist()) + if isinstance(y_train, pd.DataFrame): + y_train_np = y_train_np.values.flatten() + beta_max = 0.5 * np.max(np.abs(X_train_np.T.dot(y_train_np))) + beta_min = 0.01 * beta_max + + var_X = np.var(X_train_np) + var_y = np.var(y_train_np) + if beta_max > max_beta: # max_beta used to prevent explosion of beta_net values + if verbose: + print(":) using variance to define beta_net values") + beta_max = 0.5 * np.max(np.abs(var_X * var_y)) * 100 + beta_min = 0.01 * beta_max + if verbose: + print(f"beta_min = {beta_min} and beta_max = {beta_max}") + beta_grid = np.logspace(np.log10(beta_max), np.log10(beta_min), num=num_beta) + if extra_beta_list != None: + if len(extra_beta_list) > 0: + for add_beta in extra_beta_list: # we add additional beta based on user-defined list + beta_grid = np.append(add_beta, beta_grid) + + + beta_alpha_grid_dict = {"beta_network_vals": [], "alpha_lasso_vals": []} + # generating the alpha-values that are corresponding + try: + with tqdm(beta_grid, desc=":) Generating beta_net and alpha_lasso pairs") as pbar: + for beta in pbar: + if verbose: + print("beta_network:", beta) + # please fix it so it reflects what we want more... like the proper defaults + netremCV_demo = NetREmModel(beta_net=beta, + model_type="LassoCV", + network=prior_network, + overlapped_nodes_only=overlapped_nodes_only, + standardize_X = standardize_X, + center_y = center_y, + y_intercept = y_intercept, + max_lasso_iterations = maxit, + all_pos_coefs = all_pos_coefs, + tolerance = tolerance, + lasso_selection = lasso_selection, + num_cv_folds = num_cv_folds, + #num_jobs = num_jobs, + lassocv_eps = lassocv_eps, + lassocv_n_alphas = lassocv_n_alphas, + lassocv_alphas = lassocv_alphas) + if lassocv_alphas != None: + netremCV_demo.lassocv_alphas = lassocv_alphas + + # Fit the model and compute alpha_max and alpha_min + netremCV_demo.fit(X_train, y_train) + X_tilda_train = netremCV_demo.X_tilda_train + y_tilda_train = netremCV_demo.y_tilda_train + alpha_max = 0.5 * np.max(np.abs(X_tilda_train.T.dot(y_tilda_train))) + alpha_min = 0.01 * alpha_max + if verbose: + print(f"alpha_min = {alpha_min} and alpha_max = {alpha_max}") + + # Generate alpha_grid based on alpha_max and alpha_min + optimal_alpha = netremCV_demo.regr.alpha_ + # take the cross-validation alpha and apply as the best alpha as well for this beta_net + beta_alpha_grid_dict["beta_network_vals"].append(beta) + beta_alpha_grid_dict["alpha_lasso_vals"].append(optimal_alpha) + # we also utilize the other alphas we have constructed dynamically and will find the best alpha among those + alpha_grid = np.logspace(np.log10(alpha_min), np.log10(alpha_max), num=num_alpha) + + # Find the best alpha using cross-validation + best_alpha = None + best_score = float('-inf') + for alpha in alpha_grid: + netremCV_demo = NetREmModel(beta_net=beta, + alpha_lasso = alpha, + model_type="Lasso", + network=prior_network, + standardize_X = standardize_X, + center_y = center_y, + overlapped_nodes_only=overlapped_nodes_only, + y_intercept = y_intercept, + max_lasso_iterations = maxit, + all_pos_coefs = all_pos_coefs, + tolerance = tolerance, + lasso_selection = lasso_selection) + scores = cross_val_score(netremCV_demo, X_train, y_train, cv=num_cv_folds, scoring = "neg_mean_squared_error") # You can change cv to your specific cross-validation strategy + mean_score = np.mean(scores) + if mean_score > best_score: + best_score = mean_score + best_alpha = alpha + + # Append the beta and best_alpha to the dictionary + beta_alpha_grid_dict["beta_network_vals"].append(beta) + beta_alpha_grid_dict["alpha_lasso_vals"].append(best_alpha) + + except Exception as e: + print(f"An error occurred: {e}") + if verbose: + print("finished generate_alpha_beta_pairs") + print(beta_alpha_grid_dict) + print(f"Length of beta_alpha_grid_dict: {len(beta_alpha_grid_dict['beta_network_vals'])}") + + param_grid = [{"alpha_lasso": [alpha_las], "beta_net": [beta_net]} + for alpha_las, beta_net in zip(beta_alpha_grid_dict["alpha_lasso_vals"], + beta_alpha_grid_dict["beta_network_vals"])] + if verbose: + print(":) Performing NetREmCV with both beta_network and alpha_lasso as UNKNOWN.") + initial_greg = NetREmModel(network=prior_network, + y_intercept = y_intercept, + standardize_X = standardize_X, + center_y = center_y, + max_lasso_iterations=maxit, + all_pos_coefs=all_pos_coefs, + lasso_selection = lasso_selection, + tolerance = tolerance, + view_network=False, + overlapped_nodes_only=overlapped_nodes_only) + pbar = tqdm(total=len(param_grid)) # Assuming we're trying 9 combinations of parameters + + if reduced_cv_search: + # Run RandomizedSearchCV + if verbose: + print(f":) since reduced_cv_search = {reduced_cv_search}, we perform RandomizedSearchCV on a reduced search space") + grid_search= RandomizedSearchCV(initial_greg, + param_grid, + n_iter=num_alpha, + cv=num_cv_folds, + scoring = "neg_mean_squared_error", + #scoring=make_scorer(custom_mse, greater_is_better=False), + verbose=searchVerbosity) + else: + # Run GridSearchCV + grid_search = GridSearchCV(initial_greg, param_grid=param_grid, cv=num_cv_folds, + scoring = "neg_mean_squared_error", + #scoring=make_scorer(custom_mse, greater_is_better=False), + verbose = searchVerbosity) + grid_search.fit(X_train, y_train) + + # Extract and display the best hyperparameters + best_params = grid_search.best_params_ + optimal_alpha = best_params["alpha_lasso"] + optimal_beta = best_params["beta_net"] + print(f":) NetREmCV found that the optimal alpha_lasso = {optimal_alpha} and optimal beta_net = {optimal_beta}") + + newest_netrem = NetREmModel(alpha_lasso = optimal_alpha, + beta_net = optimal_beta, + network = prior_network, + y_intercept = y_intercept, + standardize_X = standardize_X, + center_y = center_y, + max_lasso_iterations=maxit, + all_pos_coefs=all_pos_coefs, + lasso_selection = lasso_selection, + tolerance = tolerance, + view_network=False, + overlapped_nodes_only=overlapped_nodes_only) + newest_netrem.fit(X_train, y_train) + train_mse = newest_netrem.test_mse(X_train, y_train) + print(f":) Please note that the training Mean Square Error (MSE) from this fitted NetREm model is {train_mse}") + return newest_netrem + + +def organize_B_interaction_network(netrem_model): + B_interaction_df = netrem_model.B_interaction_df + num_TFs = B_interaction_df.shape[0] + B_interaction_df = B_interaction_df.reset_index().melt(id_vars='index', var_name='TF2', value_name='B_train_weight') + B_interaction_df = B_interaction_df.rename(columns = {"index":"TF1"}) + B_interaction_df = B_interaction_df[B_interaction_df["TF1"] != B_interaction_df["TF2"]] + B_interaction_df = B_interaction_df.sort_values(by = ['B_train_weight'], ascending = False) + B_interaction_df["sign"] = np.where((B_interaction_df.B_train_weight > 0), ":)", ":(") + B_interaction_df["potential_interaction"] = np.where((B_interaction_df.B_train_weight > 0), ":(", + ":( competitive (-)") + B_interaction_df["absVal_B"] = abs(B_interaction_df["B_train_weight"]) + B_interaction_df["info"] = "B matrix of TF-TF interactions" + B_interaction_df["candidate_TFs_N"] = num_TFs + B_interaction_df["target_gene_y"] = netrem_model.target_gene_y + B_interaction_df["num_final_predictors"] = netrem_model.num_final_predictors + B_interaction_df["model_type"] = netrem_model.model_type + B_interaction_df["beta_net"] = netrem_model.beta_net + B_interaction_df["X_standardized"] = netrem_model.standardize_X + B_interaction_df["gene_data"] = "training gene expression data" + + # Step 1: Please Sort the DataFrame + B_interaction_df = B_interaction_df.sort_values('absVal_B', ascending=False) + + # Step 2: Get the rank + B_interaction_df['rank'] = B_interaction_df['absVal_B'].rank(method='min', ascending=False) + + # Step 3: Calculate the percentile + B_interaction_df['percentile'] = (1 - (B_interaction_df['rank'] / B_interaction_df['absVal_B'].count())) * 100 + return B_interaction_df \ No newline at end of file diff --git a/code/previous_version/PriorGraphNetwork.py b/code/previous_version/PriorGraphNetwork.py new file mode 100644 index 0000000..29d474d --- /dev/null +++ b/code/previous_version/PriorGraphNetwork.py @@ -0,0 +1,547 @@ +# PriorGraphNetwork Class: :) +# Standard libraries +import os +import sys +import random +import copy +import warnings + +# Third-party libraries +import pandas as pd +import numpy as np +import networkx as nx +import scipy +import matplotlib.pyplot as plt +import plotly.express as px +from tqdm import tqdm +import jinja2 + +# Scikit-learn imports +from sklearn import linear_model +from sklearn.metrics import make_scorer +from sklearn.exceptions import ConvergenceWarning +from sklearn.base import RegressorMixin, ClassifierMixin, BaseEstimator +from sklearn.model_selection import KFold, train_test_split, GridSearchCV, cross_val_score +from sklearn.decomposition import TruncatedSVD +from sklearn.linear_model import Lasso, LassoCV, LinearRegression, ElasticNetCV, Ridge + +# Scipy imports +from scipy.linalg import svd as robust_svd +from scipy.sparse.linalg.interface import LinearOperator + +# Type hinting +from typing import Optional, List, Tuple +from numpy.typing import ArrayLike + +# Custom module imports +import essential_functions as ef +import error_metrics as em +import DemoDataBuilderXandY as demo + + +import math +from sklearn.metrics.pairwise import cosine_similarity +from node2vec import Node2Vec + + +# Constants +rng_seed = 2023 # random seed for reproducibility +randSeed = 123 + +# Utility functions +printdf = lambda *args, **kwargs: print(pd.DataFrame(*args, **kwargs)) + + +class PriorGraphNetwork: + """:) Please note that this class focuses on incorporating information from a prior network (in our case, + a biological network of some sort). The input would be an edge list with: source, target, weight. If no + weights are given then a weight of 1 will be automatically assumed. + If the prior network is NOT symmetric (most likely directed): + please note we can use graph embedding techniques like weighted node2vec (on the directed graph) to generate + an embedding, find the cosine similarity, and then use the node-node similarity values for our network. + Ultimately, this class builds the W matrix (for the prior network weights to be used for our network + regularization penalty), the D matrix (of degrees), and the V matrix (custom for our approach).""" + + _parameter_constraints = { + "w_transform_for_d": ["none", "sqrt", "square"], + "degree_pseudocount": (0, None), + "default_edge_weight": (0, None), + "threshold_for_degree": (0, None), + "view_network":[True, False], + "verbose":[True, False]} + + def __init__(self, **kwargs): # define default values for constants + + self.edge_values_for_degree = False # we instead consider a threshold by default (for counting edges into our degrees) + self.consider_self_loops = False # no self loops considered + self.verbose = True # printing out statements + self.pseudocount_for_degree = 1e-3 # to ensure that we do not have any 0 degrees for any node in our matrix. + self.undirected_graph_bool = True # by default we assume the input network is undirected and symmetric :) + self.default_edge_weight = 0.1 # if an edge is missing weight information + # these are the nodes we may wish to include. If these are provided, then we may utilize these in our model. + self.gene_expression_nodes = [] # default if we use edge weights for degree: + # if edge_values_for_degree is True: we can use the edge weight values to get the degrees. + self.w_transform_for_d = "none" + #self.square_root_weights_for_degree = False # take the square root of the edge weights for the degree calculations + #self.squaring_weights_for_degree = False # square the edge weights for the degree calculations + # default if we use a threshold for the degree: + self.threshold_for_degree = 0.5 + self.view_network = False + #################### +# self.dimensions = 64 +# self.walk_length = 30 +# self.num_walks = 200 +# self.p = 1 +# self.q = 0.5 +# self.workers = 4 +# self.window = 10 +# self.min_count = 1 +# self.batch_words = 4 + self.node_color_name = "yellow" + self.added_node_color_name = "lightblue" + self.edge_color_name = "red" + self.edge_weight_scaling = 5 + self.debug = False + #################### + self.preprocessed_network = False # is the network preprocessed with the final gene expression nodes + self.__dict__.update(kwargs) # overwrite with any user arguments :) + required_keys = ["edge_list"] # if consider_self_loops is true, we add 1 to degree value for each node, + # check that all required keys are present: + missing_keys = [key for key in required_keys if key not in self.__dict__] + if missing_keys: + raise ValueError(f":( Please note since edge_values_for_degree = {self.edge_values_for_degree} ye are missing information for these keys: {missing_keys}") + self.network_nodes = self.network_nodes_from_edge_list() + # other defined results: + # added Aug 30th: + if isinstance(self.edge_list, pd.DataFrame): + print(":( Please input edgelist as a list of lists instead of a dataframe. For your edge_df, try: edge_df.values.tolist()") + #self.edge_list = self.edge_list.values.tolist() + self.original_edge_list = self.edge_list + if len(self.gene_expression_nodes) > 0: # is not None: + self.preprocessed_network = True + self.gene_expression_nodes.sort() + gene_expression_nodes = self.gene_expression_nodes + self.final_nodes = gene_expression_nodes + common_nodes = ef.intersection(self.network_nodes, self.gene_expression_nodes) + common_nodes.sort() + self.common_nodes = common_nodes + self.gex_nodes_to_add = list(set(self.gene_expression_nodes) - set(self.common_nodes)) + self.network_nodes_to_remove = list(set(self.network_nodes) - set(self.common_nodes)) + # filtering the edge_list: + self.edge_list = [edge for edge in self.original_edge_list if edge[0] in gene_expression_nodes and edge[1] in gene_expression_nodes] + else: + self.final_nodes = self.network_nodes + if self.verbose: + print(self.final_nodes) + self.tf_names_list = self.final_nodes + self.nodes = self.final_nodes + self.N = len(self.tf_names_list) + self.V = self.create_V_matrix() + if self.undirected_graph_bool: + self.directed=False + self.undirected_edge_list_to_matrix() + self.W_original = self.W + #self.edge_df = self.undirected_edge_list_updated().drop_duplicates() + else: + self.directed=True + self.W_original = self.directed_node2vec_similarity(self.edge_list, self.dimensions, + self.walk_length, self.num_walks, + self.p, self.q, self.workers, + self.window, self.min_count, self.batch_words) + self.W = self.generate_symmetric_weight_matrix() + self.W_df = pd.DataFrame(self.W, columns = self.nodes, index = self.nodes) + if self.view_network: + self.view_W_network = self.view_W_network() + else: + self.view_W_network = None + self.degree_vector = self.generate_degree_vector_from_weight_matrix() + self.D = self.generate_degree_matrix_from_weight_matrix() + # added on April 26, 2023 + degree_df = pd.DataFrame(self.final_nodes, columns = ["TF"]) + degree_df["degree_D"] = self.D * np.ones(self.N) + self.inv_sqrt_degree_df = degree_df ######## + self.edge_list_from_W = self.return_W_edge_list() + self.A = self.create_A_matrix() + self.A_df = pd.DataFrame(self.A, columns = self.nodes, index = self.nodes, dtype=np.float64) + self.param_lists = self.full_lists() + self.param_df = pd.DataFrame(self.full_lists(), columns = ["parameter", "data type", "description", "value", "class"]) + self.node_status_df = self.find_node_status_df() + self._apply_parameter_constraints() + + + def find_node_status_df(self): + """ Returns the node status """ + preprocessed_result = "No :(" + if self.preprocessed_network: + preprocessed_result = "Yes :)" + if self.preprocessed_network: + common_df = pd.DataFrame(self.common_nodes, columns = ["node"]) + common_df["preprocessed"] = preprocessed_result + common_df["status"] = "keep :)" + common_df["info"] = "Common Node (Network and Gene Expression)" + full_df = common_df + if len(self.gex_nodes_to_add) > 0: + gex_add_df = pd.DataFrame(self.gex_nodes_to_add, columns = ["node"]) + gex_add_df["preprocessed"] = preprocessed_result + gex_add_df["status"] = "keep :)" + gex_add_df["info"] = "Gene Expression Node Only" + full_df = pd.concat([common_df, gex_add_df]) + if len(self.network_nodes_to_remove) > 0: + net_remove_df = pd.DataFrame(self.network_nodes_to_remove, columns = ["node"]) + net_remove_df["preprocessed"] = preprocessed_result + net_remove_df["status"] = "remove :(" + net_remove_df["info"] = "Network Node Only" + full_df = pd.concat([full_df, net_remove_df]) + else: + full_df = pd.DataFrame(self.network_nodes, columns = ["node"]) + full_df["preprocessed"] = preprocessed_result + full_df["status"] = 'unknown :|' + full_df["info"] = "Original Network Node" + return full_df + + + def network_nodes_from_edge_list(self): + edge_list = self.edge_list + network_nodes = list({node for edge in edge_list for node in edge[:2]}) + network_nodes.sort() + return network_nodes + + + def _apply_parameter_constraints(self): + constraints = {**PriorGraphNetwork._parameter_constraints} + for key, value in self.__dict__.items(): + if key in constraints: + if isinstance(constraints[key], tuple): + if isinstance(constraints[key][0], type) and not isinstance(value, constraints[key][0]): + setattr(self, key, constraints[key][0]) + elif constraints[key][1] is not None and isinstance(constraints[key][1], type) and not isinstance(value, constraints[key][1]): + setattr(self, key, constraints[key][1]) + elif value not in constraints[key]: + setattr(self, key, constraints[key][0]) + return self + + + def create_V_matrix(self): + V = self.N * np.eye(self.N) - np.ones(self.N) + return V + + + + # Optimized functions + def preprocess_edge_list(self): + processed_edge_list = [] + default_edge_weight = self.default_edge_weight + + for sublst in self.edge_list: + if len(sublst) == 2: + processed_edge_list.append(sublst + [default_edge_weight]) + else: + processed_edge_list.append(sublst) + + return processed_edge_list + + def undirected_edge_list_to_matrix(self): + all_nodes = self.final_nodes + edge_list = self.preprocess_edge_list() + default_edge_weight = self.default_edge_weight + N = len(all_nodes) + self.N = N + weight_df = np.full((N, N), default_edge_weight) + + # Create a mapping from node to index + node_to_idx = {node: idx for idx, node in enumerate(all_nodes)} + + for edge in tqdm(edge_list) if self.verbose else edge_list: + try: + source, target, *weight = edge + weight = weight[0] if weight else default_edge_weight + weight = np.nan_to_num(weight, nan=default_edge_weight) + source_idx, target_idx = node_to_idx[source], node_to_idx[target] + weight_df[source_idx, target_idx] = weight + weight_df[target_idx, source_idx] = weight + except ValueError as e: + print(f"An error occurred: {e}") + continue + + np.fill_diagonal(weight_df, 0) + W = weight_df + np.fill_diagonal(W, (W.sum(axis=0) - W.diagonal()) / (N - 1)) + + if not ef.check_symmetric(W): + print(":( W matrix is NOT symmetric") + + self.W = W + self.W_df = pd.DataFrame(W, columns=all_nodes, index=self.final_nodes, dtype=np.float64) + return self + + + def generate_symmetric_weight_matrix(self) -> np.ndarray: + """generate symmetric W matrix. W matrix (Symmetric --> W = W_Transpose). + Note: each diagonal element is the summation of other non-diagonal elements in the same row divided by (N-1) + 2023.02.14_Xiang. TODO: add parameter descriptions""" + W = self.W_original + np.fill_diagonal(W, (W.sum(axis=0) - W.diagonal()) / (self.N - 1)) + symmetric_W = ef.check_symmetric(W) + if symmetric_W == False: + print(":( W matrix is NOT symmetric") + return None + return W + + + def return_W_edge_list(self): + wMat = ef.view_matrix_as_dataframe(self.W, column_names_list = self.tf_names_list, row_names_list = self.tf_names_list) + w_edgeList = wMat.stack().reset_index() + w_edgeList = w_edgeList[w_edgeList["level_0"] != w_edgeList["level_1"]] + w_edgeList = w_edgeList.rename(columns = {"level_0":"source", "level_1":"target", 0:"weight"}) + w_edgeList = w_edgeList.sort_values(by = ["weight"], ascending = False) + return w_edgeList + + + def view_W_network(self): + roundedW = np.round(self.W, decimals=4) + wMat = ef.view_matrix_as_dataframe(roundedW, column_names_list=self.nodes, row_names_list=self.nodes) + w_edgeList = wMat.stack().reset_index() + # https://stackoverflow.com/questions/48218455/how-to-create-an-edge-list-dataframe-from-a-adjacency-matrix-in-python + w_edgeList = w_edgeList[w_edgeList["level_0"] != w_edgeList["level_1"]] + w_edgeList = w_edgeList.rename(columns={"level_0": "source", "level_1": "target", 0: "weight"}) + w_edgeList = w_edgeList[w_edgeList["weight"] != 0] + + G = nx.from_pandas_edgelist(w_edgeList, source="source", target="target", edge_attr="weight") + pos = nx.spring_layout(G) + weights_list = [G.edges[e]['weight'] * self.edge_weight_scaling for e in G.edges] + fig, ax = plt.subplots() + if self.preprocessed_network and len(self.gex_nodes_to_add) > 0: + new_nodes = self.gex_nodes_to_add + print("new nodes:", new_nodes) + node_color_map = {node: self.added_node_color_name if node in new_nodes else self.node_color_name for node in G.nodes} + nx.draw(G, pos, node_color=node_color_map.values(), edge_color=self.edge_color_name, with_labels=True, width=weights_list, ax=ax) + else: + nx.draw(G, pos, node_color=self.node_color_name, edge_color=self.edge_color_name, with_labels=True, width=weights_list, ax=ax) + + labels = {e: G.edges[e]['weight'] for e in G.edges} + return nx.draw_networkx_edge_labels(G, pos, edge_labels=labels, ax=ax) + + + def generate_degree_vector_from_weight_matrix(self) -> np.ndarray: + """generate d degree vector. 2023.02.14_Xiang TODO: add parameter descriptions + """ + if self.edge_values_for_degree == False: + W_bool = (self.W > self.threshold_for_degree) + d = np.float64(W_bool.sum(axis=0) - W_bool.diagonal()) + else: + if self.w_transform_for_d == "sqrt": #self.square_root_weights_for_degree: # taking the square root of the weights for the edges + W_to_use = np.sqrt(self.W) + elif self.w_transform_for_d == "square": # self.squaring_weights_for_degree: + W_to_use = self.W ** 2 + else: + W_to_use = self.W + d = W_to_use.diagonal() * (self.N - 1) # summing the edge weights + d += self.pseudocount_for_degree + if self.consider_self_loops: + d += 1 # we also add in a self-loop :) + # otherwise, we can just use this threshold for the degree + if self.verbose: + print(":) Please note: we are generating the prior network:") + if self.edge_values_for_degree: + print(":) Please note that we use the sum of the edge weight values to get the degree for a given node.") + else: + print(f":) Please note that we count the number of edges with weight > {self.threshold_for_degree} to get the degree for a given node.") + if self.consider_self_loops: + print(f":) Please note that since consider_self_loops = {self.consider_self_loops} we also add 1 to the degree for each node (as a self-loop).") + print(f":) We also add {self.pseudocount_for_degree} as a pseudocount to our degree value for each node.") + print() # + return d + + + def generate_degree_matrix_from_weight_matrix(self): # D matrix + """:) Please note that this function returns the D matrix as a diagonal matrix + where the entries are 1/sqrt(d). Here, d is a vector corresponding to the degree of each matrix""" + # we see that the D matrix is higher for nodes that are singletons, a much higher value because it is not connected + d = self.degree_vector + d_inv_sqrt = 1 / np.sqrt(d) + # D = np.diag(d_inv_sqrt) # full matrix D, only suitable for small scale. Use DiagonalLinearOperator instead. + D = ef.DiagonalLinearOperator(d_inv_sqrt) + return D + + + def create_A_matrix(self): # A matrix + """ Please note that this function by Saniya creates the A matrix, which is: + :) here: %*% refers to matrix multiplication + and * refers to element-wise multiplication (for 2 dataframes with same exact dimensions, + component-wise multiplication) + # Please note that this function by Saniya creates the A matrix, which is: + # (D_transpose) %*% (V*W) %*% (D) + """ + A = self.D @ (self.V * self.W) @ self.D + approxSame = ef.check_symmetric(A) # please see if A is symmetric + if approxSame: + return A + else: + print(f":( False. A is NOT a symmetric matrix.") + print(A) + return False + + + def full_lists(self): + # network arguments used: + # argument, description, our value + full_lists = [] + term_to_add_last = "PriorGraphNetwork" + row1 = ["default_edge_w", ">= 0", "edge weight for any edge with missing weight info", self.default_edge_weight, term_to_add_last] + row2 = ["self_loops", "boolean", "add 1 to the degree for each node (based on self-loops)?", self.consider_self_loops, term_to_add_last] + + full_lists.append(row1) + full_lists.append(row2) + if self.pseudocount_for_degree != 0: + row3 = ["d_pseudocount", ">= 0", + "to ensure that no nodes have 0 degree value in D matrix", + self.pseudocount_for_degree, term_to_add_last] + full_lists.append(row3) + if self.edge_values_for_degree: + row_to_add = ["edge_vals_for_d", "boolean", + "if True, we use the edge weight values to derive our degrees for matrix D", True, term_to_add_last] + full_lists.append(row_to_add)# arguments to add in: + if self.w_transform_for_d == "sqrt": # take the square root of the edge weights for the degree calculations + row_to_add = ["w_transform_for_d: sqrt", "string", + "for each edge, we use the square root of the edge weight values to derive our degrees for matrix D", self.w_transform_for_d, term_to_add_last] + full_lists.append(row_to_add) + if self.w_transform_for_d == "square": # square the edge weights for the degree calculations + row_to_add = ["w_transform_for_d: square", "string", + "for each edge, we square the edge weight values to derive our degrees for matrix D", self.w_transform_for_d, term_to_add_last] + full_lists.append(row_to_add) + else: # default if we use a threshold for the degree: + row_to_add = ["edge_vals_for_d", "boolean", + "if False, we use a threshold instead to derive our degrees for matrix D", False, term_to_add_last] + full_lists.append(row_to_add) + self.threshold_for_degree = 0.5 # edge weights > this threshold are counted as 1 for the degree + to_add_text = "edge weights > " + str(self.threshold_for_degree) + " are counted as 1 for the degree" + row_to_add = ["thresh_for_d", ">= 0", + to_add_text, self.threshold_for_degree, term_to_add_last] + full_lists.append(row_to_add) + return full_lists + + +def build_prior_network(edge_list, gene_expression_nodes = [], default_edge_weight = 0.1, + degree_threshold = 0.5, + degree_pseudocount = 1e-3, + view_network = True, + verbose = True): + edge_vals_for_d = False + self_loops = False + w_transform_for_d = "none" + prior_graph_dict = {"edge_list": edge_list, + "gene_expression_nodes":gene_expression_nodes, + "edge_values_for_degree": edge_vals_for_d, + "consider_self_loops":self_loops, + "pseudocount_for_degree":degree_pseudocount, + "default_edge_weight": default_edge_weight, + "w_transform_for_d":w_transform_for_d, + "threshold_for_degree": degree_threshold, + "view_network": view_network, + "verbose":verbose} + if verbose: + print("building prior network:") + print("prior graph network used") + netty = PriorGraphNetwork(**prior_graph_dict) # uses the network to get features like the A matrix. #################### + return netty + + +def directed_node2vec_similarity(edge_list: List[Tuple[int, int, float]], + dimensions: int = 64, + walk_length: int = 30, + num_walks: int = 200, + p: float = 1, q: float = 0.5, + workers: int = 4, window: int = 10, + min_count: int = 1, + batch_words: int = 4) -> np.ndarray: + print("directed_node2vec_similarity") + """ Given an edge list and node2vec parameters, returns a scaled similarity matrix for the node embeddings generated + by training a node2vec model on the directed graph defined by the edge list. + + Parameters: + ----------- + edge_list: List[List[int, int, float]] + A list of lists representing the edges of a directed graph. Each edge should be a list of three values: + [source_node, target_node, edge_weight]. If no edge weight is specified, it is assumed to be 1.0. + + dimensions: int, optional (default=64) + The dimensionality of the node embeddings. + + walk_length: int, optional (default=30) + The length of each random walk during the node2vec training process. + + num_walks: int, optional (default=200) + The number of random walks to generate for each node during the node2vec training process. + + p: float, optional (default=1) + The return parameter for the node2vec algorithm. + + q: float, optional (default=0.5) + The in-out parameter for the node2vec algorithm. + + workers: int, optional (default=4) + The number of worker threads to use during the node2vec training process. + + window: int, optional (default=10) + The size of the window for the skip-gram model during training. + + min_count: int, optional (default=1) + The minimum count for a word in the training data to be included in the model. + + batch_words: int, optional (default=4) + The number of words in each batch during training. + + Returns: + -------- + scaled_similarity_matrix: np.ndarray + A scaled (0-1 range) cosine similarity matrix for the node embeddings generated by training a node2vec model + on the directed graph defined by the edge list. + """ + print("Creating directed graph from edge list") + directed_graph = nx.DiGraph() + for edge in edge_list: + source, target = edge[:2] + weight = edge[2] if len(edge) == 3 else 1.0 + directed_graph.add_edge(source, target, weight=weight) + + # Extract unique node names from the graph + node_names = list(directed_graph.nodes) + + print("Initializing the Node2Vec model") + model = Node2Vec(directed_graph, dimensions=dimensions, walk_length=walk_length, + num_walks=num_walks, p=p, q=q, workers=workers) + + print("Training the model") + model = model.fit(window=window, min_count=min_count, batch_words=batch_words) + + print("Getting node embeddings") + node_embeddings = np.array([model.wv[node] for node in node_names]) + + print("Calculating cosine similarity matrix") + similarity_matrix = cosine_similarity(node_embeddings) + + print("Scaling similarity matrix to 0-1 range") + scaled_similarity_matrix = (similarity_matrix + 1) / 2 + + # Create a DataFrame with rows and columns labeled as node names + similarity_matrix = pd.DataFrame(scaled_similarity_matrix, index=node_names, columns=node_names) + print(f":) First 5 entries of the symmetric similarity matrix for {similarity_matrix.shape[0]} nodes.") + print(similarity_matrix.iloc[0:5, 0:5]) + + similarity_df = similarity_matrix.reset_index().melt(id_vars='index', var_name='TF2', value_name='cosine_similarity') + #similarity_df = similarity_df[similarity_df['index'] < similarity_df['TF2']] + similarity_df = similarity_df.rename(columns = {"index":"node_1", "TF2":"node_2"}) + similarity_df = similarity_df[similarity_df["node_1"] != similarity_df["node_2"]] + results_dict = {} + print("\n :) ######################################################## \n") + print(":) Please note that we return a dictionary with 3 keys based on Node2Vec and cosine similarity computations:") + print("1. similarity_matrix: the cosine similarity matrix for the nodes in the original directed graph") + results_dict["similarity_matrix"] = similarity_matrix + print("2. similarity_df: simplified dataframe of the cosine similarity values from the similarity_matrix.") + + results_dict["similarity_df"] = similarity_df + print("3. NetREm_edgelist: an edge_list that is based on similarity_df that is ready to be input for NetREm.") + + results_dict["NetREm_edgelist"] = similarity_df.values.tolist() + print(results_dict.keys()) + return results_dict \ No newline at end of file diff --git a/code/previous_version/error_metrics.py b/code/previous_version/error_metrics.py new file mode 100644 index 0000000..d4ca876 --- /dev/null +++ b/code/previous_version/error_metrics.py @@ -0,0 +1,474 @@ +# Error_Metrics.py :) +import pandas as pd +import numpy as np +import random +import copy +from tqdm import tqdm +import os +import sys # https://www.dev2qa.com/how-to-run-python-script-py-file-in-jupyter-notebook-ipynb-file-and-ipython/#:~:text=How%20To%20Run%20Python%20Script%20.py%20File%20In,2.%20Invoke%20Python%20Script%20File%20From%20Ipython%20Command-Line. +import networkx as nx +import scipy +from scipy.linalg import svd as robust_svd +from sklearn.model_selection import KFold, train_test_split, GridSearchCV, cross_val_score +from sklearn.decomposition import TruncatedSVD +from sklearn import linear_model +from sklearn.linear_model import Lasso, LassoCV, LinearRegression, ElasticNetCV, Ridge +from numpy.typing import ArrayLike +# from skopt import gp_minimize, space +from typing import Optional, List, Tuple +from sklearn.metrics import make_scorer +import plotly.express as px +from sklearn.base import RegressorMixin, ClassifierMixin, BaseEstimator +import matplotlib.pyplot as plt +from numpy.typing import ArrayLike +from scipy.sparse.linalg.interface import LinearOperator +import warnings +from sklearn.exceptions import ConvergenceWarning +printdf = lambda *args, **kwargs: print(pd.DataFrame(*args, **kwargs)) +rng_seed = 2023 # random seed for reproducibility +randSeed = 123 + +def calculate_mean_square_error(actual_values, predicted_values): + # Please note that this function by Saniya calculates the Mean Square Error (MSE) + difference = (actual_values - predicted_values) + squared_diff = difference ** 2 # square of the difference + mean_squared_diff = np.mean(squared_diff) + return mean_squared_diff + + +def mse(REF: np.ndarray, X: np.ndarray, axis: Optional[int] = None) -> np.float: + """Compute mean square error between array with a reference array - + If REF or X is complex, compute mse(REF.real, X.real) + 1j * mse(REF.imag, X.imag) + + Parameters + ---------- + REF: + ground truth, or reference array, e.g. shape=(n_sample, n_target) for machine learning + X: + result array to compare with reference, e.g. shape=(n_sample, n_target) for machine learning + axis: + Axis along which the comparison is computed. Default to None to compute the comparison + of the flattened array. + + Returns + ------- + mse_: + normalized mean square error + + Examples + ------- + mse(REF, X, axis=0) compute the comparision along n_sample dimension for machine learning + regression application where shape=(n_sample, n_target) + """ + + if (not np.iscomplexobj(REF)) and (not np.iscomplexobj(X)): + return ((X - REF)**2).mean(axis=axis) + else: + return mse(REF.real, X.real, axis) + 1j * mse(REF.imag, X.imag, axis) + + +def nmse(REF: np.ndarray, X: np.ndarray, axis: Optional[int] = None) -> np.float: + """Compute normalized mean square error between array with a reference array - + If REF or X is complex, compute nmse(REF.real, X.real) + 1j * nmse(REF.imag, X.imag) + + Parameters + ---------- + REF: + ground truth, or reference array, e.g. shape=(n_sample, n_target) for machine learning + X: + result array to compare with reference, e.g. shape=(n_sample, n_target) for machine learning + axis: + Axis along which the comparison is computed. Default to None to compute the comparison + of the flattened array. + + Returns + ------- + nmse_: + normalized mean square error + + Examples + ------- + nmse(REF, X, axis=0) compute the comparision along n_sample dimension for machine learning + regression application where shape=(n_sample, n_target) + """ + if (not np.iscomplexobj(REF)) and (not np.iscomplexobj(X)): + return ((X - REF)**2).mean(axis=axis) / (REF**2).mean(axis=axis) + else: + return nmse(REF.real, X.real, axis) + 1j * nmse(REF.imag, X.imag, axis) + + +def snr(REF: np.ndarray, X: np.ndarray, axis: Optional[int] = None) -> np.float64: + """Compare an array with a reference array - compute signal to noise ration in dB. + If REF or X is complex, compute snr(REF.real, X.real) + 1j * snr(REF.imag, X.imag) + + Parameters + ---------- + REF: + ground truth, or reference array, e.g. shape=(n_sample, n_target) for machine learning + X: + result array to compare with reference, e.g. shape=(n_sample, n_target) for machine learning + axis: + Axis along which the comparison is computed. The default is to compute the comparison + of the flattened array. + + Returns + ------- + snr_: + signal to noise ration in dB + + Examples + ------- + snr(REF, X, axis=0) compute the comparision along n_sample dimension for machine learning + regression application where shape=(n_sample, n_target) + """ + if (not np.iscomplexobj(REF)) and (not np.iscomplexobj(X)): + return 10 * np.log10((REF**2).mean(axis=axis) / ((X - REF)**2).mean(axis=axis)) + else: + return snr(REF.real, X.real, axis) + 1j * snr(REF.imag, X.imag, axis) + + +def psnr(REF: np.ndarray, X: np.ndarray, axis: Optional[int] = None, max_: Optional[np.float64] = None) -> np.float64: + """See snr, TODO: copy and modify docstring from snr + """ + if (not np.iscomplexobj(REF)) and (not np.iscomplexobj(X)): + if max_ is None: + max_ = REF.max() + return 10 * np.log10(max_**2 / ((X - REF)**2).mean(axis=axis)) # change from REF.max() to 255 + else: + return psnr(REF.real, X.real, axis, max_) + 1j * psnr(REF.imag, X.imag, axis, max_) + + +def nmse_custom_score(y_true, y_pred): + """ + Calculates the negative normalized mean squared error (MSE) between the true and predicted values. + """ + import numpy as np + if isinstance(y_true, pd.DataFrame): + y_true = y_true.values.flatten() + if isinstance(y_pred, pd.DataFrame): + y_pred = y_pred.values.flatten() + if not any(y_pred): # if all predicted coefficients are 0 + return -np.inf # return a high negative score + nmseVal = nmse(y_true, y_pred) + return -nmseVal + + +def mse_custom_score(y_true, y_pred): + """ + Calculates the negative normalized mean squared error (MSE) between the true and predicted values. + default: greater_is_better, so we set negative mseVal to find the smallest mse + """ + import numpy as np + if isinstance(y_true, pd.DataFrame): + y_true = y_true.values.flatten() + if isinstance(y_pred, pd.DataFrame): + y_pred = y_pred.values.flatten() + if not any(y_pred): # if all predicted coefficients are 0 + return -np.inf # return a high negative score + mseVal = mse(y_true, y_pred) + return -mseVal + + +def snr_custom_score(y_true, y_pred): + """ + Higher the SNR the better + """ + if isinstance(y_true, pd.DataFrame): + y_true = y_true.values.flatten() + if isinstance(y_pred, pd.DataFrame): + y_pred = y_pred.values.flatten() + if not any(y_pred): # if all predicted coefficients are 0 + return -np.inf # return a high negative score + snrVal = snr(y_true, y_pred) + return snrVal + + +def psnr_custom_score(y_true, y_pred): + """ + Higher the psnr, the better + """ + if isinstance(y_true, pd.DataFrame): + y_true = y_true.values.flatten() + if isinstance(y_pred, pd.DataFrame): + y_pred = y_pred.values.flatten() + if not any(y_pred): # if all predicted coefficients are 0 + return -np.inf # return a high negative score + psnrVal = psnr(y_true, y_pred) + return psnrVal + +# Create a custom scorer object using make_scorer +mse_custom_scorer = make_scorer(mse_custom_score) +nmse_custom_scorer = make_scorer(nmse_custom_score) +snr_custom_scorer = make_scorer(snr_custom_score) +psnr_custom_scorer = make_scorer(psnr_custom_score) + + +def generate_model_metrics_for_baselines_df(X_train, y_train, X_test, y_test, model_name = "ElasticNetCV", y_intercept = False, tf_name = "SOX10"): + from sklearn.linear_model import ElasticNetCV, LinearRegression, LassoCV, RidgeCV + print(f"{model_name} results :) for fitting y_intercept = {y_intercept}") + if model_name == "ElasticNetCV": + regr = ElasticNetCV(cv=5, random_state=0, fit_intercept = y_intercept) + elif model_name == "LinearRegression": + regr = LinearRegression(fit_intercept = y_intercept) + elif model_name == "LassoCV": + regr = LassoCV(cv=5, fit_intercept = y_intercept) + elif model_name == "RidgeCV": + regr = RidgeCV(cv=5, fit_intercept = y_intercept) + regr.fit(X_train, y_train) + if model_name in ["RidgeCV", "LinearRegression"]: + model_df = pd.DataFrame(regr.coef_) + else: + model_df = pd.DataFrame(regr.coef_).transpose() + model_df.columns = X_train.columns.tolist() + selected_row = model_df.iloc[0] + selected_cols = selected_row[selected_row != 0].index # Filter out the columns with value 0 + model_df = model_df[selected_cols] + df = model_df.replace("None", np.nan).apply(pd.to_numeric, errors='coerce') + sorted_series = df.abs().squeeze().sort_values(ascending=False) + # convert the sorted series back to a DataFrame + sorted_df = pd.DataFrame(sorted_series) + # add a column for the rank + sorted_df['Rank'] = range(1, len(sorted_df) + 1) + sorted_df['TF'] = sorted_df.index + sorted_df = sorted_df.rename(columns = {0:"AbsoluteVal_coefficient"}) + tfs = sorted_df["TF"].tolist() + if tf_name not in tfs: + sorted_df = pd.DataFrame(["N/A", tf_name]).transpose() + sorted_df.columns = ["Rank", "TF"] + sorted_df["Info"] = model_name + if y_intercept: + sorted_df["y_intercept"] = "True :)" + else: + sorted_df["y_intercept"] = "False :(" + sorted_df["num_TFs"] = model_df.shape[1] + predY_train = regr.predict(X_train) + predY_test = regr.predict(X_test) + train_mse = mse(y_train.values.flatten(), predY_train) + test_mse = mse(y_test.values.flatten(), predY_test) + train_nmse = nmse(y_train.values.flatten(), predY_train) + test_nmse = nmse(y_test.values.flatten(), predY_test) + sorted_df["train_mse"] = train_mse + sorted_df["test_mse"] = test_mse + sorted_df["train_nmse"] = train_nmse + sorted_df["test_nmse"] = test_nmse + predY_train = regr.predict(X_train) + predY_test = regr.predict(X_test) + sorted_df["train_nmse"] = nmse(y_train.values.flatten(), predY_train) + sorted_df["test_nmse"] = nmse(y_test.values.flatten(), predY_test) + sorted_df["train_snr"] = snr(y_train.values.flatten(), predY_train) + sorted_df["test_snr"] = snr(y_test.values.flatten(), predY_test) + sorted_df["train_psnr"] = psnr(y_train.values.flatten(), predY_train) + sorted_df["test_psnr"] = psnr(y_test.values.flatten(), predY_test) + return sorted_df + + +def generate_model_metrics_for_netrem_model_object(netrem_model, y_intercept_fit, X_train, y_train, X_test, y_test, filtered_results = False, tf_name = "SOX10", focus_gene = "y"): + if netrem_model.model_nonzero_coef_df.shape[1] == 1: + sorted_df = pd.DataFrame(["N/A", tf_name]).transpose() + sorted_df.columns = ["Rank", "TF"] + tf_netrem_found = False + if netrem_model.model_type == "LassoCV": + sorted_df["Info"] = "NetREm (b = " + str(netrem_model.beta_network) + "; LassoCV)" #+ netrem_info# + str(netrem_model.optimal_alpha) + ")" + else: + sorted_df["Info"] = "NetREm (b = " + str(netrem_model.beta_network) + "; a = " + netrem_model.alpha_lasso + ")"# : " + netrem_info# + str(netrem_model.optimal_alpha) + ")" + + if y_intercept_fit: + sorted_df["y_intercept"] = "True :)" + else: + sorted_df["y_intercept"] = "False :(" + sorted_df["num_TFs"] = 0 + else: + sorted_df = netrem_model.sorted_coef_df[netrem_model.sorted_coef_df["TF"] == tf_name] + tfs = sorted_df["TF"].tolist() + tf_netrem_found = True + if tf_name not in tfs: + sorted_df = pd.DataFrame(["N/A", tf_name]).transpose() + sorted_df.columns = ["Rank", "TF"] + tf_netrem_found = False + sorted_df["Info"] = "NetREm (b = " + str(netrem_model.beta_network) + "; LassoCV)"# + str(netrem_model.optimal_alpha) + ")" + sorted_df["num_TFs"] = netrem_model.model_nonzero_coef_df.drop(columns = ["y_intercept"]).shape[1] + predY_train = netrem_model.predict(X_train) + predY_test = netrem_model.predict(X_test) + sorted_df["train_mse"] = mse(y_train.values.flatten(), predY_train) + sorted_df["test_mse"] = mse(y_test.values.flatten(), predY_test) + sorted_df["train_nmse"] = nmse(y_train.values.flatten(), predY_train) + sorted_df["test_nmse"] = nmse(y_test.values.flatten(), predY_test) + sorted_df["train_snr"] = snr(y_train.values.flatten(), predY_train) + sorted_df["test_snr"] = snr(y_test.values.flatten(), predY_test) + sorted_df["train_psnr"] = psnr(y_train.values.flatten(), predY_train) + sorted_df["test_psnr"] = psnr(y_test.values.flatten(), predY_test) + sorted_df_netrem = sorted_df + netrem_dict = {"sorted_df_netrem":sorted_df_netrem, "tf_netrem_found":tf_netrem_found} + return netrem_dict + + +def metrics_for_netrem_models_versus_other_models(netrem_with_intercept, netrem_no_intercept, X_train, y_train, X_test, y_test, filtered_results = False, tf_name = "SOX10", target_gene = "y"): + """ :) This is similar to function metrics_for_netrem_versus_other_models() except it focuses on 2 types of NetREm models: + 1. with y-intercept fitted + 2. with no y-intercept fitted + :) Please note: + MSE (Mean Squared Error) and NMSE (Normalized Mean Squared Error) are both measures of the average difference between the predicted and actual values, where lower values indicate better performance. + + PSNR (Peak Signal-to-Noise Ratio) and SNR (Signal-to-Noise Ratio) are both measures of the ratio between the maximum possible signal power and the power of the noise, where higher values indicate better performance. + + However, the specific metrics that are most relevant to a particular machine learning problem can vary depending on the application and the specific goals of the model. So, it's important to consider the context and objectives of each project when selecting evaluation metrics. + """ + focus_gene = target_gene + netrem_intercept_bool = True + netrem_no_intercept_bool = True + if netrem_with_intercept is None: + netrem_intercept_bool = False + tf_netrem_found_with_intercept = False + if netrem_no_intercept is None: + netrem_no_intercept_bool = False + tf_netrem_found_no_intercept = False + + if netrem_with_intercept: + netrem_with_intercept_sorted_dict = generate_model_metrics_for_netrem_model_object(netrem_with_intercept, True, X_train, y_train, X_test, y_test, filtered_results, tf_name, focus_gene) + netrem_with_intercept_sorted_df = netrem_with_intercept_sorted_dict["sorted_df_netrem"] + netrem_with_intercept_sorted_df["y_intercept"] = "True :)" + tf_netrem_found_with_intercept = netrem_with_intercept_sorted_dict["tf_netrem_found"] + + if netrem_no_intercept_bool: + netrem_no_intercept_sorted_dict = generate_model_metrics_for_netrem_model_object(netrem_no_intercept, False, X_train, y_train, X_test, y_test, filtered_results, tf_name, focus_gene) + netrem_no_intercept_sorted_df = netrem_no_intercept_sorted_dict["sorted_df_netrem"] + netrem_no_intercept_sorted_df["y_intercept"] = "False :(" + tf_netrem_found_no_intercept = netrem_no_intercept_sorted_dict["tf_netrem_found"] + + + sorted_df_elasticcv = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "ElasticNetCV", y_intercept = False, tf_name = tf_name) + sorted_df_lassocv = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "LassoCV", y_intercept = False, tf_name = tf_name) + sorted_df_ridgecv = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "RidgeCV", y_intercept = False, tf_name = tf_name) + sorted_df_linear = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "LinearRegression", y_intercept = False, tf_name = tf_name) + sorted_df_elasticcv2 = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "ElasticNetCV", y_intercept = True, tf_name = tf_name) + sorted_df_lassocv2 = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "LassoCV", y_intercept = True, tf_name = tf_name) + sorted_df_ridgecv2 = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "RidgeCV", y_intercept = True, tf_name = tf_name) + sorted_df_linear2 = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "LinearRegression", y_intercept = True, tf_name = tf_name) + + if netrem_no_intercept_bool: + sorty_combo = pd.concat([netrem_no_intercept_sorted_df, sorted_df_elasticcv, sorted_df_ridgecv, sorted_df_lassocv, sorted_df_linear]) + else: + sorty_combo = pd.concat([sorted_df_elasticcv, sorted_df_ridgecv, sorted_df_lassocv, sorted_df_linear]) + if netrem_intercept_bool: + sorty_combo = pd.concat([sorty_combo, netrem_with_intercept_sorted_df, sorted_df_elasticcv2, sorted_df_ridgecv2, sorted_df_lassocv2, sorted_df_linear2]) + else: + sorty_combo = pd.concat([sorty_combo, sorted_df_elasticcv2, sorted_df_ridgecv2, sorted_df_lassocv2, sorted_df_linear2]) + sorty_combo = sorty_combo[sorty_combo["TF"] == tf_name] + sorty_combo["TG"] = focus_gene + sorty_combo = sorty_combo.reset_index().drop(columns = ["index"]) + if 'AbsoluteVal_coefficient' not in sorty_combo.columns.tolist(): + sorty_combo['AbsoluteVal_coefficient'] = pd.Series([float('nan')]*len(sorty_combo)) + + sorty_combo = sorty_combo[['AbsoluteVal_coefficient', 'Rank', 'TF', 'Info', 'y_intercept', 'num_TFs', 'TG', 'train_mse', + 'test_mse', 'train_nmse', 'test_nmse', 'train_snr', 'test_snr', + 'train_psnr', 'test_psnr']] + aaa = sorty_combo + aaa['rank_mse_train'] = aaa['train_mse'].rank(ascending=True).astype(int) + aaa['rank_mse_test'] = aaa['test_mse'].rank(ascending=True).astype(int) + aaa['rank_nmse_train'] = aaa['train_nmse'].rank(ascending=True).astype(int) + aaa['rank_nmse_test'] = aaa['test_nmse'].rank(ascending=True).astype(int) + + aaa['rank_snr_train'] = aaa['train_snr'].rank(ascending=False).astype(int) + aaa['rank_snr_test'] = aaa['test_snr'].rank(ascending=False).astype(int) + aaa['rank_psnr_train'] = aaa['train_psnr'].rank(ascending=False).astype(int) + aaa['rank_psnr_test'] = aaa['test_psnr'].rank(ascending=False).astype(int) + aaa["total_metrics_rank"] = aaa['rank_mse_train'] + aaa['rank_mse_test'] + aaa['rank_nmse_train'] + aaa['rank_nmse_test'] + aaa["total_metrics_rank"] += aaa['rank_snr_train'] + aaa['rank_snr_test'] + aaa['rank_psnr_train'] + aaa['rank_psnr_test'] + sorty_combo = aaa + + reduced_results_df = sorty_combo[sorty_combo["Rank"] != "N/A"] + reduced_results_df = reduced_results_df.sort_values(by = ["Rank"]) + + + if tf_netrem_found_with_intercept: + print(netrem_with_intercept.final_corr_vs_coef_df[["info"] + [tf_name]]) + elif tf_netrem_found_no_intercept: + print(netrem_no_intercept.final_corr_vs_coef_df[["info"] + [tf_name]]) + if filtered_results: + return reduced_results_df + else: + return sorty_combo + + +def metrics_for_netrem_versus_other_models(netrem_model, X_train, y_train, X_test, y_test, filtered_results = False, tf_name = "SOX10", target_gene = "y"): + """ :) Please note: + MSE (Mean Squared Error) and NMSE (Normalized Mean Squared Error) are both measures of the average difference between the predicted and actual values, where lower values indicate better performance. + + PSNR (Peak Signal-to-Noise Ratio) and SNR (Signal-to-Noise Ratio) are both measures of the ratio between the maximum possible signal power and the power of the noise, where higher values indicate better performance. + + However, the specific metrics that are most relevant to a particular machine learning problem can vary depending on the application and the specific goals of the model. So, it's important to consider the context and objectives of each project when selecting evaluation metrics. + """ + focus_gene = target_gene + if netrem_model.model_nonzero_coef_df.shape[1] == 1: + sorted_df = pd.DataFrame(["N/A", tf_name]).transpose() + sorted_df.columns = ["Rank", "TF"] + tf_netrem_found = False + sorted_df["Info"] = "NetREm (b = " + str(netrem_model.beta_network) + "; LassoCV)"# + str(netrem_model.optimal_alpha) + ")" + sorted_df["y_intercept"] = "False :(" + sorted_df["num_TFs"] = 0 + else: + sorted_df = netrem_model.sorted_coef_df[netrem_model.sorted_coef_df["TF"] == tf_name] + tfs = sorted_df["TF"].tolist() + tf_netrem_found = True + if tf_name not in tfs: + sorted_df = pd.DataFrame(["N/A", tf_name]).transpose() + sorted_df.columns = ["Rank", "TF"] + tf_netrem_found = False + sorted_df["Info"] = "NetREm (b = " + str(netrem_model.beta_network) + "; LassoCV)"# + str(netrem_model.optimal_alpha) + ")" + sorted_df["y_intercept"] = "False :(" + sorted_df["num_TFs"] = netrem_model.model_nonzero_coef_df.drop(columns = ["y_intercept"]).shape[1] + predY_train = netrem_model.predict(X_train) + predY_test = netrem_model.predict(X_test) + sorted_df["train_mse"] = mse(y_train.values.flatten(), predY_train) + sorted_df["test_mse"] = mse(y_test.values.flatten(), predY_test) + sorted_df["train_nmse"] = nmse(y_train.values.flatten(), predY_train) + sorted_df["test_nmse"] = nmse(y_test.values.flatten(), predY_test) + sorted_df["train_snr"] = snr(y_train.values.flatten(), predY_train) + sorted_df["test_snr"] = snr(y_test.values.flatten(), predY_test) + sorted_df["train_psnr"] = psnr(y_train.values.flatten(), predY_train) + sorted_df["test_psnr"] = psnr(y_test.values.flatten(), predY_test) + sorted_df_netrem = sorted_df + + sorted_df_elasticcv = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "ElasticNetCV", y_intercept = False, tf_name = tf_name) + sorted_df_lassocv = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "LassoCV", y_intercept = False, tf_name = tf_name) + sorted_df_ridgecv = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "RidgeCV", y_intercept = False, tf_name = tf_name) + sorted_df_linear = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "LinearRegression", y_intercept = False, tf_name = tf_name) + sorted_df_elasticcv2 = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "ElasticNetCV", y_intercept = True, tf_name = tf_name) + sorted_df_lassocv2 = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "LassoCV", y_intercept = True, tf_name = tf_name) + sorted_df_ridgecv2 = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "RidgeCV", y_intercept = True, tf_name = tf_name) + sorted_df_linear2 = generate_model_metrics_for_baselines_df(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, model_name = "LinearRegression", y_intercept = True, tf_name = tf_name) + + sorty_combo = pd.concat([sorted_df_netrem, sorted_df_elasticcv, sorted_df_ridgecv, sorted_df_lassocv, sorted_df_linear]) + sorty_combo = pd.concat([sorty_combo, sorted_df_elasticcv2, sorted_df_ridgecv2, sorted_df_lassocv2, sorted_df_linear2]) + sorty_combo = sorty_combo[sorty_combo["TF"] == tf_name] + sorty_combo["TG"] = focus_gene + sorty_combo = sorty_combo.reset_index().drop(columns = ["index"]) + if 'AbsoluteVal_coefficient' not in sorty_combo.columns.tolist(): + sorty_combo['AbsoluteVal_coefficient'] = pd.Series([float('nan')]*len(sorty_combo)) + + sorty_combo = sorty_combo[['AbsoluteVal_coefficient', 'Rank', 'TF', 'Info', 'y_intercept', 'num_TFs', 'TG', 'train_mse', + 'test_mse', 'train_nmse', 'test_nmse', 'train_snr', 'test_snr', + 'train_psnr', 'test_psnr']] + + aaa = sorty_combo + aaa['rank_mse_train'] = aaa['train_mse'].rank(ascending=True).astype(int) + aaa['rank_mse_test'] = aaa['test_mse'].rank(ascending=True).astype(int) + aaa['rank_nmse_train'] = aaa['train_nmse'].rank(ascending=True).astype(int) + aaa['rank_nmse_test'] = aaa['test_nmse'].rank(ascending=True).astype(int) + + aaa['rank_snr_train'] = aaa['train_snr'].rank(ascending=False).astype(int) + aaa['rank_snr_test'] = aaa['test_snr'].rank(ascending=False).astype(int) + aaa['rank_psnr_train'] = aaa['train_psnr'].rank(ascending=False).astype(int) + aaa['rank_psnr_test'] = aaa['test_psnr'].rank(ascending=False).astype(int) + aaa["total_metrics_rank"] = aaa['rank_mse_train'] + aaa['rank_mse_test'] + aaa['rank_nmse_train'] + aaa['rank_nmse_test'] + aaa["total_metrics_rank"] += aaa['rank_snr_train'] + aaa['rank_snr_test'] + aaa['rank_psnr_train'] + aaa['rank_psnr_test'] + sorty_combo = aaa + + reduced_results_df = sorty_combo[sorty_combo["Rank"] != "N/A"] + reduced_results_df = reduced_results_df.sort_values(by = ["Rank"]) + if tf_netrem_found: + print(netrem_model.final_corr_vs_coef_df[["info"] + [tf_name]]) + if filtered_results: + return reduced_results_df + else: + return sorty_combo \ No newline at end of file diff --git a/code/previous_version/essential_functions.py b/code/previous_version/essential_functions.py new file mode 100644 index 0000000..ebe7587 --- /dev/null +++ b/code/previous_version/essential_functions.py @@ -0,0 +1,123 @@ +# Essential_functions.py: :) +import pandas as pd +import numpy as np +import random +import copy +from tqdm import tqdm +import os +import sys # https://www.dev2qa.com/how-to-run-python-script-py-file-in-jupyter-notebook-ipynb-file-and-ipython/#:~:text=How%20To%20Run%20Python%20Script%20.py%20File%20In,2.%20Invoke%20Python%20Script%20File%20From%20Ipython%20Command-Line. +import networkx as nx +import scipy +from scipy.linalg import svd as robust_svd +from sklearn.model_selection import KFold, train_test_split, GridSearchCV, cross_val_score +from sklearn.decomposition import TruncatedSVD +from sklearn import linear_model +from sklearn.linear_model import Lasso, LassoCV, LinearRegression, ElasticNetCV, Ridge +from numpy.typing import ArrayLike +# from skopt import gp_minimize, space +from typing import Optional, List, Tuple +from sklearn.metrics import make_scorer +import plotly.express as px +from sklearn.base import RegressorMixin, ClassifierMixin, BaseEstimator +import matplotlib.pyplot as plt +from numpy.typing import ArrayLike +from scipy.sparse.linalg.interface import LinearOperator +import warnings +from sklearn.exceptions import ConvergenceWarning +printdf = lambda *args, **kwargs: print(pd.DataFrame(*args, **kwargs)) +rng_seed = 2023 # random seed for reproducibility +randSeed = 123 + +# Python program to illustrate the intersection +# of two lists in most simple way +def intersection(lst1, lst2): + lst3 = [value for value in lst1 if value in lst2] + return lst3 + + +def view_matrix_as_dataframe(matrix, column_names_list = [], row_names_list = []): + # :) Please note this function by Saniya returns a dataframe representation of the numpy matrix + # optional are the names of the columns and names of the rows (indices) + matDF = pd.DataFrame(matrix) + if len(column_names_list) == matDF.shape[1]: + matDF.columns = column_names_list + if len(row_names_list) == matDF.shape[0]: + matDF.index = row_names_list + return matDF + + +def check_symmetric(a, rtol=1e-05, atol=1e-08): + # https://stackoverflow.com/questions/42908334/checking-if-a-matrix-is-symmetric-in-numpy + # Please note that this function checks if a matrix is symmetric in Python + # for square matrices (same # of rows and columns), there is a possiblity they may be symmetric + # returns True if the matrix is symmetric (matrix = matrix_tranpose) + # returns False if the matrix is NOT symmetric + return np.allclose(a, a.T, rtol=rtol, atol=atol) + + +class DiagonalLinearOperator(LinearOperator): + """Construct a diagonal matrix as a linear operator instead a full numerical matirx np.diag(d). + This saves memory and computation time which is especially useful when d is huge. + D.T = D + For 2d matrix A: + D @ A = d[:, np.newwaxis]* A # scales rows of A + A @ D = A * d[np.newaxis, :] # scales cols of A + For 1d vector v: + D @ v = d * v + v @ D = v * d + NOTE: Coding just for fun: using a numerical matrix or a sparse matrix maybe just fine for network regularization. + By Xiang Huang + """ + def __init__(self, d): + """d is a 1d vector of dimension N""" + N = len(d) + self.d = d + super().__init__(dtype=None, shape=(N, N)) + + def _transpose(self): + return self + + def _matvec(self, v): + return self.d * v + + def _matmat(self, A): + return self.d[:, np.newaxis] * A + + def __rmatmul__(self, x): + """Implmentation of A @ D, and x @ D + We could implment __matmul__ in a similar way without inheriting LinearOperator + Because we inherit from LinearOperator, we can implment _matvec, and _matmat instead. + """ + if x.ndim == 2: + return x * self.d[np.newaxis, :] + elif x.ndim == 1: + return x * self.d + else: + raise ValueError(f'Array should be 1d or 2d, but it is {x.ndim}d') + # Generally A @ D will call A.__matmul__(D) which raises a ValueError and not a NotImplemented + # We need to set __array_priority__ to high value higher than 0 (np.array) and 10.1 (scipy.sparse.csr_matrix) + # https://github.com/numpy/numpy/issues/8155 + # https://stackoverflow.com/questions/40252765/overriding-other-rmul-with-your-classs-mul + __array_priority__ = 1000 + + +def normalize_data_zero_to_one(data): + # https://stackoverflow.com/questions/18380419/normalization-to-bring-in-the-range-of-0-1 + return (data - np.min(data)) / (np.max(data) - np.min(data)) + + +def draw_arrow(direction = "down", color = "blue"): + x = [0.5, 0.5] + if direction == "down": + # Define the coordinates for the arrow + y = [0.9, 0.1] + else: # up-arrow + y = [0.1, 0.9] + fig, ax = plt.subplots(figsize=(2,2)) + # Plot the arrow using Matplotlib + plt.arrow(x[0], y[0], x[1]-x[0], y[1]-y[0], head_width=0.05, head_length=0.1, fc=color, ec=color) + # Set the x and y limits to adjust the plot size + plt.xlim(0, 1) + plt.ylim(0, 1) + plt.axis('off') # Hide the axis labels + plt.show() # Show the plot \ No newline at end of file diff --git a/code/previous_version/netrem_evaluation_functions.py b/code/previous_version/netrem_evaluation_functions.py new file mode 100644 index 0000000..b99b41d --- /dev/null +++ b/code/previous_version/netrem_evaluation_functions.py @@ -0,0 +1,594 @@ +import pandas as pd +import numpy as np +import matplotlib.pyplot as plt +import random +import copy +from tqdm import tqdm +import os +import sys # https://www.dev2qa.com/how-to-run-python-script-py-file-in-jupyter-notebook-ipynb-file-and-ipython/#:~:text=How%20To%20Run%20Python%20Script%20.py%20File%20In,2.%20Invoke%20Python%20Script%20File%20From%20Ipython%20Command-Line. +import networkx as nx +import scipy +from scipy.linalg import svd as robust_svd +from sklearn.model_selection import KFold, train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score +from sklearn.decomposition import TruncatedSVD +from sklearn import linear_model +from sklearn.linear_model import Lasso, LassoCV, LinearRegression, ElasticNetCV, Ridge +from numpy.typing import ArrayLike +from typing import Optional, List, Tuple +from sklearn.metrics import make_scorer +import plotly.express as px +from sklearn.base import RegressorMixin, ClassifierMixin, BaseEstimator +from numpy.typing import ArrayLike +from skopt import gp_minimize, space +from scipy.sparse.linalg.interface import LinearOperator +import warnings +from sklearn.exceptions import ConvergenceWarning +printdf = lambda *args, **kwargs: print(pd.DataFrame(*args, **kwargs)) +rng_seed = 2023 # random seed for reproducibility +randSeed = 123 +# from packages_needed import * +from error_metrics import * +from DemoDataBuilderXandY import * +from PriorGraphNetwork import * +from Netrem_model_builder import * +from sklearn.linear_model import ElasticNetCV, LinearRegression, LassoCV, RidgeCV +from skopt import gp_minimize, space +from skopt.utils import use_named_args + +class BayesianObjective_Lasso: + def __init__(self, X, y, cv_folds, model, scorer="mse", print_network=False): + self.X = X + self.y = y + self.cv_folds = cv_folds + model.view_network = print_network + self.model = model + self.scorer_obj = 'neg_mean_squared_error' # the default + if scorer == "mse": + self.scorer_obj = em.mse_custom_scorer + elif scorer == "nmse": + self.scorer_obj = em.nmse_custom_scorer + elif scorer == "snr": + self.scorer_obj = em.snr_custom_scorer + elif scorer == "psnr": + self.scorer_obj = em.psnr_custom_scorer + + def __call__(self, params): + try: + alpha_lasso, beta_network = params + # print(f"Testing with alpha_lasso = {alpha_lasso}, beta_network = {beta_network}") + + netrem_model = self.model + #print(netrem_model.get_params()) + netrem_model.alpha_lasso = alpha_lasso + netrem_model.beta_network = beta_network + + cv_scores = cross_val_score(netrem_model, self.X, self.y, cv=self.cv_folds, scoring=self.scorer_obj) + + # Check for infinite values + if np.any(np.isinf(cv_scores)): + # print("Cross-validation scores contain infinite values.") + #return np.inf + return 1e100 # Replace infinite score with large finite value + + # Debugging: Print the individual cross-validation scores + # print(f"Individual cross-validation scores: {cv_scores}") + + score = -cv_scores.mean() + # print(f"Score with alpha_lasso = {alpha_lasso}, beta_network = {beta_network} is {score}") + + #if np.isinf(score): + #print("Score is infinite!") + + return score + + except Exception as e: + #print(f"An exception occurred: {e}") + #return np.inf # Return a high "bad" value to indicate failure + return 1e100 # Replace infinite score with large finite value + + +# Define a callback function to update the progress bar +def progress_bar_callback(res): + progress_bar.update(1) + +def optimal_netrem_model_via_bayesian_param_tuner(netrem_model, X_train, y_train, + beta_net_min = 0.5, + beta_net_max = 1000, + alpha_lasso_min = 0.0001, + alpha_lasso_max = 0.1, + num_grid_values = 100, + cv_folds = 5, + scorer = "mse", + verbose = False): + + print(":) Please note that we are running: optimal_netrem_model_via_bayesian_param_tuner") + if verbose: + print(f":) Please note we are running Bayesian optimization (via skopt Python package) for parameter hunting for beta_network and alpha_lasso with model evaluation scorer: {scorer} :)") + print("we use gp_minimize here for hyperparameter tuning") + print(f":) Please note this is a start-to-finish optimizer for NetREm (Network regression embeddings reveal cell-type protein-protein interactions for gene regulation)") + + + model_type = netrem_model.model_type + if model_type == "LassoCV": + print("please note that we can only do this for Lasso model not for LassoCV :(") + print("Thus, we will alter the model_type to make it Lasso") + netrem_model.model_type = "Lasso" + + param_space = [space.Real(alpha_lasso_min, alpha_lasso_max, name='alpha_lasso', prior='log-uniform'), + space.Real(beta_net_min, beta_net_max, name='beta_network', prior='log-uniform')] + objective = BayesianObjective_Lasso(X_train, y_train, cv_folds = cv_folds, model = netrem_model, scorer = scorer) + + + # Perform Bayesian optimization + result = gp_minimize(objective, param_space, n_calls=num_grid_values, random_state=123) + + results_dict = {} + optimal_model = netrem_model + if verbose: + print(":) ######################################################################\n") + print(f":) Please note the optimal model based on Bayesian optimization found: ") + + bayesian_alpha = result.x[0] + bayesian_beta = result.x[1] + optimal_model.alpha_lasso = bayesian_alpha + optimal_model.beta_network = bayesian_beta + results_dict["bayesian_alpha"] = bayesian_alpha + print(f"alpha_lasso = {bayesian_alpha} ; beta_network = {bayesian_beta}") + if verbose: + print(":) ######################################################################\n") + print("Fitting the model using these optimal hyperparameters for beta_net and alpha_lasso...") + dict_ex = optimal_model.get_params() + optimal_model = nm.NetREmModel(**dict_ex) + optimal_model.fit(X_train, y_train) + print(optimal_model.get_params()) + results_dict["optimal_model"] = optimal_model + results_dict["bayesian_beta"] = bayesian_beta + results_dict["bayesian_alpha"] = bayesian_alpha + results_dict["result"] = result + return results_dict + +# class BayesianObjective_Lasso: +# def __init__(self, X, y, cv_folds, model, scorer = "mse", print_network = False): +# self.X = X +# self.y = y +# self.cv_folds = cv_folds +# model.view_network = print_network +# self.model = model +# self.scorer_obj = 'neg_mean_squared_error' # the default +# if scorer == "mse": +# self.scorer_obj = mse_custom_scorer +# elif scorer == "nmse": +# self.scorer_obj = nmse_custom_scorer +# elif scorer == "snr": +# self.scorer_obj = snr_custom_scorer +# elif scorer == "psnr": +# self.scorer_obj = psnr_custom_scorer + + +# def __call__(self, params): + +# alpha_lasso, beta_network = params +# #network = PriorGraphNetwork(edge_list = edge_list) +# netrem_model = self.model +# #print(netrem_model.get_params()) +# netrem_model.alpha_lasso = alpha_lasso +# netrem_model.beta_network = beta_network +# #netrem_model.view_network = self.view_network +# score = -cross_val_score(netrem_model, self.X, self.y, cv=self.cv_folds, scoring=self.scorer_obj).mean() +# return score + + +# def optimal_netrem_model_via_bayesian_param_tuner(netrem_model, X_train, y_train, +# beta_net_min = 0.001, +# beta_net_max = 10, +# alpha_lasso_min = 0.0001, +# alpha_lasso_max = 0.1, +# num_grid_values = 100, +# gridSearchCV_folds = 5, +# scorer = "mse", +# verbose = False): +# if verbose: +# print(f":) Please note we are running Bayesian optimization (via skopt Python package) for parameter hunting for beta_network and alpha_lasso with model evaluation scorer: {scorer} :)") +# print("we use gp_minimize here for hyperparameter tuning") +# print(f":) Please note this is a start-to-finish optimizer for NetREm (Network regression embeddings reveal cell-type protein-protein interactions for gene regulation)") +# from skopt import gp_minimize, space +# model_type = netrem_model.model_type +# # param_space = [space.Real(alpha_lasso_min, alpha_lasso_max, name='alpha_lasso', prior='log-uniform'), +# # space.Real(beta_net_min, beta_net_max, name='beta_network', prior='log-uniform')] + +# if model_type == "LassoCV": +# print("please note that we can only do this for Lasso model not for LassoCV :(") +# print("Thus, we will alter the model_type to make it Lasso") +# netrem_model.model_type = "Lasso" + +# param_space = [space.Real(alpha_lasso_min, alpha_lasso_max, name='alpha_lasso', prior='log-uniform'), +# space.Real(beta_net_min, beta_net_max, name='beta_network', prior='log-uniform')] +# objective = BayesianObjective_Lasso(X_train, y_train, cv_folds = gridSearchCV_folds, model = netrem_model, scorer = scorer) + +# # Perform Bayesian optimization +# result = gp_minimize(objective, param_space, n_calls=num_grid_values, random_state=123) +# results_dict = {} +# optimal_model = netrem_model +# if verbose: +# print(":) ######################################################################\n") +# print(f":) Please note the optimal model based on Bayesian optimization found: ") + +# bayesian_alpha = result.x[0] +# bayesian_beta = result.x[1] +# optimal_model.alpha_lasso = bayesian_alpha +# optimal_model.beta_network = bayesian_beta +# results_dict["bayesian_alpha"] = bayesian_alpha +# print(f"alpha_lasso = {bayesian_alpha} ; beta_network = {bayesian_beta}") +# if verbose: +# print(":) ######################################################################\n") +# print("Fitting the model using these optimal hyperparameters for beta_net and alpha_lasso...") +# dict_ex = optimal_model.get_params() +# optimal_model = NetREmModel(**dict_ex) +# optimal_model.fit(X_train, y_train) +# print(optimal_model.get_params()) +# results_dict["optimal_model"] = optimal_model +# results_dict["bayesian_beta"] = bayesian_beta +# results_dict["result"] = result +# return results_dict + + +def optimal_netrem_model_via_gridsearchCV_param_tuner(netrem_model, X_train, y_train, num_grid_values, num_cv_jobs = -1): + beta_max = 0.5 * np.max(np.abs(X_train.T.dot(y_train))) + beta_min = 0.01 * beta_max + beta_grid = np.logspace(np.log10(beta_max), np.log10(beta_min), num=num_grid_values) + import copy + alpha_grid = [] + initial_gregCV = netrem_model + original_dict = copy.deepcopy(netrem_model.get_params()) + original_model = NetREmModel(**netrem_model.get_params()) + initial_gregCV.model_type = "LassoCV" + #print(initial_gregCV.get_params()) + for beta in beta_grid: + gregCV_demo = initial_gregCV + gregCV_demo.beta_network = beta + gregCV_demo.fit(X_train, y_train) + optimal_alpha = gregCV_demo.regr.alpha_ + alpha_grid.append(optimal_alpha) + + beta_alpha_grid_dict = {} + beta_alpha_grid_dict["beta_network_vals"] = beta_grid + beta_alpha_grid_dict["alpha_lasso_vals"] = alpha_grid #np.array(alpha_grid) + param_grid = [] + for i in tqdm(range(0, len(beta_alpha_grid_dict["beta_network_vals"]))): + beta_net = beta_alpha_grid_dict["beta_network_vals"][i] + alpha_las = beta_alpha_grid_dict["alpha_lasso_vals"][i] + param_grid.append({"alpha_lasso": [alpha_las], "beta_network": [beta_net]}) + grid_search = GridSearchCV(original_model, param_grid = param_grid, cv=gridSearchCV_folds, + n_jobs = num_cv_jobs, scoring='neg_mean_squared_error') + grid_search.fit(X_train, y_train) + # Get the best hyperparameters + best_params = grid_search.best_params_ + optimal_alpha = best_params["alpha_lasso"] + optimal_beta = best_params["beta_network"] + if isinstance(optimal_alpha, np.ndarray): + optimal_alpha = optimal_alpha[0] + if isinstance(optimal_beta, np.ndarray): + optimal_beta = optimal_beta[0] + print(f":) NetREmModelCV found that the optimal alpha_lasso = {optimal_alpha} and optimal beta_network = {optimal_beta}") + update_NetREmModel = NetREmModel(**original_dict) + update_NetREmModel.beta_network = optimal_beta + update_NetREmModel.alpha_lasso = optimal_alpha + update_NetREmModel = NetREmModel(**update_NetREmModel.get_params()) + update_NetREmModel.fit(X_train, y_train) + return update_NetREmModel + + +def model_comparison_metrics_for_target_gene_with_BayesianOpt_andOr_GridSearchCV_ForNetREm(gene_num, target_genes_list, + X_train_all, X_test_all, y_train_all, y_test_all, + scgrnom_step2_df, tfs, expression_percentile, tf_df, + js_mini, ppi_edge_list, num_tfs_family, gene_expression_genes, tf_name = "SOX10", + beta_net_min = 0.001, + beta_net_max = 10, + alpha_lasso_min = 0.0001, + alpha_lasso_max = 0.1, + num_grid_values = 100, + gridSearchCV_folds = 5, + scorer = "mse", view_network = False, verbose = False, num_cv_jobs = -1): + + focus_gene = target_genes_list[gene_num] # here, this is tough 9, 10 + print(f"Please note that our focus gene (Target gene (TG) y) is: {focus_gene}") + + y_train = y_train_all[[focus_gene]] + y_test = y_test_all[[focus_gene]] + + tfs_for_tg = scgrnom_step2_df[scgrnom_step2_df["TG"] == focus_gene]["TF"].tolist() + tfs_for_tg.sort() + + tfs_for_tg = intersection(tfs_for_tg, tfs) + len(tfs_for_tg) + + low_TFs_bool = False + if len(tfs_for_tg) < 5: + print(":( uh-oh!") + low_TFs_bool = True + if verbose: + print(len(tfs_for_use)) + # adding genes from the same family to the set of TFs (based on co-binding from Step 2) + tf_families_to_add = list(set(tf_df[tf_df["gene"].isin(tfs_for_tg)]["TF_Family"])) + gene_expression_avg = np.mean(X_train_all, axis=0) + + expression_threshold = np.percentile(gene_expression_avg, expression_percentile) + if verbose: + print(f":) Please note that based on the training X data, we find that the {expression_percentile}%ile average gene expression level is: {expression_threshold}") #expression_threshold + gene_expression_avg_df = pd.DataFrame(gene_expression_avg, columns = ["avg_expression"]) + gene_expression_avg_df["gene"] = gene_expression_avg_df.index + genes_above_threshold_df = gene_expression_avg_df[gene_expression_avg_df["avg_expression"] >= expression_threshold] + info_tf_family_expression_df = pd.merge(tf_df, gene_expression_avg_df, how = "inner") + info_tf_family_expression_df = info_tf_family_expression_df.sort_values(by = ["avg_expression"], ascending = False) + info_tf_family_expression_df = info_tf_family_expression_df.sort_values(by = ["TF_Family"]) + mini_info_tf_family_express_df = info_tf_family_expression_df[info_tf_family_expression_df["TF_Family"].isin(tf_families_to_add)] + # sort dataframe by 'TF_Family' and 'avg_expression' in descending order + df_sorted = mini_info_tf_family_express_df.sort_values(['TF_Family', 'avg_expression'], ascending=False) + # select the row with the highest 'avg_expression' for each 'TF_Family' + df_result = df_sorted.groupby('TF_Family').first().reset_index() + + ######################################################################## + df_sorty = info_tf_family_expression_df[info_tf_family_expression_df["gene"].isin(genes_above_threshold_df["gene"].tolist())] + # sort dataframe by 'TF_Family' and 'avg_expression' in descending order + df_sorted1 = df_sorty.sort_values(['TF_Family', 'avg_expression'], ascending=False) + # select the top 2 rows for each 'TF_Family' + if low_TFs_bool: + num_to_use_TFs = num_tfs_family + 1 + df_result1 = df_sorted1.groupby('TF_Family').head(n=num_to_use_TFs).reset_index(drop=True) + else: + df_result1 = df_sorted1.groupby('TF_Family').head(n=num_tfs_family).reset_index(drop=True) + if verbose: + print(df_result1) + tfs_to_use_list = df_result["gene"].tolist() + tfs_to_use_list.sort() + if verbose: + print(f" :) tfs_to_use_list = {tfs_to_use_list}") + + tfs_for_use = list(set(tfs_to_use_list + df_result1["gene"].tolist())) + tfs_for_use.sort() + + ########################################################################## + js_minier = js_mini[js_mini["TF1"].isin(tfs_for_use)] + js_minier = js_minier[js_minier["TF2"].isin(tfs_for_use)] + + # for each tf from scgrnom step 2, we add the top 3 TFs based on the cobind matrix + tfs_added_list = [] + for i in tqdm(range(0, len(tfs_to_use_list))): + tf_num = i#in tfs_for_tg: + if low_TFs_bool: + tfs_added_list += js_minier[js_minier["TF1"] == tfs_to_use_list[tf_num]].head(9)["TF2"].tolist() + else: + tfs_added_list += js_minier[js_minier["TF1"] == tfs_to_use_list[tf_num]].head(3)["TF2"].tolist() + + tfs_added_list.sort() + + + #################################### + if verbose: + print(len(tfs_added_list)) + print(tfs_added_list) + combo_tfs = list(set(tfs_to_use_list+tfs_added_list)) + if verbose: + print(len(combo_tfs)) + print(combo_tfs) + tf_columns = intersection(combo_tfs, gene_expression_genes) + tf_columns = list(set(tf_columns)) + tf_columns.sort() + if verbose: + print(":) # of TFs: ", len(tf_columns)) + print(tf_columns) + + if focus_gene in tf_columns: + tf_columns.remove(focus_gene) + key_genes = tf_columns + + ######################### :) We are filtering the input PPI matrix based on the + # final TFs (key_genes) to help us save time: + filtered_ppi_edge_list = [] + for edge in ppi_edge_list: + if edge[0] in key_genes and edge[1] in key_genes: + filtered_ppi_edge_list.append(edge) + + if verbose: + print(filtered_ppi_edge_list) + + X_train = X_train_all[tf_columns] + X_test = X_test_all[tf_columns] + if verbose: + print("X_train dimensions: ", X_train.shape) + print("X_test dimensions: ", X_test.shape) + + netrem_no_intercept = netrem(edge_list = filtered_ppi_edge_list, + gene_expression_nodes = key_genes, + verbose = verbose, + view_network = view_network) + + netrem_with_intercept = netrem(edge_list = filtered_ppi_edge_list, + y_intercept = True, + verbose = verbose, + gene_expression_nodes = key_genes, + view_network = view_network) + + model_comparison_df1 = pd.DataFrame() + model_comparison_df2 = pd.DataFrame() + bayes_optimizer_bool = False + griddy_optimizer_bool = False + + ##################################################################################### + no_intercept = False + with_intercept = False + try: + optimal_netrem_no_intercept = optimal_netrem_model_via_bayesian_param_tuner(netrem_no_intercept, X_train, y_train, + beta_net_min, + beta_net_max, + alpha_lasso_min, + alpha_lasso_max, + num_grid_values, + gridSearchCV_folds, + scorer, + verbose) + #optimal_netrem_no_intercept = optimal_netrem_model_via_bayesian_param_tuner(netrem_no_intercept, X_train, y_train, verbose = verbose) + optimal_netrem_no_intercept = optimal_netrem_no_intercept["optimal_model"] + no_intercept = True + except: + print(":( Bayesian optimizer is not working for no y-intercept") + optimal_netrem_no_intercept = None + + try: + optimal_netrem_with_intercept = optimal_netrem_model_via_bayesian_param_tuner(netrem_with_intercept, X_train, y_train, + beta_net_min, + beta_net_max, + alpha_lasso_min, + alpha_lasso_max, + num_grid_values, + gridSearchCV_folds, + scorer, + verbose) + + optimal_netrem_with_intercept = optimal_netrem_with_intercept["optimal_model"] + with_intercept = True + + except: + print(":( Bayesian optimizer is not working for y-intercept") + optimal_netrem_with_intercept = None + + if no_intercept or with_intercept: + model_comparison_df1 = metrics_for_netrem_models_versus_other_models(netrem_with_intercept = optimal_netrem_with_intercept, netrem_no_intercept = optimal_netrem_no_intercept, + X_train = X_train, y_train = y_train, + X_test = X_test, y_test = y_test, filtered_results = False, + tf_name = tf_name, target_gene = focus_gene) + model_comparison_df1["approach"] = "bayes_optimizer" + bayes_optimizer_bool = True + + ##################################################################################### + no_intercept = False + with_intercept = False + try: + griddy_netrem_no_intercept = optimal_netrem_model_via_gridsearchCV_param_tuner(netrem_no_intercept, X_train, y_train, + num_grid_values, num_cv_jobs) + + no_intercept = True + except: + print(":( gridsearchCV is not working for no y-intercept") + griddy_netrem_no_intercept = None + + try: + griddy_netrem_with_intercept = optimal_netrem_model_via_gridsearchCV_param_tuner(netrem_with_intercept, X_train, y_train, + num_grid_values, num_cv_jobs) + with_intercept = True + except: + print(":( gridsearchCV is not working for y-intercept") + griddy_netrem_with_intercept = None + + if no_intercept or with_intercept: + model_comparison_df2 = metrics_for_netrem_models_versus_other_models(netrem_with_intercept = griddy_netrem_with_intercept, netrem_no_intercept = griddy_netrem_no_intercept, + X_train = X_train, y_train = y_train, + X_test = X_test, y_test = y_test, filtered_results = False, + tf_name = tf_name, target_gene = focus_gene) + + model_comparison_df2["approach"] = "gridSearchCV" + griddy_optimizer_bool = True + # except: + # print(":( gridsearchCV optimizer is not working") + both_approaches_bool = False + if bayes_optimizer_bool and griddy_optimizer_bool: + combined_model_compare_df = pd.concat([model_comparison_df1, model_comparison_df2]) + both_approaches_bool = True + elif bayes_optimizer_bool: + combined_model_compare_df = pd.concat([model_comparison_df1]) + else: + combined_model_compare_df = pd.concat([model_comparison_df2]) + + if both_approaches_bool: + res3 = combined_model_compare_df + res3["combo_key"] = res3["Info"] + "_" + res3["y_intercept"] + "_" + res3["Rank"].astype(str) + "_" + res3["num_TFs"].astype(str) + # Count the number of occurrences of each combo_key + combo_key_counts = res3.groupby('combo_key').size() + + # Create a boolean mask for the combo_keys that appear more than once + combo_key_mask = combo_key_counts > 1 + + # Update the approach column for the combo_keys that appear more than once + res3.loc[res3['combo_key'].isin(combo_key_counts[combo_key_mask].index), 'approach'] = 'both' + aaa = res3 + + aaa['rank_mse_train'] = aaa['train_mse'].rank(ascending=True).astype(int) + aaa['rank_mse_test'] = aaa['test_mse'].rank(ascending=True).astype(int) + aaa['rank_nmse_train'] = aaa['train_nmse'].rank(ascending=True).astype(int) + aaa['rank_nmse_test'] = aaa['test_nmse'].rank(ascending=True).astype(int) + + aaa['rank_snr_train'] = aaa['train_snr'].rank(ascending=False).astype(int) + aaa['rank_snr_test'] = aaa['test_snr'].rank(ascending=False).astype(int) + aaa['rank_psnr_train'] = aaa['train_psnr'].rank(ascending=False).astype(int) + aaa['rank_psnr_test'] = aaa['test_psnr'].rank(ascending=False).astype(int) + aaa["total_metrics_rank"] = aaa['rank_mse_train'] + aaa['rank_mse_test'] + aaa['rank_nmse_train'] + aaa['rank_nmse_test'] + aaa["total_metrics_rank"] += aaa['rank_snr_train'] + aaa['rank_snr_test'] + aaa['rank_psnr_train'] + aaa['rank_psnr_test'] + aaa = aaa.drop_duplicates() + combined_model_compare_df = aaa + combined_model_compare_df = combined_model_compare_df.drop(columns = ["combo_key"]) + return combined_model_compare_df + + +def baseline_metrics_function(X_train, y_train, X_test, y_test, tg, model_name, y_intercept, verbose = False): + + if verbose: + print(f"{model_name} results :) for fitting y_intercept = {y_intercept}") + try: + if model_name == "ElasticNetCV": + regr = ElasticNetCV(cv=5, random_state=0, fit_intercept = y_intercept) + elif model_name == "LinearRegression": + regr = LinearRegression(fit_intercept = y_intercept) + elif model_name == "LassoCV": + regr = LassoCV(cv=5, fit_intercept = y_intercept) + elif model_name == "RidgeCV": + regr = RidgeCV(cv=5, fit_intercept = y_intercept) + regr.fit(X_train, y_train) + if model_name in ["RidgeCV", "LinearRegression"]: + model_df = pd.DataFrame(regr.coef_) + else: + model_df = pd.DataFrame(regr.coef_).transpose() + if verbose: + print(model_df) + model_df.columns = X_train.columns.tolist() + selected_row = model_df.iloc[0] + selected_cols = selected_row[selected_row != 0].index # Filter out the columns with value 0 + model_df = model_df[selected_cols] + df = model_df.replace("None", np.nan).apply(pd.to_numeric, errors='coerce') + sorted_series = df.abs().squeeze().sort_values(ascending=False) + # convert the sorted series back to a DataFrame + sorted_df = pd.DataFrame(sorted_series) + # add a column for the rank + sorted_df['Rank'] = range(1, len(sorted_df) + 1) + sorted_df['TF'] = sorted_df.index + sorted_df = sorted_df.rename(columns = {0:"AbsoluteVal_coefficient"}) + # tfs = sorted_df["TF"].tolist() + # if tf_name not in tfs: + # sorted_df = pd.DataFrame(["N/A", tf_name]).transpose() + # sorted_df.columns = ["Rank", "TF"] + sorted_df["Info"] = model_name + if y_intercept: + sorted_df["y_intercept"] = "True :)" + else: + sorted_df["y_intercept"] = "False :(" + sorted_df["final_model_TFs"] = model_df.shape[1] + sorted_df["TFs_input_to_model"] = X_train.shape[1] + sorted_df["original_TFs_in_X"] = X_train.shape[1] + + predY_train = regr.predict(X_train) + predY_test = regr.predict(X_test) + train_mse = em.mse(y_train.values.flatten(), predY_train) + test_mse = em.mse(y_test.values.flatten(), predY_test) + sorted_df["train_mse"] = train_mse + sorted_df["test_mse"] = test_mse + sorted_df["train_nmse"] = em.nmse(y_train.values.flatten(), predY_train) + sorted_df["test_nmse"] = em.nmse(y_test.values.flatten(), predY_test) + sorted_df["train_snr"] = em.snr(y_train.values.flatten(), predY_train) + sorted_df["test_snr"] = em.snr(y_test.values.flatten(), predY_test) + sorted_df["train_psnr"] = em.psnr(y_train.values.flatten(), predY_train) + sorted_df["test_psnr"] = em.psnr(y_test.values.flatten(), predY_test) + sorted_df["TG"] = tg + sorted_df = sorted_df.reset_index().drop(columns = ["index"]) + sorted_df + except: + return pd.DataFrame() + return sorted_df \ No newline at end of file diff --git a/code/previous_version/packages_needed.py b/code/previous_version/packages_needed.py new file mode 100644 index 0000000..b4d319e --- /dev/null +++ b/code/previous_version/packages_needed.py @@ -0,0 +1,38 @@ +import pandas as pd +import numpy as np +import random +import copy +from tqdm import tqdm +import os +import sys # https://www.dev2qa.com/how-to-run-python-script-py-file-in-jupyter-notebook-ipynb-file-and-ipython/#:~:text=How%20To%20Run%20Python%20Script%20.py%20File%20In,2.%20Invoke%20Python%20Script%20File%20From%20Ipython%20Command-Line. +import networkx as nx +import scipy +from scipy.linalg import svd as robust_svd +from sklearn.model_selection import KFold, train_test_split, GridSearchCV, cross_val_score +from sklearn.decomposition import TruncatedSVD +from sklearn import linear_model +from sklearn.linear_model import Lasso, LassoCV, LinearRegression, ElasticNetCV, Ridge +from numpy.typing import ArrayLike +from skopt import gp_minimize, space +from typing import Optional, List, Tuple +from sklearn.metrics import make_scorer +import plotly.express as px +from sklearn.base import RegressorMixin, ClassifierMixin, BaseEstimator +import matplotlib.pyplot as plt +from numpy.typing import ArrayLike +from scipy.sparse.linalg.interface import LinearOperator +import warnings +from sklearn.exceptions import ConvergenceWarning +printdf = lambda *args, **kwargs: print(pd.DataFrame(*args, **kwargs)) +rng_seed = 2023 # random seed for reproducibility +randSeed = 123 + + +""" +Optimization for +(1 / (2 * M)) * ||y - Xc||^2_2 + (beta / (2 * N^2)) * c'Ac + alpha * ||c||_1 +Which is converted to lasso +(1 / (2 * M)) * ||y_tilde - X_tilde @ c||^2_2 + alpha * ||c||_1 +where M = n_samples and N is the dimension of c. +Check compute_X_tilde_y_tilde() to see how we make sure above normalization is applied using Lasso of sklearn +""" \ No newline at end of file