- Supervised Approaches
- Training Model
- Train / Test/ Validation
- Unsupervised Approaches
- Loss Function in Unsupervised
- Create a Decision Tree
- Gini
- Gain
- Classification Error
- Entropy
- Information Gain
- Gain Ratio
- Code for Decision Trees
- Confusion Matrix
Regression: Learn a line/curve (the model) using training data consisting of Input-output pairs. Use it to predict the outputs for new inputs
SVM(Support Vector Machines): Support Vector Machine (SVM) models have the ability to perform a non-linear regression / classification by mapping their inputs into high-dimensional feature spaces
Classification: Learn to separate different classes (the model) using training data consisting of input-output pairs Use it to identify the labels for new inputs
Ensembles: Ensemble methods are machine learning techniques that combines several models in order to produce optimal models
Our training data comes in pairs of inputs (x,y)
D={(x1,y1),...,(xn,yn)}
xi: input vector of the ith sample (feature vector)
yi: label of the ith sample Training dataset
D: Training dataset
The goal of supervised learning is to develop a model h:
h(xi)≈yi for all (xi,yi)∈D
Training Set: The model learns patterns and relationships within the training set. It is the data on which the model is trained to make predictions.
Testing Set: Once the model is trained, it is evaluated on the testing set to assess its performance and generalization to new, unseen data. This set helps to estimate how well the model is likely to perform on new, real-world data.
Validation Set: The validation set is another independent subset used during the training phase to fine-tune the model and avoid overfitting.
Best practice:
Train : 70%-80%
Validation : 5%-10%
Test: 10%-25%
Clustering: Learn the grouping structure for a given set of unlabeled inputs
Association rule: Association rule mining is a rule- based machine learning method for discovering interesting relations between variables in transactional databases.
Example: basket analysis, where the goal is to uncover associations between items frequently purchased together.
Apriori Algorithm: The Apriori algorithm is a widely used algorithm for mining association rules. It works by iteratively discovering frequent itemsets (sets of items that occur together frequently) and generating association rules based on these itemsets.
Rule: X => Y
X: antecedent (or left-hand side) items that when observed
Y: consequent (or right-hand side) items that are expected or likely to be present when the conditions in the antecedent are mets
A, B => C: it suggests that when both items A and B are present (antecedent), there is a likelihood that item C will also be present (consequent).
{Milk, Bread} => {Eggs}: customers who buy both milk and bread are likely to buy eggs as well.
Support: Support measures the frequency of occurrence of a particular combination of items in a dataset. High support values indicate that the itemset is common in the dataset.
Support = frq(X,Y)/N
frq(X, Y): This is the count of transactions where the itemset (X, Y) is present.
N: This represents the total number of transactions or instances in the dataset.
Support({Milk,Bread})= Number of transactions containing both Milk and Bread/Total number of transactions in the dataset
Confidence: Confidence measures the likelihood that an associated rule holds true. It is the conditional probability of finding the consequent (item B) given the antecedent (item A). High confidence indicates a strong association between the antecedent and consequent.
Confidence = frq(X,Y)/frq(X)
frq(X, Y): This is the count of transactions where both the antecedent (X) and the consequent (Y) are present.
frq(X): This is the count of transactions where the antecedent (X) is present.
Confidence({Milk, Bread}⇒{Eggs}) = Number of transactions containing Milk, Bread, and Eggs/Number of transactions containing Milk and Bread
Lift: Lift measures the strength of association between an antecedent and consequent, taking into account the support of both itemsets. A lift greater than 1 indicates that the presence of the antecedent increases the likelihood of the consequent.
Lift = Support(X,Y)/[Support(X)*Support(Y)]
Support(X, Y): This is the support of the itemset containing both X and Y
Support(X): This is the support of the antecedent X
Support(Y): This is the support of the consequent Y
The lift formula essentially compares the observed co-occurrence of X and Y (Support(X, Y)) to what would be expected if X and Y were independent events (Support(X) * Support(Y))
Lift = 1: X and Y are independent.
Lift > 1: There is a positive association between X and Y (X and Y are more likely to occur together than expected).
Lift < 1: There is a negative association between X and Y (X and Y are less likely to occur together than expected).
Lift({Milk, Bread}⇒{Eggs})= Support({Milk, Bread, Eggs})/Support({Milk, Bread})×Support({Eggs})
Unsupervised learning is about modeling the world
K-Means Clustering: In k-means clustering, the goal is to partition a dataset into k clusters, where each data point belongs to the cluster with the nearest centroid.
Loss Function: The loss function in k-means is typically the sum of squared distances between each data point and its assigned cluster centroid. The objective is to minimize this sum.
Loss = ∑ N i=1 (xi - cji)^2
MSE(Mean Squared Error) Loss = 1/N * Loss
N: the number of data points
xi: a data point
cji: centroid of the cluster to which xi is assigned
- Incomplete: e.g., occupation=“ ”
- Noisy: e.g., Salary=“-10”
- Inconsistent: e.g., Age=“42” Birthday=“03/07/1997”
Data Cleaning
- Fill in missing values
- Smooth noisy data
- Identify or remove outliers
- Remove duplicates
- Resolve inconsistencies and discrepancies
Data Transformation
- Normalization
- Discretization
Data Reduction
- Dimensionality reduction
- Numerosity reduction
Data Integration
- Combining data from multiple sources into a unified dataset.
Chained: Implies a sequential process where each variable with missing data is imputed one at a time, iterating through each variable in a cycle.
Equations: For each variable being imputed, a separate imputation model is fit. The type of model depends on the nature of the variable (e.g., logistic regression for binary variables, linear regression for continuous variables, etc.).
Univariate Imputation
In univariate imputation, each missing value in a dataset is imputed (filled in) based on information from the same variable.
- Mean/Median/Mode Imputation: Missing values are replaced with the mean, median, or mode of the observed values in the same variable. This is simple and often effective but can distort the distribution of the data and underestimate the variability.
- Random Sampling: Missing values are replaced with a value drawn randomly from the observed values of the same variable. This maintains the distribution but doesn't use any other information that might be helpful.
- Constant Value: All missing values are filled in with a constant value, such as zero. This is a basic approach and is rarely used unless there is a strong justification.
Example:
Student | Age | Test Score |
---|---|---|
A | 14 | 85 |
B | 13 | Missing |
C | 14 | 90 |
D | 13 | 75 |
E | 14 | Missing |
Mean Test Score = (85 + 90 + 75) / 3 = 83.33
Student | Age | Test Score |
---|---|---|
A | 14 | 85 |
B | 13 | 83.33 |
C | 14 | 90 |
D | 13 | 75 |
E | 14 | 83.33 |
Multivariate Imputation
Multivariate imputation considers the relationships between different variables in the dataset when imputing missing values.
- Multiple Imputation: It involves creating multiple complete datasets by imputing the missing values multiple times. Statistical models (like regression models) are used, considering the relationships among the variables. The results from these multiple datasets are then combined to give a final estimate. This method is useful as it also estimates the uncertainty due to missing data.
Age | Experience | Salary |
---|---|---|
25 | 50 | |
27 | 3 | |
29 | 5 | 110 |
31 | 7 | 140 |
33 | 9 | 170 |
11 | 200 |
Step 1: Impute all missing values with the mean
29 = (25+27+29+31+33)/5
7 = (3+5+7+9+11)/5
134 = (50+110+140+170+200)/5
Age | Experience | Salary |
---|---|---|
25 | 7 | 50 |
27 | 3 | 134 |
29 | 5 | 110 |
31 | 7 | 140 |
33 | 9 | 170 |
29 | 11 | 200 |
Step 2: Romve the 'Age' inputed value
Step 3: Use LinearRegression to estimate the missing age, the predicted age is 36.2532
# LinearRegression
from sklearn.linear_model import LinearRegression
import numpy as np
# Example data
X = np.array([[7, 50], [3, 134], [5, 110], [7, 140], [9, 170]]) # Experience and Salary
y = np.array([25, 27, 29, 31, 33]) # Age
# Create linear regression model
model = LinearRegression()
model.fit(X, y)
# Predict the missing age
predicted_age = model.predict([[11, 200]]) # Experience = 11, Salary = 200
print("Predicted Age:", predicted_age[0])
# Predicted Age: 36.25316455696203
Age | Experience | Salary |
---|---|---|
25 | 7 | 50 |
27 | 3 | 134 |
29 | 5 | 110 |
31 | 7 | 140 |
33 | 9 | 170 |
36.2532 | 11 | 200 |
Step 4: Romve the 'Experience' inputed value, and use LinearRegression to estimate the missing age, the predicted Experience is 1.8538
Step 5: Romve the 'Salary' inputed value, and use LinearRegression to estimate the missing age, the predicted Experience is 72.7748, iteration 1 done
Age | Experience | Salary |
---|---|---|
25 | 1.8538 | 50 |
27 | 3 | 72.7748 |
29 | 5 | 110 |
31 | 7 | 140 |
33 | 9 | 170 |
36.2532 | 11 | 200 |
Step 6:
Age | Experience | Salary | Age | Experience | Salary | Age | Experience | Salary | ||
---|---|---|---|---|---|---|---|---|---|---|
25 | 1.8538 | 50 | 25 | 7 | 50 | 0 | -5.1462 | 0 | ||
27 | 3 | 72.7748 | 27 | 3 | 134 | 0 | 0 | -61.2252 | ||
29 | 5 | 110 | - | 29 | 5 | 110 | = | 0 | 0 | 0 |
31 | 7 | 140 | 31 | 7 | 140 | 0 | 0 | 0 | ||
33 | 9 | 170 | 33 | 9 | 170 | 0 | 0 | 0 | ||
36.2532 | 11 | 200 | 29 | 11 | 200 | 7.2532 | 0 | 0 |
iteration 2:
Age | Experience | Salary | Age | Experience | Salary | Age | Experience | Salary | ||
---|---|---|---|---|---|---|---|---|---|---|
25 | 1.8538 | 50 | 25 | 0.9172 | 50 | 0 | 0.9366 | 0 | ||
27 | 3 | 72.7748 | 27 | 3 | 80.7385 | 0 | 0 | 7.9637 | ||
29 | 5 | 110 | - | 29 | 5 | 110 | = | 0 | 0 | 0 |
31 | 7 | 140 | 31 | 7 | 140 | 0 | 0 | 0 | ||
33 | 9 | 170 | 33 | 9 | 170 | 0 | 0 | 0 | ||
36.2532 | 11 | 200 | 34.8732 | 11 | 200 | 1.38 | 0 | 0 |
# LinearRegression
from sklearn.linear_model import LinearRegression
import numpy as np
# Example data
X = np.array([[1.8538, 50], [3, 72.7748], [5, 110], [7, 140], [9, 170]]) # Experience and Salary
y = np.array([25, 27, 29, 31, 33]) # Age
# Create linear regression model
model = LinearRegression()
model.fit(X, y)
# Predict the missing age
predicted_age = model.predict([[11, 200]]) # Experience = 11, Salary = 200
print("Predicted Age:", predicted_age[0])
# Predicted Age: 34.87326219387428
iteration 3:
...
iteration 4:
...
# MICE Imputation
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Read data
input_dataframe = pd.read_csv("/Users/zhangxijing/MasterNEU/INFO6105DataScienceEngineeringMethodsandTools/Dataset/Microbiology_Dataset.csv")
print(input_dataframe)
# MICE
imputer = IterativeImputer(max_iter=10, random_state=0) # Imputer Initialization
imputed_dataset = imputer.fit_transform(input_dataframe) # Perform imputation
imputed_dataframe = pd.DataFrame(imputed_dataset, columns=input_dataframe.columns) # Converts the numpy array into pandas DataFrame
print(imputed_dataframe)
Equal-Width Binning:
Divides the data range into equal-width intervals. For example, grouping ages into bins like 0-10, 11-20, 21-30, etc.
Equal-Frequency Binning:
Divides the data into intervals containing approximately the same number of data points. This method can be more robust to data distribution variations.
Robustness:
In statistical and mathematical contexts, a robust statistic or method is one that is not heavily influenced by outliers or extreme values. It can provide reliable results even when the data deviates from the expected distribution or contains anomalies.
# Binning
imputed_dataframe['MIC_bin'] = pd.qcut(imputed_dataframe['MIC'], q=3) # Divides the data into 3 quantile-based bins
imputed_dataframe['MIC'] = pd.Series ([interval.mid for interval in imputed_dataframe['MIC_bin']]) # Extracting Midpoints of Bins
print(imputed_dataframe)
# Original Data Binned Category Midpoints
# 0 128.000 (1.0, 256.0] 128.5000
# 1 0.125 (0.001, 0.125] 0.0630
# 2 0.064 (0.001, 0.125] 0.0630
# 3 1.000 (0.125, 1.0] 0.5625
# 4 3.000 (1.0, 256.0] 128.5000
# ... ... ... ...
# 1717 0.190 (0.125, 1.0] 0.5625
# 1718 0.190 (0.125, 1.0] 0.5625
# 1719 0.190 (0.125, 1.0] 0.5625
# 1720 0.750 (0.125, 1.0] 0.5625
# 1721 0.190 (0.125, 1.0] 0.5625
# [1722 rows x 3 columns]
# Binning
# Your data
data = np.array([2, 2, 2, 4, 4, 3, 1, 4, 2, 1, 3, 4, 1, 1, 4, 7, 4, 1, 1, 2, 4, 3, 4, 3, 3, 2, 5, 2, 3, 2, 3, 4, 2, 10, 4, 4, 6, 3, 3, 1, 1, 2, 1, 3, 2, 4, 5, 2, 4, 3, 2, 3, 4, 3, 1, 1, 6, 3, 6, 5, 7, 2, 1, 1, 6, 5, 1, 1, 1, 2, 2, 1, 2, 2, 4, 4, 1, 5, 7, 2, 1, 2, 1, 5, 3, 1, 1, 2, 3, 3, 5, 4, 4, 6, 1, 4, 4, 1, 3, 4, 4, 5, 4, 4, 1, 1, 3, 1, 2, 1, 3, 7, 2, 1, 1, 3, 3, 6, 1, 6, 2, 3, 7, 1])
# Perform quantile-based binning
bins = pd.qcut(data, q=3)
# Create a Pandas Series with the midpoints of the bins
midpoints_series = pd.Series([interval.mid for interval in bins])
# Create a DataFrame to visualize the original data, binned categories, and midpoints
df = pd.DataFrame({'Original Data': data, 'Binned Category': bins, 'Midpoints': midpoints_series})
# Display the DataFrame
print(df)
# Original Data Binned Category Midpoints
# 0 2 (0.999, 2.0] 1.4995
# 1 2 (0.999, 2.0] 1.4995
# 2 2 (0.999, 2.0] 1.4995
# 3 4 (2.0, 4.0] 3.0000
# 4 4 (2.0, 4.0] 3.0000
# .. ... ... ...
# 119 6 (4.0, 10.0] 7.0000
# 120 2 (0.999, 2.0] 1.4995
# 121 3 (2.0, 4.0] 3.0000
# 122 7 (4.0, 10.0] 7.0000
# 123 1 (0.999, 2.0] 1.4995
# [124 rows x 3 columns]
-
A decision tree is a hierarchical classification model that uses a tree structure and can be used to support decisions
-
Each internal node represents a test on one attribute (feature)
-
Each branch from a node represents a possible outcome of the test
-
Each leaf node represents a class label
-
When a Decision Tree classifies things into categories. it's called a Classification Tree.
-
When a Decision Tree predicts numeric values. it's called a Regression Tree.
Loves Popcorn | Loves Soda | Age | Loves Cool As Ice |
---|---|---|---|
Yes | Yes | 7 | No |
Yes | No | 12 | No |
No | Yes | 12 | Yes |
No | Yes | 12 | Yes |
Yes | Yes | 12 | Yes |
Yes | No | 12 | No |
No | No | 12 | No |
Loves Popcorn(True) -> 1 Loves Cool As Ice(Ture) and 3 Loves Cool As Ice(False)
Loves Popcorn(False) -> 2 Loves Cool As Ice(Ture) and 1 Loves Cool As Ice(False)
Gini Impurity for a Leaf = 1 - (1/4)^2 - (3/4)^2 = 0.375
Gini Impurity for a Leaf = 1 - (2/3)^2 - (1/3)^2 = 0.444
Total Impurity for Loves Popcorn = 0.375*(4/7) + 0.444*(3/7) = 0.405
Loves Soda(True) -> 3 Loves Cool As Ice(Ture) and 1 Loves Cool As Ice(False)
Loves Soda(False) -> 0 Loves Cool As Ice(Ture) and 3 Loves Cool As Ice(False)
Likewise Total Impurity for Loves Soda = 0.214
Loves Soda does a better job predicting who will and will not Loves Cool As Ice
Calculate Gini Impurity for 9.5, 15, 26.5, 36.5, 44, 66.5
Age < 9.5(True) -> 0 Loves Cool As Ice(Ture) and 1 Loves Cool As Ice(False)
Age < 9.5(False) -> 3 Loves Cool As Ice(Ture) and 3 Loves Cool As Ice(False)
Gini Impurity for a Leaf = 1 - (0/1)^2 - (1/1)^2 = 0
Gini Impurity for a Leaf = 1 - (3/6)^2 - 3(/6)^2 = 0.5
Total Gini Impurity for Age 9.5 = 0*(1/7) + 0.5*(6/7) = 0.429
Likewise
Total Gini Impurity for Age 9.5 = 0.429
Total Gini Impurity for Age 15 = 0.343
Total Gini Impurity for Age 26.5 = 0.476
Total Gini Impurity for Age 36.5 = 0.476
Total Gini Impurity for Age 44 = 0.343
Total Gini Impurity for Age 66.5 = 0.429
Two candidate thresholds 15 and 44 has lowest Impurity, so we can pick one, we pick 15 here
Gini Impurity for Loves Soda = 0.214
Gini Impurity for Age < 15 = 0.343
Gini Impurity for Loves Popcorn = 0.405
So we put Loves Soda at the top of the tree
Gini = 1 − ∑i=1 n pi^2
pi: the proportion of items labeled with class i in the set
Parent | |
---|---|
C1 | 6 |
C2 | 6 |
Gini = 1-(6/12)^2-(6/12)^2 = 0.5
N1 | N2 | |
---|---|---|
C1 | 5 | 2 |
C2 | 1 | 4 |
Gini(N1) = 1-(5/6)^2-(1/6)^2 = 0.278
Gini(N2) = 1-(2/6)^2-(4/6)^2 = 0.444
Gini(Children) = 6/12 * 0.278 + 6/12 * 0.444 = 0.361
Gain = 0.500 – 0.361 = 0.139
Gain = P – M
P: Impurity before spilt
M: Impurity after spilt
Gain = 0.500 – 0.361 = 0.139
Classification Error = 1−pMax
pMax: proportion of the most common class in the node
- Maximum (1 - 1/Number_of_classes) when records are equally distributed among all classes, implying least interesting information
- Minimum (0) when all records belong to one class, implying most interesting information
C1 | 0 |
C2 | 6 |
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Classification Error = 1 - max(0, 1) = 1 - 1 = 0
C1 | 2 |
C2 | 4 |
P(C1) = 2/6
P(C2) = 4/6
Classification Error = 1 - max(2/6, 4/6) = 1 - 4/6 = 1/3
Entropy = − ∑j pi*log2(pi)
j: number of classes
pi: proportion of instances belonging to class pi in the dataset
– Maximum (Log Number_of_classes) when records are equally distributed among all classes implying least information
– Minimum (0.0) when all records belong to one class, implying most information
C1 | 0 |
C2 | 6 |
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Entropy = - 0log0 - 1log1 = 0
C1 | 2 |
C2 | 4 |
P(C1) = 2/6
P(C2) = 4/6
Entropy = - (2/6)*log2(2/6) - (4/6)*log2(4/6) = 0.92
Comparing Impurity Measures
Entropy and Gini are more sensitive to changes in the node probabilities than the misclassification error rate |
Information Gain = Entropy(p) - (∑i=1 k ni/n*Entropy(i))
ni: number of records in partition i
Information gain has the disadvantage that it prefers attributes with large number of values that split the data into small, pure subsets leads to overfitting to train dataset.
Gain Ratio(Quinlan’s Gain Ratio)
– Adjusts Information Gain by the entropy of the partitioning (SplitINFO)
– arge number of small partitions is penalized
– Designed to overcome the disadvantage of Information Gain
– Used in C4.5 algorithm
Hyperparameters can be used as a constraint in decision trees:
– max_depth: maximum depth of decision tree.
– min_sample_split: The minimum number of samples required to split an internal node.
– min_samples_leaf: The minimum number of samples required to be at a leaf node.
– min_impurity_decrease: The minimum decrease in impurity by splitting.
– the technique adds complexity penalty to impurity.
– It is parameterized by the cost complexity parameter, ccp_alpha.
– Greater values of ccp_alpha increase the number of nodes pruned. When ccp_alpha is set to zero, the tree overfits.
# Predict Resistance (as label) by Patient_Age, Bacteria, and Antimicrobial (as features)
# Decision Trees
#Input Dateset
org_df = pd.read_csv("/Users/zhangxijing/MasterNEU/INFO6105DataScienceEngineeringMethodsandTools/Dataset/DS_Dataset.csv")
clean_df = prepare_data(org_df) # Missing def prepare_data(org_df) here
#Define features to predict Resistance label
label_df = clean_df.loc[:,clean_df.columns == 'MIC_Interpretation_resistant']
feat_df = clean_df.loc[:,clean_df.columns != 'MIC_Interpretation_resistant']
#Seperate test and train data
train_feat, temp_feat, train_label, temp_label = train_test_split(feat_df, label_df, test_size=0.28, random_state=42)
test_feat, val_feat, test_label, val_label = train_test_split(temp_feat, temp_label, test_size=(20/28), random_state=42)
max_depth_thr = 30 #Default max_depth threshold for Dtree (Default is 15)
min_samples_leaf_thr = 5 #Default min_samples_leaf threshold for Dtree (Default is 30)
min_impurity_thr = 0.001 #Default min_impurity threshold for Dtree (Default is 0.001)
ccp_thr = 0.0001 #Default ccp_thr threshold for Dtree (Default is 0.0001)
#Create a model using Hyper-parameters
treemodel= tree.DecisionTreeClassifier(criterion="gini",
min_impurity_decrease=min_impurity_thr,
max_depth=max_depth_thr,
min_samples_leaf=min_samples_leaf_thr,
ccp_alpha=ccp_thr)
#Train the model
treemodel.fit(train_feat, train_label)
Accuracy = (True Positive + True Negative)/(True Positive + True Negative + False Positive + False Negative)
Sensitivity = True Positive/(True Positive + False Negative)
Specificity = True Negative/(True Negative + False Positive)
- K-means clustering specifically tries to put the data into the number of clusters.
- Hierarchical clustering tells you, pairwise, what two things are most similar.
# Input Dateset
org_df = pd.read_csv("/Users/zhangxijing/MasterNEU/INFO6105DataScienceEngineeringMethodsandTools/Dataset/market_ds.csv")
train_feat = prepare_data(org_df)
# Get KMeans
inertias = []
for i in range(1, 11): # Test 1 to 10 clusters
kmeans = KMeans(n_clusters=i)
kmeans.fit(train_feat)
inertias.append(kmeans.inertia_)
# Kmeans
model = KMeans(n_clusters=2)
model.fit(train_feat)
# Filter rows based on cluster
first_cluster = train_feat.loc[model.labels_ == 0,:]
second_cluster = train_feat.loc[model.labels_ == 1,:]
# Agnes(Agglomerative Nesting)
linkage_data = linkage(train_feat, method='single', metric='euclidean')
dendrogram(linkage_data, truncate_mode = 'level' ,p=5)
plt.show()
#Input Dateset
org_df = pd.read_csv("/Users/zhangxijing/MasterNEU/INFO6105DataScienceEngineeringMethodsandTools/Dataset/hw4_train.csv")
test_df = pd.read_csv("/Users/zhangxijing/MasterNEU/INFO6105DataScienceEngineeringMethodsandTools/Dataset/hw4_test.csv")
#Define features and outcome for Regression
outcome_df = org_df.loc[:,org_df.columns == 'BloodPressure']
feat_df = org_df.loc[:,org_df.columns.isin(['Pregnancies','Glucose','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age','Outcome'])]
#Seperate test and train data
train_x,test_x,train_y,test_y = train_test_split(feat_df,outcome_df,test_size=0.25)
#Create a multiple Reg model
model = LinearRegression()
model.fit(train_x, train_y)
test_pred_y = model.predict(test_x)
r_sq = model.score(test_x, test_y)
print ('R2 =',r_sq ) # statistical measure of how well the regression predictions approximate the real data points
# Predict 'BloodPressure' in hw4_test.csv
regression_features = test_df.columns.drop(['BloodPressure'])
X_test_reg = test_df[regression_features]
test_df['BloodPressure'] = model.predict(X_test_reg)
test_pred_y = model.predict(test_x)
print(test_df)
K Nearest Neighbor
# Load dataset
org_df = pd.read_csv("/Users/zhangxijing/MasterNEU/INFO6105DataScienceEngineeringMethodsandTools/Dataset/hw4_train.csv")
label_df = org_df.loc[:,org_df.columns == 'Outcome']
feat_df = org_df.loc[:,org_df.columns != 'Outcome']
# regression_features = test_df.columns.drop(['BloodPressure'])
# Separate test and train data
train_x, test_x, train_y, test_y = train_test_split(feat_df, label_df, test_size=0.25, random_state=42)
# Initialize lists to store the metrics
accuracies = []
sensitivities = []
specificities = []
k_values = range(1, 20)
# Train 19 KNN models with k from 1 to 19
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(train_x, train_y)
test_pred_y = knn.predict(test_x)
# Calculate confusion matrix and extract TP, TN, FP, FN
cf = confusion_matrix(test_y, test_pred_y)
TN, FP, FN, TP = cf.ravel()
# Calculate accuracy, sensitivity (recall), and specificity
accuracy = accuracy_score(test_y, test_pred_y)
sensitivity = recall_score(test_y, test_pred_y)
specificity = TN / (TN + FP)
# Append metrics to their respective lists
accuracies.append(accuracy)
sensitivities.append(sensitivity)
specificities.append(specificity)
# Identify the best k based on highest accuracy (or other criteria)
best_k_acc = k_values[accuracies.index(max(accuracies))]
print(f"Best k based on highest accuracy: {best_k_acc}")
print(f"Accuracy: {max(accuracies)}, Sensitivity: {sensitivities[accuracies.index(max(accuracies))]}, Specificity: {specificities[accuracies.index(max(accuracies))]}")
Association rule: Association rule mining is a rule- based machine learning method for discovering interesting relations between variables in transactional databases.
Example: basket analysis, where the goal is to uncover associations between items frequently purchased together.
Rule: X => Y
X: antecedent (or left-hand side) items that when observed
Y: consequent (or right-hand side) items that are expected or likely to be present when the conditions in the antecedent are mets
A, B => C: it suggests that when both items A and B are present (antecedent), there is a likelihood that item C will also be present (consequent).
{Milk, Bread} => {Eggs}: customers who buy both milk and bread are likely to buy eggs as well.
Support: Support measures the frequency of occurrence of a particular combination of items in a dataset. High support values indicate that the itemset is common in the dataset.
Support = frq(X,Y)/N
frq(X, Y): This is the count of transactions where the itemset (X, Y) is present.
N: This represents the total number of transactions or instances in the dataset.
Support({Milk,Bread})= Number of transactions containing both Milk and Bread/Total number of transactions in the dataset
Confidence: Confidence measures the likelihood that an associated rule holds true. It is the conditional probability of finding the consequent (item B) given the antecedent (item A). High confidence indicates a strong association between the antecedent and consequent.
Confidence = frq(X,Y)/frq(X)
frq(X, Y): This is the count of transactions where both the antecedent (X) and the consequent (Y) are present.
frq(X): This is the count of transactions where the antecedent (X) is present.
Confidence({Milk, Bread}⇒{Eggs}) = Number of transactions containing Milk, Bread, and Eggs/Number of transactions containing Milk and Bread
Lift: Lift measures the strength of association between an antecedent and consequent, taking into account the support of both itemsets. A lift greater than 1 indicates that the presence of the antecedent increases the likelihood of the consequent.
Lift = Support(X,Y)/[Support(X)*Support(Y)]
Support(X, Y): This is the support of the itemset containing both X and Y
Support(X): This is the support of the antecedent X
Support(Y): This is the support of the consequent Y
The lift formula essentially compares the observed co-occurrence of X and Y (Support(X, Y)) to what would be expected if X and Y were independent events (Support(X) * Support(Y))
Lift = 1: X and Y are independent.
Lift > 1: There is a positive association between X and Y (X and Y are more likely to occur together than expected).
Lift < 1: There is a negative association between X and Y (X and Y are less likely to occur together than expected).
Lift({Milk, Bread}⇒{Eggs})= Support({Milk, Bread, Eggs})/Support({Milk, Bread})×Support({Eggs})
import pandas as pd
import matplotlib.pyplot as plt
from mlxtend.frequent_patterns import fpgrowth,apriori,association_rules
#Input Dateset
org_df = pd.read_csv("amr_horse_ds.csv")
org_df= pd.get_dummies(org_df.loc[:,org_df.columns!='Age'])
#Extract Association Rules
frequent_patterns_df = fpgrowth(org_df, min_support=0.1,use_colnames=True)
rules_df = association_rules(frequent_patterns_df, metric = "confidence", min_threshold = 0.9)
high_lift_rules_df = rules_df[rules_df['lift'] > 1.5]
#Save Association Rules
high_lift_rules_df.to_csv('arules.csv')
#Visualize Association Rules
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.scatter(rules_df['support'], rules_df['confidence'], rules_df['lift'], marker="*")
ax.set_xlabel('support')
ax.set_ylabel('confidence')
ax.set_zlabel('lift')
plt.show()
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.metrics import accuracy_score
# Load Dataset
org_df = pd.read_csv("/Users/zhangxijing/MasterNEU/INFO6105DataScienceEngineeringMethodsandTools/Dataset/diabetes.csv")
# Define features and label
label_df = org_df['Outcome']
feat_df = org_df.drop('Outcome', axis=1)
# Initialize models with different estimators
rf_3 = RandomForestClassifier(n_estimators=3)
rf_50 = RandomForestClassifier(n_estimators=50)
ad_3 = AdaBoostClassifier(n_estimators=3)
ad_50 = AdaBoostClassifier(n_estimators=50)
# Setup K-Fold
k_folds = KFold(n_splits=5)
# Calculate cross-validation scores
scores_rf_3 = cross_val_score(rf_3, feat_df, label_df, cv=k_folds)
scores_rf_50 = cross_val_score(rf_50, feat_df, label_df, cv=k_folds)
scores_ad_3 = cross_val_score(ad_3, feat_df, label_df, cv=k_folds)
scores_ad_50 = cross_val_score(ad_50, feat_df, label_df, cv=k_folds)
# Print scores and their means
print(f"RF 3 Scores: {scores_rf_3}, Mean: {scores_rf_3.mean()}")
print(f"RF 50 Scores: {scores_rf_50}, Mean: {scores_rf_50.mean()}")
print(f"Adaboost 3 Scores: {scores_ad_3}, Mean: {scores_ad_3.mean()}")
print(f"Adaboost 50 Scores: {scores_ad_50}, Mean: {scores_ad_50.mean()}")
- Objective: Classify handwritten characters (e.g., distinguishing 'X' from 'Y') using pixel data.
- Input Representation: Each grayscale pixel is represented by a value between 0 (black) and 255 (white).
- Complexity: Large number of weights make training difficult, prone to overfitting, and resource-intensive.
- Pattern Detection: Fully connected layers are inefficient for detecting smaller patterns within images.
import pandas as pd
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense, Input
#Input Dateset
org_df = pd.read_csv("/Users/zhangxijing/MasterNEU/INFO6105DataScienceEngineeringMethodsandTools/Datasets/diabetes.csv")
#Labels and Features
label_df = org_df.loc[:,org_df.columns == 'Outcome']
feat_df = org_df.loc[:,org_df.columns != 'Outcome']
#Normalize Features
feat_df = (feat_df - feat_df.mean()) / feat_df.std()
#Split Train and Test Data
x_train, x_test, y_train, y_test = train_test_split(feat_df, label_df, test_size = 0.3)
#Create NN Model
nn = Sequential()
nn.add(Input(shape=(8,)))
nn.add(Dense(units=5, activation='relu'))
nn.add(Dense(units=1, activation='sigmoid'))
# Set Optimizer, Loss Function, and metric
nn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
#Train NN Model
nn.fit(x_train, y_train, epochs=100)
print(nn.summary())
#Accuracy of Model on Test data
loss,accuracy = nn.evaluate(x_test,y_test)
print('accuracy=',accuracy,' , loss=',loss)
import pandas as pd
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense, Input
- 'pandas' is a powerful library for data manipulation and analysis. It provides data structures like DataFrame and Series, which are essential for handling structured data efficiently.
- 'train_test_split' from sklearn: For splitting the dataset into training and testing sets.
- 'Keras' is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, Theano, or CNTK.
- The 'Sequential' class is a linear stack of layers. The concept of a "linear stack of layers" refers to the architecture of a neural network where layers are added one after another in a single, sequential order.
- The 'Dense' layer is a fully connected layer, meaning every neuron in the layer is connected to every neuron in the previous layer.
- The 'Input' layer is used to define the shape of the input data. This layer is usually the first layer in a model, specifying the expected shape of the input tensor(the feature is 8 in this case).
# Labels and Features
label_df = org_df.loc[:, org_df.columns == 'Outcome']
feat_df = org_df.loc[:, org_df.columns != 'Outcome']
- 'label_df' contains the target variable (Outcome).
- 'feat_df' contains all the feature variables (all columns except Outcome).
# Split Train and Test Data
x_train, x_test, y_train, y_test = train_test_split(feat_df, label_df, test_size=0.3)
- Splits the dataset into training (70%) and testing (30%) sets.
# Create NN Model
nn = Sequential()
nn.add(Input(shape=(8,)))
nn.add(Dense(units=5, activation='relu'))
nn.add(Dense(units=1, activation='sigmoid'))
- Defines a sequential neural network model.
- Adds an input layer with 8 input features (the number of features in the dataset).
- Adds a hidden layer with 5 neurons and ReLU activation.
- ReLU(x)=max(0,x)
- Adds an output layer with 1 neuron and sigmoid activation (suitable for binary classification).
- Sigmoid(x)= 1/(1+e^-x)
# Set Optimizer, Loss Function, and metric
nn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
- Compiles the model with the Adam optimizer, binary cross-entropy loss function (appropriate for binary classification), and accuracy as the evaluation metric.
# Train NN Model
nn.fit(x_train, y_train, epochs=100)
print(nn.summary())
- Trains the model on the training data for 100 epochs.
# Accuracy of Model on Test data
loss, accuracy = nn.evaluate(x_test, y_test)
print('accuracy=', accuracy, ' , loss=', loss)
- Evaluates the model's performance on the test data.
- Function: Use filters (kernels) to detect specific patterns in different parts of the image.
- Efficiency: Reduce the number of parameters compared to fully connected layers, making training less computationally expensive.
- Activation Function: ReLU is preferred to introduce non-linearity and deactivate non-pattern nodes.
- Structure: Consist of multiple convolutional layers for feature extraction followed by fully connected layers for classification.
- Feature Extraction: Each layer extracts different features, from low-level (edges) to high-level (object parts).
- Pooling: Reduces dimensionality and computation, adds translation invariance.
- Backward Propagation: Optimizes weights by minimizing classification error using gradient descent.
- Flattening: Converts the final convolutional layer into a 1-dimensional array for input into a fully connected network.
- Pattern Detection: Demonstrated with examples of filters identifying specific patterns in images.
- Pooling Types: Max pooling and average pooling explained, with max pooling being the most common.
- Filters (Kernels): Extract features from input images.
- Activation Functions: Introduce non-linearity into the model.
- Pooling Layers: Reduce the spatial dimensions of the feature maps.
- Training Techniques: Use of gradient descent and backpropagation to optimize model performance.
CNNs provide a scalable and efficient approach to image classification by leveraging convolutional layers for feature extraction and fully connected layers for classification. Proper training techniques ensure that the network learns to accurately classify images based on high-level features.