Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine Learning and Modeling section added to 'Data Science'. #771

Merged
merged 2 commits into from
Sep 20, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 37 additions & 1 deletion Data Science/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,46 @@ EDA involves using statistical and visualization techniques to understand the da

Feature engineering is the process of creating new features or modifying existing ones to improve model performance.

### Machine Learning and Modeling
## Machine Learning and Modeling

Data Scientists build predictive models using machine learning algorithms. This involves splitting the data into training and testing sets, model selection, training, and evaluation.

### Machine Learning Models

There are two main types of models used for analyzing patterns and predicting behavior in data: Supervised learning and Unsupervised learning models.

The supervised models contain data that come with associated labels for identification. Unsupervised models, on the other hand, do not have labels and instead rely on deep learning methods that read patterns from groups of data and distinguish different elements and commonalities accordingly.

Through the methods of regression and classification, behaviors can be extracted from training datasets and tested on smaller datasets to predict a previously unknown value.

#### KNN - K Nearest Neighbor

The **KNN** model is one of the most straightforward, although relatively inefficient, supervised learning models. It uses *proximity* to classify a data point. After the training data is labeled with their class or numerical value, the distance is calculated and the algorithm identifies the **k** closest labeled examples to the queried data point. It then returns the *plurality* label of the k closest examples.

#### Decision Trees

A **Decision tree** is another supervised machine learning algorithm that relies on the relationship between nodes with dataset *feature values*. These work by splitting data into subsets based on feature values in a way that stems down and creates a tree-like visualization. Each node represents a decision point while each branch that stems from a node is that decision made.

A node with no branches holds the final outcome and is referred to as a *leaf node*; the node with no branches leading to it is the *root node* and is the starting decision. While can handle both numerical and categorical data and are fairly easy to visualize, they are prone to overfitting when not regulated properly.

#### Linear Regression

**Linear regression** is one of the most fundamental methods used for modeling a relationship between two or more variables, attempting to express that relationship as a linear equation.

This method determines the coefficients that best minimize the *sum squared difference* between the observed values and the values predicted by a model. It is very handy for interpreting trends within data and ultimately making predictions.

#### Clustering

Another relatively straightforward algorithm and close cousin of the KNN is **clustering**, particularly the *k-means method*. This unsupervised machine learning model finds commonalities among subsets in data based on their characteristics without any previously established labels or categories.

There are other algorithms that work similarly, like the hierarchical clustering and DBSCAN methods, that approach the process of measuring similarity between data points differently.

#### Neural Networks

One of the more commonly used methods today are deep learning **neural network models**. These machine learning models, inspired by the function and structure of the brain, consist of interconnected layers of nodes with their own associated weights. Data passes through these layers, transforming as they pass through mathematical functions, and eventually extracting complex patterns and predictions from datasets.

These algorithms work* particularly well for *large and high-dimensional data*, like image and speech recognition as well as language processing. A method called *backpropagation* is needed to adjust the weights based on the error of the networks predictions and with enough iterations and training, neural networks can produce amazing levels of accuracy in their predictions.

## Data Visualization

Data visualization is crucial for communicating insights effectively. It uses charts, graphs, and plots to represent data visually. Common types of data visualizations include:
Expand Down
Loading