Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data-analysis.md #28

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion data-analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,12 @@

#### 1. (Given a Dataset) Analyze this dataset and tell me what you can learn from it.
#### 2. What is R2? What are some other metrics that could be better than R2 and why?
- R2 is the square of correlation between the observed target variable and the predicted target variable
- goodness of fit measure. variance explained by the regression / total variance
- the more predictors you add the higher R^2 becomes.
- the more predictors you add the higher R^2 becomes. This will always be biased to models with more features
- hence use adjusted R^2 which adjusts for the degrees of freedom 
- or train error metrics
- Akaike information criteria (AIC) which penalizes the model for having more predictors. A larger value indicates a worse fit
#### 3. What is the curse of dimensionality?
- High dimensionality makes clustering hard, because having lots of dimensions means that everything is "far away" from each other.
- For example, to cover a fraction of the volume of the data we need to capture a very wide range for each variable as the number of variables increases
Expand Down