READING NOTES

Normalization

Still reading
pointed out drawbacks of 3'/5' control probes (saturation) and affyslope (background hybridization)
"The screening of nearly three thousand public available GeneChip array data suggests that there is noticeable degradation effect in the majority data files and that 2% of the files were even so severely degraded that their worth was questionable." (p.2)

Super famous paper
proposes three normalization techniques and implements in library(affy). Points out drawbacks of two traditional normalization approaches.
- proposals
  1. Cyclic loess
  - based on MAplot [Dudoit et al. 2002]
    - MA plot is useful for comparing the expression values between two samples
  - pairwise. time-consuming
  - loess (LOcal regrESSion)
  1. Contrast based method
  - faster than cyclic loess
  1. Quantile normalization
  - based on QQ-plot idea
  - simpler than other two methods
  - Is the assumption of 'all probe intensities have the same distribution' valid?
- traditional approaches
  1. Scaling method (Affymetrix's approach). This is readable
  2. non-linear method ([Schadt et al. 2001, 2002] approach)
"Baseline array" is the array having "the median of the median intensities"
Next paper:
- Hartemink 2001: comprehensive lists for obscuring variation
Questions:
- What is "MAS 5.0 Statistical algorithm" by Affymetrix?
  - seems to contain scaling method

Propose an alternative method (IRON) for MAS/quantile/dChip methods
- Implement as a C program routine here
"Batch effects are as real as any biological signal, and are indistinguishable from biological signal without post-normalization interpretation of experiment-related metadata. As such, they are not suitable for removal by chip normalization methods."
"probeset summarization (an issue unique to Affymetrix arrays, where multiple probes for the same transcript are condensed into a single representative signal)"

"Sometimes, a probe has insufficient information to be mapped to any GeneID, and we recommend omitting these from further analysis (Step 19). "
"Simply averaging the expression profiles before proceeding is not desirable either, as different probe sequences have different binding affinity, giving rise to the problem of different measurement scales"

Permutation importance is found to be more reliable to Gini index.
"The permutation variable importance measure – also referred to as the mean decrease in accuracy – reflects the average decrease in accuracy when destroying the association between a variable and the response by permuting the values of the variable."
Random Forest does not have natural cutoff, which hypothesis testing has
"It is clear that predictor variables whose importance score is negative or zero are likely to have no predictive ability. However, for the predictor variables with positive importance score it is difficult to say which importance scores are large enough so that it is unlikely that these have occurred by chance."
Random forests can deal with nonlinear effects and interactions. See Boulesteix et al. 2015