v2.0.1 - overhauled KNN error estimator

IftachSadeh · Feb 10, 2015 · 3b676ac · 3b676ac
1 parent 94c52d8
commit 3b676ac
Show file tree

Hide file tree

Showing 29 changed files with 888 additions and 697 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,18 @@
-# ANNZ 2.0.0 (26/2/2015)
+# Changelog
+
+## ANNZ 2.0.1 (10/2/2015)
+
+The following changes were made:
+
+- **Modified the way in which the KNN error estimator works:**
+In the previous version, the errors were generated by looping for each object over the *n* near-neighbors in the training dataset. For a given object, this was for all neighbors for each of the MLMs.
+In the revised version, MLM response values for the entire training dataset are estimated once; this is done before the loop on the objects begins, with the results stored in a dedicated tree (see `ANNZ::createTreeErrKNN()`). This tree is then read-in during the loop over the  objects for which the errors are generated. In this implementation, the KNN neighbor search is done once for all MLMs, and the errors are estimated simultaneously for all. This prevents both the unnecessary repeated calculations of MLM outputs, and the redundant searches for the *n* near-neighbors for the same object.
+
+- **Name of evaluation subdirectory:**
+Added the variable `evalDirPostfix`, which allows to modify the name of the evaluation subdirectory. Different input files can now be evaluated simultaneously, without overwriting previous results. The example scripts have been modified accordingly.
+
+- Various small modifications.
+
+## ANNZ 2.0.0 (26/1/2015)
 
 First version (v2.0.0) of the new implementation of the machine learning code, ANNz.
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# ANNZ 2.0.0
+# ANNZ 2.0.1
 
 ## Introduction
 ANNZ uses both regression and classification techniques for estimation of single-value photo-z (or any regression problem) solutions and PDFs. In addition it is suitable for classification problems, such as star/galaxy classification.
@@ -120,7 +120,7 @@ python scripts/annz_singleReg_quick.py --make --clean
 
 The various example scripts includes comments about the different variables which the user needs to set. Each operational mode has a *quickstart* dedicated script as well as an *advanced* script. The latter include more job-options as well as more detailed documentation.
 
-For each of the following, please use follow the four respective steps (generation, training, optimization/verification, evaluation) in sequence. For instance, for single regression, do:
+For each of the following, please follow the four respective steps (generation, training, optimization/verification, evaluation) in sequence. For instance, for single regression, do:
 ```bash
 python scripts/annz_singleReg_quick.py --singleRegression --genInputTrees
 python scripts/annz_singleReg_quick.py --singleRegression --train
@@ -332,7 +332,7 @@ There are two ways to define the PDF bins:
 
   1. The PDF is defined within `nPDFbins` equal-width bins between `minValZ` and `maxValZ` (the minimal and maximal defined values of the regression target). The `nPDFbins`, `minValZ` and `maxValZ` variables are mandatory settings for ANNZ, as defined in the example scripts.
 
-  2. A specific set of bins of arbitrary with may defined by setting the variable, `userPdfBins` (in which case `nPDFbins` is ignored). The only constrain is that the first and last bin edges be within the range defined by `minValZ` and `maxValZ`. This can e.g., be
+  2. A specific set of bins of arbitrary width may defined by setting the variable, `userPdfBins` (in which case `nPDFbins` is ignored). The only constrain is that the first and last bin edges be within the range defined by `minValZ` and `maxValZ`. This can e.g., be
   ```python
   glob.annz["userPdfBins"] = "0.05;0.1;0.2;0.24;0.3;0.52;0.6;0.7;0.8"
   ```
@@ -422,12 +422,24 @@ A few notes:
 
   - It is possible to train/optimize MLMs using specific cuts and/or weights, based on any mathematical expression which uses the variables defined in the input dataset (not limited to the variables used for the training). The relevant variables are `userCuts_train`, `userCuts_valid`, `userWeights_train` and `userWeights_valid`. See the advanced scripts for use-examples.
 
-  - The syntax for math expressions is defined using the ROOT conventions (see e.g., [TMath](https://root.cern.ch/root/html/TMath.html) and [TFormula](https://root.cern.ch/root/html/TFormula.html)). Acceptable expressions may for instance be the following ridiculous choice:
+  - The syntax for math expressions is defined using the ROOT conventions (see e.g., [TMath](https://root.cern.ch/root/html/TMath.html) and [TFormula](https://root.cern.ch/root/html/TFormula.html)). Acceptable expressions may for instance include the following ridiculous choice:
   ```python
   glob.annz["userCuts_train"]    = "(MAG_R > 22)/MAG_R + (MAG_R <= 22)*1"
   glob.annz["userCuts_valid"]    = "pow(MAG_G,3) + exp(MAG_R)*MAG_I/20. + abs(sin(MAG_Z))"
   ```
 
+  - By default, the output of evaluation is written to a subdirectory named `eval` in the output directory. An output file may e.g., be `output/test_randReg_quick/regres/optim/eval/ANNZ_randomReg_0000.csv`. It is possible to set the the `evalDirPostfix` variable in order to change this. For instance, setting
+  ```python
+  glob.annz["evalDirPostfix"] = "cat0"
+  ```
+  will produce the same output file at `output/test_randReg_quick/regres/optim/eval_cat0/ANNZ_randomReg_0000.csv`. This may be used in order to run the evaluation on multiple input files simultaneously without overwriting previous results.
+
+  - There are several parameters used to tune PDFs in randomized regression. Here are a couple of principle examples:
+
+    - **`minPdfWeight` -** may be used to set a minimal weights for an MLM in the PDF. For instance, setting `minPdfWeight=0.05` will insure that each MLM will have at least 5% relative significance in the PDF. That is, in this case, no more than 20 MLMs will be used for the PDF.
+
+    - **`max_sigma68_PDF`, `max_bias_PDF`, `max_frac68_PDF` -** may be set to put a threshold on the maximal value of the scatter (`max_sigma68_PDF`), bias (`max_bias_PDF`) or outlier-fraction (`max_frac68_PDF`) of an MLM, which may be included in the PDF. For instance, setting `max_sigma68_PDF = 0.05` will insure that any MLM which has scatter higher than `0.05` will not be included in the PDF.
+
   - By default, a progress bar is drawn during training. If one is writing the output to a log file, the progress bar is important to avoid, as it will cause the size of the log file to become very large. One can either add `--isBatch` while running the example scripts, or set in `generalSettings.py` (or elsewhere),
   ```python
   glob.annz["isBatch"] = True

diff --git a/examples/scripts/annz_binCls_advanced.py b/examples/scripts/annz_binCls_advanced.py
@@ -469,7 +469,10 @@
       # inAsciiVars - list of parameters in the input files (doesnt need to be exactly the same as in doGenInputTrees, but must contain all
       #               of the parameers which were used for training)
       glob.annz["inAsciiVars"]    = "F:MAG_U;F:MAGERR_U;F:MAG_G;F:MAGERR_G;F:MAG_R;F:MAGERR_R;F:MAG_I;F:MAGERR_I;F:MAG_Z;F:MAGERR_Z;D:Z"
-
+      # evalDirPostfix - if not empty, this string will be added to the name of the evaluation directory
+      #                  (can be used to prevent multiple evaluation of different input files from overwriting each other)
+      glob.annz["evalDirPostfix"] = ""
+
       # run ANNZ with the current settings
       runANNZ()
 

diff --git a/examples/scripts/annz_binCls_quick.py b/examples/scripts/annz_binCls_quick.py
@@ -136,7 +136,10 @@
       # inAsciiVars - list of parameters in the input files (doesnt need to be exactly the same as in doGenInputTrees, but must contain all
       #               of the parameers which were used for training)
       glob.annz["inAsciiVars"]    = "F:MAG_U;F:MAGERR_U;F:MAG_G;F:MAGERR_G;F:MAG_R;F:MAGERR_R;F:MAG_I;F:MAGERR_I;F:MAG_Z;F:MAGERR_Z;D:Z"
-
+      # evalDirPostfix - if not empty, this string will be added to the name of the evaluation directory
+      #                  (can be used to prevent multiple evaluation of different input files from overwriting each other)
+      glob.annz["evalDirPostfix"] = ""
+
       # run ANNZ with the current settings
       runANNZ()
 

diff --git a/examples/scripts/annz_rndCls_advanced.py b/examples/scripts/annz_rndCls_advanced.py
@@ -282,6 +282,9 @@
     #               of the parameers which were used for training)
     glob.annz["inAsciiVars"]  = "C:class; F:z; UL:objid; F:psfMag_r; F:fiberMag_r; F:modelMag_r; F:petroMag_r; F:petroRad_r; F:petroR50_r; " \
                               + " F:petroR90_r; F:lnLStar_r; F:lnLExp_r; F:lnLDeV_r; F:mE1_r; F:mE2_r; F:mRrCc_r; I:type_r; I:type"
+    # evalDirPostfix - if not empty, this string will be added to the name of the evaluation directory
+    #                  (can be used to prevent multiple evaluation of different input files from overwriting each other)
+    glob.annz["evalDirPostfix"] = ""
 
     # ===========================================================================================================
     # MLMsToStore - 

diff --git a/examples/scripts/annz_rndCls_quick.py b/examples/scripts/annz_rndCls_quick.py
@@ -107,6 +107,9 @@
     #               of the parameers which were used for training)
     glob.annz["inAsciiVars"]  = "C:class; F:z; UL:objid; F:psfMag_r; F:fiberMag_r; F:modelMag_r; F:petroMag_r; F:petroRad_r; F:petroR50_r; " \
                               + " F:petroR90_r; F:lnLStar_r; F:lnLExp_r; F:lnLDeV_r; F:mE1_r; F:mE2_r; F:mRrCc_r; I:type_r; I:type"
+    # evalDirPostfix - if not empty, this string will be added to the name of the evaluation directory
+    #                  (can be used to prevent multiple evaluation of different input files from overwriting each other)
+    glob.annz["evalDirPostfix"] = ""
 
     # run ANNZ with the current settings
     runANNZ()

diff --git a/examples/scripts/annz_rndReg_advanced.py b/examples/scripts/annz_rndReg_advanced.py
@@ -381,6 +381,9 @@
     # inAsciiVars - list of parameters in the input files (doesnt need to be exactly the same as in doGenInputTrees, but must contain all
     #               of the parameers which were used for training)
     glob.annz["inAsciiVars"]    = "F:MAG_U;F:MAGERR_U;F:MAG_G;F:MAGERR_G;F:MAG_R;F:MAGERR_R;F:MAG_I;F:MAGERR_I;F:MAG_Z;F:MAGERR_Z"
+    # evalDirPostfix - if not empty, this string will be added to the name of the evaluation directory
+    #                  (can be used to prevent multiple evaluation of different input files from overwriting each other)
+    glob.annz["evalDirPostfix"] = "nFile0"
 
     # run ANNZ with the current settings
     runANNZ()

diff --git a/examples/scripts/annz_rndReg_quick.py b/examples/scripts/annz_rndReg_quick.py
@@ -113,6 +113,9 @@
     # inAsciiVars - list of parameters in the input files (doesnt need to be exactly the same as in doGenInputTrees, but must contain all
     #               of the parameers which were used for training)
     glob.annz["inAsciiVars"]    = "F:MAG_U;F:MAGERR_U;F:MAG_G;F:MAGERR_G;F:MAG_R;F:MAGERR_R;F:MAG_I;F:MAGERR_I;F:MAG_Z;F:MAGERR_Z"
+    # evalDirPostfix - if not empty, this string will be added to the name of the evaluation directory
+    #                  (can be used to prevent multiple evaluation of different input files from overwriting each other)
+    glob.annz["evalDirPostfix"] = ""
 
     # run ANNZ with the current settings
     runANNZ()

diff --git a/examples/scripts/annz_singleCls_quick.py b/examples/scripts/annz_singleCls_quick.py
@@ -107,6 +107,9 @@
     #               of the parameers which were used for training)
     glob.annz["inAsciiVars"]  = "C:class; F:z; UL:objid; F:psfMag_r; F:fiberMag_r; F:modelMag_r; F:petroMag_r; F:petroRad_r; F:petroR50_r; " \
                               + " F:petroR90_r; F:lnLStar_r; F:lnLExp_r; F:lnLDeV_r; F:mE1_r; F:mE2_r; F:mRrCc_r; I:type_r; I:type"
+    # evalDirPostfix - if not empty, this string will be added to the name of the evaluation directory
+    #                  (can be used to prevent multiple evaluation of different input files from overwriting each other)
+    glob.annz["evalDirPostfix"] = ""
 
     # run ANNZ with the current settings
     runANNZ()

diff --git a/examples/scripts/annz_singleReg_quick.py b/examples/scripts/annz_singleReg_quick.py
@@ -103,6 +103,9 @@
   # inAsciiVars - list of parameters in the input files (doesnt need to be exactly the same as in doGenInputTrees, but must contain all
   #               of the parameers which were used for training)
   glob.annz["inAsciiVars"]    = "F:MAG_U;F:MAGERR_U;F:MAG_G;F:MAGERR_G;F:MAG_R;F:MAGERR_R;F:MAG_I;F:MAGERR_I;F:MAG_Z;F:MAGERR_Z;D:Z"
+  # evalDirPostfix - if not empty, this string will be added to the name of the evaluation directory
+  #                  (can be used to prevent multiple evaluation of different input files from overwriting each other)
+  glob.annz["evalDirPostfix"] = ""
 
   # run ANNZ with the current settings
   runANNZ()

diff --git a/examples/scripts/generalSettings.py b/examples/scripts/generalSettings.py
@@ -71,15 +71,8 @@ def generalSettings():
   # the uncertainty on the input parameters to the MLM-estimator. See getRegClsErrINP()
   # glob.annz["nErrINP"] = -1
 
-  # kNNErrMaxDifZ,nkNNErrMin -
-  #   - improve KNN error calculation by only considering neighbours which are "close" (in the input parameter
-  #     space) to the target object (for which the error is calculated).
-  #   - if the nth neighbour has a regression value which is different by kNNErrMaxDifZ compared to the original
-  #     target object, and at least nkNNErrMin neighbours have already been consudered, then the nth neighbour is
-  #     ignored in the KNN error calculation
-  # -----------------------------------------------------------------------------------------------------------
-  # glob.annz["kNNErrMaxDifZ"] = -1
-  # glob.annz["nkNNErrMin"]    = 50
+  # maximal number of objects in a tree/output ascii file
+  # glob.annz["nObjectsToWrite"] = 1e6
 
   return
 

diff --git a/examples/scripts/helperFuncs.py b/examples/scripts/helperFuncs.py
@@ -102,8 +102,8 @@ def initParse():
   glob.annz["doRegression"]     = glob.annz["doSingleReg"] or glob.annz["doRandomReg"] or glob.annz["doBinnedCls"]
   glob.annz["doClassification"] = glob.annz["doSingleCls"] or glob.annz["doRandomCls"]
 
-  glob.annz["maxNobj"]          = int(floor(glob.pars["maxNobj"]))    # limit number of used objects - used for debugging
-  glob.annz["trainIndex"]       = int(floor(glob.pars["trainIndex"])) # used for python batch-job submision
+  glob.annz["maxNobj"]          = int(floor(glob.pars["maxNobj"])) # limit number of used objects - used for debugging
+  glob.annz["trainIndex"]       = glob.pars["trainIndex"]          # used for python batch-job submision
 
   glob.annz["doFitsToAscii"]    = glob.pars["fitsToAscii"]
   glob.annz["doAsciiToFits"]    = glob.pars["asciiToFits"]

diff --git a/include/ANNZ.hpp b/include/ANNZ.hpp
@@ -79,6 +79,8 @@ class ANNZ : public BaseClass {
   TString  getTagPdfAvgName(int nPdfNow = -1, TString type = "");
   TString  getTagBestMLMname(TString MLMname = "");
   int      getTagNow(TString MLMname);
+  TString  getErrKNNname(int nMLMnow = -1);
+  int      getErrKNNtagNow(TString errKNNname);
   TString  getKeyWord(TString MLMname, TString sequence, TString key);
   void     loadOptsMLM();
   void     setNominalParams(int nMLMnow, TString inputVariables, TString inputVarErrors);
@@ -104,9 +106,9 @@ class ANNZ : public BaseClass {
   // -----------------------------------------------------------------------------------------------------------
   // ANNZ_TMVA.cpp :
   // -----------------------------------------------------------------------------------------------------------
-  void              prepFactory(int nANNZnow = -1, TMVA::Factory * factory = NULL);
+  void              prepFactory(int nMLMnow = -1, TMVA::Factory * factory = NULL);
   void              doFactoryTrain(TMVA::Factory * factory);
-  void              clearReaders();
+  void              clearReaders(Log::LOGtypes logLevel = Log::DEBUG_1);
   void              loadReaders(map <TString,bool> & mlmSkipNow);
   double            getReader(VarMaps * var = NULL, ANNZ_readType readType = ANNZ_readType::NUN, bool forceUpdate = false, int nMLMnow = -1);
   void              setupTypesTMVA();
@@ -116,14 +118,14 @@ class ANNZ : public BaseClass {
   // -----------------------------------------------------------------------------------------------------------
   // ANNZ_err.cpp :
   // -----------------------------------------------------------------------------------------------------------
-  void     setupKdTreeKNN(TChain * aChain, TCut cutsAll, int nANNZnow, TFile *& knnErrOutFile, TMVA::Factory *& knnErrFactory,
-                          TMVA::kNN::ModulekNN *& knnErrModule, TCut cutsSig = "", TCut cutsBck = "",
-                          TString wgtReg = "1", TString wgtSig = "1", TString wgtBck = "1");
+  void     createTreeErrKNN(int nMLMnow);
+  void     setupKdTreeKNN(TChain * aChainKnn, TFile *& knnErrOutFile, TMVA::Factory *& knnErrFactory, TMVA::kNN::ModulekNN *& knnErrModule,
+                          vector <int> & trgIndexV, int nMLMnow, TCut cutsAll, TString wgtAll);
   void     cleanupKdTreeKNN(TFile *& knnErrOutFile, TMVA::Factory *& knnErrFactory, bool verb = false);
-  double   getRegClsErrKNN(VarMaps * var = NULL, ANNZ_readType readType = ANNZ_readType::NUN,
-                           int nMLMnow = -1, TMVA::kNN::ModulekNN * knnErrModule = NULL, vector <double> * zErrV = NULL);
-  double   getRegClsErrINP(VarMaps * var = NULL, ANNZ_readType readType = ANNZ_readType::NUN,
-                           int nMLMnow = -1, UInt_t * seedP = NULL, vector <double> * zErrV = NULL);
+  void     getRegClsErrKNN(VarMaps * var, TMVA::kNN::ModulekNN * knnErrModule, vector <int> & trgIndexV,
+                           vector <int> & nMLMv, bool isREG, vector < vector <double> > & zErrV);
+
+  double   getRegClsErrINP(VarMaps * var, bool isREG, int nMLMnow, UInt_t * seedP = NULL, vector <double> * zErrV = NULL);
 
   // -----------------------------------------------------------------------------------------------------------
   // ANNZ_loopRegCls.cpp :
@@ -159,7 +161,7 @@ class ANNZ : public BaseClass {
   // private variables
   // ===========================================================================================================
   vector < Double_t >                   zClos_binE, zClos_binC, zPlot_binE, zPlot_binC, zPDF_binE, zPDF_binC, zBinCls_binE, zBinCls_binC;
-  vector < TString >                    mlmTagName, mlmTagWeight, mlmTagClsVal, mlmTagIndex, inputVariableV;
+  vector < TString >                    mlmTagName, mlmTagWeight, mlmTagClsVal, mlmTagIndex, mlmTagErrKNN, inputVariableV;
   vector < map <TString,TString> >      mlmTagErr;
   vector < vector <TString> >           pdfBinNames, inErrTag;
   vector < map < TString,TString> >     pdfAvgNames;

diff --git a/include/Utils.hpp b/include/Utils.hpp
@@ -118,7 +118,7 @@ class Utils {
 
   void    checkPathPrefix(TString pathName = "");
   void    validDirExists(TString dirName = "", bool verbose = false);
-  bool    validFileExists(TString fileName = "", bool verif = true, bool verbose = false);
+  bool    validFileExists(TString fileName = "", bool verif = true);
   void    resetDirectory(TString OutDirName = "", bool verbose = false, bool copyCode = false);
   void    checkCmndSafety(TString cmnd = "", bool verbose = false);
   void    safeRM(TString cmnd = "", bool verbose = false);
@@ -130,7 +130,7 @@ class Utils {
   bool    isSameWeightExpr(TString wgt0, TString wgt1);
 
   int     getNlinesAsciiFile(TString fileName, bool checkNonEmpty = true);
-  int     getNlinesAsciiFile(vector<TString> & fileNameV, bool checkNonEmpty = true);
+  int     getNlinesAsciiFile(vector<TString> & fileNameV, bool checkNonEmpty = true, vector <int> * nLineV = NULL);
 
   void    findObjPatternInCurrentDir(vector <TString> & patternV, vector <TString> & matchedObjV, TString clasType = "");
   void    getSortedArray(double * data, double *& sortedData);

diff --git a/include/commonInclude.hpp b/include/commonInclude.hpp
@@ -213,8 +213,8 @@ namespace Log {
       exit(1); \
     } } while(false)
 
-    #define LINE_FILL(charFill,len) \
-      std::setfill(charFill)<<std::setw(len)<<""<<std::setfill(' ')
+  #define LINE_FILL(charFill,len) \
+    std::setfill(charFill)<<std::setw(len)<<""<<std::setfill(' ')
 
 #endif // __MY_DEFINES__
 

diff --git a/src/ANNZ.cpp b/src/ANNZ.cpp
@@ -42,7 +42,7 @@ ANNZ::~ANNZ() {
   mlmTagIndex.clear();  mlmSkip.clear();         pdfBinNames.clear();    pdfAvgNames.clear();
   trainTimeM.clear();   inputVariableV.clear();  inErrTag.clear();       readerInptIndexV.clear();
   zPDF_binE.clear();    zPDF_binC.clear();       zPlot_binE.clear();     zPlot_binC.clear();
-  inNamesVar.clear();   inNamesErr.clear();      userWgtsM.clear();
+  inNamesVar.clear();   inNamesErr.clear();      userWgtsM.clear();      mlmTagErrKNN.clear();
   zClos_binE.clear();   zClos_binC.clear();      zBinCls_binE.clear();   zBinCls_binC.clear();
   typeMLM.clear();      allANNZtypes.clear();    typeToNameMLM.clear();  nameToTypeMLM.clear();
   bestMLMname.clear();