Skip to content

Commit

Permalink
v2.0.1 - overhauled KNN error estimator
Browse files Browse the repository at this point in the history
  • Loading branch information
IftachSadeh committed Feb 10, 2015
1 parent 94c52d8 commit 3b676ac
Show file tree
Hide file tree
Showing 29 changed files with 888 additions and 697 deletions.
17 changes: 16 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,18 @@
# ANNZ 2.0.0 (26/2/2015)
# Changelog

## ANNZ 2.0.1 (10/2/2015)

The following changes were made:

- **Modified the way in which the KNN error estimator works:**
In the previous version, the errors were generated by looping for each object over the *n* near-neighbors in the training dataset. For a given object, this was for all neighbors for each of the MLMs.
In the revised version, MLM response values for the entire training dataset are estimated once; this is done before the loop on the objects begins, with the results stored in a dedicated tree (see `ANNZ::createTreeErrKNN()`). This tree is then read-in during the loop over the objects for which the errors are generated. In this implementation, the KNN neighbor search is done once for all MLMs, and the errors are estimated simultaneously for all. This prevents both the unnecessary repeated calculations of MLM outputs, and the redundant searches for the *n* near-neighbors for the same object.

- **Name of evaluation subdirectory:**
Added the variable `evalDirPostfix`, which allows to modify the name of the evaluation subdirectory. Different input files can now be evaluated simultaneously, without overwriting previous results. The example scripts have been modified accordingly.

- Various small modifications.

## ANNZ 2.0.0 (26/1/2015)

First version (v2.0.0) of the new implementation of the machine learning code, ANNz.
20 changes: 16 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# ANNZ 2.0.0
# ANNZ 2.0.1

## Introduction
ANNZ uses both regression and classification techniques for estimation of single-value photo-z (or any regression problem) solutions and PDFs. In addition it is suitable for classification problems, such as star/galaxy classification.
Expand Down Expand Up @@ -120,7 +120,7 @@ python scripts/annz_singleReg_quick.py --make --clean

The various example scripts includes comments about the different variables which the user needs to set. Each operational mode has a *quickstart* dedicated script as well as an *advanced* script. The latter include more job-options as well as more detailed documentation.

For each of the following, please use follow the four respective steps (generation, training, optimization/verification, evaluation) in sequence. For instance, for single regression, do:
For each of the following, please follow the four respective steps (generation, training, optimization/verification, evaluation) in sequence. For instance, for single regression, do:
```bash
python scripts/annz_singleReg_quick.py --singleRegression --genInputTrees
python scripts/annz_singleReg_quick.py --singleRegression --train
Expand Down Expand Up @@ -332,7 +332,7 @@ There are two ways to define the PDF bins:

1. The PDF is defined within `nPDFbins` equal-width bins between `minValZ` and `maxValZ` (the minimal and maximal defined values of the regression target). The `nPDFbins`, `minValZ` and `maxValZ` variables are mandatory settings for ANNZ, as defined in the example scripts.

2. A specific set of bins of arbitrary with may defined by setting the variable, `userPdfBins` (in which case `nPDFbins` is ignored). The only constrain is that the first and last bin edges be within the range defined by `minValZ` and `maxValZ`. This can e.g., be
2. A specific set of bins of arbitrary width may defined by setting the variable, `userPdfBins` (in which case `nPDFbins` is ignored). The only constrain is that the first and last bin edges be within the range defined by `minValZ` and `maxValZ`. This can e.g., be
```python
glob.annz["userPdfBins"] = "0.05;0.1;0.2;0.24;0.3;0.52;0.6;0.7;0.8"
```
Expand Down Expand Up @@ -422,12 +422,24 @@ A few notes:

- It is possible to train/optimize MLMs using specific cuts and/or weights, based on any mathematical expression which uses the variables defined in the input dataset (not limited to the variables used for the training). The relevant variables are `userCuts_train`, `userCuts_valid`, `userWeights_train` and `userWeights_valid`. See the advanced scripts for use-examples.

- The syntax for math expressions is defined using the ROOT conventions (see e.g., [TMath](https://root.cern.ch/root/html/TMath.html) and [TFormula](https://root.cern.ch/root/html/TFormula.html)). Acceptable expressions may for instance be the following ridiculous choice:
- The syntax for math expressions is defined using the ROOT conventions (see e.g., [TMath](https://root.cern.ch/root/html/TMath.html) and [TFormula](https://root.cern.ch/root/html/TFormula.html)). Acceptable expressions may for instance include the following ridiculous choice:
```python
glob.annz["userCuts_train"] = "(MAG_R > 22)/MAG_R + (MAG_R <= 22)*1"
glob.annz["userCuts_valid"] = "pow(MAG_G,3) + exp(MAG_R)*MAG_I/20. + abs(sin(MAG_Z))"
```

- By default, the output of evaluation is written to a subdirectory named `eval` in the output directory. An output file may e.g., be `output/test_randReg_quick/regres/optim/eval/ANNZ_randomReg_0000.csv`. It is possible to set the the `evalDirPostfix` variable in order to change this. For instance, setting
```python
glob.annz["evalDirPostfix"] = "cat0"
```
will produce the same output file at `output/test_randReg_quick/regres/optim/eval_cat0/ANNZ_randomReg_0000.csv`. This may be used in order to run the evaluation on multiple input files simultaneously without overwriting previous results.

- There are several parameters used to tune PDFs in randomized regression. Here are a couple of principle examples:

- **`minPdfWeight` -** may be used to set a minimal weights for an MLM in the PDF. For instance, setting `minPdfWeight=0.05` will insure that each MLM will have at least 5% relative significance in the PDF. That is, in this case, no more than 20 MLMs will be used for the PDF.

- **`max_sigma68_PDF`, `max_bias_PDF`, `max_frac68_PDF` -** may be set to put a threshold on the maximal value of the scatter (`max_sigma68_PDF`), bias (`max_bias_PDF`) or outlier-fraction (`max_frac68_PDF`) of an MLM, which may be included in the PDF. For instance, setting `max_sigma68_PDF = 0.05` will insure that any MLM which has scatter higher than `0.05` will not be included in the PDF.

- By default, a progress bar is drawn during training. If one is writing the output to a log file, the progress bar is important to avoid, as it will cause the size of the log file to become very large. One can either add `--isBatch` while running the example scripts, or set in `generalSettings.py` (or elsewhere),
```python
glob.annz["isBatch"] = True
Expand Down
5 changes: 4 additions & 1 deletion examples/scripts/annz_binCls_advanced.py
Original file line number Diff line number Diff line change
Expand Up @@ -469,7 +469,10 @@
# inAsciiVars - list of parameters in the input files (doesnt need to be exactly the same as in doGenInputTrees, but must contain all
# of the parameers which were used for training)
glob.annz["inAsciiVars"] = "F:MAG_U;F:MAGERR_U;F:MAG_G;F:MAGERR_G;F:MAG_R;F:MAGERR_R;F:MAG_I;F:MAGERR_I;F:MAG_Z;F:MAGERR_Z;D:Z"

# evalDirPostfix - if not empty, this string will be added to the name of the evaluation directory
# (can be used to prevent multiple evaluation of different input files from overwriting each other)
glob.annz["evalDirPostfix"] = ""

# run ANNZ with the current settings
runANNZ()

Expand Down
5 changes: 4 additions & 1 deletion examples/scripts/annz_binCls_quick.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,10 @@
# inAsciiVars - list of parameters in the input files (doesnt need to be exactly the same as in doGenInputTrees, but must contain all
# of the parameers which were used for training)
glob.annz["inAsciiVars"] = "F:MAG_U;F:MAGERR_U;F:MAG_G;F:MAGERR_G;F:MAG_R;F:MAGERR_R;F:MAG_I;F:MAGERR_I;F:MAG_Z;F:MAGERR_Z;D:Z"

# evalDirPostfix - if not empty, this string will be added to the name of the evaluation directory
# (can be used to prevent multiple evaluation of different input files from overwriting each other)
glob.annz["evalDirPostfix"] = ""

# run ANNZ with the current settings
runANNZ()

Expand Down
3 changes: 3 additions & 0 deletions examples/scripts/annz_rndCls_advanced.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,9 @@
# of the parameers which were used for training)
glob.annz["inAsciiVars"] = "C:class; F:z; UL:objid; F:psfMag_r; F:fiberMag_r; F:modelMag_r; F:petroMag_r; F:petroRad_r; F:petroR50_r; " \
+ " F:petroR90_r; F:lnLStar_r; F:lnLExp_r; F:lnLDeV_r; F:mE1_r; F:mE2_r; F:mRrCc_r; I:type_r; I:type"
# evalDirPostfix - if not empty, this string will be added to the name of the evaluation directory
# (can be used to prevent multiple evaluation of different input files from overwriting each other)
glob.annz["evalDirPostfix"] = ""

# ===========================================================================================================
# MLMsToStore -
Expand Down
3 changes: 3 additions & 0 deletions examples/scripts/annz_rndCls_quick.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,9 @@
# of the parameers which were used for training)
glob.annz["inAsciiVars"] = "C:class; F:z; UL:objid; F:psfMag_r; F:fiberMag_r; F:modelMag_r; F:petroMag_r; F:petroRad_r; F:petroR50_r; " \
+ " F:petroR90_r; F:lnLStar_r; F:lnLExp_r; F:lnLDeV_r; F:mE1_r; F:mE2_r; F:mRrCc_r; I:type_r; I:type"
# evalDirPostfix - if not empty, this string will be added to the name of the evaluation directory
# (can be used to prevent multiple evaluation of different input files from overwriting each other)
glob.annz["evalDirPostfix"] = ""

# run ANNZ with the current settings
runANNZ()
Expand Down
3 changes: 3 additions & 0 deletions examples/scripts/annz_rndReg_advanced.py
Original file line number Diff line number Diff line change
Expand Up @@ -381,6 +381,9 @@
# inAsciiVars - list of parameters in the input files (doesnt need to be exactly the same as in doGenInputTrees, but must contain all
# of the parameers which were used for training)
glob.annz["inAsciiVars"] = "F:MAG_U;F:MAGERR_U;F:MAG_G;F:MAGERR_G;F:MAG_R;F:MAGERR_R;F:MAG_I;F:MAGERR_I;F:MAG_Z;F:MAGERR_Z"
# evalDirPostfix - if not empty, this string will be added to the name of the evaluation directory
# (can be used to prevent multiple evaluation of different input files from overwriting each other)
glob.annz["evalDirPostfix"] = "nFile0"

# run ANNZ with the current settings
runANNZ()
Expand Down
3 changes: 3 additions & 0 deletions examples/scripts/annz_rndReg_quick.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,9 @@
# inAsciiVars - list of parameters in the input files (doesnt need to be exactly the same as in doGenInputTrees, but must contain all
# of the parameers which were used for training)
glob.annz["inAsciiVars"] = "F:MAG_U;F:MAGERR_U;F:MAG_G;F:MAGERR_G;F:MAG_R;F:MAGERR_R;F:MAG_I;F:MAGERR_I;F:MAG_Z;F:MAGERR_Z"
# evalDirPostfix - if not empty, this string will be added to the name of the evaluation directory
# (can be used to prevent multiple evaluation of different input files from overwriting each other)
glob.annz["evalDirPostfix"] = ""

# run ANNZ with the current settings
runANNZ()
Expand Down
3 changes: 3 additions & 0 deletions examples/scripts/annz_singleCls_quick.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,9 @@
# of the parameers which were used for training)
glob.annz["inAsciiVars"] = "C:class; F:z; UL:objid; F:psfMag_r; F:fiberMag_r; F:modelMag_r; F:petroMag_r; F:petroRad_r; F:petroR50_r; " \
+ " F:petroR90_r; F:lnLStar_r; F:lnLExp_r; F:lnLDeV_r; F:mE1_r; F:mE2_r; F:mRrCc_r; I:type_r; I:type"
# evalDirPostfix - if not empty, this string will be added to the name of the evaluation directory
# (can be used to prevent multiple evaluation of different input files from overwriting each other)
glob.annz["evalDirPostfix"] = ""

# run ANNZ with the current settings
runANNZ()
Expand Down
3 changes: 3 additions & 0 deletions examples/scripts/annz_singleReg_quick.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,9 @@
# inAsciiVars - list of parameters in the input files (doesnt need to be exactly the same as in doGenInputTrees, but must contain all
# of the parameers which were used for training)
glob.annz["inAsciiVars"] = "F:MAG_U;F:MAGERR_U;F:MAG_G;F:MAGERR_G;F:MAG_R;F:MAGERR_R;F:MAG_I;F:MAGERR_I;F:MAG_Z;F:MAGERR_Z;D:Z"
# evalDirPostfix - if not empty, this string will be added to the name of the evaluation directory
# (can be used to prevent multiple evaluation of different input files from overwriting each other)
glob.annz["evalDirPostfix"] = ""

# run ANNZ with the current settings
runANNZ()
Expand Down
11 changes: 2 additions & 9 deletions examples/scripts/generalSettings.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,15 +71,8 @@ def generalSettings():
# the uncertainty on the input parameters to the MLM-estimator. See getRegClsErrINP()
# glob.annz["nErrINP"] = -1

# kNNErrMaxDifZ,nkNNErrMin -
# - improve KNN error calculation by only considering neighbours which are "close" (in the input parameter
# space) to the target object (for which the error is calculated).
# - if the nth neighbour has a regression value which is different by kNNErrMaxDifZ compared to the original
# target object, and at least nkNNErrMin neighbours have already been consudered, then the nth neighbour is
# ignored in the KNN error calculation
# -----------------------------------------------------------------------------------------------------------
# glob.annz["kNNErrMaxDifZ"] = -1
# glob.annz["nkNNErrMin"] = 50
# maximal number of objects in a tree/output ascii file
# glob.annz["nObjectsToWrite"] = 1e6

return

Expand Down
4 changes: 2 additions & 2 deletions examples/scripts/helperFuncs.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,8 +102,8 @@ def initParse():
glob.annz["doRegression"] = glob.annz["doSingleReg"] or glob.annz["doRandomReg"] or glob.annz["doBinnedCls"]
glob.annz["doClassification"] = glob.annz["doSingleCls"] or glob.annz["doRandomCls"]

glob.annz["maxNobj"] = int(floor(glob.pars["maxNobj"])) # limit number of used objects - used for debugging
glob.annz["trainIndex"] = int(floor(glob.pars["trainIndex"])) # used for python batch-job submision
glob.annz["maxNobj"] = int(floor(glob.pars["maxNobj"])) # limit number of used objects - used for debugging
glob.annz["trainIndex"] = glob.pars["trainIndex"] # used for python batch-job submision

glob.annz["doFitsToAscii"] = glob.pars["fitsToAscii"]
glob.annz["doAsciiToFits"] = glob.pars["asciiToFits"]
Expand Down
22 changes: 12 additions & 10 deletions include/ANNZ.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,8 @@ class ANNZ : public BaseClass {
TString getTagPdfAvgName(int nPdfNow = -1, TString type = "");
TString getTagBestMLMname(TString MLMname = "");
int getTagNow(TString MLMname);
TString getErrKNNname(int nMLMnow = -1);
int getErrKNNtagNow(TString errKNNname);
TString getKeyWord(TString MLMname, TString sequence, TString key);
void loadOptsMLM();
void setNominalParams(int nMLMnow, TString inputVariables, TString inputVarErrors);
Expand All @@ -104,9 +106,9 @@ class ANNZ : public BaseClass {
// -----------------------------------------------------------------------------------------------------------
// ANNZ_TMVA.cpp :
// -----------------------------------------------------------------------------------------------------------
void prepFactory(int nANNZnow = -1, TMVA::Factory * factory = NULL);
void prepFactory(int nMLMnow = -1, TMVA::Factory * factory = NULL);
void doFactoryTrain(TMVA::Factory * factory);
void clearReaders();
void clearReaders(Log::LOGtypes logLevel = Log::DEBUG_1);
void loadReaders(map <TString,bool> & mlmSkipNow);
double getReader(VarMaps * var = NULL, ANNZ_readType readType = ANNZ_readType::NUN, bool forceUpdate = false, int nMLMnow = -1);
void setupTypesTMVA();
Expand All @@ -116,14 +118,14 @@ class ANNZ : public BaseClass {
// -----------------------------------------------------------------------------------------------------------
// ANNZ_err.cpp :
// -----------------------------------------------------------------------------------------------------------
void setupKdTreeKNN(TChain * aChain, TCut cutsAll, int nANNZnow, TFile *& knnErrOutFile, TMVA::Factory *& knnErrFactory,
TMVA::kNN::ModulekNN *& knnErrModule, TCut cutsSig = "", TCut cutsBck = "",
TString wgtReg = "1", TString wgtSig = "1", TString wgtBck = "1");
void createTreeErrKNN(int nMLMnow);
void setupKdTreeKNN(TChain * aChainKnn, TFile *& knnErrOutFile, TMVA::Factory *& knnErrFactory, TMVA::kNN::ModulekNN *& knnErrModule,
vector <int> & trgIndexV, int nMLMnow, TCut cutsAll, TString wgtAll);
void cleanupKdTreeKNN(TFile *& knnErrOutFile, TMVA::Factory *& knnErrFactory, bool verb = false);
double getRegClsErrKNN(VarMaps * var = NULL, ANNZ_readType readType = ANNZ_readType::NUN,
int nMLMnow = -1, TMVA::kNN::ModulekNN * knnErrModule = NULL, vector <double> * zErrV = NULL);
double getRegClsErrINP(VarMaps * var = NULL, ANNZ_readType readType = ANNZ_readType::NUN,
int nMLMnow = -1, UInt_t * seedP = NULL, vector <double> * zErrV = NULL);
void getRegClsErrKNN(VarMaps * var, TMVA::kNN::ModulekNN * knnErrModule, vector <int> & trgIndexV,
vector <int> & nMLMv, bool isREG, vector < vector <double> > & zErrV);

double getRegClsErrINP(VarMaps * var, bool isREG, int nMLMnow, UInt_t * seedP = NULL, vector <double> * zErrV = NULL);

// -----------------------------------------------------------------------------------------------------------
// ANNZ_loopRegCls.cpp :
Expand Down Expand Up @@ -159,7 +161,7 @@ class ANNZ : public BaseClass {
// private variables
// ===========================================================================================================
vector < Double_t > zClos_binE, zClos_binC, zPlot_binE, zPlot_binC, zPDF_binE, zPDF_binC, zBinCls_binE, zBinCls_binC;
vector < TString > mlmTagName, mlmTagWeight, mlmTagClsVal, mlmTagIndex, inputVariableV;
vector < TString > mlmTagName, mlmTagWeight, mlmTagClsVal, mlmTagIndex, mlmTagErrKNN, inputVariableV;
vector < map <TString,TString> > mlmTagErr;
vector < vector <TString> > pdfBinNames, inErrTag;
vector < map < TString,TString> > pdfAvgNames;
Expand Down
4 changes: 2 additions & 2 deletions include/Utils.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ class Utils {

void checkPathPrefix(TString pathName = "");
void validDirExists(TString dirName = "", bool verbose = false);
bool validFileExists(TString fileName = "", bool verif = true, bool verbose = false);
bool validFileExists(TString fileName = "", bool verif = true);
void resetDirectory(TString OutDirName = "", bool verbose = false, bool copyCode = false);
void checkCmndSafety(TString cmnd = "", bool verbose = false);
void safeRM(TString cmnd = "", bool verbose = false);
Expand All @@ -130,7 +130,7 @@ class Utils {
bool isSameWeightExpr(TString wgt0, TString wgt1);

int getNlinesAsciiFile(TString fileName, bool checkNonEmpty = true);
int getNlinesAsciiFile(vector<TString> & fileNameV, bool checkNonEmpty = true);
int getNlinesAsciiFile(vector<TString> & fileNameV, bool checkNonEmpty = true, vector <int> * nLineV = NULL);

void findObjPatternInCurrentDir(vector <TString> & patternV, vector <TString> & matchedObjV, TString clasType = "");
void getSortedArray(double * data, double *& sortedData);
Expand Down
4 changes: 2 additions & 2 deletions include/commonInclude.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -213,8 +213,8 @@ namespace Log {
exit(1); \
} } while(false)

#define LINE_FILL(charFill,len) \
std::setfill(charFill)<<std::setw(len)<<""<<std::setfill(' ')
#define LINE_FILL(charFill,len) \
std::setfill(charFill)<<std::setw(len)<<""<<std::setfill(' ')

#endif // __MY_DEFINES__

Expand Down
2 changes: 1 addition & 1 deletion src/ANNZ.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ ANNZ::~ANNZ() {
mlmTagIndex.clear(); mlmSkip.clear(); pdfBinNames.clear(); pdfAvgNames.clear();
trainTimeM.clear(); inputVariableV.clear(); inErrTag.clear(); readerInptIndexV.clear();
zPDF_binE.clear(); zPDF_binC.clear(); zPlot_binE.clear(); zPlot_binC.clear();
inNamesVar.clear(); inNamesErr.clear(); userWgtsM.clear();
inNamesVar.clear(); inNamesErr.clear(); userWgtsM.clear(); mlmTagErrKNN.clear();
zClos_binE.clear(); zClos_binC.clear(); zBinCls_binE.clear(); zBinCls_binC.clear();
typeMLM.clear(); allANNZtypes.clear(); typeToNameMLM.clear(); nameToTypeMLM.clear();
bestMLMname.clear();
Expand Down
Loading

0 comments on commit 3b676ac

Please sign in to comment.