ANNZ 2.0.4

**Modified the function, `CatFormat::addWgtKNNtoTree()`, and added `CatFormat::asciiToFullTree_wgtKNN()`:** The purpose of the new features is to add an output variable, denoted as `inTrainFlag` to the output of evaluation. The new output indicates if the corresponding object is "compatible" with other objects from the training dataset. The compatibility is estimated by comparing the density of objects in the training dataset in the vicinity of the evaluated object. If the evaluated object belongs to an area of parameter-space which is not represented in the training dataset, we will get `inTrainFlag = 0`. In this case, the output of the training is probably unreliable.
IftachSadeh · Mar 18, 2015 · f3e0a0b · f3e0a0b
1 parent d0ac2fc
commit f3e0a0b
Show file tree

Hide file tree

Showing 11 changed files with 478 additions and 203 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,12 @@
 
 <!-- ## Master version -->
 
+## ANNZ 2.0.4 (19/3/2015)
+
+- **Modified the function, `CatFormat::addWgtKNNtoTree()`, and added `CatFormat::asciiToFullTree_wgtKNN()`:** The purpose of the new features is to add an output variable, denoted as `inTrainFlag` to the output of evaluation. The new output indicates if the corresponding object is "compatible" with other objects from the training dataset. The compatibility is estimated by comparing the density of objects in the training dataset in the vicinity of the evaluated object. If the evaluated object belongs to an area of parameter-space which is not represented in the training dataset, we will get `inTrainFlag = 0`. In this case, the output of the training is probably unreliable.
+
+- Other minor modifications.
+
 ## ANNZ 2.0.3 (25/2/2015)
 
 - **Added *MultiClass* support to binned classification:** The new option is controlled by setting the `doMultiCls` flag. In this mode, multiple background samples can be trained simultaneously against the signal. In the context of binned classification, this means that each classification bin acts as an independent sample during the training.

diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# ANNZ 2.0.3
+# ANNZ 2.0.4
 
 ## Introduction
 ANNZ uses both regression and classification techniques for estimation of single-value photo-z (or any regression problem) solutions and PDFs. In addition it is suitable for classification problems, such as star/galaxy classification.
@@ -367,6 +367,12 @@ which in this example, adds the U-band magnitude and the error on the I-band mag
 
 The directory, `output/test_randReg_quick/regres/eval/` (for the `scripts/annz_rndReg_quick.py` example), contains the output ascii and ROOT tree files, respectively, `ANNZ_randomReg_0000.csv` and `ANNZ_tree_randomReg_00002.root`. These have a similar format to that which is described above.
 
+In addition to the above-mentioned variables, the parameter `inTrainFlag` is included in the output, provided the user sets:
+```python
+glob.annz["addInTrainFlag"] = True
+```
+(See `scripts/annz_rndReg_advanced.py`.) This output indicates if the an evaluated object is "compatible" with corresponding objects from the training dataset. The compatibility is estimated by comparing the density of objects in the training dataset in the vicinity of the evaluated object. If the evaluated object belongs to an area of parameter-space which is not represented in the training dataset, we will get `inTrainFlag = 0`. In this case, the output of the training is probably unreliable. The calculation is performed using a KNN approach, similar to the algorithm used for the `glob.annz["useWgtKNN"] = True` calculation.
+
 ### Single regression
 
 The outputs of single regression are similar to those of randomized regression. In this case, the *best* MLM is actually the only MLM, and no PDF solutions are created. For instance, using `scripts/annz_singleReg_quick.py`, the performance plots will be found at `output/test_singleReg_quick/regres/optim/eval/plots/` and the output ascii file would be found at `output/test_singleReg_quick/regres/optim/eval/ANNZ_singleReg_0000.csv`. The latter would nominally include the variables:

diff --git a/examples/scripts/annz_rndReg_advanced.py b/examples/scripts/annz_rndReg_advanced.py
@@ -385,6 +385,31 @@
     #                  (can be used to prevent multiple evaluation of different input files from overwriting each other)
     glob.annz["evalDirPostfix"] = "nFile0"
 
+    # -----------------------------------------------------------------------------------------------------------
+    # addInTrainFlag, minNobjInVol_inTrain, maxRelRatioInRef_inTrain -
+    # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
+    #   - addInTrainFlag           - calculate for each object which is evaluated, if it is "close" in the
+    #                                input-parameter space to the training dataset. The result is written as part of the evaluation
+    #                                output, as an additional parameter named "inTrainFlag". The value of "inTrainFlag"
+    #                                is zero if the object is not "close" to the training objects (therefore probably has unreliable result).
+    #                                The calculation is performed using a KNN approach, similar to the algorithm used for
+    #                                the [glob.annz["useWgtKNN"] = True] calculation.
+    #   - minNobjInVol_inTrain     - The number of reference objects in the reference dataset which are used in the calculation.
+    #   - maxRelRatioInRef_inTrain - A number in the range, [0,1] - The minimal threshold of the relative difference between
+    #                              distances in the inTrainFlag calculation for accepting an object - Should be a (<0.5) positive number.
+    #   - ...._inTrain             - The rest of the parameters ending with "_inTrain" have a similar role as
+    #                              their "_wgtKNN" counterparts, which are used with [glob.annz["useWgtKNN"] = True]. These are:
+    #                                - "outAsciiVars_inTrain", "weightInp_inTrain", "cutInp_inTrain",
+    #                                  "cutRef_inTrain", "sampleFracInp_inTrain" and "sampleFracRef_inTrain"
+    # -----------------------------------------------------------------------------------------------------------
+    addInTrainFlag = False
+    if addInTrainFlag:
+      glob.annz["addInTrainFlag"]           = True
+      glob.annz["minNobjInVol_inTrain"]     = 100
+      glob.annz["maxRelRatioInRef_inTrain"] = 0.1
+      glob.annz["weightVarNames_inTrain"]   = "MAG_U;MAG_G;MAG_R;MAG_I;MAG_Z"
+      # glob.annz["weightRef_inTrain"]        = "(MAG_Z<20.5 && MAG_R<22 && MAG_U<24)" # cut the reference sample, just to have some difference...
+
     # run ANNZ with the current settings
     runANNZ()
 

diff --git a/include/CatFormat.hpp b/include/CatFormat.hpp
@@ -55,6 +55,7 @@ class CatFormat : public BaseClass {
   void    asciiToSplitTree(TString inAsciiFiles, TString inAsciiVars);
   void    asciiToFullTree(TString inAsciiFiles, TString inAsciiVars, TString treeNamePostfix = "");
   void    asciiToSplitTree_wgtKNN(TString inAsciiFiles, TString inAsciiVars, TString inAsciiFiles_wgtKNN, TString inAsciiVars_wgtKNN);
+  void    asciiToFullTree_wgtKNN(TString inAsciiFiles, TString inAsciiVars, TString treeNamePostfix);
   void    parseInputVars(VarMaps * var, TString inAsciiVars, vector <TString> & inVarNames, vector <TString> & inVarTypes);
   bool    inputLineToVars(TString line, VarMaps * var, vector <TString> & inVarNames, vector <TString> & inVarTypes);
   void    setSplitVars(VarMaps * var, TRandom * rnd, map <TString,int> & intMap);

diff --git a/src/ANNZ_loopCls.cpp b/src/ANNZ_loopCls.cpp
@@ -296,7 +296,7 @@ void ANNZ::optimCls() {
         TGraphErrors * grph = new TGraphErrors(int(graph_X.size()),&graph_X[0], &graph_Y[0],&graph_Xerr[0], &graph_Yerr[0]);
 
         grph->SetName(TString::Format((TString)"compPure_%d"+"_clasOptimize"+typeName+"_%d",nCompPureMgNow,nPlotSbSepNow));
-        grph->SetTitle(TString::Format((TString)"#%d, S_{s/b} ("+getTagName(nMLMnow)+","+typeToNameMLM[typeMLM[nMLMnow]]+") = %1.2e",nSbSepIndexNow,sbSepFrac));
+        grph->SetTitle(TString::Format((TString)"ranked as #%d, S_{s/b} ("+getTagName(nMLMnow)+","+typeToNameMLM[typeMLM[nMLMnow]]+") = %1.2e",nSbSepIndexNow+1,sbSepFrac));
         grph->GetXaxis()->SetTitle("Completeness");  grph->GetYaxis()->SetTitle("Purity");
         compPureMgV[typeName][nCompPureMgNow]->Add(grph);
       }
@@ -325,9 +325,9 @@ void ANNZ::optimCls() {
 
         normFactor = his1M[sigBckName][nMLMnow]->Integral(); if(normFactor>0) his1M[sigBckName][nMLMnow]->Scale(1/normFactor,"width");
 
-        his1M[sigBckName][nMLMnow]->SetTitle( TString::Format( (TString)"#%d, "+sigBckTitle+" ("+MLMname+","
+        his1M[sigBckName][nMLMnow]->SetTitle( TString::Format( (TString)"ranked as #%d, "+sigBckTitle+" ("+MLMname+","
                                                                         +typeToNameMLM[typeMLM[nMLMnow]]+") - S_{s/b} = %1.2e"
-                                                               ,nSbSepIndexNow,nSbSepValNow ) );
+                                                               ,nSbSepIndexNow+1,nSbSepValNow ) );
       }
     }
 
@@ -632,7 +632,7 @@ void  ANNZ::doEvalCls() {
 
   // create the chain for the loop
   // -----------------------------------------------------------------------------------------------------------
-  TString inTreeName = (TString)glob->GetOptC("treeName")+"_eval";
+  TString inTreeName = (TString)glob->GetOptC("treeName")+glob->GetOptC("evalTreePostfix");
   TString inFileName = (TString)glob->GetOptC("outDirNameFull")+inTreeName+"*.root";
 
   // prepare the chain and input variables. Set cuts to match the TMVAs

diff --git a/src/ANNZ_loopReg.cpp b/src/ANNZ_loopReg.cpp
@@ -1756,7 +1756,7 @@ void  ANNZ::doEvalReg(TChain * inChain, TString outDirName, vector <TString> * s
   // -----------------------------------------------------------------------------------------------------------
   // create the chain for the loop, or assign the input chain
   // -----------------------------------------------------------------------------------------------------------
-  TString inTreeName = (TString)treeName+"_eval";
+  TString inTreeName = (TString)treeName+glob->GetOptC("evalTreePostfix");
   TString inFileName = (TString)outDirNameFull+inTreeName+"*.root";
 
   // prepare the chain and input variables. Set cuts to match the TMVAs

diff --git a/src/ANNZ_loopRegCls.cpp b/src/ANNZ_loopRegCls.cpp
@@ -700,7 +700,7 @@ void  ANNZ::makeTreeRegClsOneMLM(int nMLMnow) {
           if(trainCut != "") cutExprs += (TString)" && ("+trainCut+")";
 
           int     nEvtPass   = aChainOut->Draw(drawExprs,cutExprs);
-          if(nEvtPass > 0) his_all = (TH1F*)gDirectory->Get(hisName);
+          if(nEvtPass > 0) his_all = (TH1F*)gDirectory->Get(hisName); his_all->BufferEmpty();
         }
         if(!his_all) continue;
 
@@ -719,7 +719,7 @@ void  ANNZ::makeTreeRegClsOneMLM(int nMLMnow) {
         int     nEvtPass   = aChainOut->Draw(drawExprs,cutExprs);
 
         if(nEvtPass > 0) {
-          his1_sb->SetDirectory(0); // allowed only after the chain fills the histogram
+          his1_sb->SetDirectory(0); his1_sb->BufferEmpty(); // allowed only after the chain fills the histogram
           if(nSigBckNow == 0) his1_sig = his1_sb;
           else                his1_bck = his1_sb;
         }

diff --git a/src/ANNZ_utils.cpp b/src/ANNZ_utils.cpp
@@ -690,7 +690,7 @@ void  ANNZ::loadOptsMLM() {
   aLOG(Log::DEBUG_1) <<coutWhiteOnBlack<<coutYellow<<" - starting ANNZ::loadOptsMLM() ... "<<coutDef<<endl;
 
   int     nMLMs     = glob->GetOptI("nMLMs");
-  TString weightKNN = glob->GetOptC("baseName_weightKNN");
+  TString weightKNN = glob->GetOptC("baseName_wgtKNN");
 
   inNamesVar.resize(nMLMs); inNamesErr.resize(nMLMs);
 

diff --git a/src/CatFormat_asciiToTree.cpp b/src/CatFormat_asciiToTree.cpp
@@ -61,7 +61,7 @@ void CatFormat::asciiToFullTree(TString inAsciiFiles, TString inAsciiVars, TStri
   TString treeName        = glob->GetOptC("treeName")+treeNamePostfix;
   TString origFileName    = glob->GetOptC("origFileName");
   TString indexName       = glob->GetOptC("indexName");
-  TString weightName      = glob->GetOptC("baseName_weightKNN");
+  TString weightName      = glob->GetOptC("baseName_wgtKNN");
 
   map <TString,int> intMap;
   vector <TString>  inFileNameV, inVarNames, inVarTypes;
@@ -200,7 +200,7 @@ void CatFormat::asciiToSplitTree(TString inAsciiFiles, TString inAsciiVars) {
   TString indexName       = glob->GetOptC("indexName");
   TString splitName       = glob->GetOptC("splitName");
   TString testValidType   = glob->GetOptC("testValidType");
-  TString weightName      = glob->GetOptC("baseName_weightKNN");
+  TString weightName      = glob->GetOptC("baseName_wgtKNN");
   bool    doPlots         = glob->GetOptB("doPlots");
   TString plotExt         = glob->GetOptC("printPlotExtension");
   TString outDirNameFull  = glob->GetOptC("outDirNameFull");
@@ -427,7 +427,8 @@ void CatFormat::asciiToSplitTree(TString inAsciiFiles, TString inAsciiVars) {
         TCanvas * tmpCnvs = new TCanvas("tmpCnvs","tmpCnvs");
         aChain->Draw(drawExprs,""); DELNULL(tmpCnvs);
 
-        TH1 * his1 = (TH1F*)gDirectory->Get(hisName); his1->SetDirectory(0); his1->SetTitle(branchNameV[nBranchNow]); assert(dynamic_cast<TH1F*>(his1));
+        TH1 * his1 = (TH1F*)gDirectory->Get(hisName); assert(dynamic_cast<TH1F*>(his1));
+        his1->SetDirectory(0); his1->BufferEmpty(); his1->SetTitle(branchNameV[nBranchNow]);
 
         outputs->optClear();
         outputs->draw->NewOptC("drawOpt"    , "HIST");