Chopping is a technique to use the limited set of data for TMVA training. In this approach data are chopped into several categories and for each category i
TMV is trained using the remaing N-1
categories, and the trained TMVA
is applied to the events from category i
.
The trainig-with-chopping is fairly trivial. First one need to define the number of distinct categories and the function to classify the events into training categories. E.g. for TMVA
training
tSignal = ... ## signal TTree/TChain
tBkg = ... ## background TTree/TChain
## book TMVA trainer
from Ostap.TMVAChopper import Trainer
trainer = Trainer (
N = N , ## ATTENTION! N is number of categories
category = "137*evt+813*run" , ## ATTENTION! It is a classification function
chop_signal = False , ## chop the signal ? (default)
chop_background = True , ## chop the background ? (default)
...
All other arguments of Trainer
are the same as for regular TMVA
trainer. Arguments chop_signal
and chop_background
defined what sample (or both) to be chopped. The argument caterory described the integer-valued function, used for classification of events. Actually trainer construct classification function as category%N
.
{% discussion "How to choose chopping parameters?" %}
For efficient usage of events number of categroeis shoudl be rather large N>>2
. For the given number of categories N
, the fraction of events used for TMVA
training is (N-1)/N
. Therefore with large N
events are used more efficiently. From other side, for large N
the traing time is proportional to N
, while the traing results shodul be more or less independent on N
. It makes senseless usage of N>100
. Therefore one gets 2<<N<100
. In practice it is convinient to choose 10<N<20
.
The ideal classification function must be independent on the properties of signal and or background. It should be pseudorandom
and provide almost uniform population of categories. It is very easy to achive using the following expression (Na*a+Nb*b+Nc*c+...+Nz*z+)%N
, where a
, b
, ... , z
are some integer-valued variables from the input TTree
/TChain
(event number, run-number, GPS time in nanoseconds, number of tracks in event. number of hits in SPD, etc...), and Na
, Nb
, ..., Nz
are prime numbers, that are large enough (Na>>N
, Nb>>N
, ... , Nz>>N
). With such construction, choosingN
to be also a prime number, one almost guaranteed that events are randomly distributed into N
-categories.
The category population can be checked using setof control historgams:
bc = trainer.background_categories
sc = trainer.signal_categories
bc[0].Draw() ## show popultion of background categroies
bc[1].Draw() ## the same with different binning
sc[0].Draw() ## show popultion of signal categroies
sc[1].Draw() ## the same with different binning
{% enddiscussion %}
Again one needs to define the classification function for input data. Clealry this function should match the one used in training
category = lambda s : int ( s.evt*137 + 813*s.run ) % N ## the classification function
from Ostap.TMVAChopper import Reader ## ATTENTION
reader = Reader(
N = N , ## number of categories
categoryfunc = category , ## category
...
All other arguments of Reader
are the same as for regular TMVA
reader. The created reader is used exactly in the same way as for no-chopping-case:
tree = ... ## the tree
mlp = reader['MLP'] ## get one method
for i in tree : ## loop over the entries
print 'MLP value is %s' % mlp ( i ) ## get the value
{% discussion "For tests and debug" %} For test and debug purposes one can use it also as a function:
v1, v2, v3 = ....
mlp = reader['MLP'] ## get one method
for i in range ( N ) : ## loop over categories
print 'MLP value for categroy %s is %s' % ( i , mlp ( i , v1 , v2 , v3 ) )
And even get the difference between responces for different categories. Clearly the spread of values should be small enough
v1, v2, v3 = ....
mean = mlp.mean ( v1 , v2 , v3 ) ## get a mean-value over different categories
stat = mlp.stat ( v1 , v2 , v3 ) ## get a statistics (mean,rms, min/max,...) of responces
{% enddiscussion %}