Skip to content

Presets

Yiming Cui edited this page Mar 11, 2020 · 2 revisions

Presets include some pre-defined loss functions and strategies.

  • textbrewer.presets.ADAPTOR_KEYS (List)

    Keys in the dict returned by the adaptor:

    • 'logits', 'logits_mask', 'losses', 'inputs_mask', 'labels', 'hidden', 'attention'
  • textbrewer.presets.KD_LOSS_MAP (Dict)

    Available kd_loss types

    • 'mse' : mean squared error
    • 'ce': cross-entropy loss
  • PROJ_MAP (Dict)

    Used to match the different dimensions of intermediate features

    • 'linear' : linear layer, no activation
    • 'relu' : ReLU activation
    • 'tanh': Tanh activation
  • MATCH_LOSS_MAP (Dict)

    Intermediate feature matching loss functions.

    • Includes 'attention_mse_sum', 'attention_mse', ‘attention_ce_mean', 'attention_ce', 'hidden_mse', 'cos', 'pkd', 'fsp', 'nst'. See Intermediate Loss for details.
  • WEIGHT_SCHEDULER (Dict)

    Scheduler used to dynamically adjust kd_loss weights and hard_label_loss weights.

    • ‘linear_decay' : decay from 1 to 0
    • 'linear_growth' : grow from 0 to 1
  • TEMPERATURE_SCHEDULER (DynamicDict)

    Used to dynamically adjust distillation temperature.

    Different from other modules, when using 'flsw' and 'cwsm', you need to provide extra parameters, for example:

    #flsw
    distill_config = DistillationConfig(
    temperature_scheduler = ['flsw', 11]
    ...)
    
    #cwsm
    distill_config = DistillationConfig(
    temperature_scheduler = ['cwsm', 1]
    ...)

Customization

If the pre-defined modules do not satisfy your requirements, you can add your own defined modules to the above dict, for example:

MATCH_LOSS_MAP['my_L1_loss'] = my_L1_loss
WEIGHT_SCHEDULER['my_weight_scheduler'] = my_weight_scheduler

Usage in DistiilationConfig:

distill_config = DistillationConfig(
  kd_loss_weight_scheduler = 'my_weight_scheduler'
  intermediate_matches = [{'layer_T':0, 'layer_S':0, 'feature':'hidden','loss': 'my_L1_loss', 'weight' : 1}]
  ...)
  

See source code for more details (will be explained in more details in the next version of the documentation).

Intermediate Loss

attention_mse

  • Takes in two matrics with the shape (batch_size, num_heads, len, len), computes the mse loss between the two matrices.
  • If the inputs_mask is provided, masks the positions where input_mask==0.

attention_mse_sum

  • Takes in two matrics with the shape (batch_size, len, len), computes the mse loss between two matrices; if the the shape is (batch_size, num_heads, len, len), sum along the num_heads dimension and then compute the mse loss between the two matrices.
  • If the inputs_mask is provided, masks the positions where input_mask==0.

attention_ce

  • Takes in two matrics with the shape (batch_size, num_heads, len, len), applies softmax on dim=-1, computes cross-entropy loss between the two matrices.
  • If the inputs_mask is provided, masks the positions where input_mask==0.

attention_ce_mean

  • Takes in two matrics. If the shape is (batch_size, len, len), compute the cross-entropy loss between the two matrices; if the shape is (batch_size, num_heads, len, len), averages over dimension num_heads and then computes cross-entropy loss between the two matrics.
  • If the inputs_mask is provided, masks the positions where input_mask==0.

hidden_mse

  • Takes in two matrices with the shape (batch_size, len, hidden_size), computes mse loss between the two matrices.
  • If the inputs_mask is provided, masks the positions where input_mask==0.
  • If the hidden sizes of student and teacher are different, 'proj' option is needed in the inetermediate_matches to match the dimensions.

cos

  • Takes in two matrices with the shape (batch_size, len, hidden_size), computes their cosine similarity loss.
  • From DistilBERT
  • If the inputs_mask is provided, masks the positions where input_mask==0.
  • If the hidden sizes of student and teacher are different, 'proj' option is needed in the inetermediate_matches to match the dimensions.

pkd

  • Takes in two matrices with the shape (batch_size, len, hidden_size), computes normalized vector mse loss at position 0 along len dimension.
  • From Patient Knowledge Distillation for BERT Model Compression
  • If the hidden sizes of student and teacher are different, 'proj' option is needed in the inetermediate_matches to match the dimensions.

nst (mmd)

  • Takes in two lists of matrices A and B. Each list contains 2 matrices with the shape (batch_size, len, hidden_size). hidden_size of matrices in A doesn't need to be the same as that of B. Computes the mse loss of similarity matrix of the 2 matrices in A and the 2 in B (both have the size (batch_size, len, len)).
  • See: Like What You Like: Knowledge Distill via Neuron Selectivity Transfer
  • If the inputs_mask is provided, masks the positions where input_mask==0.

fsp

  • Takes in two lists of matrics A and B, each list contains two matrices with the shape (batch_size, len, hidden_size). Computes the similarity matrix between the two matrices in A ( (batch_size, hidden_size, hidden_size) ) and that in B ( (batch_size, hidden_size, hidden_size) ), then computes those two matrics' mse loss.

  • See: A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning

  • If the inputs_mask is provided, masks the positions where input_mask==0.

  • If the hidden sizes of student and teacher are different, 'proj' option is needed in the inetermediate_matches to match the dimensions.

      intermediate_matches = [
      {'layer_T':[0,0], 'layer_S':[0,0], 'feature':'hidden','loss': 'fsp', 'weight' : 1, 'proj':['linear',384,768]},
      ...]