Rasa NLU Evaluation Results Comparison

Rasa NLU Evaluation Result Comparison

This repository contains code to compare multiple sets of Rasa NLU evaluation results. It can be used locally or as a Github Action.

You can find more information about Rasa NLU evaluation in the Rasa Open Source docs.

Use as a Github Action

This Github action compares NLU evaluation results using the command python -m compare_nlu_results with the input arguments provided to it.

Basic usage:

...
  steps:
  - name: Compare NLU Results
    uses: RasaHQ/[email protected]
    with:
      nlu_result_files: results1/intent_report.json="old"  results2/intent_report.json="new"

Action Output

There are no output parameters returned by this Github Action, however two files are written:

It writes a json report of all result sets combined to the filepath specified by the input json_outfile. It writes a formatted table of the compared results to the filepath specified by the input html_outfile. For example:

metric	precision			recall
result_set	old	new	(new - old)	old	new	(new - old)
entity
micro avg	0.994698	0.998004	0.003306	0.999334	0.998668	-0.000666
macro avg	0.997904	0.998714	0.00081	0.998967	0.994012	-0.004955
weighted avg	0.994733	0.99802	0.003287	0.999334	0.998668	-0.000666
product	0.989286	0.998198	0.008912	1.0	1.0	0.0
language	1.0	1.0	0.0	1.0	0.996633	-0.003367
company	1.0	1.0	0.0	0.988636	1.0	0.011364

Input arguments

You can set the following options using with in the step running this action. The nlu_result_files argument is required.

Input	Description	Default
`nlu_result_files`	The Rasa NLU evaluation report files that should be compared and the labels to associate with each of them. For example: `intent_report.json=stable second_intent_report.json=incoming`. The report from which diffs should be calculated should be listed first. All results must be of the same type (e.g. intent classification, entity extraction). Labels for files should be unique. Do not put spaces before or after the = sign. Label values with spaces should be put in double quotes. For example: `previous_results/DIETClassifier_report.json="Previous Stable Results" current_results/DIETClassifier_report.json="New Results"`
`json_outfile`	File to which to write combined json report (contents of all result files).	combined_results.json
`html_outfile`	File to which to write HTML table. File will be overwritten unless `append_table` is specified.	formatted_compared_results.html
`table_title`	Title of HTML table.	Compared NLU Evaluation Results
`label_name`	Type of labels predicted in the provided NLU result files e.g. 'intent', 'entity', 'retrieval intent'.	label
`metrics_to_diff`	Space-separated list of numeric metrics to consider when determining changes across result sets e.g. "support, f1-score".	All numeric metrics found in input reports
`metrics_to_display`	Space-separated list of metrics to display in resulting HTML table e.g. "support, f1-score, confused_with"	All metrics found in input reports
`metric_to_sort_by`	Metrics to sort by (descending) in resulting HTML table.	`support`
`display_only_diff`	Display only labels (e.g. intents or entities) where there is a difference in at least one of the `metrics_to_diff` between the first listed result set and the other result set(s). Set to `true` to use.
`append_table`	Whether to append the comparison table to the html output file, instead of overwriting it. If not specified, html_outfile will be overwritten. Set to `true` to use.
`style_table`	Whether to add CSS style tags to the html table to highlight changed values. Not compatible with Github Markdown format. Set to `true` to use.

Example Usage

You can use this Github Aciton in a CI/CD pipeline for a Rasa assistant which e.g.:

Runs NLU cross-validation
Refers to previous stable results (e.g. download these from a remote storage bucket, the example below assumes the results are already in the repo path for demonstration purposes)
Runs this action to compare the output of incoming cross-validation results to the previous stable results
Posts the HTML table as a comment to the pull request to more easily review changes

For example:

on:
  pull_request: {}

jobs:
  run_cross_validation:
    runs-on: ubuntu-latest
    name: Cross-validate
    steps:
    - name: Setup python
      uses: actions/setup-python@v1
      with:
        python-version: '3.8'

    - name: Install dependencies
      run: |
        pip install -r requirements.txt

    - name: Run cross-validation
      run: |
        rasa test nlu --cross-validation

    - name: Compare Intent Results
      uses: RasaHQ/[email protected]
      with:
        nlu_result_files: last_stable_results/intent_report.json="Stable" results/intent_report.json="Incoming"
        table_title: Intent Classification Results
        json_outfile: results/compared_intent_classification.json
        html_outfile: results/compared_results.html
        display_only_diff: false
        label_name: intent
        metrics_to_display: support f1-score
        metrics_to_diff: support f1-score
        metric_to_sort_by: support

    - name: Compare Intent Results
      uses: RasaHQ/[email protected]
      with:
        nlu_result_files: last_stable_results/DIETClassifier_report.json="Stable" results/DIETClassifier_report.json="Incoming"
        table_title: Entity Extraction Results
        json_outfile: results/compared_DIETClassifier.json
        html_outfile: results/compared_results.html
        append_table: true
        display_only_diff: true
        label_name: entity
        metrics_to_display: support f1-score precision recall
        metrics_to_diff: precision recall
        metric_to_sort_by: recall

    - name: Post cross-val comparison to PR
      uses: amn41/comment-on-pr@comment-file-contents
      env:
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
      with:
        msg: results/compared_results.html

Local Use

To compare NLU evaluation results locally, run e.g.

python -m compare_nlu_results --nlu_result_files results/intent_report.json=Base new_results/intent_report.json=New

See python -m compare_nlu_results --help for all options; the descriptions can also be found in the input arguments section.

You can also use the package in a Python script to load, compare and further analyse results:

from compare_nlu_results.results import (
    EvaluationResult,
    EvaluationResultSet
)

# view just a result set
old_results = EvaluationResult(json_report_filepath="tests/data/results/intent_report.json", name="old")
print(old_results.df)

# combine two result sets
new_results = EvaluationResult(json_report_filepath="tests/data/second_results/intent_report.json", name="new")
combined_results = EvaluationResultSet(result_sets=[old_results, new_results], label_name="intents")
print(combined_results.df)

# See differences between result sets
combined_results.get_diffs_between_sets()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Action

Rasa NLU Evaluation Results Comparison