Table of Contents
codeCAI is a conversational assistant that enables analysts to specify data analyses using natural language, which are then translated into executable Python code statements.
The approach we used to realize an assistant capable of interpreting analytical instructions, is a Transformer-based language model with an adapted tree encoding scheme and a restrictive grammar model that learns to map natural language specifications to a tree-based representation of the output code. With this syntax-driven approach, we aim to enable the language model to capture hierarchical relationships in syntax trees representing Python code fragments.
We have implemented a RASA-based dialogue system prototype, which we have integrated into the JupyterLab environment. Using this natural-language interface, data analyses can be specified at a high abstraction level, which are then automatically mapped to executable program instructions that are generated in a Jupyter Notebook
We performed the evaluation on two tasks, namely semantic parsing and code generation using the ZIH HPC cluster. In both cases, the goal is to generate formal meaning representations from natural language input, that is, lambda-calculus expressions or Python code.
Experimental results show that on one benchmark the tree encoding performs better than the sequential encoding used by the original Transformer architecture. To test whether the tree-encoded Transformer learns to predict the AST structure correctly, we looked at the exact match accuracy and token- and sequence-level precision and recall. The takeaway from the analysis of correctly predicted prefixes is that string literals have a significant impact on the quality of the prediction and that longer sequences are more difficult to predict. We also found that tree encoding gives an improvement of up to 3.0% when excluding string literals over sequential encoding.
demo.mp4
- Python 3.8 (earlier versions may work as well)
- Conda (Miniconda3 is sufficient)
- Create and activate Python environment
Alternatively, you can also use virtualenv (e.g. if you would like to use system site_packages during code inference)
conda create -n pyenv_rasa python=3.8 conda activate pyenv_rasa
(if virtualenv is not installed:conda deactivate python3 -m virtualenv ~/pyenv_rasa source ~/pyenv_rasa/bin/activate # use this everywhere instead of "conda activate pyenv_rasa" below
pip3 install --user virtualenv
afterconda deactivate
) - Install Rasa
pip3 install rasa==2.2.8 rasa-sdk==2.2.0 tensorflow==2.3.4 protobuf==3.14.0
- Clone repository (if not done yet)
git clone https://github.com/SmartDataAnalytics/codeCAI.git cd codeCAI
- Install nl2code model
pip3 install -e nl2codemodel/
- Install missing and incorrectly versioned dependencies
pip3 install sentencepiece==0.1.95 torchmetrics==0.5.1
- Download Rasa model
mkdir -p rasa/models wget -P rasa/models --content-disposition 'https://cloudstore.zih.tu-dresden.de/index.php/s/c3PcAjpPX6AeZaF/download?path=%2F&files=20210726-155823_kmeans.tar.gz'
- Download NL2Code vocabulary, grammar graph and checkpoint
mkdir -p nl2codemodel/models wget -P nl2codemodel/models --content-disposition 'https://cloudstore.zih.tu-dresden.de/index.php/s/c3PcAjpPX6AeZaF/download?path=%2F&files=usecase-nd_vocab_src.model' 'https://cloudstore.zih.tu-dresden.de/index.php/s/c3PcAjpPX6AeZaF/download?path=%2F&files=usecase-nd_grammargraph.gpickle' 'https://cloudstore.zih.tu-dresden.de/index.php/s/c3PcAjpPX6AeZaF/download?path=%2F&files=last.ckpt'
- Test Rasa installation
In a separate terminal (under
cd rasa export NL2CODE_CONF=$PWD/nl2code.yaml rasa run actions -vv
rasa
working directory):You can use the examples in theconda activate pyenv_rasa rasa shell -m models/20210726-155823_kmeans.tar.gz
rasa/examples
directory, e.g.k_means_clustering.csv
(or.json
) for conversing with the assistant.
- Create and activate Python environment using conda
cd codecai_jupyter_nli conda env create -f environment.yml conda activate codecai_jupyter_nli
- Install dependencies
pip3 install -e .
- Build the extension
(or
jlpm run build
jlpm run watch
for development, to automatically rebuild on changes)
- Activate Python environment
conda activate pyenv_rasa
- Change to the Rasa project directory
cd rasa
- Run Rasa server (adjust port 8888 if taken by another application than JupyterLab)
rasa run --enable-api --debug -m models/***.tar.gz --cors ["localhost:8888"]
- Run Rasa action server (in a separate console window, in the
rasa
directory)conda activate pyenv_rasa cd rasa export NL2CODE_CONF=$PWD/nl2code.yaml rasa run actions -vv
- Activate Python environment using conda
conda activate codecai_jupyter_nli
- Change to Jupyter base directory (creating if necessary)
mkdir -p ~/jupyter_root cd ~/jupyter_root
- Run JupyterLab
jupyter lab
- Open dialog assistant
-
Open (or create) a Jupyter notebook
-
Press Ctrl+Shift+C to open the Command Palette
(or select View → Activate Command Palette)
-
Type "nli"
-
Select "Show codeCAI NLI"
-
- Use dialog assistant
- Type an instruction (e.g. "Hi" or "Which ML methods do you know?") in the text field on the bottom of the "codeCAI NLI" side panel on the right.
Working examples are found
rasa/examples
directory, e.g.k_means_clustering.csv
(or.json
). - Press Return or click the "Send" button
- Type an instruction (e.g. "Hi" or "Which ML methods do you know?") in the text field on the bottom of the "codeCAI NLI" side panel on the right.
Working examples are found
-
In the ScaDS.AI Living Lab lecture, we presented an overview of state-of-the-art language models for program synthesis, introduced some basic characteristics of these models, and discussed several of their limitations. One possible direction of research that could help alleviate these limitations is the inclusion of structural knowledge - an attempt we have made in this regard and which we briefly introduced:
-
codeCAI Poster - Generating Code from Natural Language
-
codeCAI Paper (OpenReview) - Transformer with Tree-order Encoding for Neural Program Generation
Distributed under the MIT License. See LICENSE.txt
for more information.
This work was supported by the German Federal Ministry of Education and Research (BMBF, 01IS18026A-D) by funding the competence center for Big Data and AI "ScaDS.AI Dresden/Leipzig". The authors gratefully acknowledge the GWK support for funding this project by providing computing time through the Center for Information Services and HPC (ZIH) at TU Dresden.