codeCAI - Generating Code from Natural Language

Table of Contents

About codeCAI
Installation
Usage
- codeCAI Jupyter Plugin
- codeCAI Rasa Backend
Further Material
License
Acknowledgments

About codeCAI

codeCAI is a conversational assistant that enables analysts to specify data analyses using natural language, which are then translated into executable Python code statements.

The approach we used to realize an assistant capable of interpreting analytical instructions, is a Transformer-based language model with an adapted tree encoding scheme and a restrictive grammar model that learns to map natural language specifications to a tree-based representation of the output code. With this syntax-driven approach, we aim to enable the language model to capture hierarchical relationships in syntax trees representing Python code fragments.

We have implemented a RASA-based dialogue system prototype, which we have integrated into the JupyterLab environment. Using this natural-language interface, data analyses can be specified at a high abstraction level, which are then automatically mapped to executable program instructions that are generated in a Jupyter Notebook

We performed the evaluation on two tasks, namely semantic parsing and code generation using the ZIH HPC cluster. In both cases, the goal is to generate formal meaning representations from natural language input, that is, lambda-calculus expressions or Python code.

Experimental results show that on one benchmark the tree encoding performs better than the sequential encoding used by the original Transformer architecture. To test whether the tree-encoded Transformer learns to predict the AST structure correctly, we looked at the exact match accuracy and token- and sequence-level precision and recall. The takeaway from the analysis of correctly predicted prefixes is that string literals have a significant impact on the quality of the prediction and that longer sequences are more difficult to predict. We also found that tree encoding gives an improvement of up to 3.0% when excluding string literals over sequential encoding.

demo.mp4

Installation

Prerequisites

Python 3.8 (earlier versions may work as well)
Conda (Miniconda3 is sufficient)

Rasa Backend

Create and activate Python environment
```
conda create -n pyenv_rasa python=3.8
conda activate pyenv_rasa
```
Alternatively, you can also use virtualenv (e.g. if you would like to use system site_packages during code inference)
```
conda deactivate
python3 -m virtualenv ~/pyenv_rasa
source ~/pyenv_rasa/bin/activate # use this everywhere instead of "conda activate pyenv_rasa" below
```
(if virtualenv is not installed: pip3 install --user virtualenv after conda deactivate)

Install Rasa

pip3 install rasa==2.2.8 rasa-sdk==2.2.0 tensorflow==2.3.4 protobuf==3.14.0

Clone repository (if not done yet)

git clone https://github.com/SmartDataAnalytics/codeCAI.git
cd codeCAI

Install nl2code model
```
pip3 install -e nl2codemodel/
```

Install missing and incorrectly versioned dependencies

pip3 install sentencepiece==0.1.95 torchmetrics==0.5.1

Download Rasa model

mkdir -p rasa/models
wget -P rasa/models --content-disposition 'https://cloudstore.zih.tu-dresden.de/index.php/s/c3PcAjpPX6AeZaF/download?path=%2F&files=20210726-155823_kmeans.tar.gz'

Download NL2Code vocabulary, grammar graph and checkpoint

mkdir -p nl2codemodel/models
wget -P nl2codemodel/models --content-disposition 'https://cloudstore.zih.tu-dresden.de/index.php/s/c3PcAjpPX6AeZaF/download?path=%2F&files=usecase-nd_vocab_src.model' 'https://cloudstore.zih.tu-dresden.de/index.php/s/c3PcAjpPX6AeZaF/download?path=%2F&files=usecase-nd_grammargraph.gpickle' 'https://cloudstore.zih.tu-dresden.de/index.php/s/c3PcAjpPX6AeZaF/download?path=%2F&files=last.ckpt'

Test Rasa installation
```
cd rasa
export NL2CODE_CONF=$PWD/nl2code.yaml
rasa run actions -vv
```
In a separate terminal (under rasa working directory):
```
conda activate pyenv_rasa
rasa shell -m models/20210726-155823_kmeans.tar.gz
```
You can use the examples in the rasa/examples directory, e.g.k_means_clustering.csv (or .json) for conversing with the assistant.

(back to top)

Jupyter Plugin

Create and activate Python environment using conda

cd codecai_jupyter_nli
conda env create -f environment.yml
conda activate codecai_jupyter_nli

Install dependencies
```
pip3 install -e .
```
Build the extension
```
jlpm run build
```
(or jlpm run watch for development, to automatically rebuild on changes)

Usage

codeCAI Rasa Backend

Activate Python environment
```
conda activate pyenv_rasa
```
Change to the Rasa project directory
```
cd rasa
```
Run Rasa server (adjust port 8888 if taken by another application than JupyterLab)
```
rasa run --enable-api --debug -m models/***.tar.gz --cors ["localhost:8888"]
```

Run Rasa action server (in a separate console window, in the rasa directory)

conda activate pyenv_rasa
cd rasa
export NL2CODE_CONF=$PWD/nl2code.yaml
rasa run actions -vv

(back to top)

codeCAI Jupyter Plugin

Activate Python environment using conda
```
conda activate codecai_jupyter_nli
```
Change to Jupyter base directory (creating if necessary)
```
mkdir -p ~/jupyter_root
cd ~/jupyter_root
```
Run JupyterLab
```
jupyter lab
```
Open dialog assistant
- Open (or create) a Jupyter notebook
- Press Ctrl+Shift+C to open the Command Palette
  
  (or select View → Activate Command Palette)
- Type "nli"
- Select "Show codeCAI NLI"
Use dialog assistant
- Type an instruction (e.g. "Hi" or "Which ML methods do you know?") in the text field on the bottom of the "codeCAI NLI" side panel on the right. Working examples are found rasa/examples directory, e.g.k_means_clustering.csv (or .json).
- Press Return or click the "Send" button

Further Material

In the ScaDS.AI Living Lab lecture, we presented an overview of state-of-the-art language models for program synthesis, introduced some basic characteristics of these models, and discussed several of their limitations. One possible direction of research that could help alleviate these limitations is the inclusion of structural knowledge - an attempt we have made in this regard and which we briefly introduced:
codeCAI Poster - Generating Code from Natural Language
codeCAI Paper (OpenReview) - Transformer with Tree-order Encoding for Neural Program Generation

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Acknowledgments

This work was supported by the German Federal Ministry of Education and Research (BMBF, 01IS18026A-D) by funding the competence center for Big Data and AI "ScaDS.AI Dresden/Leipzig". The authors gratefully acknowledge the GWK support for funding this project by providing computing time through the Center for Information Services and HPC (ZIH) at TU Dresden.

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
codecai_jupyter_nli		codecai_jupyter_nli
nl2codemodel		nl2codemodel
rasa		rasa
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

codeCAI - Generating Code from Natural Language

About codeCAI

Installation

Prerequisites

Rasa Backend

Jupyter Plugin

Usage

codeCAI Rasa Backend

codeCAI Jupyter Plugin

Further Material

License

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

License

SmartDataAnalytics/codeCAI

Folders and files

Latest commit

History

Repository files navigation

codeCAI - Generating Code from Natural Language

About codeCAI

Installation

Prerequisites

Rasa Backend

Jupyter Plugin

Usage

codeCAI Rasa Backend

codeCAI Jupyter Plugin

Further Material

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages