Skip to content

TideDancer/iclr21_isotropy_contxt

Repository files navigation

Isotropy in the Contextual Embedding Space: Clusters and Manifolds

This repo contains the code, the poster and the slides for the paper: X.Cai, J.Huang, Y.Bian and K. Church, "Isotropy in the Contextual Embedding Space: Clusters and Manifolds", ICLR 2021.

The poster and slides are in the poster/ folder.


Supplementary Code

Set up using venv

python3 -m venv venv
source venv/bin/activate
python -m pip install -r requirements.txt

[Optional]: install FAISS to compute Local Intrinsic Dimension (LID)

To install FAISS, please refer to https://github.com/facebookresearch/faiss

Minimum effort to run the code

This script will generate a 3D plot of embeddings for tokens ["the","first","man"]

bash run_example.sh

Generate embeddings:

For example, to generate BERT's 3rd layer's token embeddings, using wiki2 dataset, simply run the following:

source venv/bin/activate
python gen_embeds.py bert wiki2 3 --save_file bert.layer.3.dict

The code will put generated files at

./embeds/[dataset]/[model].layer.[layerID].dict

The arguments include:

usage: gen_embeds.py model dataset layer

positional arguments:
  model                 model: gpt, gpt2, bert, dist (distilbert), xlm
  dataset               dataset: wiki2, ptb, wiki103, or other customized datapath
  layer                 layer id

optional arguments:
  -h, --help            show this help message and exit
  --save_file SAVE_FILE
                        save pickle file name
  --log_file LOG_FILE   log file name
  --datapath DATAPATH   customized datapath
  --batch_size BATCH_SIZE
                        batch size, default=1
  --bptt_len BPTT_LEN   tokens length, default=512
  --sample SAMPLE       [beta], uniform with probability=beta
  --no_cuda             disable gpu

Perform comprehensive analysis

After obtain the embedding dict files in the previous step, we can perform comprehensive analysis by giving tasks arguments:

python gen_embeds.py embeds/wiki2/bert.layer.3.dict [tasks]

Tasks include the following:

  1. Compute the averaged inter-cosine similarity. For each type/word, sample 1 embedding instance.
--inter_cos --maxl 1
  1. Compute the averaged intra-cosine similarity.
--intra_cos
  1. Perform clustering and report the mean-shifted, as well as clustered, inter and intra cosines.
--cluster --cluster_cos --center
  1. Draw 2D, 3D or frequency heatmap figures.
--draw [2d]/[3d]/[freq]
  1. Draw tokens in 3D plots. Specify tokens to draw using --draw_token. The code will evaluate the string as a list, so please use the following format.
--draw token --draw_token "['the','first','man','&']"
  1. Compute LID using either Euclidean distance or cosine distancel
--lid --lid_metric [l2]/[cos]

Other settings include:

  1. Dimension reduction and sampling. Please refer to -h to see the details of the following.
--embed [embed_dimension] --maxl [sample_method]
  1. Center shiftting, subtract the mean.
--center
  1. Zoom into the two distinct clusters existed in the GPT2 embedding sapce.
--zoomgpt2 [left]/[right] --draw 3d

Get ELMo embeddings

We use AllenNLP package for ELMo. To obtain the embeddings, first need to prepare the input data as raw text file, then run:

source venv/bin/activate
python -m pip install allennlp==1.0.0rc3
allennlp elmo /path/to/dataset_text.txt /tmp/output.hdf5 --all
python elmo.py /tmp/output.hdf5 /path/to/dataset_text.txt [dataset_name] [layer_id]

Note that only allennlp latest version does not have "elmo" subcommand. So please use the older version, e.g 1.0.0rc3.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published