Skip to content

jonsterling/stlc-type-inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dependencies

The LaTeX files have dependencies via git submodules. Please run the following before attempting to build anything:

git submodule update --init --recursive

hacking

To run or hack on this, ensure that you have SML/NJ installed. Then, open an SML REPL:

rlwrap sml

Here's an example session:

Standard ML of New Jersey...
- CM.make "src/stlc.cm";
- Test.test ();

This should print the following:

(lam nil) ==> (-> #1 unit)
(lam (var 0)) ==> (-> #4 #4)
(lam (lam (lam (var 2)))) ==> (-> #7 (-> #9 (-> #11 #7)))

There is no parser yet, but I can write one if we need it.

editors?

I like to edit SML code using Visual Studio Code, because we have made a very nice SML mode that has mostly correct highlighting (a herculean feat!) and integration with SML/NJ for type error reporting. (This is what the sml.json file is for in the root directory of this repository.)

dataset

The file dataset contains 1,031,993 randomly generated term/type pairs. You can obtain it by extracting dataset.xz using xz -d dataset.xz. The first column is the term. The second column is its type. The two columns are separated by a single space. Example data points include:

(\0)(\(\\0)(0)) (1>(1>1))
(\0)(\(\0)((\\*)(0))) (1>(1>1)
* 1
(\*)(\\\\\*) 1
(\*)(\\\\3) 1

I tried to minimise syntax to simplify things for our neural network. The syntax used for terms is

`NIL`        = *
`LAM t`      = \`t`
`AP (t1,t2)` = (`t1`)(`t2`)

The syntax used for types is

`UNIT`       = 1
`AR (t1,t2)` = (`t1`>`t2`)
`META v`     = 1

I replaced the type variables META v by UNIT to give us the simplest possible types. Again, our initial goal is to get ML to handle the simplest type inferences first, and then generalise.

The longest term is 308 characters long:

(\(\\(\\*)((\*)(*)))((\(\(\(\\\\\\*)(\3))(\\4))((\(\\*)(\\\\\\\\\\\\10))((\\\\\\\\4)(\\\\\\\\*))))(\(\\\(\\\*)(*))((\\\\\\\\7)(\\\\\\\\4)))))((\(\\(\(\\\*)(\(\\1)(\\*)))(\(\\\\5)(\\\\1)))((\(\\\\\\\\\\\*)(\\*))((\\(\\\\\\\\\\*)(\\\\\\\\\\\\5))((\\(\\\\0)(\\\\\\\\\\\\*))(\(\*)(\\\\\\\\\\\\\\15))))))((\(\\\\0)((\\*)((\*)(\\\\\\\\3))))((\(\\*)((\\\\0)(\\\\\\\\\\\\\\7)))((\\1)((\\\\\\\\\\\\\3)(\\\\\\\\*))))))

The longest type is 198 characters long:

(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>(1>1)))))))))))))))))))))))))))))))))))))))))))))))))

One can straightforwardly embed these into R^n (say, R^400 and R^300, respectively) by taking the vectors of ASCII character codes, and padding with zeros as needed.

Generating the dataset

First compile the random lambda term generator using mlton. Modify the suffix x86-linux to match the file generated by SML/NJ in the instructions below:

$ make -C cm2mlb
$ cd src
$ sml @SMLload=../cm2mlb/cm2mlb.x86-linux rando.cm > rando.mlb
$ mlton rando.mlb

Then to generate N random lambda terms, go

$ ./rando N

For example,

 $ ./rando 3
 \(\\0)((\(\(\\2)(*))(\1))((\\\\\*)(\\\*))) (1>(1>1))
 \\(\(\3)(\\\*))(\\(\\\6)(\\\\\\\3)) (1>(1>1))
 \(\1)((\\\\2)(\(\\\\\6)((\\1)(\*)))) (1>1)

Because these won't necessarily be unique, you'll probably want to go

$ ./rando N | sort -u

Note that this introduces an ordering in the terms. You can pipe the output to shuf or sort -R to randomize it. The code in rando.sml uses hardcoded probabilities to decide what kind of term to generate. Edit these probabilities to change the complexity of the random terms we generate.

To preprocess the terms for input to neural networks, we did:

$ cut -d' ' -f1 dataset | sed -e 's/./& /g' -e 's/\([0-9]\) \([0-9]\)/\1\2/g' -e 's/ $//g' | xz -9 - > terms.xz
$ cut -d' ' -f2 dataset | sed -e 's/./& /g' -e 's/ $//g' | xz -9 - > types.xz

Line n of terms then has the type on line n of types.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •