ocaml-torch internals

ocaml-torch faces several challenges, including:

binding to thousands of functions
avoiding any minor memory leaks in these functions
quickly cleaning up the memory allocations of tensors when OCaml is done with them

In order to solve this, we have 2 steps of code generation. In this diagram, solid arrows represent the code generation DAG and dashed arrows represent the code dependency DAG:

At a high level,

Declarations.yaml contains the function signatures for the whole Torch C++ API.
Custom binding generation reads all the declarations, and whenever possible, generating
- glue code for crossing between C/C++ (the generated C/C++ API),
- glue code for using the (yet to be generated) OCaml foreign functions in OCaml (the generated OCaml wrapper),
- and ctypes bindings.
Stub generation uses the ctypes library, reading the bindings and generating C and OCaml stubs. These are just glue code to handle C/OCaml FFI. Note that we have some manually-written C++ functions and bindings that get generated stubs.
There are an extremely small number of manually-written stubs (just 1 as of writing) that ctypes cannot handle.
A combination of the generated OCaml wrapper and manually written wrapper provide an actually usable OCaml API. These are further built upon in the main library (not pictured).

Memory management

A large part of this complexity is driven by memory management.

Avoiding memory leaks

It is challenging to write manual FFI stubs without memory leaks or race conditions. We use ctypes to make sure we get this right on the vast majority of functions. Although it requires a second code generation step, this spares us from reinventing stub generation.

Cleaning up tensors

We ensure that tensors are freed when OCaml garbage collects them. To do this, each Tensor is equipped with a custom finalizer. This could be done on either the C++ or OCaml side. However, the API to inform OCaml of a tensor's true size in memory only exists in C++ (the custom block API). Without this, OCaml would not know when to garbage collect on CPU and would OOM easily.

Each C++ torch::Tensor is essentially an intrusive pointer to a TensorImpl that stores the real information about the tensor. TensorImpls contain a reference count, and whenever a new intrusive pointer is created or destroyed, that reference count changes. We can't pass these torch::Tensors directly to OCaml, though, so instead we work with TensorImpl *s and use the release/reclaim(_copy) Torch API for intrusive pointers. The finalizer on each garbage-collected OCaml tensor just does a reclaim, allowing the refcount to decrement to 0 when the resulting torch::Tensor goes out of scope.

Note that OCaml is unaware of GPU memory usage. GPU users may need to manually garbage collect.

Raw tensors and GC tensors

One wrinkle in this setup is that ctypes cannot handle custom blocks. Since we want the bulk of our stubs to be generated by ctypes, we create a distinction between raw_tensors and gc_tensors.

	raw tensor	GC tensor
has finalizer?	no	yes
GC knows its size?	no	yes
FFI input for C?	no	yes
FFI output from C?	yes	no
ctypes type	void ptr	void ptr
C++ type	TensorImpl *	TensorImpl *

The only way to convert from a raw_tensor to gc_tensor is with the hand-written, non-ctypes function with_tensor_gc. It is used copiously in the generated OCaml wrapper code to ensure we only surface GC tensors to the user.

The lifecycle of each tensor looks like this:

Some wrapper function let t = Tensor.foo () gets invoked, which makes its way into C++.
C++ returns a raw_tensor that goes through a regular ctypes stub and makes its way back to the OCaml Tensor.foo call.
Still in Tensor.foo, with_tensor_gc gets invoked. This goes back into C++ and copies the pointer (but not the data) of the tensor to a new custom block with a known off-heap size. It does not attach a finalizer using the block's custom_ops. Instead, the block is returned to OCaml, where the finalizer is attached using Gc.finalise. The custom block and the raw tensor address are combined to create a new gc_tensor fat pointer, which keeps the custom block alive as long the tensor is accessible in OCaml.
Now let () = Tensor.bar t gets invoked. This goes through usual ctypes stubs, since t looks just like a regular void ptr to ctypes.
Eventually t gets garbage collected. If the managed custom block has no other references, it runs the finalizer and frees the tensor's data.

Indirection

This is a lot of indirection. The memory of each tensor (raw or GC) looks like this:

             block 1            block 2 (raw)
        ------------------    -----------------
root -> | ctypes fat ptr | -> | void *        | -----> torch::TensorImpl -> storage
        ------------------ |  -----------------     |
                           |                        |
                           |    block 3 (managed)   |
                           |  --------------------  |
                           -> | void *           | -|
                              --------------------

Here's what each thing in the chain does:

block 1: allows ctypes to manage the memory of blocks 2 and 3
block 2: points to the off-heap memory and may be used as a voidp
block 3: also points to the off-heap memory, but is an opaque block with an attached OCaml finalizer that decrements the tensor refcount.
torch::TensorImpl: holds metadata about the tensor's data type, size, etc., as well as a pointer/reference to its heap-allocated data (storage)
storage: the actual numerical data of the tensor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internals.md

internals.md

ocaml-torch internals

Memory management

Avoiding memory leaks

Cleaning up tensors

Raw tensors and GC tensors

Indirection

Files

internals.md

Latest commit

History

internals.md

File metadata and controls

ocaml-torch internals

Memory management

Avoiding memory leaks

Cleaning up tensors

Raw tensors and GC tensors

Indirection