Proposal: Static types #309

bentsherman · 2024-05-01T04:14:12Z

This PR is a showcase of the proposed syntax for static types in Nextflow.

While I started with the goal of simply adding type annotations and type checking, I realized that many aspects of the language needed to be re-thought in order to provide a consistent developer experience. Some of these things can be done now, but I suspect they will be more difficult without static types, so I have tried to show them in their "best form" in this PR.

Changes

Type annotations. The following declarations can be annotated with a type:
- workflow params/outputs
- workflow takes/emits
- process inputs/outputs
- function parameters/return
- local variables (generally not needed)
Nextflow will use these type annotations to infer the type of every value in the workflow and make sure they are valid.

The main built-in types are:
- int, float, boolean, String: primitive types
- Path: file or directory
- List<E>, Set<E>, Bag<E>: collections with various constraints on ordering and uniqueness
- Map<K,V>: map of key-value pairs
- Channel<E>: channel (i.e. queue channel)
User-defined types. Types can be composed in several ways to facilitate domain modeling:
- tuples: any number of values can be grouped into a tuple, e.g. (1, 'hello', true) has type (int, String, boolean)
- records: a combination of named values, e.g. a sample is a meta map AND some files:
  record Sample { meta: Map ; files: List<Path> }
- enums: a union of named values, e.g. a shirt size can be small or medium or large:
  enum TshirtSize { Small, Medium, Large }
- optionals: any type can be suffixed with ? to denote that it can be null (e.g. String?), otherwise it should never be null
Only use params in top-level workflow. Params are not known outside the entry workflow. Pass params into processes and workflows as explicit inputs instead.
Replace ext config with params and process inputs. The ext directive is untyped, which prevents static type checking. Pass ext settings as process inputs instead. If an ext setting needs to be configurable, expose a param for it.
Replace publishDir with workflow outputs. Publish channels to output targets in the entry workflow instead. Declare an index file for the output target instead of creating a samplesheet in the workflow. Bundle related data in a single channel (e.g. metadata, fastq, and md5 files for each sample). Don't try to anticipate the needs of downstream pipelines, just publish a comprehensive index file that can be filtered/renamed as needed.
Use eval output, topic channels to collect tool versions. Send tool versions to a topic instead of emitting them as process outputs. Use the eval() or env() function to emit the tool version. Declare an index file (see versions output target) instead of rendering the YAML directly.
Define pipeline params in the main script. Each param has a type. Complex types can be composed from collections, records, and enums. Rather than specifying a particular input format for input files, simply specify a type and Nextflow will use the type to generate a schema and load from various sources. Config params are defined separately in the main config.
Replace def with let, var, and fn. These keywords better distinguish functions from (mutable/immutable) variables. Not a crucial change but IMO it makes the code easier to read.
Replace pipe operator | with |>. This pipe operator better conveys the directional nature of a pipe and doesn't overload an existing operator (bitwise or). It can apply any "callable" (workflow / process / function) to any value:
```
// x, y, z can be any value
// f can be any callable
x |> f == f(x)
x |> f(y, z) == f(x, y, z)
```

Queue channels are just channels and value channels are just values. The Channel type always refers to a queue channel. Value channels can be used like any value, e.g. in an if statement:

// convert a list into a channel and back into a list
vals = 1..10 |> Channel.of |> collect
// `vals` is just a value, so just use it!
if( vals.size() > 2 )
  println 'more than two!'

Processes are just functions. Processes cannot be called directly with channels. Instead, call a process with values to execute it once, or call it in an operator to execute it for each value in the source channel:

// execute FASTQC in parallel on each input file
Channel.fromPath( "inputs/*.fastq" )
  |> map(FASTQC) // short for { fastq -> FASTQC(fastq) }

// execute ACCUMULATE sequentially on each input file
// (replaces experimental recursion)
Channel.fromPath( "inputs/*.txt" )
  |> reduce { result, file -> ACCUMULATE( result, file ) }

Simple operator library. Many operators can be removed in favor of equivalent functions or similar operators. All in all, the operator library can be reduced to the following core operators:
- collect: collect channel elements into a collection (i.e. bag)
- cross: cross product of two channels
- filter: filter a channel based on a condition
- flatMap: perform a scatter
- groupBy: perform a gather
- joinBy: relational join of two channels (i.e. horizontal)
- map: transform a channel
- mix: concatenate multiple channels (i.e. vertical)
- reduce: accumulate each channel element into a single value
- scan: like reduce but emit each intermediate value
- subscribe: invoke a function for each channel element
- view: print each channel element

Benefits

Well-defined workflow inputs. Workflow inputs are explicitly defined alongside the entry workflow as a set of params. Each param has a type, and complex params can be loaded transparently from any source (file, database, API, etc) as long as the runtime supports it. The JSON schema of a param can be inferred from the param's type.
Well-defined workflow outputs. Workflow outputs are explicitly defined as a set of "output targets". Each target has a type and can create an index file, which is essentially a serialization of a channel to external storage (file, database, API, etc), and each target defines how its published files are organized (e.g. in a directory tree). The JSON schema of an output target can be inferred from the target's type.
Make pipeline import-able. Separating the "core" workflow (i.e. SRA) from params and publishing makes it easy to import the pipeline into larger pipelines. See https://github.com/bentsherman/fetchngs2rnaseq for a more complete example.
Simpler dataflow logic. Since value channels are just values to the user, they can be used with native constructs like an if statement, rather than only callbacks. Since processes are called in the operator closure, their inputs can be different from the source channel structure, and their outputs can be different from the output channel structure. The amount of boilerplate, both in the workflow logic and process definition, is significantly reduced.
Simpler operator library. With a minimal set of operators, users can easily determine which operator to use based on their needs. The operators listed above are statically typed, pertain only to generic stream operations, and work with any type of value, not just tuples and lists.
Simpler process inputs/outputs. Process inputs and outputs are defined similarly to workflow inputs and outputs, rather than a custom set of type qualifiers. Files are automatically staged/unstaged based on their type declaration. Inputs can have default values. Thanks to the simplified dataflow logic described above, tuples are generally not needed.

Extra Notes

This proposed syntax will be enabled by the following internal improvements:

New script/config parser, which enables us to evolve the Nextflow language into whatever we want, without being constrained by Groovy syntax (though it still must compile to Groovy AST).
Static analysis, which can infer the type of every value based on the declared types of pipeline/workflow/process inputs, and infer whether to handle an async value (i.e. value channel) with a callback or an await.
Automatic generation of JSON schemas for workflow params and outputs based on their types. Preserves support for external tools like Seqera Platform, and lays the groundwork to transparently support different connectors (CSV/JSON file, HTTP API, SQL database, etc).

Signed-off-by: Ben Sherman <[email protected]>

bentsherman · 2024-05-01T04:28:15Z

workflows/sra/main.nf

+            |> map { meta ->
+                def sample = new Sample( meta, meta.fastq_aspera.tokenize(';').take(2).collect( name -> file(name) ) )
+                ASPERA_CLI ( sample, 'era-fasp', aspera_cli_args )
+            }                                                   // fastq: Channel<Sample>, md5: Channel<Sample>


@mahesh-panchal to your point about dynamic args, I think we can do even better in DSL3:

|> map { meta -> def sample = new Sample( /* ... */ ) ASPERA_CLI ( sample, 'era-fasp', "${meta.key}" ) }

Because we call the process in a map operator explicitly (currently it is implied), we can control how the process is invoked for each task within the operator closure, instead of passing multiple queue channels.

Yes, this specifically is much better. Treating processes like functions is soooo much better since there's no implicit transformation stuff going on with all the singleton/queue channel stuff. It has to be formed and then mapped. And tuples disappear too, except in channels (?).

Actually I think what's worrying me about this syntax is the mixing of input types. An input could be a channel (e.g. MULTIQC_MAPPINGS_CONFIG ( mappings ) lower down ) or it could be an input set (e.g. this dynamically defined Sample). This is already confusing to new comers where we commonly see people trying to use channels inside map, branch, etc.
I guess one could explain the second option as passing dynamically defined singleton channels.

In this proposal, there are no "value" channels, only queue channels and regular values. So the MULTIQC_MAPPINGS_CONFIG ( mappings ) is no different because mappings is just a value. It may be an async value, and Nextflow might represent it as a value channel under the hood, but to the user it should be indistinguishable from a regular value

In other words, you can not call a process with a channel, only values

main.nf

modules/local/aspera_cli/main.nf

modules/local/sra_to_samplesheet/main.nf

types/types.nf

adamrtalbot · 2024-05-01T15:47:56Z

workflows/sra/main.nf

+        //
+        // MODULE: Get SRA run information for public database ids
+        //
+        |> map { id ->
+            SRA_IDS_TO_RUNINFO ( id, ena_metadata_fields )
+        }                                                       // Channel<Path>
+        //
+        // MODULE: Parse SRA run information, create file containing FTP links and read into workflow as [ meta, [reads] ]
+        //
+        |> map(SRA_RUNINFO_TO_FTP)                              // Channel<Path>
+        |> set { runinfo_ftp }                                  // Channel<Path>
+        |> flatMap { tsv ->
+            splitCsv(tsv, header:true, sep:'\t')
+        }                                                       // Channel<Map>
+        |> map { meta ->
+            meta + [single_end: meta.single_end.toBoolean()]
+        }                                                       // Channel<Map>
+        |> unique                                               // Channel<Map>
+        |> set { sra_metadata }                                 // Channel<Map>
+


This actually might have the biggest impact. Piping has never been more popular, especially with biologists because of R. Being able to write the pipeline in this functional way might help with express the mental model of channels. Sorry, too much language there but I like this a lot.

Indeed. A lot of moving pieces have to come together to make this work. Does it make sense to you how I am calling the process like a function in the map operator?

It took a second to process, but this is soooo much better. This deals with what I wanted much better than how I initially thought about from the current syntax.

what is |> used for?

ooh, I see, sorry

mahesh-panchal

I really like the pipe replacement |> is easier to see, and provides some visual directionality, helping readability.

Something extra: How about also shifting:

workflow {
    workflow.onComplete {
    }
}

to

workflow {
    onStart:
    ...

    take:
    ...

    main:
    ...
    
    onComplete:
    ...
}

There are some things I really like here, but I have reservations about other stuff like how channels are obfuscated with their channel values, and process outputs

main.nf

mahesh-panchal · 2024-05-02T09:04:18Z

workflows/sra/main.nf

+        //
+        // MODULE: Get SRA run information for public database ids
+        //
+        |> map { id ->
+            SRA_IDS_TO_RUNINFO ( id, ena_metadata_fields )
+        }                                                       // Channel<Path>
+        //
+        // MODULE: Parse SRA run information, create file containing FTP links and read into workflow as [ meta, [reads] ]
+        //
+        |> map(SRA_RUNINFO_TO_FTP)                              // Channel<Path>
+        |> set { runinfo_ftp }                                  // Channel<Path>
+        |> flatMap { tsv ->
+            splitCsv(tsv, header:true, sep:'\t')
+        }                                                       // Channel<Map>
+        |> map { meta ->
+            meta + [single_end: meta.single_end.toBoolean()]
+        }                                                       // Channel<Map>
+        |> unique                                               // Channel<Map>
+        |> set { sra_metadata }                                 // Channel<Map>
+


It took a second to process, but this is soooo much better. This deals with what I wanted much better than how I initially thought about from the current syntax.

mahesh-panchal · 2024-05-02T11:00:13Z

modules/nf-core/sratools/fasterqdump/main.nf

+    topic:
+    [ task.process, 'sratools', eval("fasterq-dump --version 2>&1 | grep -Eo '[0-9.]+'") ] >> 'versions'
+    [ task.process, 'pigz', eval("pigz --version 2>&1 | sed 's/pigz //g'") ] >> 'versions'


I kind of like this, but I dislike the name topic. I don't feel like the word communicates what it's function is.

It would also be nice if we could supply a regex to validate what should be returned by the eval for some fast fail behavior when there's extra stuff being emitted. Where would one define a global variable pattern? E.g.

def SOFTWARE_VERSION = /\d+.../ def SHASUM = /\w{16}/

Or maybe this should be a class? like you can filter { Number }.

topic is a term from stream processing, used to collect related events from many different sources. in this case we are sending the tool version info to a custom "versions" topic, then the workflow reads from that topic to build the versions yaml file.

eval is just a function defined in the output / topic scope, so you could wrap it in a custom validation function:

def validate( pattern, text ) { // ... } // ... topic: validate( /foo/, eval('...') ) >> 'versions'

workflows/sra/nextflow.config

subworkflows/local/utils_nfcore_fetchngs_pipeline/main.nf

subworkflows/nf-core/fastq_download_prefetch_fasterqdump_sratools/main.nf

workflows/sra/main.nf

modules/local/aspera_cli/main.nf

Signed-off-by: Ben Sherman <[email protected]>

mahesh-panchal · 2024-05-20T12:24:23Z

How come there are new keywords (let, fn, etc)? What's the difference to def?

bentsherman · 2024-05-20T13:29:11Z

Just another idea to consider. With a formal grammar, we don't have to adhere so closely to Groovy, we can make whatever syntax we want, as long as it can be translated to Groovy AST. So as a demonstration I have replaced def with more specific keywords: fn for function defintion, let for variable that can't be reassigned, var for variable that can be reassigned (essentially def vs final in Groovy).

Notice I also changed how types are specified: <name>: <tyoe> instead of <type> <name>, which I like personally because it emphasizes the semantic name over the type which is optional.

mahesh-panchal · 2024-05-20T14:02:23Z

Is it going to be problematic if people combine groovy and this grammar? For example in exec: blocks.

bentsherman · 2024-05-20T14:14:03Z

It would apply to all Nextflow code, including exec: blocks

mahesh-panchal · 2024-05-20T14:29:02Z

I understood it would apply to all, but my question was really if it was possible that people could mix grammars and if so what would happen: e.g.

exec:
let some_var = do_stuff
def another_thing = do_other_stuff

bentsherman · 2024-05-20T19:59:31Z

We would either drop def in the next DSL version (a hard cut-off) or support it temporarily with a compiler warning

Signed-off-by: Ben Sherman <[email protected]>

bentsherman added 7 commits April 26, 2024 19:52

Replace ext/publishDir with params/publish definition

f531c5d

Signed-off-by: Ben Sherman <[email protected]>

Update config to comply with strict parser

836ace2

Signed-off-by: Ben Sherman <[email protected]>

Use param schemas as source of truth, convert to YAML

25a1fb5

Signed-off-by: Ben Sherman <[email protected]>

Use eval output, topic channels to collect tool versions

505806a

Signed-off-by: Ben Sherman <[email protected]>

Use static types, record types

5ae1562

Signed-off-by: Ben Sherman <[email protected]>

Refactor params as inputs for SRA workflow

b2f563d

Signed-off-by: Ben Sherman <[email protected]>

New dataflow syntax

24b34cf

Signed-off-by: Ben Sherman <[email protected]>

bentsherman mentioned this pull request May 1, 2024

Refactor ext config as params #308

Closed

bentsherman commented May 1, 2024

View reviewed changes

bentsherman mentioned this pull request May 1, 2024

Best practices for multiple process inputs nf-core/modules#4311

Open

bentsherman changed the title ~~DSL2+ / DSL3 preview~~ DSL2+ / DSL3 proof-of-concept May 1, 2024

bentsherman changed the title ~~DSL2+ / DSL3 proof-of-concept~~ Preview: DSL2+ (and beyond) May 1, 2024