Skip to content

Commit

Permalink
Some improvements to config docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Leo Selker committed Mar 27, 2023
1 parent 66227c5 commit e549889
Show file tree
Hide file tree
Showing 2 changed files with 41 additions and 16 deletions.
51 changes: 38 additions & 13 deletions quickstart/config.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,41 @@

# Configuration file

The configuration file - example at `configs/config.yml` - is used to store all relevant configurations, like paths to the datasets and spark parameters. It should be appropriately edited before executing the code. <br>
The configuration file - example at `configs/config.yml` - is used to store all relevant configurations, like paths to the datasets and spark parameters. It should be appropriately edited before executing the code. It is written in [YAML](https://yaml.org/).<br>

The first parameters to set are those related to spark:
## Spark parameters
Spark parameters are set under the `spark` heading. Syntax for specifying Spark parameters derives from [Spark\'s own property names](https://spark.apache.org/docs/latest/configuration.html#available-properties). For example, to conigure the parameter `spark.app.name` in Cider config, we'd use
```
spark:
app:
name: "my_first_cider_app"
```

Here is a more complete example config. It's not meant as an endorsement of these specific config values; optimal choices vary greatly based on your environment and use case.

```
spark:
app_name: "mm"
files:
max_partition_bytes: 67108864
spark:
app:
name: "my_first_cider_app"
master: "local[*]"
sql:
shuffle:
partitions: 144
driver:
memory: "8g" // driver memory
max_result_size: "2g" // maximum result size when collecting to driver
loglevel: "ERROR"
memory: "8G"
maxResultSize: "2G"
supervise: true
executor:
memory: "8G"
rpc:
askTimeout: "600s"
loglevel: "WARN"
logConf: true
```

Next we specify folder and file locations. File subpaths are given relative to a "parent" directory: Either the `input_data` directory or the `working` directory (file subpaths should not have leading slashes). The locations of the parent directories must be specified with absolute paths (with leading slashes). Cider will not modify files under the `input_data` directory. It will use the `working` directory for program outputs, some of which may act as inputs for later steps. For example, the featurizer writes features to the `working` directory, and then the ml module reads features back in from that directory. At present, file names/sub-paths written programmatically under the `working` directory are hard-coded and can't be specified in config.
## File and folder locations

Under the `path` heading, we specify folder and file locations. File subpaths are given relative to a "parent" directory: Either the `input_data` directory or the `working` directory (if you'd rather specify an absolute path, use a leading slash). The locations of the parent directories must be specified with absolute paths (with leading slashes). Cider will not modify files under the `input_data` directory. It will use the `working` directory for program outputs, some of which may act as inputs for later steps. For example, the featurizer writes features to the `working` directory, and then the ml module reads features back in (from that same directory, unless a different one is specified as input). At present, file names/sub-paths written programmatically under the `working` directory are hard-coded and can't be specified in config.

```
path:
Expand All @@ -42,7 +61,9 @@ path:
directory_path: "/Users/example/Documents/GD/cider/working_directory/"
```

The featurizer module expects certain column and column names, and we can define them in the following section of the config file:
## Column names

Cider expects certain columns to be present, and we can specify their names under the `col_names` heading (this is not a complete list):

```
col_names:
Expand Down Expand Up @@ -81,7 +102,9 @@ col_names:
geo: "tower_id"
```

We also have to set a few parameters that will affect the behaviour of some modules:
## Miscellaneous parameters

We also have to set a few parameters that will affect the behaviour of some modules, under the `params` heading:

```
params:
Expand All @@ -104,7 +127,9 @@ params:
opt_in_default: false // if true opt-in is set as default, i.e. all users give their consent unless they opt-out
```

Finally, we can set the hyper-parameters that will be tested during a grid-search performed by the ML module:
## ML tuning parameters

Under the `hyperparams` heading, we set the hyper-parameters that will be tested during a grid-search performed by the ML module:

```
hyperparams:
Expand Down
6 changes: 3 additions & 3 deletions quickstart/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

To install, and manage dependencies and virtual environments this project uses Poetry. Follow the [instructions](https://python-poetry.org/docs/) to install Poetry.

From the root directory `poetry update` followed by `poetry install` - this will establish a venv with all the needed dependencies.
From the root directory `poetry update` followed by `poetry install` - this will establish a virtual environment with all the needed dependencies.

Once your venv is made you can use `poetry run [command]` to run a single CLI command inside the venv.
Once your virtual environment is made you can use `poetry run [command]` to run a single CLI command inside the virtual environment.

You can use `poetry shell` to enter into the venv.
You can use `poetry shell` to enter into the virtual environment.

0 comments on commit e549889

Please sign in to comment.