diff --git a/quickstart/config.md b/quickstart/config.md
index c6dee20..323a42d 100644
--- a/quickstart/config.md
+++ b/quickstart/config.md
@@ -1,22 +1,41 @@
# Configuration file
-The configuration file - example at `configs/config.yml` - is used to store all relevant configurations, like paths to the datasets and spark parameters. It should be appropriately edited before executing the code.
+The configuration file - example at `configs/config.yml` - is used to store all relevant configurations, like paths to the datasets and spark parameters. It should be appropriately edited before executing the code. It is written in [YAML](https://yaml.org/).
-The first parameters to set are those related to spark:
+## Spark parameters
+Spark parameters are set under the `spark` heading. Syntax for specifying Spark parameters derives from [Spark\'s own property names](https://spark.apache.org/docs/latest/configuration.html#available-properties). For example, to conigure the parameter `spark.app.name` in Cider config, we'd use
+```
+spark:
+ app:
+ name: "my_first_cider_app"
+```
+
+Here is a more complete example config. It's not meant as an endorsement of these specific config values; optimal choices vary greatly based on your environment and use case.
```
-spark:
- app_name: "mm"
- files:
- max_partition_bytes: 67108864
+spark:
+ app:
+ name: "my_first_cider_app"
+ master: "local[*]"
+ sql:
+ shuffle:
+ partitions: 144
driver:
- memory: "8g" // driver memory
- max_result_size: "2g" // maximum result size when collecting to driver
- loglevel: "ERROR"
+ memory: "8G"
+ maxResultSize: "2G"
+ supervise: true
+ executor:
+ memory: "8G"
+ rpc:
+ askTimeout: "600s"
+ loglevel: "WARN"
+ logConf: true
```
-Next we specify folder and file locations. File subpaths are given relative to a "parent" directory: Either the `input_data` directory or the `working` directory (file subpaths should not have leading slashes). The locations of the parent directories must be specified with absolute paths (with leading slashes). Cider will not modify files under the `input_data` directory. It will use the `working` directory for program outputs, some of which may act as inputs for later steps. For example, the featurizer writes features to the `working` directory, and then the ml module reads features back in from that directory. At present, file names/sub-paths written programmatically under the `working` directory are hard-coded and can't be specified in config.
+## File and folder locations
+
+Under the `path` heading, we specify folder and file locations. File subpaths are given relative to a "parent" directory: Either the `input_data` directory or the `working` directory (if you'd rather specify an absolute path, use a leading slash). The locations of the parent directories must be specified with absolute paths (with leading slashes). Cider will not modify files under the `input_data` directory. It will use the `working` directory for program outputs, some of which may act as inputs for later steps. For example, the featurizer writes features to the `working` directory, and then the ml module reads features back in (from that same directory, unless a different one is specified as input). At present, file names/sub-paths written programmatically under the `working` directory are hard-coded and can't be specified in config.
```
path:
@@ -42,7 +61,9 @@ path:
directory_path: "/Users/example/Documents/GD/cider/working_directory/"
```
-The featurizer module expects certain column and column names, and we can define them in the following section of the config file:
+## Column names
+
+Cider expects certain columns to be present, and we can specify their names under the `col_names` heading (this is not a complete list):
```
col_names:
@@ -81,7 +102,9 @@ col_names:
geo: "tower_id"
```
-We also have to set a few parameters that will affect the behaviour of some modules:
+## Miscellaneous parameters
+
+We also have to set a few parameters that will affect the behaviour of some modules, under the `params` heading:
```
params:
@@ -104,7 +127,9 @@ params:
opt_in_default: false // if true opt-in is set as default, i.e. all users give their consent unless they opt-out
```
-Finally, we can set the hyper-parameters that will be tested during a grid-search performed by the ML module:
+## ML tuning parameters
+
+Under the `hyperparams` heading, we set the hyper-parameters that will be tested during a grid-search performed by the ML module:
```
hyperparams:
diff --git a/quickstart/installation.md b/quickstart/installation.md
index 5835daa..928d111 100644
--- a/quickstart/installation.md
+++ b/quickstart/installation.md
@@ -2,8 +2,8 @@
To install, and manage dependencies and virtual environments this project uses Poetry. Follow the [instructions](https://python-poetry.org/docs/) to install Poetry.
-From the root directory `poetry update` followed by `poetry install` - this will establish a venv with all the needed dependencies.
+From the root directory `poetry update` followed by `poetry install` - this will establish a virtual environment with all the needed dependencies.
-Once your venv is made you can use `poetry run [command]` to run a single CLI command inside the venv.
+Once your virtual environment is made you can use `poetry run [command]` to run a single CLI command inside the virtual environment.
-You can use `poetry shell` to enter into the venv.
\ No newline at end of file
+You can use `poetry shell` to enter into the virtual environment.
\ No newline at end of file