0.2.0 (#18)

## Features - major overhaul of `dataset.sql` so that each appended activity is joined and aggregated in an independent CTE, which are then combined in a final CTE - clarified project variables usage - large reorganization of the folders and yml
tnightengale · Mar 16, 2023 · b7501ad · b7501ad
1 parent b8841a0
commit b7501ad
Show file tree

Hide file tree

Showing 43 changed files with 369 additions and 235 deletions.
diff --git a/README.md b/README.md
@@ -12,11 +12,10 @@ modelling framework, based on the
 - [Install](#install)
 - [Usage](#usage)
   - [Create a Dataset](#create-a-dataset)
-  - [Configure Columns](#configure-columns)
-    - [Required Columns](#required-columns)
-    - [Mapping Column Names](#mapping-column-names)
-    - [Included Dataset Columns](#included-dataset-columns)
-    - [Configure Appended Activity Column Names](#configure-appended-activity-column-names)
+  - [Required Columns](#required-columns)
+- [Vars](#vars)
+  - [Column Mappings (optional)](#column-mappings-optional)
+  - [Included Columns (optional)](#included-columns-optional)
 - [Macros](#macros)
   - [Dataset (source)](#dataset-source)
   - [Activity (source)](#activity-source)
@@ -59,17 +58,19 @@ Include in `packages.yml`:
 
 ```yaml
 packages:
-  - git: "https://github.com/tnightengale/dbt-activity-schema"
-    revision: 0.1.0
+  - package: tnightengale/dbt_activity_schema
+    version: 0.2.0
 ```
 For latest release, see
 https://github.com/tnightengale/dbt-activity-schema/releases.
 
 ## Usage
 
 ### Create a Dataset
-Use the [dataset macro](#dataset-source) with the appropriate arguments to
-derive a Dataset by self-joining the Activity Stream model in your project. The
+Use the [dataset macro](#dataset-source) to self-join an Activity Stream using
+[relationships](#relationships).
+
+The
 [dataset macro](#dataset-source) will compile based on the provided [activity
 macros](#activity-source) and the [relationship macros](#relationships). It
 can then be nested in a CTE in a dbt-Core model. Eg:
@@ -80,7 +81,7 @@ with
 
 dataset_cte as (
     {{ dbt_activity_schema.dataset(
-        activity_stream_ref = ref("example__activity_stream"),
+        activity_stream = ref("example__activity_stream"),
 
         primary_activity = dbt_activity_schema.activity(
             dbt_activity_schema.all_ever(), "bought something"),
@@ -98,17 +99,12 @@ select * from dataset_cte
 
 ```
 > Note: This package does not contain macros to create the Activity Stream
-> model. It derives Dataset models on top of an existing Activity Stream model.
+> model. It generates the SQL to self-join an existing Activity Stream model.
 
-### Configure Columns
+### Required Columns
 This package conforms to the [Activity Schema V2
 Specification](https://github.com/ActivitySchema/ActivitySchema/blob/main/2.0.md#entity-table)
-and, by default, it expects the columns in that spec to exist in the Activity
-Stream model.
-
-#### Required Columns
-In order for critical joins in the [dataset macro](#dataset-source) to work as
-expected, the following columns must exist:
+and requires the following columns to function:
   - **`activity`**: A string or ID that identifies the action or fact
     attributable to the `customer`.
   - **`customer`**: The UUID of the entity or customer. Must be used across
@@ -120,18 +116,24 @@ expected, the following columns must exist:
   - **`activity_occurrence`**: The running count of the activity per customer.
     Create using a rank window function, partitioned by activity and customer.
 
-#### Mapping Column Names
-If the required columns exist conceptually under different names, they can be
-aliased using the nested `activity_schema_v2_column_mappings` project var. Eg:
+## Vars
+This package can be configured with the following project variables. All project
+vars can be scoped globally or to the `dbt_activity_schema` package.
+
+### Column Mappings (optional)
+The `column_mappings` project variable can be used to alias columns in Activity
+Stream. If the [required columns](#required-columns) exist conceptually under
+different names, they can be mapped to their names in the [V2
+Specification](https://github.com/ActivitySchema/ActivitySchema/blob/main/2.0.md#entity-table).
+Eg:
 
 ```yml
 # dbt_project.yml
-
 ...
 
 vars:
   dbt_activity_schema:
-    activity_schema_v2_column_mappings:
+    column_mappings:
       # Activity Stream with required column names that
       # differ from the V2 spec, mapped from their spec name.
       customer: entity_uuid
@@ -140,19 +142,18 @@ vars:
 ...
 ```
 
-#### Included Dataset Columns
-The set of columns that are included in the compiled SQL of the [dataset
-macro](#dataset-source) can be configured using the nested
-`default_dataset_columns` project var. Eg:
+### Included Columns (optional)
+The `included_columns` project variable can be set to indicate the default
+columns to be included in each [activity](#activity-source) passed to
+[dataset](#dataset-source). Eg:
 ```yml
 # dbt_project.yml
-
 ...
 
 vars:
   dbt_activity_schema:
     # List columns from the Activity Schema to include in the Dataset
-    default_dataset_columns:
+    included_columns:
       - activity_id
       - entity_uuid
       - activity_occurred_at
@@ -161,27 +162,23 @@ vars:
 ...
 ```
 
-These defaults can be overridden using the `override_columns` argument in the
-[activity macro](#activity-source).
+If it is not set, all the columns from the [V2
+Specification](https://github.com/ActivitySchema/ActivitySchema/blob/main/2.0.md#entity-table)
+will be included, based on the [columns macro](./macros/utils/columns.sql).
 
-#### Configure Appended Activity Column Names
-The naming convention of the columns, in the activities passed to the
-`appended_activities` argument can be configured by overriding the
-[generate_appended_column_alias](./macros/utils/generate_appended_column_alias.sql)
-macro. See the dbt docs on [overriding package
-macros](https://docs.getdbt.com/reference/dbt-jinja-functions/dispatch#overriding-package-macros)
-for more details.
+These defaults can be overridden on a per-activity basis by passing a list of column names to the `included_columns` argument in the
+[activity macro](#activity-source).
 
 ## Macros
 
 ### Dataset ([source](macros/dataset.sql))
 Generate the SQL for self-joining the Activity Stream.
 
 **args:**
-- **`activity_stream_ref (required)`** :
-  [ref](https://docs.getdbt.com/reference/dbt-jinja-functions/ref)
+- **`activity_stream (required)`** :
+  [ref](https://docs.getdbt.com/reference/dbt-jinja-functions/ref) | str
 
-  The dbt `ref()` that points to the activity stream model.
+  The dbt `ref()` or a CTE name that contains the [required columns](#required-columns).
 
 - **`primary_activity (required)`** : [activity](#activity-source)
 
@@ -209,9 +206,9 @@ dataset.
   The string identifier of the activity in the Activity Stream. Should match the
   value in the `activity`  column.
 
-- **`override_columns (optional)`** : List [ str ]
+- **`included_columns (optional)`** : List [ str ]
 
-  List of columns to include for the activity. Setting this Overrides the
+  List of columns to include for the activity. Setting this overrides the
   defaults configured by the `default_dataset_columns` project var.
 
 - **`additional_join_condition (optional)`** : str

diff --git a/dbt_project.yml b/dbt_project.yml
@@ -1,9 +1,9 @@
 
 # Project name.
 name: 'dbt_activity_schema'
-version: '0.1.1'
+version: '0.2.0'
 config-version: 2
-require-dbt-version: [">=1.3.0"]
+require-dbt-version: [">=1.3.0", "<2.0.0"]
 
 # The "profile" dbt uses for this project.
 profile: 'dbt_activity_schema'
@@ -18,5 +18,19 @@ snapshot-paths: ["snapshots"]
 
 target-path: "target"
 clean-targets:
-    - "target"
-    - "dbt_modules"
+  - "target"
+  - "dbt_modules"
+
+vars:
+  included_columns:
+    - activity_id
+    - ts
+    - customer
+    - anonymous_customer_id
+    - activity
+    - activity_occurrence
+    - activity_repeated_at
+    - feature_json
+    - revenue_impact
+    - link
+  column_mappings: {}
diff --git a/integration_tests/dbt_project.yml b/integration_tests/dbt_project.yml
@@ -21,12 +21,12 @@ models:
 
 vars:
   dbt_activity_schema:
-    default_dataset_columns:
+    included_columns:
       - activity_id
       - entity_uuid
       - ts
       - revenue_impact
-    activity_schema_v2_column_mappings:
+    column_mappings:
       customer: entity_uuid
       anonymous_customer_id: anonymous_entity_uuid
 

diff --git a/integration_tests/models/first_after/dataset__first_after_3.sql b/integration_tests/models/first_after/dataset__first_after_3.sql
@@ -3,13 +3,24 @@
         ref("input__first_after"),
         dbt_activity_schema.activity(
             dbt_activity_schema.all_ever(),
-            "signed up"
+            "signed up",
+            [
+                "activity_id",
+                "entity_uuid",
+                "ts",
+                "revenue_impact",
+                "feature_json"
+            ]
         ),
         [
             dbt_activity_schema.activity(
                 dbt_activity_schema.first_after(),
                 "visit page",
-                ["feature_json", "activity_occurrence", "ts"],
+                [
+                    "feature_json",
+                    "activity_occurrence",
+                    "ts"
+                ],
                 additional_join_condition="
                 json_extract({primary}.feature_json, 'type')
                 = json_extract({appended}.feature_json, 'type')

diff --git a/integration_tests/models/first_after/first_after.yml b/integration_tests/models/first_after/first_after.yml
@@ -0,0 +1,18 @@
+version: 2
+
+models:
+
+  - name: dataset__first_after_1
+    tests:
+      - dbt_utils.equality:
+          compare_model: ref("output__first_after_1")
+
+  - name: dataset__first_after_2
+    tests:
+      - dbt_utils.equality:
+          compare_model: ref("output__first_after_2")
+
+  - name: dataset__first_after_3
+    tests:
+      - dbt_utils.equality:
+          compare_model: ref("output__first_after_3")
diff --git a/...ls/first_before/dataset__first_before.sql → .../first_before/dataset__first_before_1.sql b/...ls/first_before/dataset__first_before.sql → .../first_before/dataset__first_before_1.sql
diff --git a/integration_tests/models/first_before/first_before.yml b/integration_tests/models/first_before/first_before.yml
@@ -0,0 +1,8 @@
+version: 2
+
+models:
+
+  - name: dataset__first_before_1
+    tests:
+      - dbt_utils.equality:
+          compare_model: ref("output__first_before_1")
diff --git a/...models/first_ever/dataset__first_ever.sql → ...dels/first_ever/dataset__first_ever_1.sql b/...models/first_ever/dataset__first_ever.sql → ...dels/first_ever/dataset__first_ever_1.sql
diff --git a/integration_tests/models/first_ever/first_ever.yml b/integration_tests/models/first_ever/first_ever.yml
@@ -0,0 +1,8 @@
+version: 2
+
+models:
+
+  - name: dataset__first_ever_1
+    tests:
+      - dbt_utils.equality:
+          compare_model: ref("output__first_ever_1")
diff --git a/integration_tests/models/first_in_between/dataset__first_in_between_3.sql b/integration_tests/models/first_in_between/dataset__first_in_between_3.sql
@@ -3,17 +3,36 @@
         ref("input__first_in_between"),
         dbt_activity_schema.activity(
             dbt_activity_schema.all_ever(),
-            "signed up"
+            "signed up",
+            [
+                "activity_id",
+                "entity_uuid",
+                "ts",
+                "revenue_impact",
+                "feature_json"
+            ]
         ),
         [
             dbt_activity_schema.activity(
                 dbt_activity_schema.first_in_between(),
                 "visit page",
-                ["feature_json", "activity_occurrence", "ts"],
+                [
+                    "feature_json",
+                    "activity_occurrence",
+                    "ts"
+                ],
                 additional_join_condition="
                 json_extract({primary}.feature_json, 'type')
                 = json_extract({appended}.feature_json, 'type')
                 "
+            ),
+            dbt_activity_schema.activity(
+                dbt_activity_schema.first_in_between(),
+                "bought something",
+                [
+                    "activity_id",
+                    "ts"
+                ]
             )
         ]
     )

diff --git a/integration_tests/models/last_after/last_after.yml b/integration_tests/models/last_after/last_after.yml
@@ -0,0 +1,8 @@
+version: 2
+
+models:
+
+  - name: dataset__last_after_1
+    tests:
+      - dbt_utils.equality:
+          compare_model: ref("output__last_after_1")
diff --git a/...dels/last_before/dataset__last_before.sql → ...ls/last_before/dataset__last_before_1.sql b/...dels/last_before/dataset__last_before.sql → ...ls/last_before/dataset__last_before_1.sql
diff --git a/integration_tests/models/last_before/last_before.yml b/integration_tests/models/last_before/last_before.yml
@@ -0,0 +1,8 @@
+version: 2
+
+models:
+
+  - name: dataset__last_before_1
+    tests:
+      - dbt_utils.equality:
+          compare_model: ref("output__last_before_1")
diff --git a/...s/models/last_ever/dataset__last_ever.sql → ...models/last_ever/dataset__last_ever_1.sql b/...s/models/last_ever/dataset__last_ever.sql → ...models/last_ever/dataset__last_ever_1.sql
diff --git a/integration_tests/models/last_ever/last_ever.yml b/integration_tests/models/last_ever/last_ever.yml
@@ -0,0 +1,8 @@
+version: 2
+
+models:
+
+  - name: dataset__last_ever_1
+    tests:
+      - dbt_utils.equality:
+          compare_model: ref("output__last_ever_1")
diff --git a/integration_tests/models/models.yml b/integration_tests/models/models.yml
diff --git a/...st_before/output/output__first_before.csv → ..._before/output/output__first_before_1.csv b/...st_before/output/output__first_before.csv → ..._before/output/output__first_before_1.csv
diff --git a/.../first_ever/output/output__first_ever.csv → ...irst_ever/output/output__first_ever_1.csv b/.../first_ever/output/output__first_ever.csv → ...irst_ever/output/output__first_ever_1.csv