Skip to content

Commit

Permalink
CTK: Invoke MongoDB Table Loader with Zyp Transformation
Browse files Browse the repository at this point in the history
On a specific collection loaded from a MongoDB Extended JSON file,
mask (exclude/ignore/omit) certain elements, in order to import all
records without further errors.

Both elements will be dropped:
- .image.available_sizes
- .screenshots[].available_sizes

The procedure can be improved on a later iteration.
  • Loading branch information
amotl committed Sep 19, 2024
1 parent a05e8d2 commit d309a48
Show file tree
Hide file tree
Showing 3 changed files with 52 additions and 3 deletions.
2 changes: 1 addition & 1 deletion application/cratedb-toolkit/requirements.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
cratedb-toolkit[influxdb,mongodb]==0.0.23
cratedb-toolkit[influxdb,mongodb]==0.0.24
5 changes: 3 additions & 2 deletions application/cratedb-toolkit/test_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ def test_ctk_load_table_mongodb_json(drop_testing_tables):
table_cardinalities = {
"books": 431,
"city_inspections": 81047,
"companies": 2537,
"companies": 18801,
"countries-big": 21640,
"countries-small": 248,
"covers": 5071,
Expand Down Expand Up @@ -138,7 +138,8 @@ def test_ctk_load_table_mongodb_json(drop_testing_tables):
command = f"""
ctk load table \
"file+bson://{datasets_path}/*.json?batch-size=2500" \
--cratedb-sqlalchemy-url="crate://localhost:4200/from-mongodb"
--cratedb-sqlalchemy-url="crate://localhost:4200/from-mongodb" \
--transformation=application/cratedb-toolkit/zyp-mongodb-json-files.yaml
"""
print(f"Invoking CTK: {command}", file=sys.stderr)
subprocess.check_call(shlex.split(command))
Expand Down
48 changes: 48 additions & 0 deletions application/cratedb-toolkit/zyp-mongodb-json-files.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# A Zyp Transformation [1] file to support importing datasets
# from mongodb-json-files [2] into CrateDB [3].
#
# [1] https://commons-codec.readthedocs.io/zyp/
# [2] https://github.com/ozlerhakan/mongodb-json-files
# [3] https://cratedb.com/docs/guide/feature/

# Because CrateDB can not store nested arrays into OBJECT(DYNAMIC) columns,
# this file defines a corresponding transformation to work around the problem.
#
# The workaround applied here is to just exclude/omit relevant `available_sizes`
# elements completely. Converting them right can be implemented on behalf of a
# later iteration.
#
# "image": {
# "available_sizes": [
# [
# [
# 150,
# 99
# ],
# "assets/images/resized/0001/3896/13896v3-max-150x150.jpg"
# ],
# ]
#
# A possible representation could be:
#
# "image": {
# "available_sizes": [
# {
# "path": "assets/images/resized/0001/3896/13896v3-max-150x150.jpg",
# "size": {"width": 150, "height": 99},
# }
# ]
# }
---

meta:
type: zyp-project
version: 1
collections:
- address:
container: datasets
name: companies
pre:
rules:
- expression: .[] |= del(.image.available_sizes, .screenshots[].available_sizes)
type: jq

0 comments on commit d309a48

Please sign in to comment.