Skip to content

Commit

Permalink
Update transform section
Browse files Browse the repository at this point in the history
  • Loading branch information
volcan01010 committed May 12, 2024
1 parent 743889f commit 3890bdb
Show file tree
Hide file tree
Showing 6 changed files with 46 additions and 31 deletions.
16 changes: 8 additions & 8 deletions docs/demo_copy.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,14 @@
)"""


select_sql = "SELECT name FROM igneous_rock"


def transform(chunk):
for row in chunk:
row['category'] = 'igneous'
row['last_update'] = dt.datetime.now()
yield row
new_row = {
"name": row["name"],
"category": "igneous",
"last_update": dt.datetime.now()
}
yield new_row


etl.log_to_console()
Expand All @@ -33,8 +33,8 @@ def transform(chunk):
etl.execute(create_sql, dest)

# Copy data
rows = etl.iter_rows(select_sql, src, transform=transform)
etl.load('rock', dest, rows)
rows = etl.copy_table_rows('igneous_rock', src, dest,
target='rock', transform=transform)

# Confirm transfer
for row in etl.fetchall('SELECT * FROM rock', dest):
Expand Down
1 change: 0 additions & 1 deletion docs/etl_functions/demo_named_tuple.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
import etlhelper as etl
from etlhelper.row_factories import namedtuple_row_factory


with sqlite3.connect("igneous_rocks.db") as conn:
row = etl.fetchone('SELECT * FROM igneous_rock', conn,
row_factory=namedtuple_row_factory)
Expand Down
2 changes: 2 additions & 0 deletions docs/etl_functions/error_handling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ This section describes exception classes and on_error functions.

logged errors

also handling errors in SQL e.g. ON CONFLICT

Handling insert errors
----------------------

Expand Down
9 changes: 7 additions & 2 deletions docs/etl_functions/extract.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,18 +159,23 @@ The ``pyodbc`` driver for MSSQL only supports positional placeholders.
When using the ``load`` function in conjuction with ``iter_chunks`` data
must be either named tuples or dictionaries.

Transform
transform
"""""""""

The ``transform`` parameter takes a callable (e.g. function) that
transforms the data before returning it.
See the :ref:`Transform <transform>` section for details.

Chunk size
chunk_size
""""""""""

All data extraction functions use ``iter_chunks`` behind the scenes.
This reads rows from the database in *chunks* to prevent them all being
loaded into memory at once.
The ``chunk_size`` argument sets the number of rows in each chunk.
The default ``chunk_size`` is 5000.

Return values
-------------

TODO!
43 changes: 25 additions & 18 deletions docs/etl_functions/transform.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,18 @@
Transform
^^^^^^^^^

Data can be transformed in-flight by applying a transform function. This
is any Python callable (e.g. function or class) that takes an iterator
ETL Helper functions accept a function as the ``transform`` keyword argument,
which enables transformation of data in flight.

This is any Python callable (e.g. function or class) that takes an iterator
and returns another iterator (e.g. list or generator via the ``yield``
statement). Transform functions are applied to data as they are read
statement).
Transform functions are applied to data as they are read
from the database (in the case of data fetching functions and
``copy_rows``), or before they are passed as query parameters (to
``executemany`` or ``load``). When used with ``copy_rows`` or
``executemany`` the INSERT query must contain the correct placeholders
for the transform result.

The ``iter_chunks`` and ``iter_rows`` functions that are used internally
return generators. Each chunk or row of data is only accessed when it is
required. This allows data transformation to be performed via
`memory-efficient
iterator-chains <https://dbader.org/blog/python-iterator-chains>`__.
``executemany`` or ``load``).
When used with ``copy_rows`` or ``executemany`` the INSERT query must contain
the correct parameter placeholders for the transformed result.

The simplest transform functions modify data returned mutable row
factories e.g., ``dict_row_factory`` in-place. The ``yield`` keyword
Expand All @@ -27,6 +24,7 @@ that can loop over the rows.
.. code:: python
from typing import Iterator
import etlhelper as etl
from etlhelper.row_factories import dict_row_factory
Expand All @@ -40,11 +38,12 @@ that can loop over the rows.
yield row
fetchall(select_sql, src_conn, row_factory=dict_row_factory,
transform=my_transform)
etl.fetchall(select_sql, src_conn, row_factory=dict_row_factory,
transform=my_transform)
It is also possible to assemble the complete transformed chunk and
return it. This code demonstrates that the returned chunk can have a
return it.
This code demonstrates that the returned chunk can have a
different number of rows, and be of different length, to the input.
Because ``namedtuple``\ s are immutable, we have to create a ``new_row``
from each input ``row``.
Expand All @@ -53,6 +52,7 @@ from each input ``row``.
import random
from typing import Iterator
import etlhelper as etl
from etlhelper.row_factories import namedtuple_row_factory
Expand All @@ -68,10 +68,17 @@ from each input ``row``.
return new_chunk
fetchall(select_sql, src_conn, row_factory=namedtuple_row_factory,
transform=my_transform)
etl.fetchall(select_sql, src_conn, row_factory=namedtuple_row_factory,
transform=my_transform)
Any Python code can be used within the function and extra data can
result from a calculation, a call to a webservice or a query against
another database. As a standalone function with known inputs and
another database.
As a standalone function with known inputs and
outputs, the transform functions are also easy to test.

The ``iter_chunks`` and ``iter_rows`` functions return generators.
Each chunk or row of data is only accessed when it is
required. This allows data transformation to be performed via
`memory-efficient
iterator-chains <https://dbader.org/blog/python-iterator-chains>`__.
6 changes: 4 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Welcome to ETL Helper's documentation!
:target: https://pypi.org/project/etlhelper
.. image:: https://img.shields.io/pypi/dm/etlhelper?label=Downloads%20pypi

ETL Helper is a Python ETL library to simplify data transfer into and out of databases.
ETL Helper is a Python ETL (Extract, Transform, Load) library to simplify data transfer into and out of databases.


.. note:: This documentation is a work in progress in preparation for the upcoming 1.0 release.
Expand Down Expand Up @@ -94,7 +94,7 @@ The output is:
Copying data
------------

This script copies the data to another database, with transformation and logging.
This script copies data to another database, with transformation and logging.

.. literalinclude:: demo_copy.py
:language: python
Expand All @@ -112,3 +112,5 @@ The output is:
{'id': 1, 'name': 'basalt', 'category': 'igneous', 'last_update': '2024-05-08 14:59:54.878726'}
{'id': 2, 'name': 'granite', 'category': 'igneous', 'last_update': '2024-05-08 14:59:54.879034'}
The :doc:`recipes` section has more example code.

0 comments on commit 3890bdb

Please sign in to comment.