Skip to content

Commit

Permalink
Refactor Extract section
Browse files Browse the repository at this point in the history
  • Loading branch information
volcan01010 committed May 11, 2024
1 parent 83d96c2 commit 743889f
Show file tree
Hide file tree
Showing 2 changed files with 84 additions and 53 deletions.
12 changes: 12 additions & 0 deletions docs/etl_functions/demo_named_tuple.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
"""Extract script using namedtuple_row_factory"""
import sqlite3
import etlhelper as etl
from etlhelper.row_factories import namedtuple_row_factory


with sqlite3.connect("igneous_rocks.db") as conn:
row = etl.fetchone('SELECT * FROM igneous_rock', conn,
row_factory=namedtuple_row_factory)

print(row)
print(row.name)
125 changes: 72 additions & 53 deletions docs/etl_functions/extract.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,93 +65,112 @@ returns
{'id': 2, 'name': 'granite', 'grain_size': 'coarse'}
Parameters
----------
Keyword arguments
-----------------

All extract functions are derived from :func:`iter_chunks() <etlhelper.iter_chunks>`
and take the same keyword arguments, which are passed through.

parameters
""""""""""

Variables can be inserted into queries by passing them as parameters.
These “bind variables” are sanitised by the underlying drivers to
prevent `SQL injection attacks <https://xkcd.com/327/>`__. The required
prevent `SQL injection attacks <https://xkcd.com/327/>`__.

It is necessary to use the correct
`paramstyle <https://www.python.org/dev/peps/pep-0249/#paramstyle>`__
can be checked with ``MY_DB.paramstyle``. A tuple is used for positional
placeholders, or a dictionary for named placeholders.
for the database type as a placeholder (e.g. ``?``, ``:1``).

.. code:: python
The paramstyle for a DbParams object can be checked with the
:func:`paramstyle <DbParams.paramstyle>` attribute.

select_sql = "SELECT * FROM src WHERE id = :id"
A dictionary is used for named placeholders,

with ORACLEDB.connect("ORA_PASSWORD") as conn:
fetchall(sql, conn, parameters={'id': 1})
.. code:: python
Row factories
-------------
select_sql = "SELECT * FROM src WHERE id = :id" # SQLite style
Row factories control the output format of returned rows.
with sqlite3.connect("rocks.db") as conn:
etl.fetchall(sql, conn, parameters={'id': 1})
For example return each row as a dictionary, use the following:
or a tuple for positional placeholders.

.. code:: python
from etlhelper import fetchall
from etlhelper.row_factories import dict_row_factory
select_sql = "SELECT * FROM src WHERE id = ?" # SQLite style
with sqlite3.connect("rocks.db") as conn:
etl.fetchall(sql, conn, parameters=(1,))
sql = "SELECT * FROM my_table"
Named parameters result in more readable code.

with ORACLEDB.connect('ORACLE_PASSWORD') as conn:
for row in fetchall(sql, conn, row_factory=dict_row_factory):
print(row['id'])

The ``dict_row_factory`` is useful when data are to be serialised to
JSON/YAML, as those formats use dictionaries as input.
row_factory
"""""""""""

Row factories control the output format of returned rows.
The default row factory for ETL Helper is a dictionary, but this can be
changed with the ``row_factory`` argument.

.. literalinclude:: demo_named_tuple.py
:language: python

The output is:

.. code:: bash
Row(id=1, name='basalt', grain_size='fine')
basalt
Four different row_factories are included, based in built-in Python
types:

+------------------+------------------+---------+------------------+
| Row Factory | Attribute access | Mutable | Parameter |
| | | | placeholder |
+==================+==================+=========+==================+
| dict_row_factory | ``row["id"]`` | Yes | Named |
| (default) | | | |
+------------------+------------------+---------+------------------+
| t | ``row[0]`` | No | Positional |
| uple_row_factory | | | |
+------------------+------------------+---------+------------------+
| list_row_factory | ``row[0]`` | Yes | Positional |
+------------------+------------------+---------+------------------+
| namedt | ``row.id`` or | No | Positional |
| uple_row_factory | ``row[0]`` | | |
+------------------+------------------+---------+------------------+
+-----------------------+------------------+---------+------------------+
| Row Factory | Attribute access | Mutable | Parameter |
| | | | placeholder |
+=======================+==================+=========+==================+
| dict_row_factory | ``row["id"]`` | Yes | Named |
| (default) | | | |
+-----------------------+------------------+---------+------------------+
| tuple_row_factory | ``row[0]`` | No | Positional |
+-----------------------+------------------+---------+------------------+
| list_row_factory | ``row[0]`` | Yes | Positional |
+-----------------------+------------------+---------+------------------+
| namedtuple_row_factory| ``row.id`` or | No | Positional (or |
| | ``row[0]`` | | Named with |
| | | | *load*) |
+-----------------------+------------------+---------+------------------+

The choice of row factory depends on the use case. In general named
tuples and dictionaries are best for readable code, while using tuples
or lists can give a slight increase in performance. Mutable rows are
convenient when used with transform functions because they can be
modified without need to create a whole new output row.

When using ``copy_rows``, it is necessary to use appropriate parameter
placeholder style for the chosen row factory in the INSERT query. Using
the ``dict_row_factory`` requires a switch from named to positional
parameter placeholders (e.g. ``%(id)s`` instead of ``%s`` for
PostgreSQL, ``:id`` instead of ``:1`` for Oracle). The ``pyodbc`` driver
for MSSQL only supports positional placeholders.
When loading or copying data, it is necessary to use appropriate parameter
placeholder style for the chosen row factory in the INSERT query.
Using the ``tuple_row_factory`` requires a switch from named to positional
parameter placeholders (e.g. ``%s`` instead of ``%(id)s`` for PostgreSQL,
``:1`` instead of ``:id`` for Oracle).
The ``pyodbc`` driver for MSSQL only supports positional placeholders.

When using the ``load`` function in conjuction with ``iter_chunks`` data
must be either named tuples or dictionaries.

Transform
---------
"""""""""

The ``transform`` parameter allows passing of a function to transform
the data before returning it. The function must take a list of rows and
return a list of modified rows. Rows of mutable types (dict, list) can
be modified in-place, while rows of immutable types (tuples,
namedtuples) must be created as new objects from the input rows. See
``transform`` for more details.
The ``transform`` parameter takes a callable (e.g. function) that
transforms the data before returning it.
See the :ref:`Transform <transform>` section for details.

Chunk size
----------
""""""""""

All data extraction functions use ``iter_chunks`` behind the scenes.
This reads rows from the database in chunks to prevent them all being
loaded into memory at once. The default ``chunk_size`` is 5000 and this
can be set via keyword argument.
This reads rows from the database in *chunks* to prevent them all being
loaded into memory at once.
The ``chunk_size`` argument sets the number of rows in each chunk.
The default ``chunk_size`` is 5000.

0 comments on commit 743889f

Please sign in to comment.