layout | permalink | description | title | width | hero | ||||
---|---|---|---|---|---|---|---|---|---|
default |
recipes |
How matchID supercharges your powers with recipes |
Recipes |
is-10 |
|
A recipe consist of steps of treatments made on rows and colums. The dataset is scanned by chunks, which are loaded into a Pandas dataframe. So a recipe is basically a treatment on a chunk from a dataframe, resulting into a transformed dataframe.
A recipe can call :
- another recipe (and so one, recursively)
- generic recipes (included in the core configuration, but constructed as recipes)
- internals (map, join, eval, replace, keep, delete...) : included in the core code or can also be extended
- the internal "eval" function gives access to create/modify columns based on a row function (e.g col1+col2) into recipes
Summary:
This recipe creates new columns to the dataframe, simply based on others.
This recipe keeps only columns matching a regex, with an optional where
condition :
This recipe deletes columns matching a regex :
Renames columns of a dataframe :
This is the swiss-knife recipe which evaluates a treatment row by row. A new column value will have a cell value computed with a python expression.
The values of the dataframe are accessible within the row
array.
A particular column
value is available in row['column']
as in column
.
Here's an example:
WARNING: the eval function uses pd.Dataframe.apply
row-by-row which is easy but sub-optimal for large datasets. You should consider to optimize your eval
draft with an exec
as soon as possible. Using vectorized function can lead to 100x accelerations.
Every dataengineer or datascientist wants to optimize the algorithm knowing what it does.
This method execute pure python code, given that the data in input is the df
dataframe processed by the recipes above. You should use this method everytime you process large volumes, because the apply
method
The values of the dataframe are accessible within the row
array.
A particular column
value is available in row['column']
as in column
.
Here's an example:
This methods applies regex on a selection of fields (matching itself a regex), in python style:
This method transforms a text to lowercase, removes accent and special characters on selected-by-regex fields :
This recipe is an helper for debugging a recipe. It ends prematurely the recipe, not excecuting following steps.
Complement helpers are a selection of fields (like keep) and of top rows (head) to limit the size of the treatment.
Fully shuffles the data (each column independtly) using np.random.permutation
. Used for anonymization.
This recipe converts a selection-by-regex of columns from string to integers, fills not available value with NaN or a specified value.
This recipe converts a selection-by-regex of columns from string to floats, fills not available value with NaN or specified value.
Converts a list to tuple, which can be used for indexing in a dataframe (e.g. groupby, etc.) for example.
Converts a tuple to a list.
Computes n-grams of selected columns
This recipe converts a selection-by-regex of columns from string to a date/time type :
This recipes acts like a SQL join, executed by chunks (so slower), tolerating fuzziness (so better).
This fuzzy join is either in-memory (to match to small referential datasets; ie <500k) or based on elasticsearch (for > 500k to > 100M).
fuzzy, in-memory
In the following example we try to match both city label (fuzzily), departement code (strictly) and country iso code (strictly) to recover citycode history of a city :
This example is more frequent and easier but useful when you have multiple referential datasets (slower than a SQL join but can help to limit the number of in-between datasets) :
This last example deals with the problem of big fuzzy match (up to millions against millions).
Of course you'll need a big cluster if you want to deal with many millions of matches in less than a week!
The fuzzy match just relies on pure elasticsearch queries transformed from json to yaml :
unfold: False
(defaultTrue
): each row return a bucket of potential matches.unfold
splits this buckets into rows, like in a SQL-join operation. Ifunfold
isFalse
, buckets are returned raw to enable custom operations.keep_unmatched: True
(default:False
) - ifunfold
isTrue
, keep rows without a match (ifunfold
isFalse
, no analysis of the bucket is done so all rows are kept).unnest: False
(default:True
) - by default, the elasticsearch values are splitted into columns. IfFalse
, the raw elasticsearch hits are returned one by row but in a column 'hit' which contains the json.prefix: myprefix_
(default:hit_
) - customize prefix of the keys from the elasticsearch hits.
This recipe splits a selection-by-regex columns of arrays in to multiple rows, copying the content of the other columns :
This recipe splits a selection-by-regex columns of jsons to multiple columns, one by key value of the JSON and delete previous columns :
Gathers the selected columns and values into a json in the target column :
This computes a groupby and redispatchzq value across the group :
This methods applies only on a full dataset an should have the flag chunked: False
in the input dataset, e.g :
To apply a model, you need to build one first or use a pre-trained one.
Models can only be applied to same data as in training aka same columns in the same order.
Maps a string POINT(lon, lat)
to a numerical tuple (lat,lon)
.
Calculates vincenty (from geopy.distance) distance between two wgs84 (lat,lon) tuples.
Replaces all values by key from dictionnary dic
like {"key1": "value1", "key2": "value2"}
into object
which can be a string, an array or a dictionnary (replace into values which strictly match).
Replaces all values by regex from dictionnary dic
like {"regex1": "value1", "regex2": "value2"}
into object
which can be a string, a array or a dictionnary.
object
may be either a string (or unicode) or an array of strings. For each string value, lower-cases the value, removes accents and special characters.
object
may be either a string (or unicode) or an array of strings. For each string value, split with \s+ separator
.
nltk tokenizers could be integrated here later.
flattens the list, ie [[a,b],[c]]
returns [a,b,c]
.
Computes the sha1 hash key of str(object)
.
Computes levenshtein distance between string a and string b. Current version is just a wrapping.
Computes normalized levenshtein distance between string a and string b, or minimum of levenshtein_norm between list of strings a and list of strings b.
Compute minimum jaro-winkler distance between strings of list a and strings of list b.
Computes n-grams of string a (with n in a int list here [2,3]) or n-grams of strings in list a.
SQL often performs better dans Python as soon as you don't need long code implementations. We recommand to use it when performin join operations or simple tranformations.
matchID now includes a Postgres database, but could be used with other SQL alchemy compatible database. If it doesn't work please report an issue on Github.
Note that before to perform a SQL join operation you should have declared to datasets in the same database. Be careful about the tables names which may differ between the dataset and the tables.
Let's say that clients
and deaths
are our two datasets using the same connector (with the same name for the table). Suppose we also previously declared clients_x_deaths
as a dataset using the same connector.
Here's a simple example of SQL recipe :
from clients, deaths
where
(
levenshtein(clients.matchid_name_last, deaths.matchid_name_last)<3
and levenshtein(clients.matchid_name_first, deaths.matchid_name_first)<3
and clients.matchid_location_depcode = deaths.matchid_location_depcode
and clients.matchid_birth_date = deaths.matchid_birth_date
)
output: clients_x_deaths
steps:
</div>
<div markdown="1">
Note that:
- please mind the indentation, respecting the [YAML standard](https://yaml.org/spec/1.2/spec.html)
- `test_chunk_size` (default 30) will apply as a 'LIMIT' on the input dataset (clients) and to the output result while you test the recipe. This is usefull to force to have a response in a decent time, but on the other way you may have
an empty result due to this. This is a classical SQL developpement problem and we'll be happy if you suggest other way to do this properly
- if you want only to execute SQL efficiently and store it in the same database you have to leave the 'steps' empty
- you can still use recipes after the request within the steps, the result of the request will be executed in the Python processor
- the output dataset can be in a onther connector, so the SQL will be executed in the database of the input connector, then simply casted in the other connector
- types are preserved if you cast in another dataset, with classical problem using SQL types, then Pandas, then the output connector (for example Elasticsearch). There is no universal translator for that, it will be your pain.
- if you want a fast copy you can specify the `mode: expert` in the input datasource, only for Postgres. This is usually a bit faster as it uses the `copy_expert` method. the other hand, all types will be casted into string.
</div>