As at v0.0.4, the Dan::Polars synopsis has been extended in multiple ways. This page is a vestigal version of the Dan::Polars documentation. For now it includes only features that are not covered in the Dan synopsis or the Dan::Polars synopsis.
Over time, the synopsis items will be added here in more detail.
This Documentation should be read in conjunction with the Polars Book. The content is largely example based and can be read alongside the Python and Rust examples given there.
The TOC is a subset of the Polars Book TOC.
- Concepts
- Expressions
- Casting
- Aggregation
- Conditionals
- Filter (aka grep)
- Sort
- Missing Data
- Apply (user-defined functions)
- Transformations
my \df1 = DataFrame.new(['Ray Type' => ["Ξ±", "Ξ²", "X", "Ξ³"]]);
df1.show;
shape: (4, 1)
ββββββββββββ
β Ray Type β
β --- β
β str β
ββββββββββββ‘
β Ξ± β
ββββββββββββ€
β Ξ² β
ββββββββββββ€
β X β
ββββββββββββ€
β Ξ³ β
ββββββββββββ
my \df2 = df1.drop(['Ray Type']);
df2.show;
shape: (0, 0)
ββ
ββ‘
ββ
say df2.is_empty; #True
my \df = DataFrame.new([
integers => [1, 2, 3, 4, 5],
big_integers => [1, 10000002, 3, 10000004, 4294967297],
floats => [4.0, 5.0, 6.0, 7.0, 8.0],
floats_with_decimal => [4.532, 5.5, 6.5, 7.5, 8.5],
]);
df.show;
df.select([
col("integers").cast("f32").alias("integers_as_floats"),
col("floats").cast("i32").alias("floats_as_integers"),
col("floats_with_decimal").cast("i32").alias("floats_with_decimal_as_integers"),
]).show;
my \dfs = DataFrame.new([
integers => [1, 2, 3, 4, 5],
floats => [4.0, 5.03, 6.0, 7.0, 8.0],
strings => <4.0 5.0 6.0 7.0 8.0>>>.Str.Array,
]);
dfs.show;
dfs.select([
col("integers").cast("str"),
col("floats").cast("str"),
col("strings").cast("f32"),
]).show;
my \dfs = DataFrame.new([
integers => [-1, 0, 2, 3, 4],
floats => [0.0, 1.0, 2.0, 3.0, 4.0],
bools => [True, False, True, False, True],
]);
dfs.show;
dfs.select([
col("integers").cast("bool"),
col("floats").cast("bool"),
col("bools").cast("i32"),
]).show;
my \df = DataFrame.new([
nrs => [1, 2, 3, 4, 5],
nrs2 => [2, 3, 4, 5, 6],
names => ["foo", "ham", "spam", "egg", ""],
random => [1.rand xx 5],
groups => ["A", "A", "B", "C", "B"],
]);
df.show;
#viz. https://pola-rs.github.io/polars-book/user-guide/expressions/operators/#logical
#(gt >, lt <, ge >=, le <=, eq ==, ne !=, and &&, or ||)
df.select([(col("nrs") > 2).alias("jones")]).head;
#df.select([(col("nrs") >= 2).alias("jones")]).head;
#df.select([(col("nrs") < 2).alias("jones")]).head;
#df.select([(col("nrs") <= 2).alias("jones")]).head;
#df.select([(col("nrs") == 2).alias("jones")]).head;
#df.select([(col("nrs") != 2).alias("jones")]).head;
#df.select([((col("nrs") >= 2) && (col("nrs2") == 5)) .alias("jones")]).head;
#df.select([((col("nrs") >= 2) || (col("nrs2") == 5)) .alias("jones")]).head;
The filter method applies to the entire DataFrame.
df.filter([(col("nrs") != 4)]).show;
shape: (4, 5)
βββββββ¬βββββββ¬ββββββββ¬βββββββββββ¬βββββββββ
β nrs β nrs2 β names β random β groups β
β --- β --- β --- β --- β --- β
β i32 β i32 β str β f64 β str β
βββββββͺβββββββͺββββββββͺβββββββββββͺβββββββββ‘
β 1 β 2 β foo β 0.568035 β A β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββ€
β 2 β 3 β ham β 0.4602 β A β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββ€
β 3 β 4 β spam β 0.647715 β B β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββ€
β 5 β 6 β β 0.991221 β B β
βββββββ΄βββββββ΄ββββββββ΄βββββββββββ΄βββββββββ
Unlike .filter
, DataFrame .grep
is implemented by converting a rust Dan::Polars::DataFrame
to a raku Dan::DataFrame
(a .flood
), performing the grep with a raku block-style syntax and then convering back (a .flush
). The implication is that the syntax is very rich, but the performance is lower than Expression Sorting.
# Grep (binary filter)
say ~df.grep( { .[1] < 0.5 } ); # by 2nd column
say ~df.grep( { df.ix[$++] eq <2022-01-02 2022-01-06>.any } ); # by index (multiple)
Specify an Array[Str] of column names and an Array[Bool] of descending? options:
df.sort(["groups","names"],[False, True]).show;
shape: (5, 5)
βββββββ¬βββββββ¬ββββββββ¬βββββββββββ¬βββββββββ
β nrs β nrs2 β names β random β groups β
β --- β --- β --- β --- β --- β
β i32 β i32 β str β f64 β str β
βββββββͺβββββββͺββββββββͺβββββββββββͺβββββββββ‘
β 2 β 3 β ham β 0.651383 β A β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββ€
β 1 β 2 β foo β 0.687945 β A β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββ€
β 3 β 4 β spam β 0.020684 β B β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββ€
β 5 β 6 β β 0.961176 β B β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββ€
β 4 β 5 β egg β 0.666724 β C β
βββββββ΄βββββββ΄ββββββββ΄βββββββββββ΄βββββββββ
Or, if you prefer a more raku-oriented style, specify a Block:
df.sort( {df[$++]<random>} )[*].reverse^.show;
shape: (5, 5)
βββββββ¬βββββββ¬ββββββββ¬βββββββββββ¬βββββββββ
β nrs β nrs2 β names β random β groups β
β --- β --- β --- β --- β --- β
β i32 β i32 β str β f64 β str β
βββββββͺβββββββͺββββββββͺβββββββββββͺβββββββββ‘
β 5 β 6 β β 0.961176 β B β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββ€
β 1 β 2 β foo β 0.687945 β A β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββ€
β 4 β 5 β egg β 0.666724 β C β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββ€
β 2 β 3 β ham β 0.651383 β A β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββ€
β 3 β 4 β spam β 0.020684 β B β
βββββββ΄βββββββ΄ββββββββ΄βββββββββββ΄βββββββββ
As set out in the Dan synopsis, DataFrame level sort is done like this:
# Sort
say ~df.sort: { .[1] }; # sort by 2nd col (ascending)
say ~df.sort: { -.[1] }; # sort by 2nd col (descending)
say ~df.sort: { df[$++]<C> }; # sort by col C
say ~df.sort: { df.ix[$++] }; # sort by index
Here is another example from the Dan::Polars Nutshell:
$obj .= sort( {$obj[$++]<species>, $obj[$++]<mass>} )[*].reverse^;
Unlike colspec sort, Block sort is implemented by converting a rust Dan::Polars::DataFrame
to a raku Dan::DataFrame
(ie. .flood
), performing the sort with a raku block-style syntax and then convering back (ie. .flush
). The implication is that the syntax is very rich, but the performance is lower.
The sort method on col Expressions in a select is independently applied to each col.
df.select([(col("names").alias("jones").sort),col("groups").alias("smith").sort,col("nrs").reverse]).head;
shape: (5, 3)
βββββββββ¬ββββββββ¬ββββββ
β jones β smith β nrs β
β --- β --- β --- β
β str β str β i32 β
βββββββββͺββββββββͺββββββ‘
β β A β 5 β
βββββββββΌββββββββΌββββββ€
β egg β A β 4 β
βββββββββΌββββββββΌββββββ€
β foo β B β 3 β
βββββββββΌββββββββΌββββββ€
β ham β B β 2 β
βββββββββΌββββββββΌββββββ€
β spam β C β 1 β
βββββββββ΄ββββββββ΄ββββββ
The sort method on col Expressions in a groupby is applied to the list result.
df.groupby(["groups"]).agg([col("nrs").sort]).head;
#df.groupby(["groups"]).agg([col("nrs").reverse]).head;
shape: (3, 2)
ββββββββββ¬ββββββββββββ
β groups β nrs β
β --- β --- β
β str β list[i32] β
ββββββββββͺββββββββββββ‘
β C β [4] β
ββββββββββΌββββββββββββ€
β A β [1, 2] β
ββββββββββΌββββββββββββ€
β B β [3, 5] β
ββββββββββ΄ββββββββββββ
In Dan::Polars, missing data is represented by the raku Type Object (Int, Bool, Str and so on) or by the raku Numeric special values (NaN, +/-Inf).
my \df = DataFrame.new([
nrs => [1, 2, 3, 4, 5],
nrs2 => [Num, NaN, 4, Inf, 8.3],
names => ["foo", Str, "spam", "egg", ""],
random => [1.rand xx 5],
groups => ["A", "A", "B", "C", "B"],
flags => [True,True,False,True,Bool],
]);
df.show;
shape: (5, 6)
βββββββ¬βββββββ¬ββββββββ¬βββββββββββ¬βββββββββ¬ββββββββ
β nrs β nrs2 β names β random β groups β flags β
β --- β --- β --- β --- β --- β --- β
β i32 β f64 β str β f64 β str β bool β
βββββββͺβββββββͺββββββββͺβββββββββββͺβββββββββͺββββββββ‘
β 1 β null β foo β 0.074586 β A β true β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββΌββββββββ€
β 2 β NaN β null β 0.867919 β A β true β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββΌββββββββ€
β 3 β 4.0 β spam β 0.069183 β B β false β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββΌββββββββ€
β 4 β inf β egg β 0.739191 β C β true β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββΌββββββββ€
β 5 β 8.3 β β 0.133729 β B β null β
βββββββ΄βββββββ΄ββββββββ΄βββββββββββ΄βββββββββ΄ββββββββ
And, conversely, when cast back to a (non Polars) Dan DataFrame:
say ~df.Dan-DataFrame;
nrs nrs2 names random groups flags
0 1 Num foo 0.9188127959571387 A True
1 2 NaN Str 0.08257029673307026 A True
2 3 4 spam 0.0682447340762582 B False
3 4 Inf egg 0.3287371781756494 C True
4 5 8.3 0.5133318112263049 B Bool
You can test for what you have with:
Sense | Truthiness | Definedness | Numberness | Finiteness |
---|---|---|---|---|
so | n/a | is_null | is_not_nan | is_finite |
not | is_not | is_not_null | is_nan | is_infinite |
#`[
df.select([(col("nrs") > 2)]).head;
df.select([((col("nrs") > 2).is_not)]).head;
df.select([(col("nrs2").is_null)]).head;
df.select([(col("nrs2").is_not_null)]).head;
df.select([(col("nrs2").is_not_nan)]).head;
df.select([(col("nrs2").is_nan)]).head;
df.select([(col("nrs2").is_finite)]).head;
#]
df.select([(col("nrs2").is_infinite)]).head;
shape: (5, 1)
βββββββββ
β nrs2 β
β --- β
β bool β
βββββββββ‘
β null β
βββββββββ€
β false β
βββββββββ€
β false β
βββββββββ€
β true β
βββββββββ€
β false β
βββββββββ
In Rust Polars, map and apply functions are offered. In Dan::Polars, only apply
is provided for user-defined functions. Per the Polars user guide:
Use cases for map in the group_by context are slim. They are only used for performance reasons, but can quite easily lead to incorrect results...
Luckily, apply
works on the smallest logical elements for the operation:
select context
-> single elementsgroup by context
-> single groups
Dan::Polars apply
aims to offer near native Rust Polars performance on user-defined operations embedded in raku code. Long term, it is intended to be suitable for concurrent and parallel processing so could be faster than Python Polars. The operation is written in "Rust lambda slang" within your raku code and then it is JIT compiled and made available in a Rust library (libapply.so
or equivalent) to be called from the Rust Polars library.
Monadic - operations with one argument
Taking this example DataFrame:
my \df = DataFrame.new([
nrs => [1, 2, 3, 4, 5],
nrs2 => [2, 3, 4, 5, 6],
names => ["foo", "ham", "spam", "egg", ""],
random => [1.rand xx 5],
groups => ["A", "A", "B", "C", "B"],
]);
df.show;
shape: (5, 5)
βββββββ¬βββββββ¬ββββββββ¬βββββββββββ¬βββββββββ
β nrs β nrs2 β names β random β groups β
β --- β --- β --- β --- β --- β
β i32 β i32 β str β f64 β str β
βββββββͺβββββββͺββββββββͺβββββββββββͺβββββββββ‘
β 1 β 2 β foo β 0.455665 β A β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββ€
β 2 β 3 β ham β 0.961131 β A β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββ€
β 3 β 4 β spam β 0.093231 β B β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββ€
β 4 β 5 β egg β 0.570909 β C β
βββββββΌβββββββΌββββββββΌβββββββββββΌβββββββββ€
β 5 β 6 β β 0.716256 β B β
βββββββ΄βββββββ΄ββββββββ΄βββββββββββ΄βββββββββ
Here we add one to each i32 value in a groupby
:
df.groupby(["groups"]).agg([col("nrs").apply("|a: i32| (a + 1) as i32").alias("jones")]).head;
shape: (3, 2)
ββββββββββ¬ββββββββββββ
β groups β jones β
β --- β --- β
β str β list[i32] β
ββββββββββͺββββββββββββ‘
β B β [4, 6] β
ββββββββββΌββββββββββββ€
β A β [2, 3] β
ββββββββββΌββββββββββββ€
β C β [5] β
ββββββββββ΄ββββββββββββ
The various part of the raku source code with Rust lambda slang are described below:
# Monadic Real
# apply() is exported directly into client script and acts on the ExprC made by col()
# its argument is a string in the form of a Rust lambda with |signature| (body) as rtn-type
# the lambda takes variable 'a: type' if monadic or 'a: type, b: type' if dyadic'
# the body is a valid Rust expression
#df.select([col("nrs").apply("|a: i32| (a + 1) as i32").alias("jones")]).head;
--- ------ ---------- ----- -------- ----- ------ ----- ----
| | | | | | | | -> method head prints top lines of result
| | | | | | | |
| | | | | | | -> method alias returns a new Expr
| | | | | | |
| | | | | | -> lamda return type (Rust)
| | | | | |
| | | | | -> lambda expression using varname (Rust)
| | | | |
| | | | -> lambda signature with varname : type (Rust)
| | | |
| | | -> method apply returns a new Expr with the results of the lambda
| | |
| | -> method col(Str \colname) returns a new (empty) Expr
| |
| -> method select(Array \exprs) creates a LazyFrame, calls .select(exprs) then .collect
|
|
-> DataFrame object with attributes of pointers to rust DataFrame and LazyFrame structures
Dyadic - operations with two arguments
Taking this example DataFrame:
my \df2 = DataFrame.new([
keys => ["a", "a", "b"],
values => [10, 7, 1],
ovalues => [10, 7, 1],
]);
df2.show;
shape: (3, 3)
ββββββββ¬βββββββββ¬ββββββββββ
β keys β values β ovalues β
β --- β --- β --- β
β str β i32 β i32 β
ββββββββͺβββββββββͺββββββββββ‘
β a β 10 β 10 β
ββββββββΌβββββββββΌββββββββββ€
β a β 7 β 7 β
ββββββββΌβββββββββΌββββββββββ€
β b β 1 β 1 β
ββββββββ΄βββββββββ΄ββββββββββ
Here we add one to each i32 value in a groupby
:
In Dan::Polars, the two sections - Join and Concat - are related via these tables:
Function | Description | Dan |
---|---|---|
join | Join on a column | df1.join(df2, how=>'inner', on=>'col') |
concat | Concatenate along an axis | df1.concat(df2, axis=>0/1) |
Function | Description | Dan |
---|---|---|
concat | Append one Series to another | series1.concat( series2 ) |
The rationale for this solution is set out in Issue #10
Here is the signature of the Dan::Polars .join
method:
subset JoinType of Str where <left inner outer cross>.any;
method join( DataFrame \right, Str :$on, JoinType :$how = 'outer' ) { ... }
- use ```on => 'colname' to pass the column on which to do the join
- Dan::Polars will guess the on column(s) if nothing is supplied
on_right
andon_left
are not provided- ignored if a
cross
join
- use ```how => 'jointype' to specify how to do the join
- default is
outer
- undefined cells are created as
null
right
is not implemented (swap method call if needed)asof
andsemi
are not yet implemented
- default is
First some examples:
my \df_customers = DataFrame([
customer_id => [1, 2, 3],
name => ["Alice", "Bob", "Charlie"],
]);
df_customers.show;
shape: (3, 2)
βββββββββββββββ¬ββββββββββ
β customer_id β name β
β --- β --- β
β i32 β str β
βββββββββββββββͺββββββββββ‘
β 1 β Alice β
βββββββββββββββΌββββββββββ€
β 2 β Bob β
βββββββββββββββΌββββββββββ€
β 3 β Charlie β
βββββββββββββββ΄ββββββββββ
my \df_orders = DataFrame([
order_id => ["a", "b", "c"],
customer_id => [1, 2, 2],
amount => [100, 200, 300],
]);
df_orders.show;
shape: (3, 3)
ββββββββββββ¬ββββββββββββββ¬βββββββββ
β order_id β customer_id β amount β
β --- β --- β --- β
β str β i32 β i32 β
ββββββββββββͺββββββββββββββͺβββββββββ‘
β a β 1 β 100 β
ββββββββββββΌββββββββββββββΌβββββββββ€
β b β 2 β 200 β
ββββββββββββΌββββββββββββββΌβββββββββ€
β c β 2 β 300 β
ββββββββββββ΄ββββββββββββββ΄βββββββββ
df_customers.join(df_orders, on => "customer_id", how => "inner").show;
shape: (3, 4)
βββββββββββββββ¬ββββββββ¬βββββββββββ¬βββββββββ
β customer_id β name β order_id β amount β
β --- β --- β --- β --- β
β i32 β str β str β i32 β
βββββββββββββββͺββββββββͺβββββββββββͺβββββββββ‘
β 1 β Alice β a β 100 β
βββββββββββββββΌββββββββΌβββββββββββΌβββββββββ€
β 2 β Bob β b β 200 β
βββββββββββββββΌββββββββΌβββββββββββΌβββββββββ€
β 2 β Bob β c β 300 β
βββββββββββββββ΄ββββββββ΄βββββββββββ΄βββββββββ
df_customers.join(df_orders).show; #outer join relying on defaults
shape: (4, 4)
βββββββββββββββ¬ββββββββββ¬βββββββββββ¬βββββββββ
β customer_id β name β order_id β amount β
β --- β --- β --- β --- β
β i32 β str β str β i32 β
βββββββββββββββͺββββββββββͺβββββββββββͺβββββββββ‘
β 1 β Alice β a β 100 β
βββββββββββββββΌββββββββββΌβββββββββββΌβββββββββ€
β 2 β Bob β b β 200 β
βββββββββββββββΌββββββββββΌβββββββββββΌβββββββββ€
β 2 β Bob β c β 300 β
βββββββββββββββΌββββββββββΌβββββββββββΌβββββββββ€
β 3 β Charlie β null β null β
βββββββββββββββ΄ββββββββββ΄βββββββββββ΄βββββββββ
df_customers.join(df_orders, on => "customer_id", how => "left").show;
^^ same as above (in this example)
For cross join:
my \df_colors = DataFrame([
color => ["red", "blue", "green"],
]);
df_colors.show;
shape: (3, 1)
βββββββββ
β color β
β --- β
β str β
βββββββββ‘
β red β
βββββββββ€
β blue β
βββββββββ€
β green β
βββββββββ
my \df_sizes = DataFrame([
size => ["S", "M", "L"],
]);
df_sizes.show;
shape: (3, 1)
ββββββββ
β size β
β --- β
β str β
ββββββββ‘
β S β
ββββββββ€
β M β
ββββββββ€
β L β
ββββββββ
df_colors.join( df_sizes, :how<cross> ).show;
shape: (9, 2)
βββββββββ¬βββββββ
β color β size β
β --- β --- β
β str β str β
βββββββββͺβββββββ‘
β red β S β
βββββββββΌβββββββ€
β red β M β
βββββββββΌβββββββ€
β red β L β
βββββββββΌβββββββ€
β blue β S β
βββββββββΌβββββββ€
β ... β ... β
βββββββββΌβββββββ€
β blue β L β
βββββββββΌβββββββ€
β green β S β
βββββββββΌβββββββ€
β green β M β
βββββββββΌβββββββ€
β green β L β
βββββββββ΄βββββββ
Here is the signature of the Dan::Polars DataFrame .concat
method:
method concat( DataFrame:D $dfr, :ax(:$axis) is copy ) { ... }
given $axis {
when ! .so || /^r/ || /^v/ { 0 }
when .so || /^c/ || /^h/ { 1 }
}
ax
is an alias foraxis
- default (False) is vertical
- as values you can use
- False | True
- 0 | 1
- anything with initial char [r]ow or [c]olumn
- anything with initial char [v]ertical or [h]orizontal
First, some example data:
my \dfa = DataFrame.new(
[['a', 1], ['b', 2]],
columns => <letter number>,
);
my \dfb = DataFrame.new(
[['c', 3], ['d', 4]],
columns => <letter number>,
);
my \dfc = DataFrame.new(
[['cat', 4], ['dog', 4]],
columns => <animal legs>,
);
dfa.concat(dfb).show; # vertical is default
shape: (4, 2)
ββββββββββ¬βββββββββ
β letter β number β
β --- β --- β
β str β i32 β
ββββββββββͺβββββββββ‘
β a β 1 β
ββββββββββΌβββββββββ€
β b β 2 β
ββββββββββΌβββββββββ€
β c β 3 β
ββββββββββΌβββββββββ€
β d β 4 β
ββββββββββ΄βββββββββ
dfa.concat(dfc, :axis).show; # horizontal or column-wise
shape: (2, 4)
ββββββββββ¬βββββββββ¬βββββββββ¬βββββββ
β letter β number β animal β legs β
β --- β --- β --- β --- β
β str β i32 β str β i32 β
ββββββββββͺβββββββββͺβββββββββͺβββββββ‘
β a β 1 β cat β 4 β
ββββββββββΌβββββββββΌβββββββββΌβββββββ€
β b β 2 β dog β 4 β
ββββββββββ΄βββββββββ΄βββββββββ΄βββββββ
my \s = Series.new( [b=>1, a=>0, c=>2] );
my \t = Series.new( [f=>1, e=>0, d=>2] );
my $u = s.concat: t; # concatenate
$u.show;
shape: (6,)
Series: 'anon' [i32]
[
1
0
2
1
0
2
]
Copyright(c) 2022-2023 Henley Cloud Consulting Ltd.