THIS MODULE IS EXPERIMENTAL AND SUBJECT TO CHANGE WITHOUT NOITCE
Top level raku Data ANalysis Module that provides a base set of raku-style datatype roles, accessors & methods, primarily:
- Series
- DataFrames
A common basis for bindings such as ... Dan::Pandas (via Inline::Python), Dan::Polars (via NativeCall / Rust FFI), etc.
It's rather a zen concept since raku contains many Data Analysis constructs & concepts natively anyway (see note 7 below)
Contributions via PR are very welcome - please see the backlog Issue, or just email [email protected] to share ideas!
zef install Dan;
more examples in bin/synopsis.raku
use Dan :ALL;
### Series ###
my \s = Series.new( [b=>1, a=>0, c=>2] ); #from Array of Pairs
# -or- Series.new( [rand xx 5], index => <a b c d e>);
# -or- Series.new( data => [1, 3, 5, NaN, 6, 8], index => <a b c d e f>, name => 'john' );
say ~s;
# Accessors
say s[1]; #2 (positional)
say s<b c>; #2 1 (associative with slice)
# Map/Reduce
say s.map(*+2); #(3 2 4)
say [+] s; #3
# Hyper
say s >>+>> 2; #(3 2 4)
say s >>+<< s; #(2 0 4)
# Update
s.data[1] = 1; # set value
s.splice(1,2,(j=>3)); # update index & value
# Combine
my \t = Series.new( [f=>1, e=>0, d=>2] );
s.concat: t; # concatenate
say "=============================================";
### DataFrames ###
my \dates = (Date.new("2022-01-01"), *+1 ... *)[^6];
my \df = DataFrame.new( [[rand xx 4] xx 6], index => dates, columns => <A B C D> );
# -or- DataFrame.new( [rand xx 5], columns => <A B C D>);
# -or- DataFrame.new( [rand xx 5] );
say ~df;
say "---------------------------------------------";
# Data Accessors [row;col]
say df[0;0];
df[0;0] = 3; # set value
# Cascading Accessors (ok to mix Positional and Associative)
say df[0][0];
say df[0]<A>;
say df{"2022-01-03"}[1];
# Object Accessors & Slices (see note 1)
say ~df[0]; # 1d Row 0 (DataSlice)
say ~df[*]<A>; # 1d Col A (Series)
say ~df[0..*-2][1..*-1]; # 2d DataFrame
say ~df{dates[0..1]}^; # the ^ postfix converts an Array of DataSlices into a new DataFrame
say "---------------------------------------------";
### DataFrame Operations ###
# 2d Map/Reduce
say df.map(*.map(*+2).eager);
say [+] df[*;1];
say [+] df[*;*];
# Hyper
say df >>+>> 2;
say df >>+<< df;
# Transpose
say ~df.T;
# Describe
say ~df[0..^3]^; # head
say ~df[(*-3..*-1)]^; # tail
say ~df.shape;
say ~df.describe;
# Sort
say ~df.sort: { .[1] }; # sort by 2nd col (ascending)
say ~df.sort: { -.[1] }; # sort by 2nd col (descending)
say ~df.sort: { df[$++]<C> }; # sort by col C
say ~df.sort: { df.ix[$++] }; # sort by index
# Grep (binary filter)
say ~df.grep( { .[1] < 0.5 } ); # by 2nd column
say ~df.grep( { df.ix[$++] eq <2022-01-02 2022-01-06>.any } ); # by index (multiple)
say "---------------------------------------------";
my \df2 = DataFrame.new([
A => 1.0,
B => Date.new("2022-01-01"),
C => Series.new(1, index => [0..^4], dtype => Num),
D => [3 xx 4],
E => Categorical.new(<test train test train>),
F => "foo",
]);
say ~df2;
say df2.data;
say df2.dtypes;
say df2.index; #Hash (name => row number) -or- df.ix; #Array
say df2.columns; #Hash (label => col number) -or- df.cx; #Array
say "---------------------------------------------";
### DataFrame Splicing ### (see notes 2 & 3)
# row-wise splice:
my $ds = df2[1]; # get a DataSlice
$ds.splice($ds.index<d>,1,7); # tweak it a bit
df2.splice( 1, 2, [j => $ds] ); # default
# column-wise splice:
my $se = df2.series: <a>; # get a Series
$se.splice(2,1,7); # tweak it a bit
df2.splice( :ax, 1, 2, [K => $se] ); # axis => 1
say "---------------------------------------------";
### DataFrame Concatenation ### (see notes 4 & 5)
my \dfa = DataFrame.new(
[['a', 1], ['b', 2]],
columns => <letter number>,
);
#`[
letter number
0 a 1
1 b 2
#]
my \dfc = DataFrame.new(
[['c', 3, 'cat'], ['d', 4, 'dog']],
columns => <animal letter number>,
);
#`[
letter number animal
0 c 3 cat
1 d 4 dog
#]
dfa.concat: dfc; # row-wise / outer join is default
#`[
letter number animal
0 a 1 NaN
1 b 2 NaN
0⋅1 c 3 cat
1⋅1 d 4 dog
#]
dfa.concat: dfc, join => 'inner';
#`[
letter number
0 a 1
1 b 2
0⋅1 c 3
1⋅1 d 4
#]
my \dfd = DataFrame.new( [['bird', 'polly'], ['monkey', 'george']],
columns=> <animal name>, );
dfa.concat: dfd, axis => 1; #column-wise
#`[
letter number animal name
0 a 1 bird polly
1 b 2 monkey george
#]
say "=============================================";
Notes:
[1] raku accessors may use any function that makes a List, e.g.
Positional slices: [1,3,4], [0..3], [0..*-2], [*]
Associative slices: <A C D>, {'A'..'C'}
viz. https://docs.raku.org/language/subscripts
[2] splice is the core update method
for all add, drop, move, delete, update & insert operations
viz. https://docs.raku.org/routine/splice
[3] named parameter 'axis' indicates if row(0) or col(1)
if omitted, default=0 (row) / 'ax' is an alias
use a Pair literal like :!axis, :axis(1) or :ax
[4] concat is the core combine method
for all join, merge & combine operations
duplicate labels are extended with $mark ~ $i++
# $mark = '⋅'; # unicode Dot Operator U+22C5
use :ii (:ignore-index)
to reset the index (row or col)
[5] concat supports join => outer|inner|right|left
unknown values are set to NaN
default is outer, :jn is alias, and you can go :jn on first letter
set axis param (see splice above) for col-wise concatenation
[6] relies on hypers instead of overriding dyadic operators [+-*/]
say ~my \quants = Series.new([100, 15, 50, 15, 25]);
say ~my \prices = Series.new([1.1, 4.3, 2.2, 7.41, 2.89]);
say ~my \costs = Series.new( quants >>*<< prices );
[7] what are we getting from raku core that others do in libraries?
- pipes & maps
- multi-dimensional arrays
- slicing & indexing
- references & views
- map, reduce, hyper operators
- operator overloading
- concurrency
- types (incl. NaN)
copyright(c) 2022-2024 Henley Cloud Consulting Ltd.