Vector arithmetic #153

danielpcox · 2021-06-25T19:31:29Z

Adds new Math method to dataframe.DataFrame capable of computing n-ary arithmetic functions against entire selected columns, storing the the result in a new column (or replacing an existing one). Supports int and float64 types. Supports operator specification by string (e.g., "+", "/", etc.) or unary, binary, or trinary int or float64 function (e.g., for supplying a float64 function from Go's math module). For example:

/*  `input` is a 5x4 DataFrame:

   Strings  Floats   Primes Naturals
0: e        2.718000 1      1
1: Pi       3.142000 3      2
2: Phi      1.618000 5      3
3: Sqrt2    1.414000 7      4
4: Ln2      0.693000 11     5
   <string> <float>  <int>  <int>
*/
df := New(
	series.New([]string{"e", "Pi", "Phi", "Sqrt2", "Ln2"}, series.String, "Strings"),
	series.New([]float64{2.718, 3.142, 1.618, 1.414, 0.693}, series.Float, "Floats"),
	series.New([]int{1, 3, 5, 7, 11}, series.Int, "Primes"),
	series.New([]int{1, 2, 3, 4, 5}, series.Int, "Naturals"),
)

// New method `Math` takes a new column name, an operator (string or func) and at least one column name
withNewDiffColumn = df.Math("Diff", "-", "Floats", "Primes")

fmt.Println(withNewDiffColumn)

/* New DataFrame now has a column named "Diff" which is
    the result of subtracting Primes from Floats.
	
    Strings  Floats   Primes Naturals Diff
 0: e        2.718000 1      1        1.718000  
 1: Pi       3.142000 3      2        0.142000  
 2: Phi      1.618000 5      3        -3.382000 
 3: Sqrt2    1.414000 7      4        -5.586000 
 4: Ln2      0.693000 11     5        -10.307000
    <string> <float>  <int>  <int>    <float> 
*/

There are more examples in the docs and tests.

This PR also adds new FindElem method to dataframe.DataFrame which lets a user pull a particular series.Element out of a DataFrame by specifying a column and value to select a row (assumed to be unique), and another column to find a particular value within that row. For example, the following line will search through the "Metric" column of each row for a value "envoy_cluster_upstream_rq_active", and then it will return the series.Element from that row corresponding to the "Value" column:

df.FindElem("Metric", "envoy_cluster_upstream_rq_active", "Value")

chrmang

Hi Daniel,

thank you very much for your contribution. The df.Math is a very interesting function, but is it possible for you to do a little rework? I don't want to reduce it to only math.

A footprint like Apply2Element(resultcol string, f func(e1, e2 Element) Element, col1, col2 string) DataFrame gives more flexibility and can work even on string Series.
Maybe Apply2Float64(esultcol string, f func(e1, e2 float64) Element, col1, col2 string) DataFrame is also handy. Especially in combination with the "math" package.
Automatic coercion can cause subtle bugs in user code. Please, don't use it. If needed, this can be done in f.
With df.Filter it is already possible to select one or more Rows of a Dataframe.
Maybe we should add a df.Head function to select only the first.

chrmang · 2021-06-27T10:49:00Z

CHANGELOG.md

@@ -12,6 +12,8 @@ This project adheres to [Semantic Versioning](http://semver.org/).
 - Combining filters with AND
 - User-defined filters
 - Concatination of Dataframes
+- Math for vector operations on multiple columns


Please move it to the 0.12.0 section

Will do 👌

danielpcox · 2021-06-30T02:57:24Z

Thanks for your comments. I agree this can be reworked to accommodate more than math. However, I'd be very sorry to see the math-specific string flavors of op go, or to add more overhead to a simple operation such that it can no longer be performed succinctly in the 90% case.

Counter-proposal:

What if instead we split df.Math into three different new methods:

The first would have signature Arithmetic(resultcol string, op string, operandcols ...string) DataFrame, which only takes the limited string ops ("+", "-", "*", "/", and "%"), and takes variadic operands like df.Math currently does. I'd also like Arithmetic to be allowed to coerce values, because its purpose is to make common operations as easy as possible, but more on that below.
The second would have signature something like ElemMultiApply(resultcol string, op func(elements ...Element) Element, operandcols ...string) DataFrame, where the user passes a variadic function on Elements and however many columns, and it gets applied without coercion.
The third would have signature something like FloatMultiApply(resultcol string, op interface{}, operandcols ...string) DataFrame and use the same techniques as in the current df.Math to support unary, binary, and trinary op functions on at least float64 values (and I'd really like to be able to automatically convert the ints in mixed operand columns to float64 as necessary to enable pleasant access to the math package on integers - but see below for coercion discussion). It would have to be able to support all three arities to be able to take any function from math directly.

No coercision?

Is the request not to do automatic coercion a gota policy set in stone? I thought there was already automatic coercion in gota. Capply and Rapply in the readme say "casting the types as necessary", and the function I used to figure out what the output should be (int or float64) was already there; I just moved it out of a function to make it accessible to Math.

Are you sure you wouldn't want it in, when it's only automatic coercion in one direction (int -> float64) and it only happens when the input columns are mismatched (at least one float64 column among the operands)? I would personally much prefer a concise API with a few well-documented potential gotchas to a verbose API that makes me do extra work in the most common cases. Coercion is also how I managed to make it possible without much ceremony to pass any function from Go's math package in as op and have it correctly apply to columns of mixed type (they get detected and cast to float64 to be compatible, and the output is always float64).

There's also type coercion in Pandas and R, and people seem to be able to handle it. I think a nice pile of warnings in the documentation would suffice, at least for me, and what we get for it is agility and API clarity. (And the reason I'm using gota in the first place is because idiomatic Go doesn't let me express a complex high level thought succinctly enough to do it often.)

All of that said, I'm flexible here, and it's your project. :)

FindElem

As for FindElem, I think I can just remove that without sacrificing much. I currently perform the same operation in my existing code with df.Filter(...).Elem(0,1).Float() which is succinct enough. (If we do add something that gives you only the first match later though, I'd suggest First or FirstRow rather than Head, because df.head in Pandas shows the first n, defaulting to 5.)

As an aside, for when there are many rows, I was thinking of adding Index(columnname string) DataFrame which would build an index of the values in that column to their row number, and if a user chose to build such an index up-front, anything that needed to search for a value (e.g., Filter) would make use of it to improve performance. That's still possible with the Filter-Elem-Float paradigm for looking up a value.

chrmang · 2021-07-05T19:22:44Z

Thank you for your detailed explanation. Let me explain my opinion about coercion.

In many cases it can be a great thing. Especially when dealing with AI it is useful, because sooner or later all variables are float and a loss of one or two digits precision is not a problem. At other use-cases, this would be not acceptable - think of financial services. And it can cause bugs like #154 . This bug was in gota, other bugs can be in user code. It's all about the use-case.
Go is designed as a typed language and we should use the benefits of compile-time type checking. You are right, gota is full of automatic coercion, but this is the preferred way in Python, R and JavaScript. What benefit would people have if we write a 1:1 copy of pandas in Go? Speed? pandas is written in C with a Python wrapper - it's already fast. What else if not type safety?
In the future maybe I will find a way to replace interface {} with generices or generate or ... Nevertheless we should move forward and improve the library usability.

The idea of your counter-proposal is good. You know my opinion about coercion, but let's have a try. Go for it.
Can you change your PR, please?

danielpcox · 2021-07-12T20:42:28Z

What benefit would people have if we write a 1:1 copy of pandas in Go? Speed? pandas is written in C with a Python wrapper - it's already fast. What else if not type safety?

Hmm... I admit, in my particular case, the only reason I'm using Go is because I have to. So at least one benefit would be that people don't have to leave their preferred language (which is great for other purposes) to manipulate tabular data in a readable way. The company I work for has services written in Go, and we need a compact way to express quite a large collection of high level operations on tables. I don't think idiomatic Go is the best language for doing that, (partly because of static typing, but mostly because the for loop reigns supreme in Go, and because of inline error checking), but I can't rewrite someone's entire service just because I'd prefer to do the number crunching with Pandas. I was pleased to find gota because it bends a few of the laws that makes Go especially painful for high-level data wrangling, and therefore finally made my code readable. It's not Go, it's a DSL, and that's my favorite thing about it.

The idea of your counter-proposal is good. You know my opinion about coercion, but let's have a try. Go for it.

Do you mean "go for it" as-written, or without any automatic int->float64 coercion? If the latter, perhaps I should also add a fourth method to DataFrame that explicitly coerces a column's types, so there isn't a big drop in readability when the types don't match.

danielpcox added 12 commits June 17, 2021 23:22

wip vector arithmetic

1260d4f

fixed Math

c9d53dc

use go modules

2ce8e1b

added tests for Math method

b885474

error tests and coerce ints with float func op

03d5231

update replace in go.mod to repo

7527397

adding FindElem to easily select a single value by labels

aa97592

adding tests for FindElem

2fed518

documentation

05485a1

removing replace directive from go.mod for PR

44283e0

Merge remote-tracking branch 'origin/dev' into math-method

c155292

adding Math and FindElem to changelog

0d7eb16

danielpcox changed the title ~~Math method~~ Vector arithmetic Jun 25, 2021

This was referenced Jun 26, 2021

Looking for contributors #78

Open

Vector arithmetic #152

Closed

chrmang requested changes Jun 27, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector arithmetic #153

Vector arithmetic #153

danielpcox commented Jun 25, 2021 •

edited

Loading

chrmang left a comment

chrmang Jun 27, 2021

danielpcox Jun 30, 2021

danielpcox commented Jun 30, 2021

chrmang commented Jul 5, 2021

danielpcox commented Jul 12, 2021 •

edited

Loading

Vector arithmetic #153

Are you sure you want to change the base?

Vector arithmetic #153

Conversation

danielpcox commented Jun 25, 2021 • edited Loading

chrmang left a comment

Choose a reason for hiding this comment

chrmang Jun 27, 2021

Choose a reason for hiding this comment

danielpcox Jun 30, 2021

Choose a reason for hiding this comment

danielpcox commented Jun 30, 2021

Counter-proposal:

No coercision?

FindElem

chrmang commented Jul 5, 2021

danielpcox commented Jul 12, 2021 • edited Loading

danielpcox commented Jun 25, 2021 •

edited

Loading

danielpcox commented Jul 12, 2021 •

edited

Loading