Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle class #6

Open
eusebe opened this issue Feb 24, 2015 · 6 comments
Open

Handle class #6

eusebe opened this issue Feb 24, 2015 · 6 comments

Comments

@eusebe
Copy link

eusebe commented Feb 24, 2015

Hi,

This is not really a bug, but a feature request: is it able to detect class changes?

> y <- iris[1:3,]
> x <- y
> x$Sepal.Width <- as.character(x$Sepal.Width)
> x$Sepal.Length <- as.factor(x$Sepal.Length)
> class(x[, 1])
[1] "factor"
> class(x[, 2])
[1] "character"
> class(x[, 3])
[1] "numeric"
> render_diff(diff_data(y, x))
@@Sepal.LengthSepal.Width...

Thanks!
David

@edwindj
Copy link
Owner

edwindj commented Feb 24, 2015

Hi David,

Very good question! The answer is yes and no.

The object returned by diff_data knows the class changes. They are used by patch_data

 > y <- iris[1:3,]
> x <- y
> x$Sepal.Width <- as.character(x$Sepal.Width)
> x$Sepal.Length <- as.factor(x$Sepal.Length)
> str(patch_data(y, diff_data(y, x)))
'data.frame':   3 obs. of  5 variables:
 $ Sepal.Length: Factor w/ 3 levels "4.7","4.9","5.1": 3 2 1
 $ Sepal.Width : chr  "3.5" "3" "3.2"
 $ Petal.Length: num  1.4 1.4 1.3
 $ Petal.Width : num  0.2 0.2 0.2
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1

However the Coopy Diff Format does not know about type changes so storing a diff and loading it, won't give the correct patch.

x <- y <- iris[1:3,]
x$Sepal.Width <- as.character(x$Sepal.Width)
x$Sepal.Length <- as.factor(x$Sepal.Length)
write_diff(diff_data(y, x), file = "diff.csv")
d_yx <- read_diff("diff.csv")
str(patch_data(y, d_yx))

The author of the Coopy Diff Format welcomes the addition of type information, which is a good thing. Note however that this probably will not handle factor/character switches, since this is very R specific.

@edwindj
Copy link
Owner

edwindj commented Feb 24, 2015

I will try to add class changes to the diff format, but this will not land into the master branch before april (due to time constraints).
Current thoughts:

  • extend the coopy highlighter diff format with type info
  • have a R specific storage format: e.g. factors can be quite nasty: different order of levels for example. diff_data and patch_data take this into account, but it is not stored in the diff format.

@paulfitz
Copy link

One way to do this with the existing diff format would be to generate two diffs: the existing one comparing content, and a second one that compares metadata. So for a table like this:

Sepia.Length Sepia.Width ...
10.2 55.3 ...
9.1 1.2 ...

There would also be another (imaginary) table of type information like this:

Name TypeNuance1 TypeNuance2 ...
Sepia.Length ... ... ...
Sepia.Width ... ... ...
... ... ... ...

Where the columns are a flattened version of whatever parameters it takes to fully describe types in R. Diffs of the second table would then give a clear record of type changes.

@edwindj
Copy link
Owner

edwindj commented Feb 25, 2015

@paulfitz I like your idea of creating a metadiff, which is very flexible.

However I see the following issues:

  • In stead of one diff file we end up with two diff files. I think it is desirable to store the diff in one textual form.
  • R has so called factor columns, which use a fixed ordered set of values (called levels). Each factor column has its own levels. In R it is desirable to track changes in this set of values.
    e.g. c("female", "male") -> ("male", "female") or ("female", "male") -> ("total", "female", "male"). I'm not sure what the best way is to code this in a metadiff. (one column or many columns..)

Maybe we should have both: encode simple type information in the coopy highlighter diff format and extended type information in a meta diff

Shall we move this discussion to http://dataprotocols.org/tabular-diff-format ?

@gwarnes-mdsol
Copy link
Collaborator

Perhaps it would be easiest to implement a second function ('diff_meta'?) to handle metadata changes.

A simplistic way to handle (Changes in) Factor levels is to concatenate them into a single string with some separator, e.g.

set.seed(42)
f <- factor(sample(letters[1:5], 20, replace=TRUE))
f.levels <- paste( dQuote(levels(f)), collapse=", ")
f.levels

@paulfitz
Copy link

My overall feeling is that it is hard to make meta stuff work cleanly enough for it to actually save people time for real.

That said: some basic support of column meta-data diffing did end up added to daff. For example, when comparing sqlite databases:
sqldiff
Basically, a data table can have a meta-table where cells specify properties of the columns of the data table. For sqlite databases I added just a single "type" property. When showing diffs, the meta-table is diffed just like regular tables, then prepended to the data diff with @ decoration to distinguish it from data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants