Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Coopy Highlighter Diff format with column type changes #3

Open
danfowler opened this issue Jul 1, 2016 · 12 comments
Open

Extend Coopy Highlighter Diff format with column type changes #3

danfowler opened this issue Jul 1, 2016 · 12 comments

Comments

@danfowler
Copy link
Contributor

From @edwindj on February 25, 2015 14:10

A useful addition to the coopy highlighter diff format would be column type changes.

For example:

dataset a
A,B
1.1,1

and

dataset b
A,B,C
1,"1",2.1

The Coopy Diff is:

!,,+++
@@,A,B,C
->,1.1->1,1,2.1

A typed version of the format could be:

!,{number->integer},{integer->string},+++{number}
@@,A,B,C
->,1.1->1,1,2.1

In which the schema row can contain a column type change. IMHO type information is not obligatory, but should be interpreted by an implementation as a type suggestion, since types differs across programming languages. The types of json table schema seems like a good candidate for denoting common types.

Copied from original issue: frictionlessdata/datapackage#164

@danfowler
Copy link
Contributor Author

From @paulfitz on February 26, 2015 22:46

Thanks @edwindj. There's also work on refining the types in json table scheme in #159.

What do you think about leaving types in a separate optional row, like:

!,,,+++
@type,number->integer,integer->string,number
@@,A,B,C
->,1.1->1,1,2.1

I'm thinking that the spec could leave space for meta data associated with columns via a series of @foo lines (say @type, @precision, @special_stuff_for_R etc). Conforming consumers of diffs can ignore all that stuff, or try to use it. Conforming producers of diffs and add some of that stuff, or none of it.

The advantage of the separate rows is that the cells can behave exactly as in ordinary rows and be parsed in just the same way.

@danfowler
Copy link
Contributor Author

From @paulfitz on February 26, 2015 23:9

Also, I understand from edwindj/daff#6 that you like having a single file for expressing diffs, and that may be the way to go. But just as the Tabular Data Package spec proposes data in csv and schema in json, there may be something to be said for expressing schema differences in a hierachical format like json rather than trying to flatten types out.

@danfowler
Copy link
Contributor Author

From @edwindj on February 27, 2015 7:8

@paulfitz I like your the syntax for extra lines that may be ignored by consumers.
Maybe we can add this to the spec of Coopy Highligher Diff nonetheless.

Regarding type changes in one file or two: should we follow the diff paradigm of storing all changes in one text or should we follow the json table schema paradigm of describing meta data (changes) in a json file? The last option would force all users to use json table schema which I find too strict. May be we should support both with a preference for json table schema. When a schema is available it should be used, otherwise a less expressive form can be used with the @type syntax.

Note that a solution in the spirit of datapackage probably would not calculate a diff, but just reference two resources: table remote and table local.

@danfowler
Copy link
Contributor Author

From @paulfitz on February 28, 2015 4:0

I agreed it would make sense to stick the new syntax in. I could take a shot also at adding support for it in daff. What I'd do is just ask the source of the tables if there's any meta-data, diff that, and pass it along. For patching, I'm not 100% clear what would happen, but basically daff should tell you what meta-data changes happened and let you take care of taking action based on them.

This feature should make diffs more useful within an environment with a single kind of data source, even if it wouldn't be very useful for interchange between different kinds of data sources.

@danfowler
Copy link
Contributor Author

From @edwindj on March 1, 2015 9:59

Great! I will follow your changes and implement them in daff for R.

@danfowler
Copy link
Contributor Author

From @rgrp on May 26, 2015 16:8

@paulfitz shoudl this remain open - are their pending changes? Otherwise let's close with summary.

@danfowler
Copy link
Contributor Author

From @paulfitz on May 31, 2015 3:56

@rgrp can we keep it open a while longer? I've been plugging away on this, close to maturing.

@danfowler
Copy link
Contributor Author

From @rgrp on May 31, 2015 8:4

@paulfitz fantastic!

@danfowler
Copy link
Contributor Author

From @edwindj on May 31, 2015 10:3

@paulfitz Great!

@danfowler
Copy link
Contributor Author

From @paulfitz on October 10, 2015 15:50

I implemented a version of this some time back, and then got distracted working on a demo for it with sqlite. Suppose we have a birds table as follows:

# schema: id INTEGER PRIMARY KEY, name TEXT, count TEXT
id,name,count
-------------
1,robin,251
2,eagle,10
3,pigeon,140

And we modify the type of a column, add another column, and add a row:

# schema: id INTEGER PRIMARY KEY, name TEXT, count INTEGER, weather TEXT
id,name,count,weather
---------------------
1,robin,251,warm
2,eagle,10,
3,pigeon,140,
4,penguin,5,cold

Then daff would report this diff:

sqlite_diff

To use this in R, you'd need to implement some code that reports the properties of each column that you care about. That is sufficient for diffing. For patching, you'd need to be able to accept a description of the changes in a particular format and make them happen. I'll need to document this better if you're still interested in pursuing this @edwindj.

@danfowler
Copy link
Contributor Author

From @edwindj on October 11, 2015 11:56

@paulfitz, I'm still interested :-), documentation helps, but I will update my R code so this example works. Won't be until end of this week.

@danfowler
Copy link
Contributor Author

From @rgrp on March 7, 2016 18:59

@edwindj @paulfitz can this be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant