handling for arbitrary derive cycles #46

ahl · 2022-05-13T21:47:41Z

JSON Schema can define types that have cycles. This is simple to handle in languages like JavaScript or Java, but more complex in Rust since those cycles must be explicitly broken with a Box<T>. Note that use of a Vec or HashMap also breaks the containment cycle but has implications for derive computation which we will discuss later. Note too that where one "breaks" a cycle may have multiple solutions, some that require more breaks than others. Note also that it may not be feasible to reconstruct the types e.g. if the JSON Schema were derived from Rust types because the information about Box indirections is explicitly discarded (and probably reasonably so, but one could imagine including hints; more on that later as well).

Currently we break trivial A -> A cycles such as:

struct A {
    a: Option<Box<A>>, // this needs to be boxed
}

We can do this without a bunch of graph traversal and it solved a proximate problem.

The more general case requires us to decompose the type graph into strongly connected subgraphs that form a DAG (e.g. with algorithms proposed by Tarjan, Dijkstra or Kosaraju). In this case, the edges are defined by structure or newtype containment either directly or via an Option type. Within each strongly connected subgraph we then would determine where to "break" the cycles by inserting Boxes. The general case of this requires exponential time to compute. While the number of nodes (types) in a cycle is likely to be small, we still may elect for a heuristic, the simplest of which would be to cut all edges. There's very little harm in cutting more than is absolutely required--the serialization isn't affected for example--the only consequence is to the legibility and ergonomics of the generated types.

For JSON Schema generated from rust types, it could be helpful to annotate boxed types with an extension. This could act as a heuristic when slicing a strongly connected component i.e. we use these extensions to see if they properly break the containment cycle and do something else if they don't.

The derive macros we apply to types have a similar problem. Consider, for example, the following type:

struct A {
    value: u32,
}

For this struct we could #[derive(Eq, PartialEq)], but if we change the u32 to an f32 we could not! A Vec<T> is Eq only if T: Eq and a HashSet<T> isn't Ord regardless of the traits implemented by T.

From the list of desirable traits to implement such as Hash, Ord, and Eq, the ones we can apply to a type depend on the types to which it refers. And those references may form a cycle. As above, we must compute the strongly connected components. Above the edges were containment; here the edges are all references (i.e. a Vec is an edge here but not above`). Within each strongly connected component we must take the intersection of all supportable traits.

The text was updated successfully, but these errors were encountered:

justinmmott · 2022-10-08T15:31:51Z

Hey Adam I was giving this some thought and I think I've come up with a solution that would run in O(n²). I thought about it slightly differently. If we have a DAG then we don't need any boxes, so I was thinking about how can we turn a directed graph into a DAG by removing nodes. This led me to the following solution:

Let's say we have a graph like this with some SCC. (taken from the example on Wikipedia)

Now let's construct a BFS tree. WLOG I assumed the nodes were alphabetically ordered.

If we subtract all of these edges from the original graph we get this:

This is a graph of all of the back-edges or the edges that are creating our cycles. We want this graph to have no edges in order for our graph to be a DAG. So, we can just remove nodes from this graph that have the most edges (here I'm counting both incoming and outgoing edges). This is where the O(n²) factor comes from, we need to keep a max Fibonnaci heap, in the worst case we could be decreasing the priority of every key in the heap which in a Fibonacci heap would take O(n) operations. So, in the example we would first remove F which would look something like this:

Then D, A, and C. We can think of these removals as boxing that type.

Now that there are no edges in the back-edges graph we know that our original graph would be a DAG so no further work is needed. Our final types would look something like this:

struct A {
    b: B,
}

struct B {
    e: E,
    f: Box<F>,
}

struct C {
    d: Box<D>,
    g: G,
}

struct D {
    c: Box<C>,
    h: H,
}

struct E {
    a: Box<A>,
    f: Box<F>,
}

struct F {
    g: G,
}

struct G {
    f: Box<F>,
}

ahl · 2022-10-11T04:11:47Z

Interesting. Would H look like this?

struct H {
    d: Box<D>,
    g: G,
}

When we "remove" a node, it sounds like that type would always then be referred to via Box<T>. Would it be viable to turn all of those back-edges into Boxes?

In addition to "containment cycles". I'm also interested in dependency cycles so that we can use a more expansive set of derive macros. Consider, say, that a struct that contains an f64 can't have the Eq derive macro applied. So in the example above, if A contained an f32 then A, B, and E would not be able to derive Eq but other types might. Vaguely I had assumed that a proper solution to this problem would involve identifying SCCs and doing some sort of topo sort.

It may be that these are different enough problems that we want to address containment and derives distinctly.

justinmmott · 2022-10-11T18:32:44Z

Interesting. Would H look like this?
struct H {
    d: Box<D>,
    g: G,
}

Oops I forgot to include H, but yes that is what H would look like.

When we "remove" a node, it sounds like that type would always then be referred to via Box.

Yes, by "removing" a node we are making all references to its type a Box<T> rather thanT.

Would it be viable to turn all of those back-edges into Boxes?

This is the current idea with "removing" the nodes until there are no back-edges left. By changing all references of a node's type to a Box we are removing the back-edges. So, by removing the nodes that have the most edges we can remove all of the back-edges while transforming the minimal number of types to a Box<T>.

Edit: I realized I might be misunderstanding your question. The solution I've thought of optimizes for "removing" the minimal number of nodes. I thought that this would be nice because it would make things easier to implement. However, you could be asking, since we are changing every reference to this type to a Box<T> rather than a T we are not optimizing for the minimal number of Boxes. My current thinking is that we would require 1 Box per back edge, so maybe we could just do that rather than trying to optimize for the minimal node "removal".

In addition to "containment cycles". I'm also interested in dependency cycles so that we can use a more expansive set of > derive macros. Consider, say, that a struct that contains an f64 can't have the Eq derive macro applied. So in the example > above, if A contained an f32 then A, B, and E would not be able to derive Eq but other types might. Vaguely I had assumed > that a proper solution to this problem would involve identifying SCCs and doing some sort of topo sort.

I think I'm confused on how having a type reference cycle causes an issue. I'll go over my current understanding of the problem and a proposed solution. Hopefully, that will help you find where I am missing some understanding of why these cycles could cause an issue.

Problem statement: We want to be able to find the maximal number of desired traits that we can implement using the derive macro on a per type basis.

Proposed solution: Start by assuming that every type can impl all of our desired traits: Eq, Ord, etc. Construct a graph that would be the exact same as the graph we made in the "containment cycle" problem. Then on a per node basis we examine all of its type references. I'm not really sure on the terminology here, but we would only want to look at the types that aren't also nodes themselves, which I'm going to refer to as "known" types from here on out. So for example if we had the following structs:

struct A {
    b: B,
}
struct B {
    inner: f64
}

We wouldn't look at A.b but only B.inner, as A.b is a reference to another node whereas B.inner is a "known" type. These "known" types wouldn't always be a primitive type such as in the case of Vec or a HashSet which is why I'm unsure on the terminology.

If any of these "known" types are restricting in any way then we would want to apply that restriction to the current type. In the example above, we would say that B could not derive the Eq trait. The next step would be to DFS/BFS out from this node on the transpose graph and apply the same restriction to all nodes we hit. Transpose graph being a graph where we flip the direction of all of the edges on a directed graph. So in the example from the previous discussion, if A had a f64 we would traverse out and hit E and then B and then the traversal would end. So, we would also remove the Eq trait from those nodes. If we do this for every node we should only be removing desirable traits from nodes that have a reference or a nested reference to some restrictive "known" type.

We should be able to only look at the "known" types because we are doing this for every node so if there is a path C -> B -> A when we are at node A, we would traverse to node C on the transpose graph. So, we need not worry about the reference to type A while we are processing node B.

This would in the worst case require a traversal from every node that reaches every other node in the case of graph where all nodes are contained in a single SCC and where every node has a restrictive "known" type. This would have a runtime of O(n²).

ahl · 2023-05-18T21:00:49Z

Note that #300 addressed containment cycles; there is still more work to be done in terms of properly selecting which derive macros to apply.

This was referenced May 12, 2023

Cannot generate from self-referential schemata #289

Closed

WIP: use json schema for json schema #293

Draft

Handle arbitrary containment cycles, breaking them with a Box type #300

Merged

ahl changed the title ~~handling for arbitrary containment and derive cycles~~ handling for arbitrary derive cycles May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handling for arbitrary derive cycles #46

handling for arbitrary derive cycles #46

ahl commented May 13, 2022

justinmmott commented Oct 8, 2022

ahl commented Oct 11, 2022

justinmmott commented Oct 11, 2022 •

edited

Loading

ahl commented May 18, 2023

handling for arbitrary derive cycles #46

handling for arbitrary derive cycles #46

Comments

ahl commented May 13, 2022

justinmmott commented Oct 8, 2022

ahl commented Oct 11, 2022

justinmmott commented Oct 11, 2022 • edited Loading

ahl commented May 18, 2023

justinmmott commented Oct 11, 2022 •

edited

Loading