-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Canonicalization for cell arrays #552
base: master
Are you sure you want to change the base?
Conversation
if (a < b) return -1; | ||
if (a > b) return +1; | ||
return 0; | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: closing }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really exciting! Made some suggestions, mostly naming
2. canonical | ||
3. compacted and canonical | ||
|
||
#### Low52 ordered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Not keen on this name. Maybe "cell digit ordered" or similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to avoid just using "ordered", since there are potentially multiple different orderings someone might do, like the standard uint64_t
ordering, for example.
What don't you like about lower 52? :) I liked that it was a very distinct name, so it was immediately clear to a reader that you're talking about a specific concept. It's also easier to code search. I'm worried that "cell digit ordered" is a bit too generic of a name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer the actual ordering to be opaque to the end user - the idea is:
- We have a special, somewhat opaque format for sets of H3 indexes, and if you use it, you get access to these set functions.
- We have 3 levels of canonicalization, L1, L2, L3. Each one is more expensive to apply than the last, but the subsequent runtime of the set functions is faster.
Beyond that, the user shouldn't care. Calling this "Low52" (very concrete), "Canonical" (completely opaque), and "Compacted Canonical" (partly concrete, partly opaque) just seems to invite confusion for the user about what they should use - the names here are about the implementation, not the end use. Treating all of the formats as opaque helps to ensure that they are used only with appropriate functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm convinced to keep things opaque. And now I'm considering getting rid of "L1", as I don't see any use cases. L1 was more of a by-product of me figuring out how to get things working.
As far as the algorithms are concerned, they won't care if a set is compacted or not as long as it is canonical. Because of that, I might also avoid a separate type for "L3" and just leave it to the user to keep track of whether a set is compacted. (I also don't think there's an obvious/easy test for if a set is compacted; you basically just have to run through the compact logic again and check that there are no changes.)
With that in mind, what would you think of something like this (modulo names):
typedef struct {
H3Index *cells;
int64_t numCells;
} CellArray;
bool isCanonicalSet(CellArray A);
H3Error toCanonicalSet(CellArray *A);
// in-place compact algo; no dynamic memory allocation needed.
// result comes out canonical as a nice by-product.
H3Error canonicalCompact(CellArray *A);
// functions below work on canonical; are faster on canonical compacted.
bool setContains(CellArray A, H3Index h);
bool doSetsIntersect(CellArray A, CellArray B);
bool isSubset(CellArray A, CellArray B);
H3Error setIntersection(CellArray A, CellArray B, CellArray *C);
H3Error setUnion(CellArray A, CellArray B, CellArray *C);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those look good to me! For clarity, the set functions only work if the set is canonical, right? I'd love some way to enforce this at a type level, e.g. instead of CellArray
call it CanonicalSet
, then take the args H3Index *cells, int64_t numCells
for the functions that don't have this requirement (isCanonicalSet
, toCanonicalSet
, canonicalCompact
-- which BTW I'd call toCompactCanonicalSet
). That way the user has to either pass their array through toCanonicalSet
or toCompactCanonicalSet
in order to use the set functions, or at least they need to explicitly create a CanonicalSet
themselves, affirming that their input is canonical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you'd have:
typedef struct {
H3Index *cells;
int64_t numCells;
} CanonicalSet;
bool isCanonicalSet(H3Index *cells, int64_t numCells);
H3Error toCanonicalSet(H3Index *cells, int64_t numCells, CanonicalSet *out);
// in-place compact algo; no dynamic memory allocation needed.
// result comes out canonical as a nice by-product.
H3Error toCompactCanonicalSet(H3Index *cells, int64_t numCells, CanonicalSet *out);
// functions below work on canonical; are faster on canonical compacted.
bool setContains(CanonicalSet A, H3Index h);
bool doSetsIntersect(CanonicalSet A, CanonicalSet B);
bool isSubset(CanonicalSet A, CanonicalSet B);
H3Error setIntersection(CanonicalSet A, CanonicalSet B, CanonicalSet *out);
H3Error setUnion(CanonicalSet A, CanonicalSet B, CanonicalSet *out);
A couple of questions here:
- If we're already making this tradeoff between pre-processing and fast operations, do we need the non-compact version? I guess the benefit is that you can get the original cells out, as long as they were unique.
- In the intersection and union, how do we manage the memory for the output? I'm thinking we might want to offer helpers here like
maxSetIntersectionSize
(size of the larger set) andmaxSetUnionSize
(sum of the set sizes) to help callers allocate memory for theout
set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those look good to me! For clarity, the set functions only work if the set is canonical, right?
Yes. We could write up versions that work on sorted but not canonical sets, but I don't think it is worth it. They're mostly the same; the non-canonical but sorted sets just introduce a few extra annoying edge cases you have to consider.
H3Error toCanonicalSet(H3Index *cells, int64_t numCells, CanonicalSet *out);
I was thinking about this too. My hesitation (and why I originally wrote it as H3Error toCanonicalSet(CellArray *A)
) was that out
would point to the same memory as cells
(since the operation is in-place). Maybe not a big deal, but I worry about issues that come up with multiple references to the same memory, like double calls to free
.
- If we're already making this tradeoff between pre-processing and fast operations, do we need the non-compact version? I guess the benefit is that you can get the original cells out, as long as they were unique.
A set of cells is different from its compact representation (that compact representation could be uncompacted to multiple different resolutions, for example). And if users want to uniquely identify a set of cells with a hash, I think we still want to provide them with a way to get a canonical representation of any set of cells.
I'm imagining situations where uncompacted sets of cells are efficiently stored as a tuple (compacted set id, uncompact resolution)
, and we'd want a hash that distinguishes the compacted and uncompacted sets.
- In the intersection and union, how do we manage the memory for the output? I'm thinking we might want to offer helpers here like
maxSetIntersectionSize
(size of the larger set) andmaxSetUnionSize
(sum of the set sizes) to help callers allocate memory for theout
set
Agreed. But it is actually the sum of the set sizes in both cases, and I was thinking that was easy enough to remember for now. But you're probably right in that we should provide functions so users don't need to know that.
And it's the sum, even for intersection, because things get weird when you start working with compact canonical sets.
For example, the intersection of the first two sets (in the sense we're talking about) here is the third:
(Maybe actually, the worst-case bound is the sum of the set sizes minus 2?)
And it might be possible to have a slightly expensive function that computes the exact intersection size so you could allocate the exact amount of space needed, but that would result in basically running the intersection algorithm twice (I think). But maybe that's worth it?
We can check this property by ensuring that | ||
|
||
```c | ||
cmpCanon(a[i-1], a[i]) == -2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, naming - maybe just "cmpOrdered" and "cmpOrderedSet"?
Either way, I'd prefer Canonical
to Canon
(I think of a "canon" in this context as being a set of things, e.g "the Shakespeare canon" is the set of recognized works, each of which is canonical)
|
||
typedef struct { | ||
H3Index *cells; | ||
int64_t N; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numCells
?
- `cmpCanon(a, b) == -1` if `a` is a child (or further descendant) of `b` | ||
- `cmpCanon(a, b) == +1` if `b` ... `a` | ||
- `cmpCanon(a, b) == -2` if `a` < `b` in the low52 ordering, but they are not related | ||
- `cmpCanon(a, b) == +2` if `b` < `a` ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use case for this?
int isLow52Sorted(const H3Index *cells, const int64_t N); | ||
H3Error low52Sort(H3Index *cells, const int64_t N); | ||
|
||
int isCanonicalCells(const H3Index *cells, const int64_t N); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh. Maybe isCanonicalSet
or isOrderedSet
?
Basically I want singular, type-style names (and maybe type aliases) for the different kinds of arrays, e.g. OrderedList
, OrderedSet
, CompactOrderedSet
. Maybe these should even be structs with a "type" enum specifying which one they are, so that set operations can work on all of them equivalently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea! I think things like OrderedList
, OrderedSet
, CompactOrderedSet
would definitely help to make things clearer.
But that starts to suggest changes to the API to start using these structs instead of our usual pointer and number of elements. Would we want to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I was thinking of using "canonical" instead of "set", because they're slightly different. You can have a distinct set of cells that includes parent/child cells, but canonical forbids that. So wanted a concept that was very obvious to a reader as distinct from the usual notion of "set".
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about something like Low52List
, CanonicalSet
? I'm not yet sure if we need a separate type for compact and canonical tho...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above. I think having these new functions (and, for the moment, only these new functions) use the structs as semi-opaque objects would be valuable, and I would choose names that indicate the use cases and tradeoffs the user incurs, rather than the implementation that leads to these tradeoffs.
No description provided.