Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated naming conventions #84

Open
wants to merge 64 commits into
base: main
Choose a base branch
from
Open

Updated naming conventions #84

wants to merge 64 commits into from

Conversation

bcpierce00
Copy link
Collaborator

No description provided.

@bcpierce00
Copy link
Collaborator Author

bcpierce00 commented Sep 1, 2024

This PR includes the original naming-conventions PR plus a bunch of extra tidying, polishing, and reworking in light of discussions. I believe everything is now consistent and the conventions are working pretty well. It is ready to merge IMHO.

@bcpierce00
Copy link
Collaborator Author

@cp526 We are waiting for your confirmation before merging this, whenever you get back to work...

@cp526
Copy link
Collaborator

cp526 commented Sep 5, 2024

That looks reasonable, but – without wanting to drag this revision out more – one point I'm not completely convinced about is requiring CN identifiers to be uppercase.

I see there's maybe some value in syntactically distinguishing C and CN variables for emphasising that C variables refer to (pure snapshots of) mutable objects whereas CN variables are immutable, but on the other hand CN specifications should (as much as reasonable) look natural to C programmers, who may want to follow the same naming conventions in their CN development as in their C code (possibly with project/organisation-specific naming conventions for identifiers).

A second concern is entangled with specification language design. Rust, Haskell, and OCaml all require lowercase variable names and uppercase constructor names. A benefit is that variables and constructors are syntactically distinct, so when the user misspells a constructor name in a pattern-match expression the compiler can detect this and not misunderstand it as a catch-all variable pattern (which would be a difficult-to-diagnose-bug). CN is currently fine in that regard even without an enforced restriction on the capitalisation of variable and constructor names, because CN constructors always take a parenthesis-enclosed list of arguments (Constr { bt1 arg1, ... btn argn }) using a (somewhat heavy-weight) record-like notation, which applies even when the constructor has no arguments. If we wanted to generalise CN datatype constructors so they can take arbitrary arguments, or no arguments, then we would need a different mechanism for syntactically distinguishing variables from constructors, and the capitalisation route Rust, Haskell, and OCaml take would be the obvious one. (But currently that's only a hypothetical problem.)

Either way, ignoring variables names, I see no benefit in requiring CN identifiers for types, record field names, or constructor argument names to be capitalised (though I may have missed some discussion about this), but would think that lower case names for these would better fit typical C code.

@bcpierce00
Copy link
Collaborator Author

bcpierce00 commented Sep 5, 2024 via email

@PeterSewell
Copy link
Contributor

PeterSewell commented Sep 5, 2024 via email

@thatplguy
Copy link
Collaborator

What's the next step here? Continue with this PR as is, and try adopting the naming scheme @bcpierce00 proposed? Or is there an alternate proposal?

I expect we'll make more changes to the syntax over the next year, so there will be future opportunities to change the naming scheme as well, as painful as that is.

@bcpierce00
Copy link
Collaborator Author

bcpierce00 commented Sep 6, 2024 via email

@@ -0,0 +1,8 @@
/*@
datatype Dll {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Dllist to mirror the C type struct dllist?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be better.

@thatplguy
Copy link
Collaborator

Ok, that makes sense. Before we think about considering other schemes, let me recap to make sure I'm understanding correctly.

The main goal in adopting the uppercase naming scheme is to make a clear, explicit connection between CN-level things and C-level things, e.g.

  • for the C type, struct list_int, in CN we would have predicate List_Int and datatype List_Int
  • for the C variable sorted_nums, in CN we would have take Sorted_Nums = List_Int(sorted_nums)

The drawbacks are

  1. Using uppercase names everywhere is clunky and not a common language convention.
  2. It's not effective when C code also uses uppercase names.
  3. Other languages use capitalization to syntactically distinguish constructors, which has some benefits as per @cp526's comment. (It's not clear to me that restricting case in this way is actually an option for CN, since C allows uppercase names.)
  4. The naming scheme can still be confusing for C → CN lifting, e.g. the P in take P = Owned(p) IMO is a confusing name for the value pointed to by p.

Does that sound right?

@thatplguy
Copy link
Collaborator

I suppose it makes sense to list non-goals as well. Here's what we're not trying to accomplish with the uppercase naming convention:

  1. Avoiding top-level name conflicts between C and CN. Library developers will already have a convention, and it would make sense to use the same convention for CN identifiers. E.g. the pthreads library prefixes pthreads_ to top-level identifiers, so CN top-level identifiers should use the same prefix (but maybe capitalized 🙂).
  2. Syntactically distinguishing C vs. CN identifiers. If we're worried about developers getting confused by the type/provenance of an identifier, there are other approaches we should consider, like go-to-definition code navigation, showing the identifier type on hover in the IDE, etc.

@bcpierce00
Copy link
Collaborator Author

bcpierce00 commented Sep 6, 2024 via email

@cp526
Copy link
Collaborator

cp526 commented Sep 6, 2024

From my perspective, maintaining a syntactic distinction between C things and CN things should be a non-goal. Indeed, in the implementation of the CN frontend we've put some effort to integrate C things more seamlessly into CN specifications (e.g. one can mention C variables without a quotation mechanism and "dereference" owned pointers in specifications).

It's not completely clear where exactly the line should be drawn anyway: in take v = Owned<int>(p), is v a CN thing or a C thing? The name v is introduced by CN, but the pointee of p exists in C just the same.

@cp526
Copy link
Collaborator

cp526 commented Sep 6, 2024

  • A CN identifier that represents the state of a mutable data
    structure after some function returns should be named the same as
    the starting state of the data structure, with an _post at the
    end.
    - E.g., The list copy function takes a linked list l
    representing a sequence L and leaves l at the end pointing
    to a final sequence L_post such that L == L_post.
    (Moreover, it returns a new sequence Ret with L == Ret.)

From my understanding of earlier comments, this part of the scheme is one of the main motivations for using uppercase letters for CN identifiers -- so one can mechanically derive a good name for a CN variable introduced by a resource take binding (?).

A different option would be to use l_pre and l_post in place of L and L_post, or to number different "generations" of l as l1, l2, etc. Either of these would make the pre-state names a bit longer, but it would make for more uniform names in the presence of loops, which will need variable names such as l_inv or l3 or similar.

The text above is from the section 'For new code', but for code not written by us (the more interesting case), a deterministic naming scheme is trickier.

  1. If we're not writing the C code, the mechanically derived name won't always be good, as in Cole's example; e.g. the linked list pointer from above (there l) might have been called p, and take P = LinkedList(l) would not be a great name for the mathematical list. Perhaps a type-based naming scheme, whereby CN identifiers are always prefixed with letters determined by their CN type (or the resource predicate whose output they bind) could work? As discussed on a previous call, this could actually be mechanically enforced by CN, as Lem does, which allows users to associate prefix naming rules with type definitions.

  2. The close correspondence found in tutorial examples of (a) a C-type definition, (b) the associated CN resource predicate, and (c) the mathematical value represented by it may not be the common case in real-world examples (not that we've done enough of those yet), but perhaps that's obvious. In the examples we've done in CN, we've had resource predicates that "package together" ownership for different related bits of memory at different C-types and associated constraints; we'll also sometimes want resource predicates whose output does not include all the information represented in memory, for instance a page table predicate whose output is not table-shaped but a mapping from physical to virtual addresses (so a looser correspondence of datatype and resource predicate).

@cp526
Copy link
Collaborator

cp526 commented Sep 6, 2024

Ok, that makes sense. Before we think about considering other schemes, let me recap to make sure I'm understanding correctly.

The main goal in adopting the uppercase naming scheme is to make a clear, explicit connection between CN-level things and C-level things, e.g.

  • for the C type, struct list_int, in CN we would have predicate List_Int and datatype List_Int
  • for the C variable sorted_nums, in CN we would have take Sorted_Nums = List_Int(sorted_nums)

The drawbacks are

  1. Using uppercase names everywhere is clunky and not a common language convention.
  2. It's not effective when C code also uses uppercase names.
  3. Other languages use capitalization to syntactically distinguish constructors, which has some benefits as per @cp526's comment. (It's not clear to me that restricting case in this way is actually an option for CN, since C allows uppercase names.)

It would be an option if we needed it, because pattern matching can only ever introduce new CN variables. If CN restricted variables to lower-case names and constructors to upper-case names, an uppercase string in a pattern (in a pattern-matching expression) could reliably be interpreted as a constructor name since it could not be introducing any new C variables.

  1. The naming scheme can still be confusing for C → CN lifting, e.g. the P in take P = Owned(p) IMO is a confusing name for the value pointed to by p.

Does that sound right?

@thatplguy
Copy link
Collaborator

A different option would be to use l_pre and l_post in place of L and L_post, or to number different "generations" of l as l1, l2, etc. Either of these would make the pre-state names a bit longer, but it would make for more uniform names in the presence of loops, which will need variable names such as l_inv or l3 or similar.

I was just going to suggest something similar, e.g. _in (requires), _out (ensures), and _inv (loop invariants). I reviewed the tutorial examples up to (not including) the case studies, and I think this would work, along with a few other complementary changes:

  1. No requirements for local CN variables, e.g. let sum = (i64) x + i64 y; is fine.
  2. Predicates that convert C memory objects to a CN logical representation are named ctypename_at, e.g. sllist_at.
  3. Logical datatype representations often will naturally have different names than their C counterparts, e.g. List vs. sllist. When they coincide, prepend cn_ to the logical datatype, e.g. cn_sllist.

FWIW this is how I found myself naturally writing CN specs before @bcpierce00 started on the style guide.

I think it's also worth being more explicit somewhere in the tutorial about this style of verification, where one builds a separate (logical) representation of correct behavior and then relates the logical representation to the C code – calling that out lets us make the point that one should pick some naming scheme for the logical representation that reflects the corresponding C implementation, even if it's not the one we recommend.

@bcpierce00
Copy link
Collaborator Author

I agree that the l_in / l_out convention serves the (critical) purpose of establishing a clear rule for "what to call the thing we get when we take Owned(l). I also like the symmetry of the _in and _out suffixes.

I remain uneasy about mashing the C and CN worlds together -- to me, l_pre suggests "the state of the heap structure l at the beginning of the function," rather than "the mathematical abstraction of this heap structure," and I fear our users are going to get tripped up by this. At the very least, we are increasing the load on the tooltips in the Verse IDE, which will now bear more of the responsibility of helping users understand which world a given identifier belongs to. But I sense I am not convincing people on this point.

@cp526
Copy link
Collaborator

cp526 commented Sep 7, 2024

I agree that the l_in / l_out convention serves the (critical) purpose of establishing a clear rule for "what to call the thing we get when we take Owned(l). I also like the symmetry of the _in and _out suffixes.

*_in and *_out is quite good, and (positively) a bit less math-sy than *_pre and *_post; *_inv unfortunately isn't quite symmetric (in either scheme, with pre/post or with in/out). Maybe there's something better than *inv*? *while or *loop? Still not symmetric.

I remain uneasy about mashing the C and CN worlds together -- to me, l_pre suggests "the state of the heap structure l at the beginning of the function," rather than "the mathematical abstraction of this heap structure," and I fear our users are going to get tripped up by this. At the very least, we are increasing the load on the tooltips in the Verse IDE, which will now bear more of the responsibility of helping users understand which world a given identifier belongs to. But I sense I am not convincing people on this point.

I better understand this concern now, but I wonder whether this problem, in fact, arises from how we are using the name of the pointer (here l) in constructing the name of the resource output? If the pointer, for instance, as in Cole's example above, is called p, then neither P nor P_in nor p_in is a good name for the output, and even the capitalised P or P_in might be misunderstood as the heap state under p, no?

Would it not be sufficient, and make specifications clearer, if the user could more freely pick a suitable name for the resource output -- perhaps based on its type, as previously suggested -- and we advise them to uniformly use the *_in/*_out suffixing scheme for consistency between preconditions, postconditions, and loop invariants? For instance, the user might choose a convention to call list-typed variables xs, ys, etc., and then specify take xs_in = List__I32(p) in the precondition and take xs_out = List__I32(p) in the postcondition. In the case of Owned this scheme might lead to names such as take i_in = Owned<int>(p) and take node_in = Owned<struct node>(p). The user could choose to include the name of the pointer p in the name of the resource output, perhaps to clearly disambiguate between multiple outputs of the same kind, but I'm not sure it's necessary (or always useful) to insist on it.
If we decided to generalise CN resource predicates to not require a first pointer-typed argument ( rems-project/cerberus#303 ) -- not clear right now -- we would lose that "hook" for mechanically deriving the output name.

For completeness, rems-project/cerberus#312 discusses replacing the take notation. If we changed the notation for resource assertions to not require names for resource outputs, but instead (as discussed in some form there) had alternative syntax for referring to the outputs of a resource without such names, then this aspect of the naming scheme would become less important.

@thatplguy
Copy link
Collaborator

I remain uneasy about mashing the C and CN worlds together -- to me, l_pre suggests "the state of the heap structure l at the beginning of the function," rather than "the mathematical abstraction of this heap structure," and I fear our users are going to get tripped up by this. At the very least, we are increasing the load on the tooltips in the Verse IDE, which will now bear more of the responsibility of helping users understand which world a given identifier belongs to. But I sense I am not convincing people on this point.

Hm. Maybe I'm confused in the same way we fear our user might be 🙂 . In my mind, the important distinction is made by the types involved. Our verification strategy, broadly speaking, is to develop a CN model of correct computation and relate it to the C implementation. The model happens to use richer abstractions, like mathematical integers, to more easily represent programmer intentions.

In that sense, the v in let v = Owned<int>(p) is in fact representing the state of the heap at p, because v has type int. But when we use a custom resource predicate to project the heap structure into a CN datatype, we're relating the implementation to the mathematical abstraction denoted by the datatype. The same thing happens when we lift v into a larger integer type, e.g. let sum = (i64) v + (i64) 1.

If that's more or less accurate, and the types and the location of their definitions (in C vs in CN) convey the distinction, then I'd be in favor of relying on the tools for showing types and definitions in the IDE rather than pushing that information into naming conventions.

@bcpierce00
Copy link
Collaborator Author

bcpierce00 commented Sep 7, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants