Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing Grammar in combination with vector construction using [...] causes ambiguous syntax. #109

Open
sebffischer opened this issue Mar 31, 2024 · 9 comments
Labels
theme-internals Relates to internal operations of the language type-design Discussion regarding design of enhancements or project at large type-enhancement New feature or request

Comments

@sebffischer
Copy link
Collaborator

Because [[ can be used with a single number like 1 and [ with a vector like [1], it is not clear what the expression x[[1]] represents, i.e. does it call [[ or [.

@dgkf
Copy link
Owner

dgkf commented Apr 1, 2024

I'm very aware - I haven't really figured out a good path forward. I'm at an impasse with this syntax. This is described in this issue, but it warrants having it's own dedicated issue. I'm glad you honed in on it as well, I'd love to have other ideas bouncing around this.

I really want this syntax to make more sense because I found the [ vs [[ syntax really convoluted in R when I was learning and I still see people struggling with it frequently.

I'd really like to retire [[ as a separate postfix operator, and instead use dispatch. To me this just feels much more "R" (even if it's not how R works). Making that work raises a few questions:

  • How do you differentiate a scalar index from a vector index such that:
    x[<scalar>] produces an element of x
    x[<vector>] produces a slice of x
  • Does this mean we need a scalar type?
  • Even if we did have a scalar type, and y = 1; x[y] (x[<scalar>]) could be differentiated from y = [1]; x[y] (x[<vector>]), then this is still the opposite behavior of R. Personally I'm okay with that if the syntax can be consolidated.
  • I think R's roots as an array-programming-language are pretty magical and I wouldn't want to abandon that philosophy just to add scalars, so the semantics of using scalars in an array language need to be hashed out.

I don't have good answers to these, so for now it remains a wide-open experiment.

@dgkf dgkf added type-design Discussion regarding design of enhancements or project at large status-feedback-needed Needs further discussion before moving further meta-discussion Discussions and resources labels Apr 1, 2024
@sebffischer
Copy link
Collaborator Author

The only way I see to both avoid introducing scalars and to not have a dedicated [[ operator (or an additional parameter to the [ operator) would be to use non-standard evaluation.

For example:

> l <- list(1, 2, 3)
> l[1]
[1] 1
> l[1:1] 
[[1]]
[1] 1

However, this then gets a little tricky when the argument passed to [(x, i) is a variable:

> l <- list(1, 2, 3)
> ii <- 1:2
> l[ii]
# Error: Must have length 1
> l[ii[1]:ii[length(ii)]] # needless to say this is extremely ugly, but this could be made prettier 
[[1]]
[1] 1 2
> ii1 <- 1
l[ii1]
[1] 1

I am not sure that this is then altogether much different from the [[ operator, just the internal implementation does not use a special operator but non-standard evaluation. Needless to say, the syntax I suggested above is not pretty. I am also not sure though, whether I would bake NSE into such a core data structure of the language. Also the above is probably also not easier to understand than the difference between the [[ and [ operators (maybe even harder?).

Even if we did have a scalar type, and y = 1; x[y] (x[]) could be differentiated from y = [1]; x[y] (x[]), then this is still the opposite behavior of R. Personally I'm okay with that if the syntax can be consolidated.

Maybe this would be an argument to construct vectors with (...) and lists with [... ]?
Then the syntax would be x[(1)] to slice the vector and x[1] to access the scalar, avoiding the confusion for R users.
However, I think x[(1)] looks a little weird, so would probably just bite the bullet and live with the inverted semantics of [[.

I think R's roots as an array-programming-language are pretty magical and I wouldn't want to abandon that philosophy just to add scalars, so the semantics of using scalars in an array language need to be hashed out.

Agreed! However, maybe we can introduce scalars without moving far away from this idea.
What I like about the introduction of scalars (without thinking through all the details):

  • We can more naturally express vectors that contain doubles or NAs as we define vector types in terms of the scalar values that they contain, i.e. double() is a Obj::Vector whose elements are all of type Obj::Scalar(Double) or Obj::Scalar(NA)
  • This should have some side-benefits regarding performance as we don't have to box all length 1 vectors. (Admittedly, with the Altreps used throughout this project, the benefits are probably smaller than in R, where vectors are almost always allocated).
  • We should be able to keep the semantics of scalars almost identical to length-1 vectors (except for cases like the subsetting, where we want to differentiate). For example, 1 == [1, 2, 3] should result in [true, false, false] and not false.

@dgkf
Copy link
Owner

dgkf commented Apr 8, 2024

Agreed! However, maybe we can introduce scalars without moving far away from this idea.

Yeah, this has sort of been what I've been leaning toward, so it's good to hear you reinforce that direction.

As I think about it more, I think we'll have Scalar, Vector, and in the future (maybe Matrix? and) NDArray for higher order arrays. For things that are used frequently like scalars and vectors, I think it's good to specialize the representation and only defer to a more generic representation for higher dimensional.

This should have some side-benefits regarding performance as we don't have to box all length 1 vectors.

Certainly - especially if those basic types all implement Copy, then we save a lot on performance and avoiding project overhead by dealing with boxes/shared references.


Cool, well that's about all the confirmation I needed. I think Scalar is a necessary feature here.

@dgkf dgkf added type-enhancement New feature or request theme-internals Relates to internal operations of the language and removed status-feedback-needed Needs further discussion before moving further meta-discussion Discussions and resources labels Apr 8, 2024
@sebffischer
Copy link
Collaborator Author

Just that I don't forget it: The AtomicMode trait should then probably also be renamed, as Vectors as are not the atomic data structure anymore once we have scalars.

@dgkf
Copy link
Owner

dgkf commented Apr 9, 2024

We might be able to keep it, since it applies to the T in Vector<T>, not the Vector itself.

I imagine that any T in either a Vector<T> or a Scalar<T> will probably need to share some foundational traits, which is currently what AtomicMode is for. That said, we might consider renaming it anyways to be more descriptive.

I'm not sure I agree that Vector should become Vector<Scalar<T>>. In principle I think this would be preferred, but because it's such a widely used data type, we should probably minimize the layers of computation when working with its data.

@sebffischer
Copy link
Collaborator Author

sebffischer commented Apr 9, 2024

As I think about it more, I think we'll have Scalar, Vector, and in the future (maybe Matrix? and) NDArray for higher order arrays.

I think it would be really cool if we could conceptualize a Scalar as a 0-dimensional array. Not necessarily in the internal representation for which I agree that overhead should be as minimal as possible, but in terms of its semantics.
(Then array still is the most atomic data structure and AtomicMode is no misnomer, or rather should be implemented for scalar, matrix, vector, nd-array)

I'm not sure I agree that Vector should become Vector<Scalar>. In principle I think this would be preferred, but because it's such a widely used data type, we should probably minimize the layers of computation when working with its data.

Yes I think that makes sense!

@dgkf
Copy link
Owner

dgkf commented Apr 9, 2024

I think it would be really cool if we could conceptualize a Scalar as a 0-dimensional array.

Yeah, that's how I see it too. I think we'll have distinct data types, but I think the broadcasting should operate such that a 0-dim array is operationally identical, at least for broadcasting/recycling operations like mathematical operators.

@dgkf
Copy link
Owner

dgkf commented Apr 15, 2024

Was reading Designing Types for R, Empirically (Vitek, et. al., 2020) today and just wanted to highlight some of their conclusions (Section 6.1.1). It's worth a read through all their commentary, but there are at least two that might inform a direction for this project:

Supporting the inclusion of an explicit missing/na

Similar to julia's Vector{Union{_, Missing}}

[on] NAs. Our data supports making the presence of NAs explicit. Only 2923 (or 2.78%) of arguments are marked as possibly having NAs, thus the overwhelming majority of types appear to be NA-free. In practice, programmers check for them and sanitize them if they are present.

Supporting the introduction of a Scalar

[on] Scalars. The data also suggests that programmers often use scalars, and do dimensionality checks on their data. In our data 25,064 (or 33.33%) of the arguments are scalar types. While not completely surprising, this is a rather large number. Consider the hankel.matrix function, it takes two arguments and checks that n is int, that x is a vector, and also, indirectly, that n is a a scalar (this comes from the fact that it is used in the guard of a conditional which fails if n is not a scalar).

@sebffischer
Copy link
Collaborator Author

thanks so much for sharing this, the title sound fascinating. maybe we could even try to reach out to Jan vitek?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme-internals Relates to internal operations of the language type-design Discussion regarding design of enhancements or project at large type-enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants