Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Row getVector* methods are now templates #122

Merged
merged 2 commits into from
May 2, 2020

Conversation

ilijapuaca
Copy link
Contributor

This seemed like a pitfall for anyone starting fresh -- all the data loaded through arrow loader would end up using doubles, while in our accessors we're always requesting floats. Current implementation requires ugly casting, or even worse returning the implicit default value without it.

Not sure whether this is the best implementation out there, but it gets rid of ~250 lines of boilerplate code that was in there before.

@ilijapuaca ilijapuaca requested a review from ibgreen May 1, 2020 11:03
// NOTE: Decimal array data is being loaded as double
auto startPositions = row.getDoubleVector3("start");
return mathgl::Vector3<float>{startPositions};
return row.getVector3<float>("start");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative is row.get<Vector3<float>>()

You could do this with partial template specializations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the method signature would end up being auto get(const std::string& columnName, const T& defaultValue = {}) const -> T?

That doesn't sound like a good idea to me as whoever is calling this might expect us to support any type out there, while in reality it would be only vectors. We could throw an error for when specialization isn't implemented for that type, but it just seems unnecessary and it does not seem to add too much value

@@ -41,7 +41,7 @@ auto Row::getInt(const std::string& columnName, int defaultValue) const -> int {

auto chunk = this->_getChunk(columnName);
auto doubleValue = this->_getDouble(chunk);
if (doubleValue.has_value()) {
if (doubleValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a JavaScript-ism to me. Maybe rename to doubleOption? The value part of the name combined with using auto makes me think doubleValue actually has type `double., for which this if statement doesn't really make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree that using has_value is more verbose and improves the readability, especially when using auto. @ibgreen seems to prefer leaving that out, so I was trying to be consistent.

Updated auto to explicit type as a middle ground

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see both view points. pointers in C++ are routinely compared as boolean values.

if (pointer) vs if (pointer === nullptr)

And C++ has the conversion to boolean built-in.

std::optional is clearly designed to mimic the pointer idiom. It support similar style if checking, and can even be "dereferenced" using the * operator (instead of get_value)

So - we may want to pick a style (e.g. pointer like vs explicit has/get_value calls) since we are increasingly making use of std::optional.

@@ -104,171 +104,6 @@ auto Row::getString(const std::string& columnName, const std::string& defaultVal
return defaultValue;
}

auto Row::getFloatVector2(const std::string& columnName, const Vector2<float>& defaultValue) const -> Vector2<float> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue with templates is that all the code goes into header files and needs to be compiled (or at least instantiated) again and again for each type, which slows down compilation and bloats the executable. So it can sometimes be good to have a smaller template that does type casting and such, and have bigger non-templated helper methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what I was trying to do. I don't see a way to make _getVectorData smaller without breaking it down into individual methods that would each return a vector of data. That would be additional overhead on top of what was just added, where we'd be filling 2 vectors and casting them in order to form the final data.

Open to suggestions if you have an improvement in mind, I may be missing something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I comment linearly as I read through the PR. I was pleasantly surprised at how compact the code was when I got a bit further, but forgot to modify this comment. All good.


return Vector4<float>{first, second, third, fourth};
}
return std::make_shared<ListArrayMetadata>(offset, length, listArray->values());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not sharing it, don't return a shared pointer. Just return the struct by value. It is faster and cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need a way to check whether it's valid, so it was either this, an optional, or a raw pointer. I guess an optional might make more sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - you could add e.g. a valid flag to the struct, or use an optional.

-> mathgl::Vector4<float>;
auto getDoubleVector4(const std::string& columnName, const mathgl::Vector4<double>& defaultValue = {}) const
-> mathgl::Vector4<double>;
template <typename T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative is row.get<Vector3<float>>()

You could do this with partial template specializations.

I.e. you would still have all these methods, but they would be specializations rather than different methods.

auto metadata = this->_getListArrayMetadata(this->_getChunk(columnName), this->_chunkRowIndex);

auto data = std::make_shared<std::vector<T>>();
if (auto doubleArray = std::dynamic_pointer_cast<arrow::DoubleArray>(metadata->values)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dynamic casts are somewhat expensive as they have to inspect RTTI data. It is rarely an issue, only if done very frequently inside loops - but unfortunately that is the case with all Row methods.

I would strongly prefer that we switched on the arrow type constant here instead.

We should set up a simple benchmark so we can test the row method performance (in particular we need to cache the string value lookups).

@ilijapuaca
Copy link
Contributor Author

Landing this as some of the stuff I'm working on calls out to getVector APIs. Will revisit any additional comments in a followup

@ilijapuaca ilijapuaca merged commit 3287fc2 into master May 2, 2020
@ilijapuaca ilijapuaca deleted the feature/arrow-row-casting branch May 2, 2020 04:28
auto metadata = optionalMetadata.value();
std::vector<T> data;
switch (metadata.values->type_id()) {
case arrow::Type::type::DOUBLE: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

-> mathgl::Vector3<T> {
auto optionalData = this->_getVectorData<T>(columnName);
if (!optionalData) {
probegl::WarningLog() << "Unsupported vector data type, returning default value";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful with these warnings, these could cause massive cascades if processing a million row table.

The probe.gl logging system in JS has warning deduplication as a built-in feature. I.e. each warning message string is cached and only printed once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could put that in place if it's not too much work, not sure how you went about doing that in JS. For now, I think there should be something in place that notifies the user of the API that something has gone wrong during the conversion though, as ugly as this may be

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a set recording which strings have already been logged. Could also be a map<string, timestamp> and only log if say 1 second has passed.

I believe the log function has an option.once. This is set by default on log.warn and log.error (the api is slightly different).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened up a new issue #129

@@ -41,7 +41,7 @@ auto Row::getInt(const std::string& columnName, int defaultValue) const -> int {

auto chunk = this->_getChunk(columnName);
auto doubleValue = this->_getDouble(chunk);
if (doubleValue.has_value()) {
if (doubleValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see both view points. pointers in C++ are routinely compared as boolean values.

if (pointer) vs if (pointer === nullptr)

And C++ has the conversion to boolean built-in.

std::optional is clearly designed to mimic the pointer idiom. It support similar style if checking, and can even be "dereferenced" using the * operator (instead of get_value)

So - we may want to pick a style (e.g. pointer like vs explicit has/get_value calls) since we are increasingly making use of std::optional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants