Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readExcel created <Comparable> column #150

Closed
LeandroC89 opened this issue Aug 14, 2022 · 6 comments
Closed

readExcel created <Comparable> column #150

LeandroC89 opened this issue Aug 14, 2022 · 6 comments
Assignees
Labels
question Further information is requested
Milestone

Comments

@LeandroC89
Copy link
Contributor

Hello,
found this annoying situation where a schema would be printed as

F1: Comparable
F2: String
F3: String
F4: String
F5: Comparable
F6: String

This creates 3 kinds of issues:

  1. If a filter is used, sometimes only one of the entries will be retrieved even if there are 2 for the given filter.
  2. If an update or sort operation is used this issues occurs -> class java.lang.Double cannot be cast to class java.lang.String (java.lang.Double and java.lang.String are in module java.base of loader 'bootstrap')
  3. If a convert{ "F1"() }.to() is called this error occurs -> Can't find converter from kotlin.Comparable<> to kotlin.String*

For situation 1 I tried to update or convert the column to a String, hence why I discovered situations 2 & 3

Thanks

@koperagen
Copy link
Collaborator

Hi! Would it be helpful and clear what's going on if schema was printed like this?

F1: Comparable (String & Double)
F2: String
F3: String
F4: String
F5: Comparable (String & Double)
F6: String

@LeandroC89
Copy link
Contributor Author

LeandroC89 commented Aug 16, 2022

Hi,
my issue is not with the schema itself but with using Comparable as a type.

If you have 2 occurrences of the same value on the same column. Excel filter would detect both instances. But Dataframe filter may return only 1. (I suppose this has to do with one being considered String and the other a Double. Despite the reason it is a huge risk as filter suddenly becomes unreliable)

Same happens with the sort operation, instead of sorting it throws an Exception since it can't compare String with a Double.

If one tries to use convert to change types, it either isn't possible or you have define the column type as either String or Double since Comparable is not a valid type. And then you get the same Exception as before.

If one tries the convert to String it'll fail. Likewise for the update function.

My workaround was to add a new column which is the Comparable typed column .toString() and use it instead.

My point is, if Comparable is unreliable and clearly the library isn't prepared for it, wouldn't it be better to simply remove it? Make it so if a Text is found on a column of Number up until then, then the whole column would be type String.

I even ensured I had selected all populated Excel cells and had them typed as Text beforehand (not that Excel is any good at typings unless you go through each cell, double click and press Enter but that would take forever).

Thank you.

@koperagen
Copy link
Collaborator

Make it so if a Text is found on a column of Number up until then, then the whole column would be type String.

This sounds good, i think. If it means no loss of data (i.e. all those numbers can be converted back by something like it.toDoubleOrNull()), then we probably can do it.

But i still would like to see if we can improve experience of how to handle situations when this weird type shows up in input data.

Because, in fact, you can tidy up this column like this
df.convert { F1 }.with(Infer.Type) { (it as? Double)?.let { it.toString() } }

Another solution could be saving F1 as a ColumnGroup:
F1

  • string: String
  • double: Double

So you could
df.convert { F1 }.with { it.string ?: it.double?.toString() }

What do you think? My concern is that first solution is probably easy to miss, and the second one can be confusing, because you suddenly get ColumnGroup instead of DataColumn. Maybe we should print schema after read operations by default in notebooks and add some extra information, idk

@LeandroC89
Copy link
Contributor Author

Hello,
and thank you for your explanation.

I'm having some difficulty getting the inferType to properly work. I managed to get it like this:
(I wanted nulls as "" for this case)

.convert { "F1"<Any>() }
            .with(inferType = true) {
                 it.toString() }

I used Any because Comparable is not a valid Column Type, whether used for String access or for Column Accessors
(Using String would cause an issue with InferType)

Still had to deal with the ".0" resulting from the conversion Double to String but i's already something I can work with.


But as for the Comparable use case:

  • It can cause issues when using sort
  • Filters may become unreliable if data isn't properly converted beforehand
  • Column is not accepted, having to resort to using one of the column types or Any

Does it make sense to keep it as a possible column type when reading a file? (Just adding some remarks, I already have a workaround for my main issue thanks to your reply 👍 )

Thank you!

@zaleslaw zaleslaw added this to the 0.11.0 milestone Apr 25, 2023
@zaleslaw zaleslaw added the question Further information is requested label Apr 25, 2023
@zaleslaw zaleslaw self-assigned this Jun 12, 2023
@zaleslaw zaleslaw modified the milestones: 0.11.0, 0.12.0, Backlog Jun 19, 2023
@Jolanrensen Jolanrensen self-assigned this Jun 22, 2023
@Jolanrensen Jolanrensen modified the milestones: Backlog, 0.12.0 Jun 22, 2023
@zaleslaw zaleslaw removed their assignment Jun 23, 2023
@Jolanrensen Jolanrensen added the invalid This issue/PR doesn't seem right label Oct 9, 2023
@Jolanrensen
Copy link
Collaborator

@koperagen had other ideas for handling multiple types in one column, which I summarized here: #466.

@Jolanrensen Jolanrensen modified the milestones: 0.12.0, Backlog Nov 7, 2023
@zaleslaw zaleslaw removed the invalid This issue/PR doesn't seem right label Apr 8, 2024
@koperagen koperagen modified the milestones: Backlog, 0.14.0 Jul 5, 2024
@koperagen
Copy link
Collaborator

Found a good way to solve it after all #745

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants