Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Mongo to BigQuery: Long (Int64) is serialized as float-like (scientific) number #9590

Closed
joelluijmes opened this issue Jan 19, 2022 · 7 comments
Labels
area/connectors Connector related issues community connectors/source/mongodb normalization team/destinations Destinations team's backlog type/bug Something isn't working

Comments

@joelluijmes
Copy link
Contributor

joelluijmes commented Jan 19, 2022

Environment

  • Airbyte version: 0.35.5-alpha
  • OS Version / Instance: Google Kubernetes
  • Deployment: Kubernetes
  • Source Connector and version: Mongo V2 0.1.11
  • Destination Connector and version: BigQuery 0.6.3
  • Severity: High
  • Step where error happened: Sync job

Current Behavior

In our Mongo source, we have large numeric data (i.e., of type long). For instance the value: 3241201784 (> 2.1B). However, somewhere during the sync (not sure which component), the value is serialized as float-like type to 3.241201784E9.
This is a problem, because the value is actually a numeric id. Thus I need the exact value, not a float-like type.

For the records I checked, the value as-is preserves the same precision (i.e., the same amount of decimal digits are present). So theoretically I can convert them back to long/int64. However I’m not confident that it is always this case and/or will happen when we roll over to >10B records.

Expected Behavior

Instead, I'd expected one of the following behaviors:

  • Stored as actual numeric type (long)
  • If this is not possible due Java / JSON Serialization, the value stored as string. In this manner, we don't lose precision of the value.

Steps to Reproduce

  1. Get a large number in mongo
  2. Sync it through Airbyte to BigQuery
  3. See results as scientific float like value

Are you willing to submit a PR?

Sure, but could use some guidance on how to debug the full flow to see which part the serialization goes wrong.

@joelluijmes
Copy link
Contributor Author

joelluijmes commented Jan 19, 2022

Triage:

Airbyte did recognize the mongo schema as number (see screenshot). FYI, the field with large values is the _id (i.e., custom _id field, not using ObjectId here). Upon looking at the source, INT32, INT64, DOUBLE and DECIMAL are all treated as JSON number. So that seems fine.

image

Subsequently, I hypothese the issue might be the serialization to JSON (source), i.e., an issue with the library. Or upon deserialization in the destination connector it is assumed to be float-like because it's a large value.

@harshithmullapudi harshithmullapudi added area/connectors Connector related issues and removed needs-triage labels Jan 20, 2022
@harshithmullapudi
Copy link
Contributor

Hey does it makes sense to change it to string and handle it whatever format you need over Custom DBT? Does this sound good?

@joelluijmes
Copy link
Contributor Author

If I can instruct Airbyte to to treat it as string, that be great! Is that possible?
If you mean at Mongo, no that wont be viable as we have to change our applications then.

@auyer
Copy link

auyer commented Apr 1, 2022

Hi !
Had the same issue while loading BIGSERIAL from postgres do S3 (parquet). No dbt transformations involved (or even supported fo that matter).
image

image

@grishick grishick removed this from GL Roadmap Apr 21, 2022
@grishick grishick changed the title 🐛 Long (Int64) is serialized as float-like (scientific) number 🐛 Mongo to BigQuery: Long (Int64) is serialized as float-like (scientific) number May 3, 2022
@grishick grishick added the team/destinations Destinations team's backlog label Sep 27, 2022
@joelluijmes
Copy link
Contributor Author

joelluijmes commented Dec 1, 2022

I noticed a PR was merged which should fix this issue (#14362). I noticed after my report two other issues (#12606 and #12057) were opened, and supposedly closed by the aforementioned PR.

I just tested this again, and unfortunately the data is still synced as floats instead of integer. I created a new connection to ensure it ran schema discovery again.

Can this issue be researched again?


Meanwhile we are running:

@edgao
Copy link
Contributor

edgao commented Dec 2, 2022

ah, that was my bad - I didn't notice the mongodb in the list of sources. Unfortunately mongo isn't a jdbc source (since it's not even an rdbms) so it wasn't solved by #14362.

I did a little digging - updating this mapping is a good starting point (i.e. mapping INT32 and INT64 to JsonSchemaType.INTEGER instead of NUMBER). I probably can't get to this in a reasonable timeframe, but super happy to help workshop a pull request with anyone interested. (there's probably some additional work to update the tests/etc which I didn't find on my first passthrough)

under the hood, what's happening is that source-mongodb is discover-ing these fields as numbers, so the destination is creating a floating-point column. If we update source-mongodb to discover them as integers, then that'll probably be sufficient to make the destination write them correctly. (theoretically, we'd also want to update how source-mongodb read-s these values, but I think it's already handling integers correctly)

@marcosmarxm
Copy link
Member

Normalization and custom dbt are deprecated features and will be removed soon from the project. For that reason I'm closing the issue as it won't be implemented anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues community connectors/source/mongodb normalization team/destinations Destinations team's backlog type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants