Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In case of errors on some items during sync, no items are saved #3259

Closed
gbitaudeau opened this issue Oct 2, 2023 · 40 comments
Closed

In case of errors on some items during sync, no items are saved #3259

gbitaudeau opened this issue Oct 2, 2023 · 40 comments
Assignees
Labels
bug Something isn't working datastore Issues related to the DataStore category

Comments

@gbitaudeau
Copy link

Describe the bug

If during initial sync, AppSync responds with some items on error, these items are null in items response array. The other items which are not in error are not saved in datastore. Here a redacted result for sync Model named Draw with an error on one item:

{
  "data": {
    "syncDraws": {
      "items": [
        {
          "id": "70083950-9b18-4e62-90c6-433ae933def3",
          "drawBatch": {
            "id": "e236042a-7c21-4b82-90b3-d8b73d2c6b99",
            "_deleted": true
          },
          "item": {
            "_deleted": true,
            "id": "fa986c78-ff21-46e2-82ff-8d0a218510bc"
          },
          "owner": "cb8c1cd8-f22c-4e30-8a7c-42fa3e2e16b1",
          "updatedAt": "2023-09-20T09:09:06.134Z"
        },
        null
  },
  "errors": [
    {
      "path": [
        "syncDraws",
        "items",
        17,
        "drawBatch"
      ],
      "data": null,
      "errorType": "Unauthorized",
      "errorInfo": null,
      "locations": [
        {
          "line": 10,
          "column": 7,
          "sourceName": null
        }
      ],
      "message": "Not Authorized to access drawBatch on type Draw"
    }
  ]
}

Steps To Reproduce

Steps to reproduce the behavior:
1. Consider a schema.graphql where 2 objects are linked together with `@hasMany` and `@belongsTo` and each object have the following auth policy: `@auth(rules: [{ allow: owner}])`.
2. Create on object of each type and linked then together.
3. Change the owner of the object with the `@hasMany` relationship.
4. The sync of the object type with `@belongsTo` is no longer working.

Expected behavior

DataStore's initial sync should reconcile partial GraphQL responses by saving the items in the local Database and emitting the errors to the errorHandler.

Amplify Framework Version

2.19.0

Amplify Categories

DataStore

Dependency manager

Swift PM

Swift version

5.9

CLI version

12.4.0

Xcode version

15.0

Relevant log output

No response

Is this a regression?

No

Regression additional context

No response

Platforms

iOS

OS Version

iOS 17.0.1

Device

iPhone

Specific to simulators

No response

Additional context

See discussion #3232 (comment)

@lawmicha lawmicha added bug Something isn't working datastore Issues related to the DataStore category labels Oct 2, 2023
@lawmicha
Copy link
Member

lawmicha commented Oct 2, 2023

Hi @gbitaudeau, do you have an redacted version of your schema for us to test with?

@gbitaudeau
Copy link
Author

Sorry for the delay, I will make a short project with a simplified schema and add it here as soon as I can.

@thisisabhash
Copy link
Member

Great, thanks. Let us know how it goes and if you have any questions.

@gbitaudeau
Copy link
Author

Hi,

Here you can find a way to reproduce easily a case where sync failed and no items were saved in datastore.

Please note that to make as short as possible, I changed how to reproduce the problem because I encountered this case due to authorization issue (what I described initially: owner changed on some model linked to other), but it takes more step to reproduce as you need to manage auth, signup, signin...

So using this schema:

# This "input" configures a global authorization rule to enable public access to
# all models in this schema. Learn more about authorization rules here: https://docs.amplify.aws/cli/graphql/authorization-rules
input AMPLIFY { globalAuthRule: AuthRule = { allow: public } } # FOR TESTING ONLY!

type Blog @model {
  id: ID!
  name: String!
  posts: [Post] @hasMany
}

type Post @model {
  id: ID!
  title: String!
  blog: Blog! @belongsTo
  comments: [Comment] @hasMany
}

type Comment @model {
  id: ID!
  post: Post! @belongsTo
  content: String!
}

You execute the following code:

import Foundation
import Amplify
import AWSDataStorePlugin
import AWSAPIPlugin

let task = Task {
    do {
        try await AmplifyConf.instance.start()
        try await AmplifyConf.instance.createNewObjects()
        guard try await AmplifyConf.instance.getComments().count > 0 else {
            fatalError("Comments are empty")
        }
        try await AmplifyConf.instance.makeWrongCommentForeignKey()
        try await AmplifyConf.instance.clearAndStart()
        try await Task.sleep(seconds: 2) // wait sync query to finish
        guard try await AmplifyConf.instance.getComments().count > 0 else {
            fatalError("Comments are empty after clearAndStart") // <== we finish here
        }
    } catch {
        print ("Error: \(error)")
    }
}

public class AmplifyConf {
    public static let instance = AmplifyConf()
    private var listenAmplifyHubTask: Task<Void, Never>?

    private let blog = Blog(id: "THE BLOG", name: "blob")
    
    private init() {
    }
    
    func start() async throws {
//        Amplify.Logging.logLevel = .debug
        
        let datastoreConfig = DataStoreConfiguration.custom { error in
            var dumped = String()
            dump(error, to: &dumped)
            print("ERROR \(error)")
        }
        try Amplify.add(plugin: AWSDataStorePlugin(modelRegistration: AmplifyModels(), 
                                                   configuration: datastoreConfig))
        try Amplify.add(plugin: AWSAPIPlugin(modelRegistration: AmplifyModels()))
        try Amplify.configure()
        
        listenAmplifyHub()
        
        try await clearAndStart()
    }
    
    private func listenAmplifyHub() {
        self.listenAmplifyHubTask = Task {
            for await event in Amplify.Hub.publisher(for: .dataStore).values {
                
                var dumped = String()
                dump(event, to: &dumped)
                print("event \(dumped)")
                
//                print("[Amplify HubEvent](dataStore) \(event.eventName)")
                 
            }
        }
    }
    
    public func clearAndStart() async throws {
        try await Amplify.DataStore.clear()
        try await Amplify.DataStore.start()
    }
    
    public func createNewObjects() async throws {
        let dateStr = "\(Date())"
        let post = Post(title: "post-\(dateStr)", blog: blog)
        let comment1 = Comment(post: post, content: "Content1 \(dateStr)")
        let comment2 = Comment(post: post, content: "Content2 \(dateStr)")
        try await Amplify.DataStore.save(blog)
        try await Amplify.DataStore.save(post)
        try await Amplify.DataStore.save(comment1)
        try await Amplify.DataStore.save(comment2)
    }
    
    public func makeWrongCommentForeignKey() async throws {
        guard var comment = try await Amplify.DataStore.query(Comment.self, sort: .descending(Comment.CodingKeys.createdAt), paginate: .firstResult).first else {
            print ("No comment")
            return
        }
        let post = Post(title: "Not Saved", blog: blog)
        comment.setPost(post)
        let _ = try await Amplify.API.mutate(request: .updateMutation(of: comment, version: 1))
    }
    
    public func getComments() async throws -> [Comment] {
        return try await Amplify.DataStore.query(Comment.self)
    }
}

For me the import part is not the details about makeWrongCommentForeignKey but the fact that after a mutation which make some model objects invalids, no model objects are synchronized at all.

The problem come from how SyncComment query response is managed in Amplify Datastore. Here a sample response:

{
  "data": {
    "syncComments": {
      "items": [
        null,
        {
          "_deleted": null,
          "_lastChangedAt": 1697194380978,
          "_version": 1,
          "content": "Content 2023-10-13 10:53:00 +0000",
          "createdAt": "2023-10-13T10:53:00.976Z",
          "id": "30A91BAD-FFE6-4798-9B12-4110C20BE538",
          "post": {
            "id": "30D0339C-DBA0-4F41-A442-D33C12E4CFB1",
            "_deleted": null
          },
          "updatedAt": "2023-10-13T10:53:00.976Z",
          "postCommentsId": "30D0339C-DBA0-4F41-A442-D33C12E4CFB1"
        }
      ]
    }
  },
  "errors": [
    {
      "path": [
        "syncComments",
        "items",
        0,
        "post"
      ],
      "locations": null,
      "message": "Cannot return null for non-nullable type: 'Post' within parent 'Comment' (/syncComments/items[0]/post)"
    }
  ]
}

Hope you have enough information.

@thisisabhash
Copy link
Member

Thank you, we will investigate this and post further updates.

@thisisabhash thisisabhash self-assigned this Jan 10, 2024
@gbitaudeau
Copy link
Author

Hi @thisisabhash ,

In our latest production release, we added more logs to Crashlytics to be able to measure whether or not this bug affects a large number of our users.. An analysis on our data and Crashlytics show us than more than 500 users have the issue.
Whosnext -WhosnextCore_AmplifyCoreError__200 -_Console_Firebase

All affected users have model objects which suffer from a foreign key that targets an object not present in DynamoDB.
They get the following error during initial sync :

▿ DataStoreError: failed to process graphqlResponseData
Caused by:
APIError: failed to process graphqlResponseData
Caused by:
DataStoreError: The key `__typename` was not found
Recovery suggestion: Check if the parsed JSON contains the expected `__typename`
  ▿ api: (2 elements)
    ▿ .0: APIError: failed to process graphqlResponseData
Caused by:
DataStoreError: The key `__typename` was not found
Recovery suggestion: Check if the parsed JSON contains the expected `__typename`
      ▿ operationError: (3 elements)
        - .0: \"failed to process graphqlResponseData\"
        - .1: \"\"
        ▿ .2: Optional(DataStoreError: The key `__typename` was not found
Recovery suggestion: Check if the parsed JSON contains the expected `__typename`)
          ▿ some: DataStoreError: The key `__typename` was not found
Recovery suggestion: Check if the parsed JSON contains the expected `__typename`
            ▿ decodingError: (2 elements)
              - .0: \"The key `__typename` was not found\"
              - .1: \"Check if the parsed JSON contains the expected `__typename`\"
    - .1: nil

This error occurs when sync response contain a content like :

{
  "data": {
    "syncComments": {
      "items": [
        null,
        {
          "_deleted": null,
          "_lastChangedAt": 1697194380978,
          "_version": 1,
          "content": "Content 2023-10-13 10:53:00 +0000",
          "createdAt": "2023-10-13T10:53:00.976Z",
          "id": "30A91BAD-FFE6-4798-9B12-4110C20BE538",
          "post": {
            "id": "30D0339C-DBA0-4F41-A442-D33C12E4CFB1",
            "_deleted": null
          },
          "updatedAt": "2023-10-13T10:53:00.976Z",
          "postCommentsId": "30D0339C-DBA0-4F41-A442-D33C12E4CFB1"
        }
      ]
    }
  },
  "errors": [
    {
      "path": [
        "syncComments",
        "items",
        0,
        "post"
      ],
      "locations": null,
      "message": "Cannot return null for non-nullable type: 'Post' within parent 'Comment' (/syncComments/items[0]/post)"
    }
  ]
}

We investigated to try to find the root cause of why some foreign keys reference missing objects and we were able to find at least 3 possible scenarios (but perhaps there are others ...):

  1. Case of an expired Cognito Token in request authorization header which produce an HTTP error code 401
  2. Case of a network error which sometimes produce an URLError with the code URLError.Code. secureConnectionFailed
  3. Case of a Amplify.DataStore.stop() while the mutations is processed (we think it can happen if application switch from background to foreground and vice versa)

I will describe just after how we were able to reproduce these 3 cases on the schema given here #3259 (comment). I will used the model describe by this schema instead of our real model to simplify.

When one of these error occurs on a create mutation for a post, the post is not saved in server side but is present in local datastore. Restarting the datastore will not send the post mutation. Then if a comment which belongs to this post is created, it is saved in dynamoDB and then sync failed.

To be able to fix sync problem in our application, we took the decision to replace all relations (hasMany , belongsTo , ...) with their foreign keys in the schema.graphql used to generate our model objects. Then we made some logic to compare our own device database with the content of the datastore to create again missing objects in datastore. This is a heavy process and we hope to not lose some data.

Case 1: Expired Cognito Token

We use amazon cognito user pools and all objects have only one auth rule : @auth(rules: [{ allow: owner}]) .
To simulate this error:

  • configure cognito token to be valid only during 5 min
  • start datastore
  • create a post and a comment using:
let post = Post(id:"post-auth", title: "post", blog: blog)
let comment1 = Comment(id:"comment1-auth", post: post, content: "Content-1")
                     
try await Amplify.DataStore.save(post)
try await Amplify.DataStore.save(comment1)
  • use a https proxy to rewrite the authorization header on CreatePost mutation with an expired token (took from a previous request made more than 5min ago). Don't rewrite authorization token on CreateMessage mutation.
  • start the sync engine again using:
try await Amplify.DataStore.stop()
try await Amplify.DataStore.start()
  • the error occurs.

The server return an HTTP error code 401. If it can help, we observed that in method SyncMutationToCloudOperation.getRetryAdviceIfRetryable l261 on

case .httpStatusError(_, let httpURLResponse) where httpURLResponse.statusCode == 401:
            advice = shouldRetryWithDifferentAuthType()

shouldRetryWithDifferentAuthType return a RequestRetryAdvice with shouldRetry set to false.

Case 2: Network secureConnectionFailed error

Simulate this error is quite challenging because the request CreatePost mutation should be stop during the ssl steps and other requests should not be stopped. In our log, this error look like:

Task <C383C0CB-CA8E-4D4B-9218-405A6ABC57CC>.<117> finished with error [-1200] Error Domain=NSURLErrorDomain Code=-1200 "An SSL error has occurred and a secure connection to the server cannot be made." UserInfo={NSErrorFailingURLStringKey=https://c5i3idiurzanjmgvpytfeppfkq.appsync-api.eu-west-3.amazonaws.com/graphql, NSLocalizedRecoverySuggestion=Would you like to connect to the server anyway?, _kCFStreamErrorDomainKey=3, _NSURLErrorFailingURLSessionTaskErrorKey=LocalDataTask <C383C0CB-CA8E-4D4B-9218-405A6ABC57CC>.<117>, _NSURLErrorRelatedURLSessionTaskErrorKey=(
    "LocalDataTask <C383C0CB-CA8E-4D4B-9218-405A6ABC57CC>.<117>"
), NSLocalizedDescription=An SSL error has occurred and a secure connection to the server cannot be made., NSErrorFailingURLKey=https://c5i3idiurzanjmgvpytfeppfkq.appsync-api.eu-west-3.amazonaws.com/graphql, NSUnderlyingError=0x283dfd590 {Error Domain=kCFErrorDomainCFNetwork Code=-1200 "(null)" UserInfo={_kCFStreamPropertySSLClientCertificateState=0, _kCFNetworkCFStreamSSLErrorOriginalValue=-9816, _kCFStreamErrorDomainKey=3, _kCFStreamErrorCodeKey=-9816, _NSURLErrorNWPathKey=satisfied (Path is satisfied), viable, interface: en0[802.11], ipv4, dns, proxy, uses wifi}}, _kCFStreamErrorCodeKey=-9816}

Stopping the connection randomly during a request doesn't always produce this error (but we randomly succeeded to get it severals times). To help, we modify RequestRetryablePolicy.determineRetryRequestAdvice with this version who simulate an urlError code not in the list (as secureConnectionFailed):

private func determineRetryRequestAdvice(basedOn urlError: URLError,
                                             attemptNumber: Int) -> RequestRetryAdvice {
//        switch urlError.code {
//        case .notConnectedToInternet,
//             .dnsLookupFailed,
//             .cannotConnectToHost,
//             .cannotFindHost,
//             .timedOut,
//             .dataNotAllowed,
//             .cannotParseResponse,
//             .networkConnectionLost:
//            let waitMillis = retryDelayInMillseconds(for: attemptNumber)
//            return RequestRetryAdvice(shouldRetry: true, retryInterval: .milliseconds(waitMillis))
//        default:
//            break
//        }
        return RequestRetryAdvice(shouldRetry: false)
    }
  • start datastore
  • create a post and a comment using:
let post = Post(id:"post-ne", title: "post", blog: blog)
let comment1 = Comment(id:"comment1-ne", post: post, content: "Content-1")
                     
try await Amplify.DataStore.save(post)
try await Amplify.DataStore.save(comment1)
  • use a https proxy to abort the CreatePost request
  • start the sync engine again using:
try await Amplify.DataStore.stop()
try await Amplify.DataStore.start()
  • the error occurs.

In case 1 and 2, the CreateMessage request finish with the following error:

"▿ DataStoreError: GraphQL service returned a successful response containing errors: [Amplify.GraphQLError(message: \"Cannot return null for non-nullable type: \\\'Post\\\' within parent \\\'Comment\\\' (/createComment/post)\", locations: nil, path: Optional([Amplify.JSONValue.string(\"createComment\"), ....

Case 3: DataStore.stop during mutation

Use the following code:

import Foundation
import Amplify
import AWSDataStorePlugin
import AWSAPIPlugin

let task = Task {
    await AmplifyConf.instance.testStartStopLostMutation()
}

public actor AmplifyConf {
    public static let instance = AmplifyConf()
    private var listenAmplifyHubTask: Task<Void, Never>?
    
    private let blog = Blog(id: "THE BLOG", name: "blob")
    
    private var outboxEmpty = true
    
    private init() {
    }
    
    func start() async throws {
        //Amplify.Logging.logLevel = .debug
        
        let datastoreConfig = DataStoreConfiguration.custom(errorHandler: { error in
            var dumped = String()
            dump(error, to: &dumped)
            print("ERROR \(error)")
        })
        
        try Amplify.add(plugin: AWSDataStorePlugin(modelRegistration: AmplifyModels(),
                                                   configuration: datastoreConfig))
        try Amplify.add(plugin: AWSAPIPlugin(modelRegistration: AmplifyModels()))
        try Amplify.configure()
        
        listenAmplifyHub()
        
//        try await Amplify.DataStore.clear()
        try await Amplify.DataStore.start()
    }
    
    func testStartStopLostMutation() async {
        do {
            try await start()
            print(">>> wait blog")
            let blog = try await subscribeToBlog()
            
            let j = Int16.random(in: 0...Int16.max)
            
            for i in 0...100 {
                print(">>> for n°\(i)")
                let post = Post(id:"post-\(j)-\(i)", title: "post-\(j)-\(i)", blog: blog)
                let comment1 = Comment(id:"comment1-\(j)-\(i)", post: post, content: "Content-1-\(j)-\(i)")
                
                self.outboxEmpty = false
                let task = Task {
                    Task.detached {
                        try await Amplify.DataStore.stop()
                        try await Amplify.DataStore.start()
                    }
                    Task.detached {
                        try await Amplify.DataStore.save(post)
                        try await Amplify.DataStore.save(comment1)
                    }
                }
                let _ = await task.result
                    
                while self.outboxEmpty == false {
                    try await Task.sleep(seconds: 0.01)
                }
            }
        } catch {
            print ("ERROR: \(error)")
        }
    }
    
    // You must hold a reference to your subscription.
    var blogSubscription: AmplifyAsyncThrowingSequence<DataStoreQuerySnapshot<Blog>>?

    // Subscribe
    func subscribeToBlog() async throws -> Blog {
        let blogSubscription = Amplify.DataStore.observeQuery(
            for: Blog.self,
            where: Blog.self.keys.id.eq("THE BLOG")
        )
        
        // hold onto your subscription
        self.blogSubscription = blogSubscription
        // observe new snapshots
        for try await blogSnapshot in blogSubscription {
            if let blog = blogSnapshot.items.first {
                self.blogSubscription = nil
                return blog
            }
        }
        
        fatalError("No blog")
    }

    
    private func listenAmplifyHub() {
        self.listenAmplifyHubTask = Task {
            for await event in Amplify.Hub.publisher(for: .dataStore).stream {
                switch event.eventName {
                case HubPayload.EventName.DataStore.outboxStatus:
                    if let outboxEvent = event.data as? OutboxStatusEvent {
                        self.outboxEmpty = outboxEvent.isEmpty
                        print(">>> outboxEmpty \(outboxEmpty)")
                    }
                default:
                    break
                }
            }
        }
    }
}
  • add a breakpoint inside DataStoreConfiguration.custom(errorHandler
  • launch it on a device
  • when a comment is created but its post is missing, the breakpoint should pause the process

@gbitaudeau
Copy link
Author

I forgot to mention that we use Amplify 2.23.0 for latest production release.

@thisisabhash
Copy link
Member

Hello @gbitaudeau , Thank you for the detailed information on the ticket. We are investigating this issue with priority.

@5d
Copy link
Member

5d commented Jan 17, 2024

Hi @gbitaudeau ,

I've implemented a solution in the 5d/partial-sync branch. Could you please test it out when you have time?

@gbitaudeau
Copy link
Author

Thanks @5d , I will take a look when I can.

Just to resume, there are 3 parts in this problem:

  1. Make the sync engine ready even if there are errors in the sync response (which is the purpose of your branch I think). In our case, as we change the model to remove relations (hasMany, belongsTo, ...), this is not our first priority.
  2. Ensure that mutations that are not passed server-side are not lost. As long as this is the case, Datastore and DynamoDB will continue to diverge and the number of objects with wrong foreign key will increase. This is what we are waiting the most.
  3. Find how we can send again missing objects on the server and cleanup object with an incorrect foreign key that are no longer needed on the server. We've already worked on this and were able to build logic which covers send again missing objects. We can't find a good cleaning solution at the moment. This is our lowest priority.

@lawmicha
Copy link
Member

Hi @gbitaudeau, thanks for your patience and the detailed analysis.

For (1), that is right, #3474 is specifically addressing handling partial responses (response with items and errors), items should be reconciled to the local store. errors should be passed to the errorHandler. It should also handle decoding null items in items successfully.

For (2), we have two PRs which will update the retry determination logic to true to retry the request. the retry handler will retry with exponential back-off to avoid a retry storm.

#3475 - Adds the secureConnectionFailed case as retryable
#3477 - Adds the session expired scenario as retryable

With these two changes, DataStore should retry on the one mutation event indefinitely as long as the retry determination logic returns true.

The server return an HTTP error code 401.

I understand this was to mock an error response from the service but If the server actually returns 401 unauthorized, DataStore will not retry. DataStore does not retry on unauthorized errors since it may never succeed, and block the rest of the items. If the Refresh token is expired and cannot refresh the access token, then we will retry (after this PR is in #3477).

For (3). In the event that a mutation event fails to sync to AppSync and it is determined not to retry, all failure code paths end up calling the errorHandler before deleting the mutation event. The error handler returns an error that can be cast to DataStoreError.api(amplifyError, mutationEvent). Here should provide you a stream of all the mutation events that failed. Would you be able to clean up the local DB from this event using mutationEvent.modelId ?

@thisisabhash
Copy link
Member

We have added the above changes to this branch: https://github.com/aws-amplify/amplify-swift/commits/fix/gh-3259/
Please try it out and let us know.

@lawmicha
Copy link
Member

Hi @gbitaudeau, a question about the repro step in the issue description

Change the owner of the object with the @hasMany relationship.

What steps did you take to change the parent's owner to another? Which schema and operations did you perform on which model?

@gbitaudeau
Copy link
Author

Hi,

Thanks all for all your work and for the version 2.25.5 👍

I made some tests on it, and here what I can say:

Also to give you more insight, we listen HubPayload.EventName.Auth.sessionExpired and when this event occurs, we alway sign out user using await Amplify.Auth.signOut() to be able to sign in him again automatically. Which means that during a short period, we don't have a user signed in but we don't want to loose mutations in Datastore. Maybe there is a better way to manage that ...

What steps did you take to change the parent's owner to another? Which schema and operations did you perform on which model?

I don't know why we got this, but I suspect bugs on our side with the signOut/signIn process explain just before.

For (3). In the event that a mutation event fails to sync to AppSync and it is determined not to retry, all failure code paths end up calling the errorHandler before deleting the mutation event. The error handler returns an error that can be cast to DataStoreError.api(amplifyError, mutationEvent). Here should provide you a stream of all the mutation events that failed. Would you be able to clean up the local DB from this event using mutationEvent.modelId ?

I well get the stream of mutation events which fail, but I don't really know what I can do with it because :

  • I don't want to remove these objects from my local DB otherwise my users will think that my application lost data
  • In case of a create/delete mutation, I don't know how to send them again server side

@lawmicha
Copy link
Member

lawmicha commented Jan 29, 2024

Hi @gbitaudeau, thanks for the reply

unknown, internationalRoamingOff, XXXCertificateXXX, ... Maybe it would be better to invert the logic here

We actually started with this inversed but it caused a retry storm. I think we can explore inversing it in certain places and making it smarter, ie. if the same URLError occurs more than x number of times, then don't retry.

the server actually return a 401 unauthorized (and it's not a mock) but it's just my token which is expired

This is difficult for the client (Datastore) to handle if the server is sending back Unauthorized instead of something like TokenExpired error. I'm looking to see if I can cover most use cases by checking the expiry of the token before we send the request, or checking the expiry if we sent the request and received unauthorized. If it was due to an expired token, then we can keep the mutation event and retry.

we alway sign out user using await Amplify.Auth.signOut()

If you sign the user out, can you also stop DataStore? This will prevent the mutations from being sent unncessarily if you don't expect any requests to be sent with a public auth type like API key. And then start DataStore after the user has signed in successfully?

I well get the stream of mutation events which fail

I'm working on some additional checks to further categorize the mutation event failures as retryable (checking the expiry of the token, and checking if the error is due to the user signed out). However, if on the off chance, even after these changes, if you can actually determine that the mutation is one that you want to keep, you can perform the mutation against AppSync directly though Amplify.API.mutate(request:) API. Keep in mind that doing this requires a very specific selection set to be generated to work with DataStore to receive the corresponding subscription event.

@gbitaudeau
Copy link
Author

If you sign the user out, can you also stop DataStore? This will prevent the mutations from being sent unncessarily if you don't expect any requests to be sent with a public auth type like API key. And then start DataStore after the user has signed in successfully?

Good point, after checking, I don't know why but we forgot to stop the datastore when the user is signed out for whatever reason. I will add this on our side, thanks.

you can perform the mutation against AppSync directly

This could be done, but it might need quite a big amount of work ... May be a nice feature on your side could be either :

  • with DataStoreConfiguration to add a mutationErrorHandler which can force a mutation to be kept by datastore giving a retry delay, a retry priority (if other mutations should be sent before this one will be retried) or something like that
  • persist a "mutation on error" list which can be consult and easily retried

Actually, to "replay" mutation we follow this flow :

  1. clear the DataStore
  2. wait it to be full sync
  3. compare it with our local base and replay all missing creates / updates / deletes again

@lawmicha
Copy link
Member

lawmicha commented Feb 1, 2024

#3487 further improves the retryability determination and has been released in https://github.com/aws-amplify/amplify-swift/releases/tag/2.25.6 Please give this update a try.

@gbitaudeau Could you open a feature request for that? I'll look into flushing out the details. Feel free to reference back to this issue to keep the context. We'd like to resolve this bug since the root cause has been fixed and released.

For replaying the mutation, in flow 3, do you replay all missing creates / updates / deletes against the local database by calling DataStore.save/delete?

@lawmicha lawmicha closed this as completed Feb 1, 2024
@gbitaudeau
Copy link
Author

gbitaudeau commented Feb 1, 2024

Case 3: DataStore.stop during mutation

@lawmicha did you test this scenario ?

@lawmicha
Copy link
Member

lawmicha commented Feb 1, 2024

I didn't test this one, it looks like it simplifies to a sequence of calls in this order that may cause a problem:

  1. save(post) saves to local database, but no mutation event saved perhaps
  2. stop() prevents (1) from saving the mutation event
  3. start() post doesn't sync
  4. save(comment) comment saves to local db, and saves mutation event
  5. comment mutation event is synced to AppSync, and fails due to missing post

Did you have the error log from the errorHandler callback?

cc @thisisabhash / @5d

@lawmicha
Copy link
Member

lawmicha commented Feb 1, 2024

I've re-opened this issue to track Case 3, @thisisabhash had attempted earlier but could not reproduce even with the sample code. If you can reproduce, can you try again on my branch? lawmicha.save-sync-transaction. This assumes the problem is due to the sequence I explained here #3259 (comment) which seems pretty probable. The fix i'm adding in #3492 simply wraps the model write and mutation event write in a transaction

@thisisabhash thisisabhash added the pending-community-response Issue is pending response from the issue requestor label Feb 2, 2024
@gbitaudeau
Copy link
Author

gbitaudeau commented Feb 5, 2024

Hi @lawmicha ,

Using your branch doesn't fix the issue.

Here the log I had (using the sample code for case 3):

Adding plugin: AWSDataStorePlugin.AWSDataStorePlugin)
Adding plugin: <AWSAPIPlugin.AWSAPIPlugin: 0x28320d800>)
Configuring
Configuration: nil
No plugin found for configuration key `awsCognitoAuthPlugin`. Add a plugin for that key.
Configure finished
Initializing database connection: /private/var/mobile/Containers/Data/Application/911A50B0-5F0E-4303-971E-271DA2D82978/Documents/TestAmplifyGBI.db
Setting up 3 models
Setting up 3 models
Checking MutationSyncMetadata records, SQL: select (select count(1) as count from MutationSyncMetadata) as allRecords,
(select count(1) as count from MutationSyncMetadata where id like '%|%') as newKeys
No MutationSyncMetadata migration needed.
Detected ModelSyncMetadata table has syncPredicate column. No migration needed
Initializing database connection: /private/var/mobile/Containers/Data/Application/911A50B0-5F0E-4303-971E-271DA2D82978/Documents/TestAmplifyGBI.db
Setting up 3 models
Setting up 3 models
Checking MutationSyncMetadata records, SQL: select (select count(1) as count from MutationSyncMetadata) as allRecords,
(select count(1) as count from MutationSyncMetadata where id like '%|%') as newKeys
No MutationSyncMetadata migration needed.
Detected ModelSyncMetadata table has syncPredicate column. No migration needed
pauseSubscriptions()
>>> wait blog
pauseMutations()
clearStateOutgoingMutations(storageAdapter:)
[InitializeSubscription] initializeSubscriptions(api:storageAdapter:)
Time to generate snapshot: 0.000429334 seconds. isSynced: false, count: 0
[EA163A79-D68A-4D76-87FB-E4D67A1A67A7] - Try [1/1]
[5FEAC80C-7474-4141-B5A0-9BBD437657A0] - Try [1/1]
[998E0D0C-66C4-4C5D-9BE2-FDAB770FD330] - Try [1/1]
Received subscription: PassthroughSubject
Received subscription: PassthroughSubject
Received subscription: PassthroughSubject
[92201CE6-BC85-4C2F-A761-675A09D6646C] - Try [1/1]
[C306C6AE-51FA-4F85-AA7C-B72B7FC5E116] - Try [1/1]
[C655C65F-03F7-4EEC-86DB-7A51D71931B8] - Try [1/1]
[RealtimeConnectionProvider] Status: notConnected. Connectivity status: satisfied
[3727A837-B933-407B-9C6A-6CE576D50833] - Try [1/1]
[18B216B6-8BA7-4733-A032-7823E723B575] - Try [1/1]
[90D459BC-7DD6-4D2F-BFC4-1A60E719D490] - Try [1/1]
[RealtimeConnectionProvider] WebsocketDidConnect, sending init message
[RealtimeConnectionProvider] Starting stale connection timer for 300.0s
[RealtimeConnectionProvider] received connectionAck
[AppSyncSubscriptionConnection]: Connection connected, start subscription A860D25D-9533-4736-8438-C2D72F6439B8.
[AppSyncSubscriptionConnection]: Connection connected, start subscription 450AE70B-2D82-4886-B909-0275E7B655B4.
[AppSyncSubscriptionConnection]: Connection connected, start subscription 60A01A66-38F1-473B-BA36-405185F28316.
[AppSyncSubscriptionConnection]: Connection connected, start subscription C6034D6D-0804-4832-AAA3-B0DFB698033F.
[AppSyncSubscriptionConnection]: Connection connected, start subscription 594676DA-001A-4845-A2B4-19245E5188AC.
[AppSyncSubscriptionConnection]: Connection connected, start subscription AF4ACADF-31A5-48E4-9157-EBDD40342784.
[AppSyncSubscriptionConnection]: Connection connected, start subscription EC578EB7-493E-4D2A-B3CD-FE709D8ADDAA.
[AppSyncSubscriptionConnection]: Connection connected, start subscription 6B1B54D6-6C5E-4B71-90AB-B19ACA1C8CD6.
[AppSyncSubscriptionConnection]: Connection connected, start subscription 8CF5E170-A1D0-4CC7-B452-CD9A4094C50E.
connectionState now connected
connectionState now connected
connectionState now connected
[InitializeSubscription.6] performInitialSync()
Beginning initial sync
Beginning sync for Blog
[D3F5B793-EC35-4843-8DCD-0DD189D3E24D] - Try [1/1]
Starting query 5D5277B1-0015-4B02-8908-79734082B87C
Starting network task for query 5D5277B1-0015-4B02-8908-79734082B87C
[Operation D3F5B793-EC35-4843-8DCD-0DD189D3E24D] - Success
Beginning sync for Post
[056A7873-B4C4-4625-A507-333C096F9882] - Try [1/1]
Starting query 145F7D42-217E-4B04-A18D-657341C0D1E4
Starting network task for query 145F7D42-217E-4B04-A18D-657341C0D1E4
total time: 0.019735459s
Time to generate snapshot: 0.0 seconds. isSynced: true, count: 1
>>> for n°0
Received completion: finished
[AppSyncSubscriptionConnection] Unsubscribe A860D25D-9533-4736-8438-C2D72F6439B8
[AppSyncSubscriptionConnection] Unsubscribe 6B1B54D6-6C5E-4B71-90AB-B19ACA1C8CD6
[AppSyncSubscriptionConnection] Unsubscribe 594676DA-001A-4845-A2B4-19245E5188AC
Received completion: finished
[AppSyncSubscriptionConnection] Unsubscribe C6034D6D-0804-4832-AAA3-B0DFB698033F
[AppSyncSubscriptionConnection] Unsubscribe 8CF5E170-A1D0-4CC7-B452-CD9A4094C50E
[AppSyncSubscriptionConnection] Unsubscribe AF4ACADF-31A5-48E4-9157-EBDD40342784
Received completion: finished
[AppSyncSubscriptionConnection] Unsubscribe 450AE70B-2D82-4886-B909-0275E7B655B4
[AppSyncSubscriptionConnection] Unsubscribe 60A01A66-38F1-473B-BA36-405185F28316
[AppSyncSubscriptionConnection] Unsubscribe EC578EB7-493E-4D2A-B3CD-FE709D8ADDAA
[RealtimeConnectionProvider] all subscriptions removed, disconnecting websocket connection.
RemoteSyncEngine publisher completed successfully
pauseSubscriptions()
pauseMutations()
clearStateOutgoingMutations(storageAdapter:)
[InitializeSubscription] initializeSubscriptions(api:storageAdapter:)
Received subscription: PassthroughSubject
Received subscription: PassthroughSubject
Received subscription: PassthroughSubject
[FA856293-006F-4965-AF3A-0B3BE817C52F] - Try [1/1]
Time to generate snapshot: 0.002628417 seconds. isSynced: false, count: 1
[AB24D5B9-0CC4-42DC-A080-9C3467F51B40] - Try [1/1]
[618CB455-52CB-4ABD-92FD-2FA935F98965] - Try [1/1]
[23E62CA2-B70C-4BD4-A047-4C5B81A78408] - Try [1/1]
[61731032-52FD-43AD-9448-B348AC0C966C] - Try [1/1]
[9301AAC5-B140-4935-8BC0-DBFD4364C11A] - Try [1/1]
[C2791C5A-8F9F-4D85-977E-ABC5D27E3C84] - Try [1/1]
[9B5E10D6-DD28-47FA-927A-E6DBA88B39D9] - Try [1/1]
[01E89560-4491-4FC7-917E-2A64D0A38ED1] - Try [1/1]
[Operation 056A7873-B4C4-4625-A507-333C096F9882] - Success
Beginning sync for Comment
[37EAA631-6FD8-4F37-B7DE-59627FC56DE4] - Try [1/1]
Starting query A1531A35-DDEF-43D9-A10F-DF61BD45BF96
Starting network task for query A1531A35-DDEF-43D9-A10F-DF61BD45BF96
[RealtimeConnectionProvider] WebsocketDidConnect, sending init message
[RealtimeConnectionProvider] Starting stale connection timer for 300.0s
[RealtimeConnectionProvider] received connectionAck
[AppSyncSubscriptionConnection]: Connection connected, start subscription 6E238B18-A5F2-4D27-9B7A-DF9C73D8A261.
[AppSyncSubscriptionConnection]: Connection connected, start subscription EB802CBD-A887-4470-9809-433B1A74F20C.
[AppSyncSubscriptionConnection]: Connection connected, start subscription 34BA061E-AD3D-4610-996A-E555FCB02348.
[AppSyncSubscriptionConnection]: Connection connected, start subscription 640AF21B-7B80-48B1-A01E-6D3E38850B63.
[AppSyncSubscriptionConnection]: Connection connected, start subscription 50BF263C-BC74-4991-A64B-BB4A389D112C.
[AppSyncSubscriptionConnection]: Connection connected, start subscription 48C61355-6178-4BAA-B12A-ECD4BEE871E0.
[AppSyncSubscriptionConnection]: Connection connected, start subscription 006A75DD-4048-4044-9547-939DF5C07A07.
[AppSyncSubscriptionConnection]: Connection connected, start subscription 735B2AE2-C688-47C5-AB3C-F69F72A5B28C.
[AppSyncSubscriptionConnection]: Connection connected, start subscription FBAC43A9-D363-4AB0-BAB1-2420D1FE52A8.
[Operation 37EAA631-6FD8-4F37-B7DE-59627FC56DE4] - Success
connectionState now connected
connectionState now connected
connectionState now connected
[InitializeSubscription.6] performInitialSync()
Beginning initial sync
Beginning sync for Blog
[77DD18E0-2456-4900-AFF0-8CD365A105CB] - Try [1/1]
Starting query C3208641-D23D-438E-8759-84B0F2F33AFF
Starting network task for query C3208641-D23D-438E-8759-84B0F2F33AFF
[Operation 77DD18E0-2456-4900-AFF0-8CD365A105CB] - Success
enqueue(_:) skipping reconciliation, no models to enqueue.
Beginning sync for Post
Time to generate snapshot: 0.0 seconds. isSynced: true, count: 1
[37F90946-FADC-4754-8C65-591AD39F37F6] - Try [1/1]
Starting query 692D41BC-42FC-4118-A652-453F2B9CEDFC
Starting network task for query 692D41BC-42FC-4118-A652-453F2B9CEDFC
[Operation 37F90946-FADC-4754-8C65-591AD39F37F6] - Success
enqueue(_:) skipping reconciliation, no models to enqueue.
Beginning sync for Comment
[179E9210-E3EC-4968-8C7D-D3795288418F] - Try [1/1]
Starting query 9E91B344-B2F2-4E78-BE0D-C8950E9F4FC9
Starting network task for query 9E91B344-B2F2-4E78-BE0D-C8950E9F4FC9
[Operation 179E9210-E3EC-4968-8C7D-D3795288418F] - Success
enqueue(_:) skipping reconciliation, no models to enqueue.
Successfully finished sync
activateCloudSubscriptions()
startMutationQueue(api:mutationEventPublisher:reconciliationQueue:)
>>> nbStart 1
RemoteSyncEngine SyncEngineActive
>>> outboxEmpty false
>>> outboxEmpty false
sendMutationToCloud(withAuthType:)
Starting mutation 0C953B09-91A0-4175-88FE-21DBDA76F78C
Starting network task for mutation 0C953B09-91A0-4175-88FE-21DBDA76F78C
[CascadeDelete.1] Deleting MutationEvent with identifiers: ["5D6C1A70-0E49-4486-95B7-5993B16E4769"]
[CascadeDelete.3] Local cascade delete of MutationEvent successful!
[CascadeDelete.4] sending a total of
1 delete mutations
total time: 0.016674917s
>>> outboxEmpty false
>>> outboxEmpty false
sendMutationToCloud(withAuthType:)
Starting mutation 1DD25669-84F8-40D3-BAEF-70C6146393C3
Starting network task for mutation 1DD25669-84F8-40D3-BAEF-70C6146393C3
total time: 0.001397542s
[CascadeDelete.1] Deleting MutationEvent with identifiers: ["C18EA7DB-CC92-4F23-97A1-EE2AB91FE455"]
[CascadeDelete.3] Local cascade delete of MutationEvent successful!
[CascadeDelete.4] sending a total of
1 delete mutations
>>> outboxEmpty true
total time: 0.018948125s
>>> for n°1
Received completion: finished
[AppSyncSubscriptionConnection] Unsubscribe 640AF21B-7B80-48B1-A01E-6D3E38850B63
[AppSyncSubscriptionConnection] Unsubscribe 34BA061E-AD3D-4610-996A-E555FCB02348
[AppSyncSubscriptionConnection] Unsubscribe 6E238B18-A5F2-4D27-9B7A-DF9C73D8A261
Received completion: finished
[AppSyncSubscriptionConnection] Unsubscribe 006A75DD-4048-4044-9547-939DF5C07A07
[AppSyncSubscriptionConnection] Unsubscribe 48C61355-6178-4BAA-B12A-ECD4BEE871E0
[AppSyncSubscriptionConnection] Unsubscribe 735B2AE2-C688-47C5-AB3C-F69F72A5B28C
Received completion: finished
[AppSyncSubscriptionConnection] Unsubscribe EB802CBD-A887-4470-9809-433B1A74F20C
[AppSyncSubscriptionConnection] Unsubscribe FBAC43A9-D363-4AB0-BAB1-2420D1FE52A8
[AppSyncSubscriptionConnection] Unsubscribe 50BF263C-BC74-4991-A64B-BB4A389D112C
[RealtimeConnectionProvider] all subscriptions removed, disconnecting websocket connection.
RemoteSyncEngine publisher completed successfully
pauseSubscriptions()
pauseMutations()
clearStateOutgoingMutations(storageAdapter:)
[InitializeSubscription] initializeSubscriptions(api:storageAdapter:)
Received subscription: PassthroughSubject
Received subscription: PassthroughSubject
[1EEA40C8-8FB5-4E71-A310-A217D8061AB7] - Try [1/1]
Received subscription: PassthroughSubject
[0CFCDE7E-00A6-4D2E-9B79-AD35D936C7BE] - Try [1/1]
[7E972998-0A0C-4F7A-9A03-6C705C4DA54F] - Try [1/1]
[67E5FA45-6388-4F3F-8C7E-62D5D7A52F6F] - Try [1/1]
Time to generate snapshot: 0.004691833 seconds. isSynced: false, count: 1
[7DF1EBE5-35F5-440C-99A3-539397C70131] - Try [1/1]
[05EB9855-9143-44F2-9F4A-A4005FB71746] - Try [1/1]
[435E20FA-A056-42C2-A1C5-7A83C76809BC] - Try [1/1]
[19FFA126-6A29-483E-BDBB-9F0D789C70B6] - Try [1/1]
[0A6575E6-230F-443A-BF80-851A96ADA35B] - Try [1/1]
[RealtimeConnectionProvider] WebsocketDidConnect, sending init message
[RealtimeConnectionProvider] Starting stale connection timer for 300.0s
[RealtimeConnectionProvider] received connectionAck
[AppSyncSubscriptionConnection]: Connection connected, start subscription 3E7766DC-0A76-4359-88EC-6023824170FE.
[AppSyncSubscriptionConnection]: Connection connected, start subscription D8BD11A8-FE14-4E51-8274-D982B075DA62.
[AppSyncSubscriptionConnection]: Connection connected, start subscription EF73AC4D-933F-470D-A49E-A48DD7D063EB.
[AppSyncSubscriptionConnection]: Connection connected, start subscription CA01C616-6215-46B6-BF3C-6C746DE4A0E9.
[AppSyncSubscriptionConnection]: Connection connected, start subscription D62FC4A0-E1AE-40F4-A8D9-6FCDBA30FD03.
[AppSyncSubscriptionConnection]: Connection connected, start subscription 753C1652-CE87-4D5D-A38E-17761F157C48.
[AppSyncSubscriptionConnection]: Connection connected, start subscription 7AA57388-BE76-4E11-AFB4-636861597899.
[AppSyncSubscriptionConnection]: Connection connected, start subscription 8086D71B-776E-41A6-86DB-4D16CDFFDD57.
[AppSyncSubscriptionConnection]: Connection connected, start subscription 9F517B57-186D-4950-9F53-35F0D71EF003.
connectionState now connected
connectionState now connected
connectionState now connected
[InitializeSubscription.6] performInitialSync()
Beginning initial sync
Beginning sync for Blog
[2402A7E4-5D54-4E96-8EB5-2957F6D51B8D] - Try [1/1]
Starting query A3D94B28-9EA3-4507-B4D1-BF8B79604613
Starting network task for query A3D94B28-9EA3-4507-B4D1-BF8B79604613
[Operation 2402A7E4-5D54-4E96-8EB5-2957F6D51B8D] - Success
enqueue(_:) skipping reconciliation, no models to enqueue.
Beginning sync for Post
Time to generate snapshot: 0.0 seconds. isSynced: true, count: 1
[A08C93C3-959A-48A0-863B-1577F8CE68CA] - Try [1/1]
Starting query B77928E3-1B93-4B45-90A5-0B78DBD0E56E
Starting network task for query B77928E3-1B93-4B45-90A5-0B78DBD0E56E
[Operation A08C93C3-959A-48A0-863B-1577F8CE68CA] - Success
total time: 0.004153292s
Beginning sync for Comment
[4504C1A5-69CA-4403-A32E-0BED98E28732] - Try [1/1]
Starting query BC55B739-208E-4AC7-97AE-9A4D146C94B4
Starting network task for query BC55B739-208E-4AC7-97AE-9A4D146C94B4
[Operation 4504C1A5-69CA-4403-A32E-0BED98E28732] - Success
total time: 0.00252s
Successfully finished sync
activateCloudSubscriptions()
startMutationQueue(api:mutationEventPublisher:reconciliationQueue:)
>>> nbStart 2
RemoteSyncEngine SyncEngineActive
>>> outboxEmpty false
>>> outboxEmpty false
sendMutationToCloud(withAuthType:)
Starting mutation 2A5300EE-F31C-4961-ADCA-4BDAD988DB54
Starting network task for mutation 2A5300EE-F31C-4961-ADCA-4BDAD988DB54
GraphQLError missing extensions and errorType GraphQLError(message: "Cannot return null for non-nullable type: \'Post\' within parent \'Comment\' (/createComment/post)", locations: nil, path: Optional([Amplify.JSONValue.string("createComment"), Amplify.JSONValue.string("post")]), extensions: nil)

The error handler (in DataStoreConfiguration ) is called with this error as if the mutation for the post never existed:

"▿ DataStoreError: GraphQL service returned a successful response containing errors: [Amplify.GraphQLError(message: \"Cannot return null for non-nullable type: \\\'Post\\\' within parent \\\'Comment\\\' (/createComment/post)\", locations: nil, path: Optional([Amplify.JSONValue.string(\"createComment\"), Amplify.JSONValue.string(\"post\")]), extensions: nil)]\nRecovery suggestion: The list of `GraphQLError` contains service-specific messages\nCaused by:\nGraphQLResponseError<MutationSync<AnyModel>>: GraphQL service returned a successful response containing errors: [Amplify.GraphQLError(message: \"Cannot return null for non-nullable type: \\\'Post\\\' within parent \\\'Comment\\\' (/createComment/post)\", locations: nil, path: Optional([Amplify.JSONValue.string(\"createComment\"), Amplify.JSONValue.string(\"post\")]), extensions: nil)]\nRecovery suggestion: The list of `GraphQLError` contains service-specific messages\n  ▿ api: (2 elements)\n    ▿ .0: GraphQLResponseError<MutationSync<AnyModel>>: GraphQL service returned a successful response containing errors: [Amplify.GraphQLError(message: \"Cannot return null for non-nullable type: \\\'Post\\\' within parent \\\'Comment\\\' (/createComment/post)\", locations: nil, path: Optional([Amplify.JSONValue.string(\"createComment\"), Amplify.JSONValue.string(\"post\")]), extensions: nil)]\nRecovery suggestion: The list of `GraphQLError` contains service-specific messages\n      ▿ error: 1 element\n        ▿ Amplify.GraphQLError\n          - message: \"Cannot return null for non-nullable type: \\\'Post\\\' within parent \\\'Comment\\\' (/createComment/post)\"\n          - locations: nil\n          ▿ path: Optional([Amplify.JSONValue.string(\"createComment\"), Amplify.JSONValue.string(\"post\")])\n            ▿ some: 2 elements\n              ▿ Amplify.JSONValue.string\n                - string: \"createComment\"\n              ▿ Amplify.JSONValue.string\n                - string: \"post\"\n          - extensions: nil\n    ▿ .1: Optional(Amplify.MutationEvent(id: \"AF15AB4B-090F-445C-8667-514AA39BEFD9\", modelId: \"comment1-25836-1\", modelName: \"Comment\", json: \"{\\\"content\\\":\\\"Content-1-25836-1\\\",\\\"createdAt\\\":null,\\\"id\\\":\\\"comment1-25836-1\\\",\\\"post\\\":{\\\"identifiers\\\":[{\\\"name\\\":\\\"id\\\",\\\"value\\\":\\\"post-25836-1\\\"}],\\\"source\\\":\\\"DataStore\\\"},\\\"updatedAt\\\":null}\", mutationType: \"create\", createdAt: Amplify.Temporal.DateTime(foundationDate: 2024-02-05 14:12:58 +0000, timeZone: Optional(GMT (fixed))), version: nil, inProcess: true, graphQLFilterJSON: nil))\n      ▿ some: Amplify.MutationEvent\n        - id: \"AF15AB4B-090F-445C-8667-514AA39BEFD9\"\n        - modelId: \"comment1-25836-1\"\n        - modelName: \"Comment\"\n        - json: \"{\\\"content\\\":\\\"Content-1-25836-1\\\",\\\"createdAt\\\":null,\\\"id\\\":\\\"comment1-25836-1\\\",\\\"post\\\":{\\\"identifiers\\\":[{\\\"name\\\":\\\"id\\\",\\\"value\\\":\\\"post-25836-1\\\"}],\\\"source\\\":\\\"DataStore\\\"},\\\"updatedAt\\\":null}\"\n        - mutationType: \"create\"\n        ▿ createdAt: Amplify.Temporal.DateTime\n          ▿ foundationDate: 2024-02-05 14:12:58 +0000\n            - timeIntervalSinceReferenceDate: 728835178.6199999\n          ▿ timeZone: Optional(GMT (fixed))\n            ▿ some: GMT (fixed)\n              - identifier: \"GMT\"\n              - kind: \"fixed\"\n              ▿ abbreviation: Optional(\"GMT\")\n                - some: \"GMT\"\n              - secondsFromGMT: 0\n              - isDaylightSavingTime: false\n        - version: nil\n        - inProcess: true\n        - graphQLFilterJSON: nil\n"

In this log, the error appears for the first comment, but if I try several times, the comment in error is not always the first.

@github-actions github-actions bot removed the pending-community-response Issue is pending response from the issue requestor label Feb 5, 2024
@lawmicha
Copy link
Member

lawmicha commented Feb 5, 2024

Thanks for the logs @gbitaudeau, I noticed in the sample code, the post and comment objects are created with non-unique identifiers. This could cause some problems with the MutationEvent's incorrectly being generated as a "create" type when it checks the local database, as it doesn't exist yet during the sync. If a MutationEvent of create is sent to AppSync, and the post already exists in DynamoDB, it should fail. Re-using identifiers and failing to sync certain items as "create" mutations doesn't seem to be related to the issue you're seeing since the error is on the response for creating a comment coming back from AppSync. If you don't have a use case for re-using identifiers, I will suggest to still update the code snippet to create new post and comments:

// Code from Case 3 code sample
let post = Post(id:"post-\(j)-\(i)", title: "post-\(j)-\(i)", blog: blog)
let comment1 = Comment(id:"comment1-\(j)-\(i)", post: post, content: "Content-1-\(j)-\(i)")

Add uniqueness to the id generation such as "\(UUID().uuidString)-post-\(j)-\(i)" and "\(UUID().uuidString)"-comment1-\(j)-\(i)". This way we aren't cross contaminating data in tests.

Can you dig into the state of the comment with id comment1-25836-1 which you're seeing an error with by directly reviewing the DynamoDB data and querying for it from AppSync?

@lawmicha
Copy link
Member

lawmicha commented Feb 5, 2024

I was able to reproduce this (with the non-unique ids sample code from Case 3). Here's the state of the object directly querying AppSync.

DataStore's create mutation for

query GetCommentNoPost {
  getComment(id: "comment1-8087-0") {
    _deleted
    _lastChangedAt
    _version
    content
    createdAt
    id
    postCommentsId
    updatedAt
  }
}
{
  "data": {
    "getComment": {
      "_deleted": null,
      "_lastChangedAt": 1707150477530,
      "_version": 1,
      "content": "Content-1-8087-0",
      "createdAt": "2024-02-05T16:27:57.507Z",
      "id": "comment1-8087-0",
      "postCommentsId": "post-8087-0",
      "updatedAt": "2024-02-05T16:27:57.507Z"
    }
  }
}
query GetCommentAndPost {
  getComment(id: "comment1-8087-0") {
    _deleted
    _lastChangedAt
    _version
    content
    createdAt
    id
    postCommentsId
    updatedAt
    post {
      _deleted
      _lastChangedAt
      _version
      blogPostsId
      createdAt
      id
      title
      updatedAt
    }
  }
}
{
  "data": {
    "getComment": null
  },
  "errors": [
    {
      "path": [
        "getComment",
        "post"
      ],
      "locations": null,
      "message": "Cannot return null for non-nullable type: 'Post' within parent 'Comment' (/getComment/post)"
    }
  ]
}
query GetPost {
  getPost(id: "post-8087-0") {
    id
    _deleted
    _lastChangedAt
    _version
    blogPostsId
    createdAt
    title
    updatedAt
  }
}

{
  "data": {
    "getPost": null
  }
}

Based on the state of these objects, it appears the post was not saved in DynamoDB, create mutation for comment was successful in creating a comments that points to a post which does not exist. The second query with the comment and nested post fields error because the call is looking to resolve the post from the Post table in dynamoDB but it does not exist

@gbitaudeau
Copy link
Author

@lawmicha , I understand your point about id uniqueness, but I cleanup everything before each test (local Datastore and DynamoDB). I used let j = Int16.random(in: 0...Int16.max) instead of UUID just as convenience to find corrupted data easily.

I changed the code to use UUID and got the same error:

let postId = "post-\(i)-\(UUID().uuidString)"
let post = Post(id: postId, title: "post-\(i)", blog: blog)
let commentId = "comment-\(i)-\(UUID().uuidString)"
let comment1 = Comment(id: commentId, post: post, content: "Content-\(i)")

Here you can find logs, and a Charles proxy file ( https://www.charlesproxy.com )
missingPost.zip

Here the dynamo table contents:
dynamoDB.zip

@lawmicha
Copy link
Member

lawmicha commented Feb 5, 2024

Oh sorry, I missed that j is pretty random, I'm running the sample and can reproduce the issue

@lawmicha
Copy link
Member

lawmicha commented Feb 5, 2024

Just a short update, when I breakpoint into the error handler for the createComment error, I was able to check the local DB contents using https://sqlitebrowser.org/ and can see that both the post and comment's mutation events are there.

What's unexpected is they both have an InProcess state of true. This means, the post mutation event was picked up (marked as InProcess true) and got interrupted by DataStore.stop. When DataStore.starts, it should start off by marking them all as false before picking up the oldest, which should have been the post MutationEvent. There's a bug in which the comment mutation event is being picked up because the post mutation event is hidden from the query due to it already having InProcess state of true.

@lawmicha
Copy link
Member

lawmicha commented Feb 5, 2024

Updated the branch lawmicha.save-sync-transaction with a fix that requests the next MutationEvent to include in process ones (this will include the post MutationEvent which was interrupted). I've tested this manually using the code sample from Case 3 and no longer see the errorHandler called. @gbitaudeau can you refresh package cache to pick up the latest from #3492, to include this commit 4dfbdaa ?

@gbitaudeau
Copy link
Author

@lawmicha I tested your commit 4dfbdaa but still reproduce the issue.

@lawmicha
Copy link
Member

lawmicha commented Feb 6, 2024

Hey @gbitaudeau, some more testing on my end. I ended up spliting the sample code in Case 3 into three sections

The @main app's init instantiates the plugins and configures Amplify

init() {
        Task {
            try await AmplifyConf.instance.start()
        }
    }
// In AmplifyConf
func start() async throws {
        Amplify.Logging.logLevel = .verbose

        let datastoreConfig = DataStoreConfiguration.custom(errorHandler: { error in
            var dumped = String()
            dump(error, to: &dumped)
            print("[lawmicha] ERROR \(error)")
        })
        try Amplify.add(plugin: AWSDataStorePlugin(modelRegistration: AmplifyModels(),
                                                   configuration: datastoreConfig))
        try Amplify.add(plugin: AWSAPIPlugin(modelRegistration: AmplifyModels()))
        try Amplify.configure()
    }

A button to start DataStore without running the test:

Button("Start") {
      Task {
          try await Amplify.DataStore.start()
      }
  }

and the test itself:

Button("Test") {
      Task {
          await AmplifyConf.instance.testStartStopLostMutation()
      }
  }
// Updated `testStartStopLostMutation()
func testStartStopLostMutation() async {
        do {
            print(">>> wait blog")
            let blog = try await subscribeToBlog()
            let j = Int16.random(in: 0...Int16.max)
            for i in 0...100 {
                print(">>> for n°\(i)")
                let post = Post(id:"post-\(j)-\(i)", title: "post-\(j)-\(i)", blog: blog)
                let comment1 = Comment(id:"comment1-\(j)-\(i)", post: post, content: "Content-1-\(j)-\(i)")

                let task = Task {
                    Task.detached {
                        try await Amplify.DataStore.stop()
                        try await Amplify.DataStore.start()
                    }
                    Task.detached {
                        try await Amplify.DataStore.save(post)
                        try await Amplify.DataStore.save(comment1)
                    }
                }
                let _ = await task.result
            }
        } catch {
            print ("ERROR: \(error)")
        }
    }

In the code above, i removed the wait for outbox to be empty so it should actually be a greater stress test.

Just starting DataStore, without clearing out DynamoDB, will get some errors back on the errorHandler because of some of my previous runs were against the code with the bug, so there are a bunch of comments pointing to non-existent posts for me. This data can be cleaned up so it doesn't get pulled in anymore.

Can you confirm the error is coming back is during create mutations, and not during a syncComments call? You'll be able to see the difference from the path in the error like Cannot return null for non-nullable type: \'Post\' within parent \'Comment\' (/syncComments/items[56]/post) or (/createComment/post). If you didn't clean up the DynamoDB table this run then it could get partial responses, that is a list of comments that will get reconciled to the local database and errors on each comment that references to an non-existent post, which are sent back on the errorHandler.

Running the "Test" button is giving me a new error with ConditionalCheckFailed, which is due to duplicate mutation events processing. The good news is that data integrity is not impacted by this, since the second create mutation will fail, while a hypothetical update mutation may cause an additional version bump and appear to refresh if it's backing some UI. I'm going to look into seeing how DataStore can properly cancel the SyncMutationToCloudOperation when DataStore is stopped to prevent the duplicate event processing.

@gbitaudeau
Copy link
Author

Just starting DataStore, without clearing out DynamoDB, will get some errors back on the errorHandler because of some of my previous runs were against the code with the bug, so there are a bunch of comments pointing to non-existent posts for me. This data can be cleaned up so it doesn't get pulled in anymore.

That's why I cleanup everything before each test ;)

Can you confirm the error is coming back is during create mutations, and not during a syncComments call?

Yes I can confirm this point: the error occurs on a createComment mutation.

About ConditionalCheckFailed: this error is not a problem on our side because it happens quite often: we observe it as soon as a mutation request is retried although it had previously been interrupted (by a network issue as instance) while being well executed on the server in the same time.

@lawmicha
Copy link
Member

lawmicha commented Feb 6, 2024

Other than ConditionalCheckFailed errors on createComment mutation, are you experiencing any other errors with the latest on lawmicha.save-sync-transaction?

@gbitaudeau
Copy link
Author

Sorry if I wasn't clear enough but with latest lawmicha.save-sync-transaction , I still get :

GraphQLError missing extensions and errorType GraphQLError(message: "Cannot return null for non-nullable type: \'Post\' within parent \'Comment\' (/createComment/post)", locations: nil, path: Optional([Amplify.JSONValue.string("createComment"), Amplify.JSONValue.string("post")]), extensions: nil)

@gbitaudeau
Copy link
Author

If this can help, when I breakpoint into the error handler for the createComment error, like you, I was able to check the local DB contents using https://sqlitebrowser.org/ and can see that only the comment's mutation events is there.

@lawmicha
Copy link
Member

lawmicha commented Feb 6, 2024

Thanks for the details, I added back the wait for outbox code to match exactly Case 3, and can see that one of the post's mutation event isn't being created

save(_:modelSchema:condition:eagerLoad:completion:) syncing mutation for Post(id: "post-9377-0", title: "post-9377-0", _blog: Amplify.LazyReference<amplify3259.Blog>, comments: Optional(Amplify.List<amplify3259.Comment>), createdAt: nil, updatedAt: nil)
save(_:modelSchema:condition:eagerLoad:completion:) syncing mutation for Comment(id: "comment1-9377-0", _post: Amplify.LazyReference<amplify3259.Post>, content: "Content-1-9377-0", createdAt: nil, updatedAt: nil)
save(_:modelSchema:condition:eagerLoad:completion:) syncing mutation for Post(id: "post-9377-1", title: "post-9377-1", _blog: Amplify.LazyReference<amplify3259.Blog>, comments: Optional(Amplify.List<amplify3259.Comment>), createdAt: nil, updatedAt: nil)
save(_:modelSchema:condition:eagerLoad:completion:) syncing mutation for Comment(id: "comment1-9377-1", _post: Amplify.LazyReference<amplify3259.Post>, content: "Content-1-9377-1", createdAt: nil, updatedAt: nil)
save(_:modelSchema:condition:eagerLoad:completion:) syncing mutation for Post(id: "post-9377-2", title: "post-9377-2", _blog: Amplify.LazyReference<amplify3259.Blog>, comments: Optional(Amplify.List<amplify3259.Comment>), createdAt: nil, updatedAt: nil)
save(_:modelSchema:condition:eagerLoad:completion:) syncing mutation for Comment(id: "comment1-9377-2", _post: Amplify.LazyReference<amplify3259.Post>, content: "Content-1-9377-2", createdAt: nil, updatedAt: nil)
save(_:modelSchema:condition:eagerLoad:completion:) syncing mutation for Post(id: "post-9377-3", title: "post-9377-3", _blog: Amplify.LazyReference<amplify3259.Blog>, comments: Optional(Amplify.List<amplify3259.Comment>), createdAt: nil, updatedAt: nil)
save(_:modelSchema:condition:eagerLoad:completion:) syncing mutation for Comment(id: "comment1-9377-3", _post: Amplify.LazyReference<amplify3259.Post>, content: "Content-1-9377-3", createdAt: nil, updatedAt: nil)
save(_:modelSchema:condition:eagerLoad:completion:) syncing mutation for Comment(id: "comment1-9377-4", _post: Amplify.LazyReference<amplify3259.Post>, content: "Content-1-9377-4", createdAt: nil, updatedAt: nil)

In this instance, comment1-9377-4's post's mutation event was not created, despite the SQL transaction. The save for post-9377-4 was successful but did not enqueue the post's mutation event.

@lawmicha
Copy link
Member

lawmicha commented Feb 6, 2024

Hey @gbitaudeau, I've updated the PR with some changes to actually do what I had intended it to do. DataStore.save(post) will fail if DataStore.stop() interrupts it between writing the post model and the post mutation event. Internally they're two different SQL prepared statements, and so we're wrapping it under a transaction and throwing to roll back the first write.

This will result in a failure thrown when saving the post, so it won't continue onward to save the comment. In the sample code, I've updated to catch and log this:

Task.detached {
      do {
          try await Amplify.DataStore.save(post)
          try await Amplify.DataStore.save(comment1)
      } catch {
          print("Failed to save: \(error)")
      }
  }

My breakpoint on the errorHandler is no longer getting called since the comment never gets saved after the failure from saving the post.

Please go ahead and try the latest from lawmicha.save-sync-transaction

@gbitaudeau
Copy link
Author

Hi @lawmicha , all tests are ok on my side, failure on save is nice in this case. Thank you very much.

During this test I can see that there are memory leaks on web sockets: each DataStore.start() add new web sockets but old ones remain active. In my mind, we should use Starscream 4.0.6 which include this fix: daltoniam/Starscream@d2a9549
Also I found a small leak in ModelSyncedEventEmitter, you can find a patch for this here:
memoryLeak.patch

This memory leak might be better tracked by another issue but as the code to easily reproduce it is here, I let you organise this as you want.

@lawmicha
Copy link
Member

lawmicha commented Feb 9, 2024

Thanks for confirming your tests are working. I'll keep this issue open until my PR is merged and released. In terms of the Starscream dependency and memory leak patch, i'll create two new issues to track it, thanks for the details!

@lawmicha
Copy link
Member

The fix in #3492 has been released in 2.26.3 https://github.com/aws-amplify/amplify-swift/releases/tag/2.26.3

Copy link
Contributor

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working datastore Issues related to the DataStore category
Projects
None yet
Development

No branches or pull requests

4 participants