Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: cancelling MoveTables errors with 'cannot remove tables since one or more do not exist in the denylist' #15963

Closed
wiebeytec opened this issue May 17, 2024 · 11 comments · Fixed by #15977

Comments

@wiebeytec
Copy link

wiebeytec commented May 17, 2024

Overview of the Issue

(Edit with synopsis: it seems to delete the table while the stream is still running.)

I was trying a MoveTables on a 2 TB table. It had been running for a few hours. Then I cancelled it, which gave this error:

$ ./vtctldclient MoveTables cancel --workflow hourlies  --target-keyspace sites2023 
E0516 06:03:46.566627 1566808 main.go:56] rpc error: code = Unknown desc = cannot remove tables since one or more do not exist in the denylist

Running vtctldclient MoveTables status --workflow hourlies --target-keyspace sites2023 would say it's in error. I had to remove it using Workflow delete with --keep-data (because otherwise it wouldn't remove because the source table was already gone):

./vtctldclient Workflow delete --workflow hourlies --keyspace sites2023 --keep-data

So to be clear, I did not remove any tables manually. There may be some race condition in checking/clearing the deny list and removing the table?

I later saw this in the log:

utils.go:218] Got unrecoverable error: task error: failed inserting rows: Table 'vt_sites2023.logDataHourlyDeltas' doesn't exist

That suggests it dropped the table while the stream was still running.

Reproduction Steps

Try to cancel a MoveTables of a large table (that may have been running for a while), like:

./vtctldclient MoveTables cancel --workflow hourlies  --target-keyspace sites2023 

Binary Version

# vtgate --version
vtgate version Version: 19.0.3 (Git revision cb5464edf5d7075feae744f3580f8bc626d185aa branch 'HEAD') built on Thu Apr  4 12:18:41 UTC 2024 by runner@fv-az1543-228 using go1.22.2 linux/amd64

Operating System and Environment details

Ubuntu 22.04

Log Fragments

mei 16 06:03:52 vitess-mysql-a1 start_vttablet[585]: E0516 06:03:52.597190     585 utils.go:218] Got unrecoverable error: task error: failed inserting rows: Table 'vt_sites2023.logDataHourlyDeltas' doesn't exist (errno 1146) (sqlstate 42S02) during query: insert into logDataHourlyDeltas(idSite,`timestamp`,idDataAttribute,instance,valueFloat,lastUpdate) values DELETED-VERY_LONG
mei 16 06:03:52 vitess-mysql-a1 start_vttablet[585]: E0516 06:03:52.647956     585 controller.go:272] INTERNAL: unable to setState() in controller: could not insert into log table: insert into _vt.vreplication_log(vrepl_id, type, state, message) values(15, 'State Changed', 'Error', 'task error: failed inserting rows: Table \'vt_sites2023.logDataHourlyDeltas\' doesn\'t exist (errno 1146) (sqlstate 42S02) during query: insert into logDataHourlyDeltas(idSite,`timestamp`,idDataAttribute,instance,valueFloat,lastUpdate) values DELETED-VERY_LONG
mei 16 06:03:52 vitess-mysql-a1 start_vttablet[585]: E0516 06:03:52.669643     585 dbclient.go:127] error in stream 15, will retry after 5s: task error: failed inserting rows: Table 'vt_sites2023.logDataHourlyDeltas' doesn't exist (errno 1146) (sqlstate 42S02) during query: insert into logDataHourlyDeltas(idSite,`timestamp`,idDataAttribute,instance,valueFloat,lastUpdate) values DELTED_VERY_LONG
mei 16 06:09:58 vitess-mysql-a1 start_vttablet[585]: W0516 06:09:58.185446     585 rpc_server.go:72] TabletManager.ExecuteFetchAsDba(query:"drop table `vt_sites2023`.`logDataHourlyDeltas`" db_name:"vt_sites2023" max_rows:1 disable_foreign_key_checks:true)(on vic-eu-central-1-fakelive-0000000300 from ) error: rpc error: code = Unknown desc = Unknown table 'vt_sites2023.logDataHourlyDeltas' (errno 1051) (sqlstate 42S02) during query: drop table `vt_sites2023`.`logDataHourlyDeltas`
mei 16 06:11:25 vitess-mysql-a1 start_vttablet[585]: W0516 06:11:25.437655     585 rpc_server.go:72] TabletManager.ExecuteFetchAsDba(query:"drop table `vt_sites2023`.`logDataHourlyDeltas`" db_name:"vt_sites2023" max_rows:1 disable_foreign_key_checks:true)(on vic-eu-central-1-fakelive-0000000300 from ) error: rpc error: code = Unknown desc = Unknown table 'vt_sites2023.logDataHourlyDeltas' (errno 1051) (sqlstate 42S02) during query: drop table `vt_sites2023`.`logDataHourlyDeltas`
mei 16 06:11:49 vitess-mysql-a1 start_vttablet[585]: W0516 06:11:49.075837     585 rpc_server.go:72] TabletManager.ExecuteFetchAsDba(query:"drop table `vt_sites2023`.`logDataHourlyDeltas`" db_name:"vt_sites2023" max_rows:1 disable_foreign_key_checks:true)(on vic-eu-central-1-fakelive-0000000300 from ) error: rpc error: code = Unknown desc = Unknown table 'vt_sites2023.logDataHourlyDeltas' (errno 1051) (sqlstate 42S02) during query: drop table `vt_sites2023`.`logDataHourlyDeltas`
@wiebeytec wiebeytec added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels May 17, 2024
@mattlord
Copy link
Contributor

I’m afraid that I will have to mark this as can’t repeat as there’s no test case. Canceling a MoveTables workflow on a large table is something that happens all the time. So to me it would seem that there’s more to it than that.

You didn’t provide enough detail to try and determine what happened in your case either.

Someone or something dropped the logDataHourlyDeltas table in the middle of the import. As you saw, there was an insert error showing that before you even tried to cancel the workflow, no?

We could turn this into a FR / bug about allowing cancel and cleaning up when the table is already gone. That’s a known issue as you saw.

Please let me know what you would like to do. Thanks!

@mattlord mattlord added Can't Repeat and removed Needs Triage This issue needs to be correctly labelled and triaged labels May 17, 2024
@mattlord mattlord self-assigned this May 17, 2024
@wiebeytec
Copy link
Author

wiebeytec commented May 17, 2024

Someone or something dropped the logDataHourlyDeltas table in the middle of the import. As you saw, there was an insert error showing that before you even tried to cancel the workflow, no?

I'm positive nobody deleted that table. And I tried cancelling it several times, which may explain the the cancel line below that.

I can reproduce it, but only about 70% of the time (the --auto-start=false turned out not to be important):

$ ./vtctldclient MoveTables create --auto-start=false --workflow hourlies --source-keyspace legacy2023 --target-keyspace sites2023 --tables logDataHourlyDeltas
The following vreplication streams exist for workflow sites2023.hourlies:                                             
                                                                                                                      
id=16 on sites2023/vic-eu-central-1-fakelive-301: Status: Stopped. VStream has not started.                                                                                                                                                 
id=16 on sites2023/vic-eu-central-1-fakelive-300: Status: Stopped. VStream has not started.                           
id=16 on sites2023/vic-eu-central-1-fakelive-303: Status: Stopped. VStream has not started.                           
id=16 on sites2023/vic-eu-central-1-fakelive-302: Status: Stopped. VStream has not started.                           
                                                                                                                      
Traffic State: Reads Not Switched. Writes Not Switched

Then:

$ ./vtctldclient MoveTables start --workflow hourlies  --target-keyspace sites2023                                    
{                                                                                                                     
  "summary": "Successfully updated the hourlies workflow on (4) target primary tablets in the sites2023 keyspace",    
  "details": [                                                                                                        
    {                                                                                                                 
      "tablet": {                                                                                                     
        "cell": "vic-eu-central-1-fakelive",                                                                          
        "uid": 300                                                                                                    
      },                                                                                                                                                                                                                                    
      "changed": true                                                                                                 
    },                                                                                                                                                                                                                                      
    {                                                                                                                 
      "tablet": {                                                                                                                                                                                                                           
        "cell": "vic-eu-central-1-fakelive",                                                                          
        "uid": 301                                                                                                    
      },                                                                                                              
      "changed": true                                                                                                 
    },                                                                                                                
    {                                                                                                                                                                                                                                       
      "tablet": {                                                                                                     
        "cell": "vic-eu-central-1-fakelive",                                                                                                                                                                                                
        "uid": 302                                                                                                    
      },                                                                                                              
      "changed": true                                                                                                 
    },                                                                                                                
    {                                                                                                                 
      "tablet": {                                                                                                     
        "cell": "vic-eu-central-1-fakelive",                                                                          
        "uid": 303                                                                                                    
      },                                                                                                              
      "changed": true                                                                                                 
    }                                                                                                                 
  ]                                                                                                                   
}

Then:

$ ./vtctldclient MoveTables status --workflow hourlies  --target-keyspace sites2023
The following vreplication streams exist for workflow sites2023.hourlies:

id=16 on sites2023/vic-eu-central-1-fakelive-303: Status: Copying. VStream Lag: 0s.
id=16 on sites2023/vic-eu-central-1-fakelive-302: Status: Copying. VStream Lag: 0s.
id=16 on sites2023/vic-eu-central-1-fakelive-301: Status: Copying. VStream Lag: 0s.
id=16 on sites2023/vic-eu-central-1-fakelive-300: Status: Copying. VStream Lag: 0s.

Traffic State: Reads Not Switched. Writes Not Switched
$ ./vtctldclient MoveTables cancel --workflow hourlies  --target-keyspace sites2023
E0517 06:30:45.734471 1574458 main.go:56] rpc error: code = Unknown desc = cannot remove tables since one or more do not exist in the denylist

This is the vttablet log of only the one cancel operation (lines cut to 500 chars length):

mei 17 06:45:09 vitess-mysql-a1 start_vttablet[583]: E0517 06:45:09.954921     583 utils.go:218] Got unrecoverable error: task error: failed inserting rows: Table 'vt_sites2023.logDataHourlyDeltas' doesn't exist (errno 1146) (sqlstate 42S02) during query: insert into logDataHourlyDeltas(idSite,`timestamp`,idDataAttribute,instance,valueFloat,lastUpdate) values (975,1595223000,104,0,0.273064,1595223926), (975,1595223000,105,0,0.0910206,1595223926), (975,1595223000,341,11,0.364084,1595223926), (975,1595223000,533,0,903,1595223926), (975,1595223900,101,0,0.0910225,1595224811), (975,1595223900,

mei 17 06:45:09 vitess-mysql-a1 start_vttablet[583]: E0517 06:45:09.963547     583 controller.go:272] INTERNAL: unable to setState() in controller: could not insert into log table: insert into _vt.vreplication_log(vrepl_id, type, state, message) values(25, 'State Changed', 'Error', 'task error: failed inserting rows: Table \'vt_sites2023.logDataHourlyDeltas\' doesn\'t exist (errno 1146) (sqlstate 42S02) during query: insert into logDataHourlyDeltas(idSite,`timestamp`,idDataAttribute,instance,valueFloat,lastUpdate) values (975,1595223000,104,0,0.273064,1595223926), (975,1595223000,105,0,0.0

mei 17 06:45:09 vitess-mysql-a1 start_vttablet[583]: E0517 06:45:09.966850     583 dbclient.go:127] error in stream 25, will retry after 5s: task error: failed inserting rows: Table 'vt_sites2023.logDataHourlyDeltas' doesn't exist (errno 1146) (sqlstate 42S02) during query: insert into logDataHourlyDeltas(idSite,`timestamp`,idDataAttribute,instance,valueFloat,lastUpdate) values (975,1595223000,104,0,0.273064,1595223926), (975,1595223000,105,0,0.0910206,1595223926), (975,1595223000,341,11,0.364084,1595223926), (975,1595223000,533,0,903,1595223926), (975,1595223900,101,0,0.0910225,1595224811

@mattlord
Copy link
Contributor

Thank you for the additional info, @wiebeytec ! I'll have to do some investigating and testing (on main) then to see if I can come up with any explanation and a test case. Something quite odd and unexpected is certainly happening in your case — I just don't know what and will have to try and find out.

This is the vttablet log of only the one cancel operation (lines cut to 500 chars length):

Did you see this on all of that tablets though?

@wiebeytec
Copy link
Author

Did you see this on all of that tablets though?

I just performed the operation in a loop 20 times. It failed 20 times, and it was 100% consistent that one shard/tablet did not produce the "table doesn't exist" error upon cancel, and the other three did. In the above list, vic-eu-central-1-fakelive-0000000303 was always fine. This is the last tablet of shard sites2023/cc-

I also looked with show tables on vic-eu-central-1-fakelive-0000000303 that the table indeed kept coming and going.

@mattlord
Copy link
Contributor

I think I see the problem, shown with this patch here:

diff --git a/go/vt/vtctl/workflow/server.go b/go/vt/vtctl/workflow/server.go
index 17b01736a7..1bac98f55c 100644
--- a/go/vt/vtctl/workflow/server.go
+++ b/go/vt/vtctl/workflow/server.go
@@ -1970,14 +1970,6 @@ func (s *Server) WorkflowDelete(ctx context.Context, req *vtctldatapb.WorkflowDe
        span.Annotate("keep_routing_rules", req.KeepRoutingRules)
        span.Annotate("shards", req.Shards)

-       // Cleanup related data and artifacts.
-       if _, err := s.DropTargets(ctx, req.Keyspace, req.Workflow, req.KeepData, req.KeepRoutingRules, false); err != nil {
-               if topo.IsErrType(err, topo.NoNode) {
-                       return nil, vterrors.Wrapf(err, "%s keyspace does not exist", req.Keyspace)
-               }
-               return nil, err
-       }
-
        deleteReq := &tabletmanagerdatapb.DeleteVReplicationWorkflowRequest{
                Workflow: req.Workflow,
        }
@@ -2002,6 +1994,14 @@ func (s *Server) WorkflowDelete(ctx context.Context, req *vtctldatapb.WorkflowDe
                return nil, vterrors.Errorf(vtrpcpb.Code_FAILED_PRECONDITION, "the %s workflow does not exist in the %s keyspace", req.Workflow, req.Keyspace)
        }

+       // Cleanup related data and artifacts.
+       if _, err := s.DropTargets(ctx, req.Keyspace, req.Workflow, req.KeepData, req.KeepRoutingRules, false); err != nil {
+               if topo.IsErrType(err, topo.NoNode) {
+                       return nil, vterrors.Wrapf(err, "%s keyspace does not exist", req.Keyspace)
+               }
+               return nil, err
+       }
+
        response := &vtctldatapb.WorkflowDeleteResponse{}
        response.Summary = fmt.Sprintf("Successfully cancelled the %s workflow in the %s keyspace", req.Workflow, req.Keyspace)
        details := make([]*vtctldatapb.WorkflowDeleteResponse_TabletInfo, 0, len(res))

We're removing the tables and other related artifacts before removing the workflow, so there's a race there that you would see when there's a high write rate in the stream.

If you don't mind, would you mind doing the cancel with the legacy vtctlclient MoveTables (no d) command? The vtctldclient implementation is new in v18+ and that's where the issue lies so that should work fine and also points to the cause above. If not, no worries. This is almost certainly the cause and I will work up a test case this week.

@wiebeytec
Copy link
Author

Great to read you found something.

However, when I try it with vtctlclient, it also errors:

vtctlclient --server 127.0.0.1:15999 MoveTables cancel sites2023.hourlies

E0521 02:12:46.851596 1603627 main.go:96] E0521 00:12:46.851158 vtctl.go:2196] 
cannot remove tables since one or more do not exist in the denylist

The following vreplication streams exist for workflow sites2023.hourlies:

id=87 on -40/vic-eu-central-1-fakelive-0000000300: Status: Error: task error: failed inserting rows: Table 'vt_sites2023.logDataHourlyDeltas' doesn't exist (errno 1146) (sqlstate 42S02) during query: insert into logDataHourlyDeltas(idSite,`timestamp`,idDataAttribute,instance,valueFloat,lastUpdate) values (2441,1555932600,103,0,1.2,1555933525), etc
id=87 on 40-80/vic-eu-central-1-fakelive-0000000301: Status: Error: task error: failed inserting rows: Table 'vt_sites2023.logDataHourlyDeltas' doesn't exist (errno 1146) (sqlstate 42S02) during query: insert into logDataHourlyDeltas(idSite,`timestamp`,idDataAttribute,instance,valueFloat,lastUpdate) values (2349,1637340300,101,0,0.109238,1637341230), etc
id=87 on 80-cc/vic-eu-central-1-fakelive-0000000302: Status: Error: task error: failed inserting rows: Table 'vt_sites2023.logDataHourlyDeltas' doesn't exist (errno 1146) (sqlstate 42S02) during query: insert into logDataHourlyDeltas(idSite,`timestamp`,idDataAttribute,instance,valueFloat,lastUpdate) values (2212,1616848200,533,0,901,1616849106),etc
id=87 on cc-/vic-eu-central-1-fakelive-0000000303: Status: Error: task error: failed inserting rows: Table 'vt_sites2023.logDataHourlyDeltas' doesn't exist (errno 1146) (sqlstate 42S02) during query: insert into logDataHourlyDeltas(idSite,`timestamp`,idDataAttribute,instance,valueFloat,lastUpdate) values (2330,1576764900,104,0,0.01,1576765822), etc

MoveTables Error: rpc error: code = Unknown desc = cannot remove tables since one or more do not exist in the denylist
E0521 02:12:46.888985 1603627 main.go:105] remote error: rpc error: code = Unknown desc = cannot remove tables since one or more do not exist in the denylist

@mattlord
Copy link
Contributor

mattlord commented May 21, 2024

@wiebeytec was that a workflow that you also created with vtctlclient? I wonder if the underlying problem happened earlier on (not putting a denied table entry in place) and we only saw it on cancel.

@wiebeytec
Copy link
Author

wiebeytec commented May 22, 2024

It wasn't, but you do have a point: whether I use vtctlclient or vtctldclient, the denied_tables is not updated. Thinking back on it, I noticed this yesterday when I was moving two more tables.

Currently, this is the list:

$ ./vtctldclient GetShard legacy2023/0
{
  "keyspace": "legacy2023",
  "name": "0",
  "shard": {
    "primary_alias": {
      "cell": "vic-eu-central-1-fakelive",
      "uid": 600
    },
    "primary_term_start_time": {
      "seconds": "1714006345",
      "nanoseconds": 596022682
    },
    "key_range": null,
    "source_shards": [],
    "tablet_controls": [
      {
        "tablet_type": "PRIMARY",
        "cells": [],
        "denied_tables": [
          "alarmLog",
          "lastLogData",
          "sites",
          "users2sites"
        ],
        "frozen": false
      }
    ],
    "is_primary_serving": true
  }
}

But, there are actually two more tables I moved yesterday. And the logDataHourlyDeltas is also not appearing.

The four that are there now, were put there with a single MoveTables workflow.

I looked in the logs for vttablet on the shards and legacy2023, but no mention about the denied_tables. Just some stuff about vreplication and streaming rows.

@mattlord
Copy link
Contributor

Thank you again, @wiebeytec ! I see what the issue is and I've updated the PR description to reflect the changes that will be made: #15977

@wiebeytec
Copy link
Author

👍

Will placing the table in the deny list be bypassed with the routing rules? Otherwise table access is blocked.

@mattlord
Copy link
Contributor

mattlord commented May 23, 2024

Will placing the table in the deny list be bypassed with the routing rules? Otherwise table access is blocked.

The table should only be in the deny list on the keyspace where queries for the table are NOT being routed. They are an additional safety mechanism to prevent queries against the table from being served from the wrong keyspace / side of the workflow if you e.g. use shard targeting in VTGate or query the VTTablet directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants