Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcmu-runner auto restart when timeout and unfound targets. #665

Open
YMhuan opened this issue Jul 26, 2021 · 7 comments
Open

tcmu-runner auto restart when timeout and unfound targets. #665

YMhuan opened this issue Jul 26, 2021 · 7 comments

Comments

@YMhuan
Copy link

YMhuan commented Jul 26, 2021

Hello!

  • tcmu version: tcmu-runner-1.4.0

  • Linux version & kernal version: CentOS7.7.1908 kernel 3.10.0-1127

  • Ceph version: 13.2.8 minic
    272 OSD in 8 storage nodes
    32 rbd-volumes(LUNs) in 8 clients; 8 gateways(1 client log in 1 gw); 1targets

  • cluser IO:
    650MB/s 1K IOps write

  1. What I do
    To simulate pg inactive as well as watching the relative changes in cluster IO and iscsi LUN.

  2. What I expected to see
    LUN time out -> dissconnected -> re-login
    BUT, what didn't perform as expected:
    (1).Some tcmu-runner service seems restart after timeout.(4/8 tcmu-runner service were restarted.)
    restart

    (2).Cannot discovery target using gateway IP:port (see e.g)

e.g:
The vm-node11(client) was connected to 66.66.66.2:3260,but after the tcmu-runner auto restart, the vm can't discovery targets from command. ----iscsiadm -m discovery -t sendtargets -p 66.66.66.2:3260
Command out put on client:
no_gw

  • Appendix:
    Last 25 lines log for tcmu-runner at 66.66.66.2 :

25970 2021-07-26 14:47:10.815 1825526 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-11-1: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
25971 2021-07-26 14:47:10.826 1825526 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-11-2: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
25972 2021-07-26 14:47:10.840 1825526 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-11-3: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
25973 2021-07-26 14:47:10.840 1825526 [INFO] alua_implicit_transition:569 rbd/rbd.xxx30T-11-1: Starting lock acquisition operation.
25974 2021-07-26 14:47:10.848 1825526 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-11-4: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
25975 2021-07-26 14:47:10.849 1825526 [INFO] alua_implicit_transition:569 rbd/rbd.xxx30T-11-2: Starting lock acquisition operation.
25976 2021-07-26 14:47:10.857 1825526 [INFO] alua_implicit_transition:569 rbd/rbd.xxx30T-11-3: Starting lock acquisition operation.
25977 2021-07-26 14:47:10.865 1825526 [INFO] alua_implicit_transition:569 rbd/rbd.xxx30T-11-4: Starting lock acquisition operation.
25978 2021-07-26 14:47:10.921 1825526 [WARN] tcmu_rbd_lock:757 rbd/rbd.xxx30T-11-1: Acquired exclusive lock.
25979 2021-07-26 14:47:10.927 1825526 [WARN] tcmu_rbd_lock:757 rbd/rbd.xxx30T-11-2: Acquired exclusive lock.
25980 2021-07-26 14:47:10.997 1825526 [WARN] tcmu_rbd_lock:757 rbd/rbd.xxx30T-11-4: Acquired exclusive lock.
25981 2021-07-26 14:47:11.013 1825526 [WARN] tcmu_rbd_lock:757 rbd/rbd.xxx30T-11-3: Acquired exclusive lock.
25982 2021-07-26 14:53:21.238 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25983 2021-07-26 14:53:21.238 1825526 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-11-3: Handler connection lost (lock state 1)
25984 2021-07-26 14:53:21.239 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25985 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25986 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25987 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25988 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25989 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25990 2021-07-26 14:53:38.087 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25991 2021-07-26 14:53:38.093 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25992 2021-07-26 14:53:38.097 1825526 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2021-04.com.fixxxxxxe.iscsi-gw:xxxtor/tpgt_7.
25993 2021-07-26 14:54:41.430 1832626 [INFO] dyn_config_start:422: Inotify is watching "/etc/tcmu/tcmu.conf", wd: 1, mask: IN_ALL_EVENTS
LOG END AT HERE FOR 3 HOURS.

END

Thank you for reading~

@YMhuan
Copy link
Author

YMhuan commented Jul 26, 2021

As I tried to restart the tcmu-runner & rbd-target-api at 66.66.66.2 it fixed.

@lxbsz
Copy link
Collaborator

lxbsz commented Jul 26, 2021

What's your tcmu-runner.service settings ? Does the systemd restart it after something happen ? Do you have any logs about why the tcmu-runner crashed ?

25989 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25990 2021-07-26 14:53:38.087 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25991 2021-07-26 14:53:38.093 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.

The rbd handler will set the rados_osd_op_timeout to 30 as default, so the requests to ceph cluster will timedout if OSD couldn't response to tcmu-runner in 30 seconds. Have you checked the ceph logs ? Does there has any error ?

@YMhuan
Copy link
Author

YMhuan commented Jul 26, 2021

What's your tcmu-runner.service settings ? Does the systemd restart it after something happen ? Do you have any logs about why the tcmu-runner crashed ?

25989 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25990 2021-07-26 14:53:38.087 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25991 2021-07-26 14:53:38.093 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.

The rbd handler will set the rados_osd_op_timeout to 30 as default, so the requests to ceph cluster will timedout if OSD couldn't response to tcmu-runner in 30 seconds. Have you checked the ceph logs ? Does there has any error ?

The settings of tcmu-runner service as follows:

[Unit]
Description=LIO Userspace-passthrough daemon
After=network.target

[Service]
LimitNOFILE=1000000
Type=dbus
BusName=org.kernel.TCMUService1
KillMode=process
ExecStart=/usr/bin/tcmu-runner

[Install]
WantedBy=multi-user.target

I didn't found log about why tcmu-runner restart in /var/log/tcmu-runner.log. What I pasted in issue are all the logs during my test.

In ceph cluster I got one pg inactive.
In other nodes(which tcmu-runner not auto restart),I found log like this:

2021-07-26 14:47:36.555 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 131072, data offset: 7621352292352, elapsed_usec: 44276
2021-07-26 14:47:36.571 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 7621342973952, elapsed_usec: 14293
2021-07-26 14:47:36.613 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-2: Timing out cmd.
2021-07-26 14:47:36.613 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-2: Handler connection lost (lock state 0)
2021-07-26 14:47:36.615 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-2: io mod: read mode, data len: 131072, data offset: 7753758081024, elapsed_usec: 30001306
2021-07-26 14:47:36.662 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-2: Timing out cmd.
2021-07-26 14:47:36.662 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-2: io mod: read mode, data len: 262144, data offset: 7753758212096, elapsed_usec: 30000921
2021-07-26 14:47:37.341 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-4: Timing out cmd.
2021-07-26 14:47:37.341 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-4: Handler connection lost (lock state 0)
2021-07-26 14:47:37.343 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 7621768609792, elapsed_usec: 30000880
2021-07-26 14:47:47.416 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-1: Timing out cmd.
2021-07-26 14:47:47.416 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-1: Handler connection lost (lock state 0)
2021-07-26 14:47:47.418 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-1: io mod: read mode, data len: 4096, data offset: 7747494891520, elapsed_usec: 30000675
2021-07-26 14:47:47.419 4050954 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2021-04.com.xxxxxxxxx.iscsi-gw:xxxxxx/tpgt_3.
2021-07-26 14:47:47.523 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-1: HA not supported.
2021-07-26 14:47:47.601 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-4: HA not supported.
2021-07-26 14:47:47.649 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-2: HA not supported.
2021-07-26 14:47:47.649 4050954 [INFO] tgt_port_grp_recovery_thread_fn:271: Enabled iscsi/iqn.2021-04.com.xxxxxxxx.iscsi-gw:xxxx/tpgt_3.
2021-07-26 14:48:21.371 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-1: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-26 14:48:21.384 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-4: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-26 14:48:21.437 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-2: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-26 14:48:21.442 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-3: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported

I expected the gw will auto connect(log) afer disconnect by time out, but i cant discovery or login to the gw on the node which tcmu-runner auto-restarted.
Is there something I missed to take it into account.

@lxbsz
Copy link
Collaborator

lxbsz commented Jul 26, 2021

What's your tcmu-runner.service settings ? Does the systemd restart it after something happen ? Do you have any logs about why the tcmu-runner crashed ?

25989 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25990 2021-07-26 14:53:38.087 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25991 2021-07-26 14:53:38.093 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.

The rbd handler will set the rados_osd_op_timeout to 30 as default, so the requests to ceph cluster will timedout if OSD couldn't response to tcmu-runner in 30 seconds. Have you checked the ceph logs ? Does there has any error ?

The settings of tcmu-runner service as follows:

[Unit]
Description=LIO Userspace-passthrough daemon
After=network.target

[Service]
LimitNOFILE=1000000
Type=dbus
BusName=org.kernel.TCMUService1
KillMode=process
ExecStart=/usr/bin/tcmu-runner

[Install]
WantedBy=multi-user.target

I didn't found log about why tcmu-runner restart in /var/log/tcmu-runner.log. What I pasted in issue are all the logs during my test.

Are you using container or something else ? I know in some container envs the container service will detect the tcmu-runner service's status, if it's dead the container will try to pull it up.

As default the tcmu-runner itself won't auto-restart by systemd service, so I am curious how this could happen for you ?

In ceph cluster I got one pg inactive.
In other nodes(which tcmu-runner not auto restart),I found log like this:

2021-07-26 14:47:36.555 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 131072, data offset: 7621352292352, elapsed_usec: 44276
2021-07-26 14:47:36.571 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 7621342973952, elapsed_usec: 14293
2021-07-26 14:47:36.613 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-2: Timing out cmd.
2021-07-26 14:47:36.613 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-2: Handler connection lost (lock state 0)
2021-07-26 14:47:36.615 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-2: io mod: read mode, data len: 131072, data offset: 7753758081024, elapsed_usec: 30001306
2021-07-26 14:47:36.662 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-2: Timing out cmd.
2021-07-26 14:47:36.662 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-2: io mod: read mode, data len: 262144, data offset: 7753758212096, elapsed_usec: 30000921
2021-07-26 14:47:37.341 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-4: Timing out cmd.
2021-07-26 14:47:37.341 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-4: Handler connection lost (lock state 0)
2021-07-26 14:47:37.343 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 7621768609792, elapsed_usec: 30000880
2021-07-26 14:47:47.416 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-1: Timing out cmd.
2021-07-26 14:47:47.416 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-1: Handler connection lost (lock state 0)
2021-07-26 14:47:47.418 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-1: io mod: read mode, data len: 4096, data offset: 7747494891520, elapsed_usec: 30000675
2021-07-26 14:47:47.419 4050954 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2021-04.com.xxxxxxxxx.iscsi-gw:xxxxxx/tpgt_3.
2021-07-26 14:47:47.523 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-1: HA not supported.
2021-07-26 14:47:47.601 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-4: HA not supported.
2021-07-26 14:47:47.649 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-2: HA not supported.

Were you building the tcmu-runner from source ?
Why in this node the HA not supported, this usually due to you are using the lower ceph version, which doesn't support the exclusive lock, when building the tcmu-runner.

While in another node, you are building the tcmu-runner with a higher ceph version ????

@YMhuan
Copy link
Author

YMhuan commented Jul 27, 2021

What's your tcmu-runner.service settings ? Does the systemd restart it after something happen ? Do you have any logs about why the tcmu-runner crashed ?

25989 2021-07-26 14:53:21.240 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25990 2021-07-26 14:53:38.087 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.
25991 2021-07-26 14:53:38.093 1825526 [ERROR] tcmu_rbd_handle_timedout_cmd:992 rbd/rbd.xxx30T-11-3: Timing out cmd.

The rbd handler will set the rados_osd_op_timeout to 30 as default, so the requests to ceph cluster will timedout if OSD couldn't response to tcmu-runner in 30 seconds. Have you checked the ceph logs ? Does there has any error ?

The settings of tcmu-runner service as follows:

[Unit]
Description=LIO Userspace-passthrough daemon
After=network.target

[Service]
LimitNOFILE=1000000
Type=dbus
BusName=org.kernel.TCMUService1
KillMode=process
ExecStart=/usr/bin/tcmu-runner

[Install]
WantedBy=multi-user.target

I didn't found log about why tcmu-runner restart in /var/log/tcmu-runner.log. What I pasted in issue are all the logs during my test.

Are you using container or something else ? I know in some container envs the container service will detect the tcmu-runner service's status, if it's dead the container will try to pull it up.

As default the tcmu-runner itself won't auto-restart by systemd service, so I am curious how this could happen for you ?

In ceph cluster I got one pg inactive.
In other nodes(which tcmu-runner not auto restart),I found log like this:

2021-07-26 14:47:36.555 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 131072, data offset: 7621352292352, elapsed_usec: 44276
2021-07-26 14:47:36.571 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 7621342973952, elapsed_usec: 14293
2021-07-26 14:47:36.613 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-2: Timing out cmd.
2021-07-26 14:47:36.613 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-2: Handler connection lost (lock state 0)
2021-07-26 14:47:36.615 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-2: io mod: read mode, data len: 131072, data offset: 7753758081024, elapsed_usec: 30001306
2021-07-26 14:47:36.662 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-2: Timing out cmd.
2021-07-26 14:47:36.662 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-2: io mod: read mode, data len: 262144, data offset: 7753758212096, elapsed_usec: 30000921
2021-07-26 14:47:37.341 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-4: Timing out cmd.
2021-07-26 14:47:37.341 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-4: Handler connection lost (lock state 0)
2021-07-26 14:47:37.343 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 7621768609792, elapsed_usec: 30000880
2021-07-26 14:47:47.416 4050954 [ERROR] tcmu_rbd_handle_timedout_cmd:1001 rbd/rbd.xxx30T-6-1: Timing out cmd.
2021-07-26 14:47:47.416 4050954 [ERROR] tcmu_notify_conn_lost:187 rbd/rbd.xxx30T-6-1: Handler connection lost (lock state 0)
2021-07-26 14:47:47.418 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-1: io mod: read mode, data len: 4096, data offset: 7747494891520, elapsed_usec: 30000675
2021-07-26 14:47:47.419 4050954 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2021-04.com.xxxxxxxxx.iscsi-gw:xxxxxx/tpgt_3.
2021-07-26 14:47:47.523 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-1: HA not supported.
2021-07-26 14:47:47.601 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-4: HA not supported.
2021-07-26 14:47:47.649 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-2: HA not supported.

Were you building the tcmu-runner from source ?
Why in this node the HA not supported, this usually due to you are using the lower ceph version, which doesn't support the exclusive lock, when building the tcmu-runner.

While in another node, you are building the tcmu-runner with a higher ceph version ????

Thank you for replying.
The ceph cluster version is 13.2.8 in all storage nodes.
tcmu-runner was build by 1.4.0 source code
'HA not supported' occoured in "another node", which I didn't pasted yesterday. just like:

2021-07-27 15:04:47.234 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-1: io mod: read mode, data len: 262144, data offset: 7866031341568, elapsed_usec: 964577
2021-07-27 15:04:47.251 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 131072, data offset: 7623567015936, elapsed_usec: 590750
2021-07-27 15:04:47.251 4050954 [INFO] tgt_port_grp_recovery_thread_fn:245: Disabled iscsi/iqn.2021-04.com.xxxxxxxxx.iscsi-gw:xxxxxxx/tpgt_3.
2021-07-27 15:04:47.396 4050954 [WARN] tcmu_rbd_check_excl_lock_enabled:796 rbd/rbd.xxx30T-6-2: HA not supported.
2021-07-27 15:04:47.396 4050954 [INFO] tgt_port_grp_recovery_thread_fn:271: Enabled iscsi/iqn.2021-04.com.xxxxxxxxx.iscsi-gw:xxxxxxx/tpgt_3.
2021-07-27 15:05:46.317 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-1: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-27 15:05:46.337 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-4: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-27 15:05:46.348 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-3: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-27 15:05:46.361 4050954 [WARN] tcmu_print_cdb_info:1193 rbd/rbd.xxx30T-6-2: a3 c 1 12 0 0 0 0 2 0 0 0 is not supported
2021-07-27 15:05:46.454 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 0, elapsed_usec: 93241
2021-07-27 15:05:46.489 4050954 [INFO] rbd_finish_aio_generic:1141 rbd/rbd.xxx30T-6-4: io mod: read mode, data len: 4096, data offset: 4096, elapsed_usec: 34772

@YMhuan
Copy link
Author

YMhuan commented Jul 28, 2021

What I can find on gateway node(storage node) is.

[root@gateway6 ~]# netstat -anp |grep 3260
tcp        0      0 66.66.66.1:3260         0.0.0.0:*               LISTEN      -                   
tcp        0      0 66.66.66.3:3260         0.0.0.0:*               LISTEN      -                   
tcp        0      0 66.66.66.2:3260         0.0.0.0:*               LISTEN      -                   
tcp        0      0 66.66.66.170:3260       0.0.0.0:*               LISTEN      -                   
tcp        0      0 66.66.66.169:3260       0.0.0.0:*               LISTEN      -                   
tcp        0      0 66.66.66.168:3260       0.0.0.0:*               LISTEN      -                   
tcp        0      0 66.66.66.167:3260       0.0.0.0:*               LISTEN      -                   
tcp        0      0 66.66.66.166:3260       0.0.0.0:*               LISTEN      -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:47513    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:46572    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:47373    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:47443    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:47225    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:46863    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:47005    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:47586    FIN_WAIT2   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:47293    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:46722    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:46933    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:46430    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:33260        66.66.66.168:7159       ESTABLISHED 3000906/ceph-osd    
tcp        0      0 66.66.66.1:3260         172.168.188.19:47152    TIME_WAIT   -                   
tcp        0      0 88.88.88.1:7131         88.88.88.168:33260      ESTABLISHED 2999805/ceph-osd    
tcp        0      0 88.88.88.1:7157         88.88.88.169:33260      ESTABLISHED 3007121/ceph-osd    
tcp        0      0 66.66.66.1:3260         172.168.188.19:46651    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:47081    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:46504    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:46351    TIME_WAIT   -                   
tcp        0      0 66.66.66.1:3260         172.168.188.19:46791    TIME_WAIT   -   

What in client iscsid service

[root@vm-node11 ~]# systemctl status iscsid
● iscsid.service - Open-iSCSI
   Loaded: loaded (/usr/lib/systemd/system/iscsid.service; enabled; vendor preset: disabled)
   Active: active (running) since Sat 2021-07-24 11:00:33 CST; 4 days ago
     Docs: man:iscsid(8)
           man:iscsiadm(8)
 Main PID: 4459 (iscsid)
   CGroup: /system.slice/iscsid.service
           ├─4458 /usr/sbin/iscsid
           └─4459 /usr/sbin/iscsid

Jul 28 14:47:33 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:36 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:39 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:42 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:45 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:48 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:51 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:54 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:47:57 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)
Jul 28 14:48:00 vm-node11 iscsid[4458]: conn 0 login rejected: target error (03/01)

@YMhuan
Copy link
Author

YMhuan commented Jul 28, 2021

I will set "log_level = 5" in /etc/tcmu/tcmu.conf and rebuild this error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants