Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iocp: fix crash, GetQueuedCompletionStatus() write freed WSAOVERLAPPED memory #4136

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

jimying
Copy link
Contributor

@jimying jimying commented Nov 7, 2024

try to fix issue #985

@nanangizz
Copy link
Member

Great! I assume you've tested this and it does fix the issue :)

I think we also need to run under somekind of stress test, e.g: using ioqueue test in pjlib-test, to make sure all memory pools (of pending ops) are properly released. Note that the after an ioqueue key is unregistered, the key will be put into the closing-key-list and soon into the free-key-list to be reused by another socket. We need to make sure that all pending op has been freed before the key is freed & reused.

Next, perhaps we can apply a little bit optimization, e.g: instead of mem-pool for each pending-op, perhaps mem-pool per ioqueue-key to avoid multiple alloc+free for multiple pending-op, using same mechanism as ioqueue key (employing additional list for keeping unused pending-op instances to be reused later).

@jimying
Copy link
Contributor Author

jimying commented Nov 7, 2024

Note:
this only fix the issue when PJ_IOQUEUE_HAS_SAFE_UNREG=1;
PJ_IOQUEUE_HAS_SAFE_UNREG=0 still has memory error

@nanangizz
Copy link
Member

Note: this only fix the issue when PJ_IOQUEUE_HAS_SAFE_UNREG=1; PJ_IOQUEUE_HAS_SAFE_UNREG=0 still has memory error

When PJ_IOQUEUE_HAS_SAFE_UNREG==1, the key won't be reused for 500ms (configurable via PJ_IOQUEUE_KEY_FREE_DELAY), this is a bit risky actually, e.g: on high CPU load, the cancellation may take longer?

@nanangizz
Copy link
Member

nanangizz commented Nov 14, 2024

Tried to run pjlib-test with this PR patch on VS2005 and pool debugging enabled (so pool uses normal malloc/free(), by setting PJ_POOL_DEBUG to 1 in config_site.h), I got an assertion:

 	_wassert(const wchar_t * expr=0x004d1e90, const wchar_t * filename=0x004d1cf0, unsigned int lineno=588) Line 212	C
 	pj_ioqueue_register_sock2(pj_pool_t * pool=0x023a29b8, pj_ioqueue_t * ioqueue=0x023a2a2c, long sock=388, pj_grp_lock_t * grp_lock=0x00000000, void * user_data=0x00000002, const pj_ioqueue_callback * cb=0x0019fb54, pj_ioqueue_key_t * * key=0x0019fbec) Line 588 + 0x2c bytes	C
 	pj_ioqueue_register_sock(pj_pool_t * pool=0x023a29b8, pj_ioqueue_t * ioqueue=0x023a2a2c, long sock=388, void * user_data=0x00000002, const pj_ioqueue_callback * cb=0x0019fb54, pj_ioqueue_key_t * * key=0x0019fbec) Line 665 + 0x1f bytes	C
 	unregister_test(const pj_ioqueue_cfg * cfg=0x0019fd34) Line 554 + 0x23 bytes	C
 	udp_ioqueue_test_imp(const pj_ioqueue_cfg * cfg=0x0019fd34) Line 1190 + 0x9 bytes	C
 	udp_ioqueue_test() Line 1255 + 0x9 bytes	C
 	test_inner() Line 171 + 0x2a bytes	C
 	test_main() Line 245 + 0x5 bytes	C

Not sure if this is the same issue, but this assertion does not happen when using ioqueue select.

@jimying
Copy link
Contributor Author

jimying commented Nov 14, 2024

@nanangizz no this patch, Is there this assert?

@nanangizz
Copy link
Member

@nanangizz no this patch, Is there this assert?

Yes, same assert without this patch.

Comment on lines 753 to 755
BOOL rc = fnCancelIoEx(key->ioqueue->iocp, (LPOVERLAPPED)&op->pending_key);
if (rc)
PJ_PERROR(2, (THIS_FILE, PJ_RETURN_OS_ERROR(GetLastError()), "cancel io error"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • actually when rc ==1, it succeeds
  • shouldn't the handle key->hnd instead of key->ioqueue->iocp?

Copy link
Contributor Author

@jimying jimying Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right.

msn doc say:
If this parameter is NULL, all I/O requests for the hFile parameter are canceled.

it's better use fnCancelIoEx(key->hnd, NULL) to cancel all pending operation. i tested it, looks okay

@jimying
Copy link
Contributor Author

jimying commented Nov 17, 2024

Tried to run pjlib-test with this PR patch on VS2005 and pool debugging enabled (so pool uses normal malloc/free(), by setting PJ_POOL_DEBUG to 1 in config_site.h), I got an assertion:

 	_wassert(const wchar_t * expr=0x004d1e90, const wchar_t * filename=0x004d1cf0, unsigned int lineno=588) Line 212	C
 	pj_ioqueue_register_sock2(pj_pool_t * pool=0x023a29b8, pj_ioqueue_t * ioqueue=0x023a2a2c, long sock=388, pj_grp_lock_t * grp_lock=0x00000000, void * user_data=0x00000002, const pj_ioqueue_callback * cb=0x0019fb54, pj_ioqueue_key_t * * key=0x0019fbec) Line 588 + 0x2c bytes	C
 	pj_ioqueue_register_sock(pj_pool_t * pool=0x023a29b8, pj_ioqueue_t * ioqueue=0x023a2a2c, long sock=388, void * user_data=0x00000002, const pj_ioqueue_callback * cb=0x0019fb54, pj_ioqueue_key_t * * key=0x0019fbec) Line 665 + 0x1f bytes	C
 	unregister_test(const pj_ioqueue_cfg * cfg=0x0019fd34) Line 554 + 0x23 bytes	C
 	udp_ioqueue_test_imp(const pj_ioqueue_cfg * cfg=0x0019fd34) Line 1190 + 0x9 bytes	C
 	udp_ioqueue_test() Line 1255 + 0x9 bytes	C
 	test_inner() Line 171 + 0x2a bytes	C
 	test_main() Line 245 + 0x5 bytes	C

Not sure if this is the same issue, but this assertion does not happen when using ioqueue select.

I found the reason: key double unregister.
Fixed: when key->closing = 1, not unregister.
now test passed.

@jimying jimying marked this pull request as ready for review November 22, 2024 02:52
@nanangizz
Copy link
Member

Thanks @jimying .

Honestly I haven't got a chance to reproduce the original issue and test the proposed solution. I believe you are using this ioqueue in real world, experienced the issue, and find this solution does work, is that correct?

Next, here are few notes about the proposed solution:

  • As you've mentioned, the approach requires PJ_IOQUEUE_HAS_SAFE_UNREG to work, so I think this mode has to be enforced somehow.
  • Also as I've mentioned before, the safe key unregistration relies on timer (default is 500ms), there is a risk when CPU load is high, increasing the timeout is not ideal as the risk itself is actually still there, only reduced to some degree. Perhaps the unregistration should rely on zero-pending-operation instead, or combination of them somehow.
  • Memory pool per operation can be optimized. Note that a single ioqueue key can have many & rapid operations, e.g: receiving RTP packets, so creating & releasing pool for each operation does not sound very optimized. An idea is to have some pool of pending_op (just like pool of key), may be owned by ioqueue or by key.

Also, this ioqueue has been disabled for quite sometime and some improvement in the ioqueue area may not be integrated into this ioqueue, e.g: group lock for key. So please understand that there may still be some steps required to enable this ioqueue again :)

@jimying
Copy link
Contributor Author

jimying commented Nov 23, 2024

@nanangizz i write a simple demo to reproduce the crash issue in msys2, #4172

I have tested it, in old code, it can 100% reproduce the crash.

To test new code we can git cherry-pick the demo patch to this branch.

@nanangizz
Copy link
Member

Thanks @jimying.

@nanangizz nanangizz modified the milestones: release-2.15, release-2.16 Dec 3, 2024
@sauwming
Copy link
Member

Hi @jimying, please let us know whether you plan to incorporate @nanangizz suggestions above.
In my opinion, the first two (enforcingPJ_IOQUEUE_HAS_SAFE_UNREG and zero pending-op) are important and shouldn't require much modification.
While for the last suggestion (memory pool), I think it's optional if you don't have the time necessary to implement it. Just put a note in the code for future optimisation.

@jimying
Copy link
Contributor Author

jimying commented Dec 13, 2024

@sauwming sorry, I will submit it as soon as possible today or tomorrow

@jimying
Copy link
Contributor Author

jimying commented Dec 13, 2024

enforcingPJ_IOQUEUE_HAS_SAFE_UNREG and zero pending-op

new commits do:

  1. remove macro PJ_IOQUEUE_HAS_SAFE_UNREG, only keep PJ_IOQUEUE_HAS_SAFE_UNREG=1 logic
  2. remove closing_list, when ref_count=0 return key to freelist.
    In order to achieve,add ref when alloc pending-op and dec ref when free it

@@ -133,12 +133,9 @@ struct pj_ioqueue_key_t
int connecting;
#endif

#if PJ_IOQUEUE_HAS_SAFE_UNREG
pj_atomic_t *ref_count;
Copy link
Member

@sauwming sauwming Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should remove all the code related to PJ_IOQUEUE_HAS_SAFE_UNREG. I believe what @nanangizz meant was that we need to make sure PJ_IOQUEUE_HAS_SAFE_UNREG is set to 1.

Since by default PJ_IOQUEUE_HAS_SAFE_UNREG is already 1, IMO adding a check below should be sufficient, just to make sure that users don't accidentally disable it:

#if PJ_IOQUEUE_HAS_SAFE_UNREG == 0
    /* IOCP only works with PJ_IOQUEUE_HAS_SAFE_UNREG enabled, otherwise we will get memory error in ... */
    #error "PJ_IOQUEUE_HAS_SAFE_UNREG must be enabled to use ioqueue IOCP"
#endif

Another alternative is to force enable it in pjlib/config.h, something like:

#if PJ_IOQUEUE_IMP == PJ_IOQUEUE_IOCP
#   define PJ_IOQUEUE_HAS_SAFE_UNREG    1
#endif

But this is more complicated since we need to define PJ_IOQUEUE_IMP similar to PJ_SSL_SOCK_IMP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic codes of PJ_IOQUEUE_HAS_SAFE_UNREG = 0 are no need any more, i think it's better to remove.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still there in all other ioq backends.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants