Improve Performance for SRS #1673

winlinvip · 2020-03-26T00:58:11Z

Performance optimization is an endless topic that requires continuous improvement. SRS2 has undergone a significant performance optimization, increasing from 3k to 7k. Further optimizations are needed, and the optimization process and data will be posted in this issue.

Previously, SRS2 had undergone some optimizations, as referenced below:

Play RTMP benchmark

The data for playing RTMP was benchmarked by [SB][srs-bench]:

Update	SRS	Clients	Type	CPU	Memory	Commit
2014-12-07	2.0.67	10k(10000)	players	95%	656MB	code
2014-12-05	2.0.57	9.0k(9000)	players	90%	468MB	code
2014-12-05	2.0.55	8.0k(8000)	players	89%	360MB	code
2014-11-22	2.0.30	7.5k(7500)	players	87%	320MB	code
2014-11-13	2.0.15	6.0k(6000)	players	82%	203MB	code
2014-11-12	2.0.14	3.5k(3500)	players	95%	78MB	code
2014-11-12	2.0.14	2.7k(2700)	players	69%	59MB	-
2014-11-11	2.0.12	2.7k(2700)	players	85%	66MB	-
2014-11-11	1.0.5	2.7k(2700)	players	85%	66MB	-
2014-07-12	0.9.156	2.7k(2700)	players	89%	61MB	code
2014-07-12	0.9.156	1.8k(1800)	players	68%	38MB	-
2013-11-28	0.5.0	1.8k(1800)	players	90%	41M	-

Publish RTMP benchmark

The data for publishing RTMP was benchmarked by [SB][srs-bench]:

Update	SRS	Clients	Type	CPU	Memory	Commit
2014-12-04	2.0.52	4.0k(4000)	publishers	80%	331MB	code
2014-12-04	2.0.51	2.5k(2500)	publishers	91%	259MB	code
2014-12-04	2.0.49	2.5k(2500)	publishers	95%	404MB	code
2014-12-04	2.0.49	1.4k(1400)	publishers	68%	144MB	-
2014-12-03	2.0.48	1.4k(1400)	publishers	95%	140MB	code
2014-12-03	2.0.47	1.4k(1400)	publishers	95%	140MB	-
2014-12-03	2.0.47	1.2k(1200)	publishers	84%	76MB	code
2014-12-03	2.0.12	1.2k(1200)	publishers	96%	43MB	-
2014-12-03	1.0.10	1.2k(1200)	publishers	96%	43MB	-

Play HTTP FLV benchmark

The data for playing HTTP FLV was benchmarked by [SB][srs-bench]:

Update	SRS	Clients	Type	CPU	Memory	Commit
2014-05-25	2.0.171	6.0k(6000)	players	84%	297MB	code
2014-05-24	2.0.170	3.0k(3000)	players	89%	96MB	code
2014-05-24	2.0.169	3.0k(3000)	players	94%	188MB	code
2014-05-24	2.0.168	2.3k(2300)	players	92%	276MB	code
2014-05-24	2.0.167	1.0k(1000)	players	82%	86MB	-

Latency benchmark

The latency between encoder and player with realtime config([CN][v3_CN_LowLatency], [EN][v3_EN_LowLatency]):
|

Update	SRS	VP6	H.264	VP6+MP3	H.264+MP3
2014-12-16	2.0.72	0.1s	0.4s	0.8s	0.6s
2014-12-12	2.0.70	0.1s	0.4s	1.0s	0.9s
2014-12-03	1.0.10	0.4s	0.4s	0.9s	1.2s

winlinvip · 2020-03-26T01:01:01Z

SRS4: Refine ST Iterate Coroutines Performance

There is an optimization in ST that could potentially improve performance by 5% to 10%. This mainly addresses the issue of iterating coroutines. Data reference: ossrs/state-threads#5 (comment)

This optimization involves significant changes, so it will not be implemented in SRS3, but is expected to be in SRS4.

MacPro information:

macOS Mojave
Version 10.14.6 (18G3020)
MacBook Pro (Retina, 15-inch, Mid 2015)
Processor: 2.2 GHz Intel Core i7
Memory: 16 GB 1600 MHz DDR3

Docker information:

Docker Desktop 2.2.0.3(42716)
Engine: 19.03.5
Resources: CPUs 4, Memory 2GB, Swap 1GB

Note: SRS is bound to CPU0, and SB is bound to CPU2-3.

SRS3 for Playing Baseline

SRS3, without this optimization, can serve as a performance baseline to see how much this PR has improved relative to it.

Mac:trunk chengli.ycl$ docker exec -it git top
top - 03:44:38 up 14:03,  0 users,  load average: 1.72, 1.71, 1.74
Tasks:  12 total,   1 running,  11 sleeping,   0 stopped,   0 zombie
%Cpu0  : 44.7 us, 14.9 sy,  0.0 ni, 32.5 id,  0.0 wa,  0.0 hi,  7.8 si,  0.0 st
%Cpu1  :  1.5 us,  2.9 sy,  0.0 ni, 95.3 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu2  : 21.2 us, 11.2 sy,  0.0 ni, 67.3 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu3  : 16.0 us,  8.4 sy,  0.0 ni, 75.3 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
KiB Mem :  2037260 total,   490352 free,  1188940 used,   357968 buff/cache
KiB Swap:  1048572 total,  1028092 free,    20480 used.   704796 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                   
 6654 root      20   0  463540 331388   2960 S  24.6 16.3  21:00.42 ./objs/sb_rtmp_load -c 2000 -r rtmp://127.0.0.1:1935/live/livestream      
 6606 root      20   0  449600 317332   2824 S  20.6 15.6  20:56.26 ./objs/sb_rtmp_load -c 2000 -r rtmp://127.0.0.1:1935/live/livestream      
11191 root      20   0 1339072 194020   5440 S  64.1  9.5   1:43.16 ./gprof.srs_3_baseline -c console.conf 

Mac:trunk chengli.ycl$ docker exec git netstat -anp|grep srs|wc -l
    4002

Mac:trunk chengli.ycl$ docker exec git dstat -N lo
You did not select any stats, using -cdngy by default.
----total-cpu-usage---- -dsk/total- ---net/lo-- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
 19   9  70   0   0   2|   0     0 | 134M  134M|   0     0 |4500  6374 
 24  14  58   0   0   4|   0     0 | 184M  184M|   0     0 |4829  5833 

[root@de6e1cac0533 trunk]# gprof -b gprof.srs_3_baseline gmon.out |more
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 19.71      8.35     8.35                             _st_epoll_dispatch
 16.91     15.52     7.17 45118865     0.00     0.00  SrsConsumer::enqueue(SrsSharedPtrMessage*, bool, SrsRtmpJitterAlgorithm)
 10.29     19.88     4.36  1857259     0.00     0.00  SrsProtocol::do_send_messages(SrsSharedPtrMessage**, int)
  9.33     23.83     3.96 45118865     0.00     0.00  SrsFastVector::push_back(SrsSharedPtrMessage*)
  4.65     25.80     1.97     4000     0.49     3.17  SrsRtmpConn::do_playing(SrsSource*, SrsConsumer*, SrsQueueRecvThread*)
  3.54     27.30     1.50     7295     0.21     1.47  SrsSource::on_audio_imp(SrsSharedPtrMessage*)
  3.42     28.75     1.45  1857259     0.00     0.00  SrsProtocol::send_and_free_messages(SrsSharedPtrMessage**, int, int)
  3.16     30.09     1.34 45086840     0.00     0.00  srs_chunk_header_c0(int, unsigned int, int, signed char, int, char*, int)
  2.36     31.09     1.00 45118865     0.00     0.00  SrsRtmpJitter::correct(SrsSharedPtrMessage*, SrsRtmpJitterAlgorithm)

Interpretation:

CPU usage is 64%, with 44% in user space and 14% in system space.
Functions in user space mainly include _st_epoll_dispatch and RTMP Messages processing logic.

SRS3 for Playing with ST Refined

SRS3, with this PR merged, optimizes the ST iteration logic.

Mac:trunk chengli.ycl$ docker exec -it git top
top - 04:00:43 up 14:19,  0 users,  load average: 1.47, 1.57, 1.62
Tasks:  13 total,   3 running,  10 sleeping,   0 stopped,   0 zombie
%Cpu0  : 40.6 us, 10.2 sy,  0.0 ni, 43.3 id,  0.0 wa,  0.0 hi,  5.8 si,  0.0 st
%Cpu1  :  1.0 us,  2.1 sy,  0.0 ni, 96.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 17.7 us, 11.8 sy,  0.0 ni, 70.1 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu3  : 16.8 us,  9.5 sy,  0.0 ni, 73.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  2037260 total,   429264 free,  1226620 used,   381376 buff/cache
KiB Swap:  1048572 total,  1028092 free,    20480 used.   667064 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                   
 6606 root      20   0  449356 317088   2824 S  19.3 15.6  24:59.70 ./objs/sb_rtmp_load -c 2000 -r rtmp://127.0.0.1:1935/live/livestream      
 6654 root      20   0  448304 316176   2960 R  19.9 15.5  25:11.48 ./objs/sb_rtmp_load -c 2000 -r rtmp://127.0.0.1:1935/live/livestream      
11352 root      20   0 1357608 241384   5344 R  54.8 11.8   2:25.22 ./gprof.srs_3_st -c console.conf

Mac:trunk chengli.ycl$ docker exec git netstat -anp|grep srs|wc -l
    4003

Mac:trunk chengli.ycl$ docker exec git dstat -N lo
You did not select any stats, using -cdngy by default.
----total-cpu-usage---- -dsk/total- ---net/lo-- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
 21  10  67   0   0   2|   0     0 | 111M  111M|   0     0 |4563  6364 
 23   9  66   0   0   2|   0     0 | 121M  121M|   0     0 |4505  6306 
 20   9  69   0   0   2|   0     0 | 130M  130M|   0     0 |4812  6843 

[root@de6e1cac0533 trunk]# gprof -b gprof.srs_3_st gmon.out |more
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 22.33     14.96    14.96 82024549     0.00     0.00  SrsConsumer::enqueue(SrsSharedPtrMessage*, bool, SrsRtmpJitterAlgorithm)
 13.08     23.73     8.77 82024549     0.00     0.00  SrsFastVector::push_back(SrsSharedPtrMessage*)
 12.30     31.97     8.24  3312993     0.00     0.00  SrsProtocol::do_send_messages(SrsSharedPtrMessage**, int)
  5.25     35.49     3.52     4001     0.88     5.96  SrsRtmpConn::do_playing(SrsSource*, SrsConsumer*, SrsQueueRecvThread*)
  5.07     38.89     3.40    13188     0.26     1.73  SrsSource::on_audio_imp(SrsSharedPtrMessage*)
  4.54     41.93     3.04  3312993     0.00     0.01  SrsProtocol::send_and_free_messages(SrsSharedPtrMessage**, int, int)
  3.49     44.27     2.34 82013595     0.00     0.00  srs_chunk_header_c0(int, unsigned int, int, signed char, int, char*, int)
  2.63     46.03     1.76 82024549     0.00     0.00  SrsRtmpJitter::correct(SrsSharedPtrMessage*, SrsRtmpJitterAlgorithm)
  2.28     47.56     1.53     7656     0.20     1.68  SrsSource::on_video_imp(SrsSharedPtrMessage*)
  2.13     48.99     1.43                             st_writev

Interpretation:

CPU usage is 54%, with 40% in user space and 10% in system space.
Functions in user space mainly include RTMP Messages processing logic.

Note: After optimizing ST, there is a certain improvement in performance, and _st_epoll_dispatch is no longer a hotspot function.

winlinvip · 2020-03-26T01:21:13Z

SRS3: Use Compiler O2 To Improve Performance

SRS1,2,3 have always used O0 by default, disabling compiler optimization. Data can be compared after enabling optimization.

MacPro information:

macOS Mojave
Version 10.14.6 (18G3020)
MacBook Pro (Retina, 15-inch, Mid 2015)
Processor: 2.2 GHz Intel Core i7
Memory: 16 GB 1600 MHz DDR3

Docker information:

Docker Desktop 2.2.0.3(42716)
Engine: 19.03.5
Resources: CPUs 4, Memory 2GB, Swap 1GB

Note: SRS is bound to CPU0, and SB is bound to CPU2-3.

SRS3 Play Baseline

First, let's look at the baseline data, with an average CPU usage of 66%, 39% in user space, and 22% in system space.

Mac:trunk chengli.ycl$ docker exec -it git top
top - 01:03:30 up 1 day, 14 min,  0 users,  load average: 1.53, 1.39, 1.12
Tasks:   5 total,   3 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu0  : 39.6 us, 22.9 sy,  0.0 ni, 28.7 id,  0.0 wa,  0.0 hi,  8.9 si,  0.0 st
%Cpu1  :  0.3 us,  1.7 sy,  0.0 ni, 97.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 21.3 us, 11.8 sy,  0.0 ni, 66.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 26.7 us, 15.2 sy,  0.0 ni, 58.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  2037260 total,   412404 free,  1260192 used,   364664 buff/cache
KiB Swap:  1048572 total,   939260 free,   109312 used.   640028 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                   
88041 root      20   0  555112 393012   3056 S  26.7 19.3   4:58.08 ./objs/sb_rtmp_load -c 2500 -r rtmp://127.0.0.1:1935/live/livestream      
88046 root      20   0  555004 392828   3000 R  35.3 19.3   5:34.46 ./objs/sb_rtmp_load -c 2500 -r rtmp://127.0.0.1:1935/live/livestream      
88034 root      20   0 1651656 218748   5484 R  66.3 10.7  12:38.10 ./srs_3_baseline -c console.conf                                          
88035 root      20   0   58284   3716   3196 R   0.0  0.2   0:00.46 top                                                                       
    1 root      20   0   11944   2628   2336 S   0.0  0.1   0:01.51 bash

SRS3 Play with Compiler O2

After enabling the O2 compiler option for SRS3, performance can be improved by about 10%, with CPU usage around 52%, 26% in user space, and 17% in system space.

Mac:trunk chengli.ycl$ docker exec -it git top
top - 01:09:24 up 1 day, 20 min,  0 users,  load average: 1.23, 1.38, 1.20
Tasks:   5 total,   1 running,   4 sleeping,   0 stopped,   0 zombie
%Cpu0  : 26.7 us, 17.8 sy,  0.0 ni, 46.2 id,  0.0 wa,  0.0 hi,  9.2 si,  0.0 st
%Cpu1  :  1.8 us,  4.8 sy,  0.0 ni, 93.0 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
%Cpu2  : 24.3 us, 11.4 sy,  0.0 ni, 64.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 20.6 us, 10.7 sy,  0.0 ni, 68.4 id,  0.0 wa,  0.0 hi,  0.4 si,  0.0 st
KiB Mem :  2037260 total,   375336 free,  1307788 used,   354136 buff/cache
KiB Swap:  1048572 total,   939260 free,   109312 used.   594752 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                   
88041 root      20   0  550440 388408   3056 S  31.2 19.1   6:55.76 ./objs/sb_rtmp_load -c 2500 -r rtmp://127.0.0.1:1935/live/livestream      
88046 root      20   0  545716 383624   3000 S  24.6 18.8   7:27.84 ./objs/sb_rtmp_load -c 2500 -r rtmp://127.0.0.1:1935/live/livestream      
88085 root      20   0 1713060 290732   5040 S  52.5 14.3   2:38.46 ./srs_3_o2 -c console.conf                                                
88035 root      20   0   58284   3716   3196 R   0.0  0.2   0:00.60 top                                                                       
    1 root      20   0   11944   2628   2336 S   0.0  0.1   0:01.54 bash

c47b9e46

winlinvip · 2020-03-27T11:47:02Z

It was found that the Docker environment may have unstable baseline issues, sometimes high and sometimes low, with significant differences, as shown in the following figure:

Some optimizations have been made, some of which are expected to improve, such as enabling O2, but due to the unstable baseline, they will be put on hold for now and tested on a physical machine later. The following are the optimization branches:

compiler O2 Enable O2 optimization during compilation.
inline Enable inline optimization for hot spot functions.
tcmalloc Use tcmalloc for memory allocation.
st Merge ST improvement #5 to optimize busy coroutine scheduling performance.
large iovs Increase the number of mw_msgs combined for writing.
perf stat Count the number of mw messages.
fast vector Optimize the queue for each consumer.
mr always Always enable mr read waiting.
mr buffer Always read a fixed length of data.
small buffer Using a small buffer may provide better performance.
vector queue Using vector directly is also an option.

winlinvip · 2020-04-19T10:42:51Z

Regarding ST optimization, the points that can be optimized are:

Use of timer and cond, refer to Refine SRS timer and cond for performance issue. #1711
IO event processing requires traversing io_q, refer to Support MSG_ZEROCOPY for streaming server. state-threads#13 (comment)

For analysis on ST, refer to: https://github.com/ossrs/state-threads/tree/srs#analysis

About setjmp and longjmp, read setjmp.
About the stack structure, read stack
About asm code comments, read #91d530e.
About the scheduler, read #13-scheduler.
About the IO event system, read #13-IO.

winlinvip added the Enhancement Improvement or enhancement. label Mar 26, 2020

winlinvip added this to the SRS 3.0 release milestone Mar 26, 2020

winlinvip assigned winlinvip and chen-guanghua and unassigned winlinvip Aug 23, 2021

winlinvip modified the milestones: SRS 3.0 release, SRS 5.0 release Aug 26, 2021

ossrs locked and limited conversation to collaborators Jul 18, 2023

winlinvip converted this issue into discussion #3666 Jul 18, 2023

winlinvip added the TransByAI Translated by AI/GPT. label Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Improve Performance for SRS #1673

Improve Performance for SRS #1673

winlinvip commented Mar 26, 2020 •

edited

Loading

winlinvip commented Mar 26, 2020 •

edited

Loading

winlinvip commented Mar 26, 2020 •

edited

Loading

winlinvip commented Mar 27, 2020 •

edited

Loading

winlinvip commented Apr 19, 2020 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

Improve Performance for SRS #1673

Improve Performance for SRS #1673

Comments

winlinvip commented Mar 26, 2020 • edited Loading

winlinvip commented Mar 26, 2020 • edited Loading

SRS4: Refine ST Iterate Coroutines Performance

SRS3 for Playing Baseline

SRS3 for Playing with ST Refined

winlinvip commented Mar 26, 2020 • edited Loading

SRS3: Use Compiler O2 To Improve Performance

SRS3 Play Baseline

SRS3 Play with Compiler O2

winlinvip commented Mar 27, 2020 • edited Loading

winlinvip commented Apr 19, 2020 • edited Loading

This issue was moved to a discussion.

winlinvip commented Mar 26, 2020 •

edited

Loading

winlinvip commented Mar 26, 2020 •

edited

Loading

winlinvip commented Mar 26, 2020 •

edited

Loading

winlinvip commented Mar 27, 2020 •

edited

Loading

winlinvip commented Apr 19, 2020 •

edited

Loading