-
Notifications
You must be signed in to change notification settings - Fork 2
/
cheri_misidioms.ltx
1309 lines (1089 loc) · 60.4 KB
/
cheri_misidioms.ltx
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%&cheri_misidioms_preamble
\endofdump
\begin{document}
\begin{abstract}
\noindent
Several open-source memory allocators have been ported to CHERI, a hardware capability
platform. In this paper we examine the security and performance of these
allocators when run under CheriBSD on Arm's prototype Morello platform.
We introduce a number of security attacks and show that all but one allocator
are vulnerable to some of the attacks --- including the default CheriBSD allocator.
We then show that while some forms of allocator performance are meaningful,
comparing the performance of hybrid and pure capability (i.e.~`running in
non-CHERI vs.~running in CHERI modes') allocators does not currently appear to be meaningful.
Although we do not fully understand the
reasons for this, it seems to be at least as much due to factors such as
immature compiler toolchains and prototype hardware as it is due to the effects of
capabilities on performance.
\end{abstract}
\maketitle
\section{Introduction}
CHERI (Capability Hardware Enhanced RISC Instructions) provides and enforces hardware
capabilities that allow programmers to make strong security guarantees
about the memory safety properties of their programs~\cite{watson20cheriintroduction}. However, capabilities are
not magic --- programmers must first decide which memory safety properties they
wish to enforce and then write their software in such a way to enforce those properties.
Mistakes or oversights undermine
the security guarantees that programmers believe their code possesses.
In this paper we study memory allocators (henceforth just ``allocators'')
in the context of CHERI. Apart from some embedded
systems (which preallocate fixed quantities of memory), allocators are
ubiquitous, because they allow us to write programs that are generic over a
variety of memory usage patterns. Allocators' security properties and performance are a fundamental pillar supporting
the security properties and performance of software in general --- any
security flaws and/or performance problems in allocators thus have
significant, widespread, consequences.
In this paper we show that most CHERI allocators are subject to surprisingly simple attacks.
We then show that while some aspects of allocator performance can be meaningfully
compared, it does not currently appear to be meaningful to compare the performance of hybrid
and pure capability (roughly speaking: ``running in
non-CHERI vs.~running in CHERI modes'') allocators.
We analyse some of the likely factors for this latter case,
which suggest that this may be at least as much due to factors such as immature compiler
toolchains and prototype hardware as to the effects of capabilities on performance.
We do not claim that our
work is definitive, though it does suggest two things: that some allocators
undermine the security properties one might reasonably expect from software running on
pure capability CHERI; and that it is currently difficult to reason about the
performance impact of CHERI on software. Our experiments
are repeatable and can be downloaded from
\url{https://archive.org/details/cheri_allocator_ismm}.
This paper is structured as follows. First, we introduce the necessary
background: a brief overview of capabilities and CHERI (\autoref{sec:cheri});
and our running example, a trivial bump allocator (\autoref{sec:bumppointerallocator}).
We then introduce our study: the allocators
under consideration (\autoref{sec:cheriallocators}); our attacks (\autoref{sec:atks});
a partial performance evaluation (\autoref{sec:performance}) and an analysis
of some of the performance discrepancies we uncovered (\autoref{sec:dissection}).
\section{CHERI Overview}
\label{sec:cheri}
In this section, we provide a simple overview of CHERI, and its major concepts.
Since CHERI has developed over a number of years, and is explained across
a range of documentation and papers, some concepts have acquired more than one name,
or names that subtly conflict with mainstream
software development definitions. We use a single name for each concept, sometimes
introducing new names where we believe that makes things clearer.
A \emph{capability} is a token that gives those who bear it \emph{abilities} to
perform certain actions. By restricting who has access to a given capability,
one can enforce security properties (e.g.~``this part of the software can
only read from memory address range X to Y''). Capabilities have a long history: \cite{levy84capability}
provides a narrative overview of capability architectures, and
may usefully be augmented by more recent work such as~\cite{miller06robust}.
A good first intuition is that CHERI is a modern version of this longstanding
idea, with finer-grained permissions, and adapted to work on recent
processor instruction sets.
We use the term \emph{CHERI} to refer to the `abstract capability machine'
that software can observe: that is, the combination of a capability hardware
instruction set, an ABI (roughly speaking, the interface between userland and kernel
e.g.~\cite{brooks19cheriabi}), a user-facing library that exposes
capability-related functions, and a CHERI-aware language. (e.g.~CHERI C~\cite{watson20chericprogramming},
an adaption, in the sense of both extending and occasionally altering, of C).
Except where we use the name of a different hardware implementation (e.g.~CHERI
RISC-V), we assume the use of Arm's `Morello' hardware, which is a
prototype ARMv8 chip extended with CHERI instructions.
\label{abilities}
Conceptually, a CHERI system starts with a `root' capability that has
the maximum set of abilities. Each
new \emph{child} capability must be derived from one or more \emph{parent}
capabilities. A child capability must have the same, or fewer, abilities than its
parent: put another way, capabilities' abilities monotonically decrease.
An \emph{authentic}\footnote{CHERI calls these
`tagged' or `valid' (and their inauthentic counterparts `untagged' or `invalid').}
capability is one that has been derived from authentic parents according to
CHERI's rules. Attempts to create capabilities that violate CHERI's rules
cause the hardware to produce an \emph{inauthentic} result, guaranteeing
that capabilities cannot be forged.
On Morello and CHERI RISC-V, capabilities behave as if they are 128 bits in
size, but also carry an additional (129th) bit that records the authenticity of each
capability. Software can read, and unset, the authenticity bit, but cannot set it:
only a child capability
derived, correctly, from authentic parent capabilities can itself be authentic.
A capability consists of a memory
\emph{address}\footnote{This portion of a capability does not \emph{have} to store an
address, though typically it does so, and the CHERI API calls it \texttt{address}.
In the context of this paper, since it always stores an address, we stick with this name.},
and its abilities: a set of \emph{permissions} (only a subset of
which we consider in this paper);
and \emph{bounds}, the memory range on which
the capability can operate.
Permissions include the ability to read / write from / to memory.
A \emph{permissions check} is said to be successful if the permission required
for a given operation is provided by a given capability.
A capability's bounds
are from a \emph{low} (inclusive) to a \emph{high} (exclusive) address: when
we refer to a capability's bounds being of `$x$' bytes we mean that
$\textit{high}-\textit{low}=x$. An address is
\emph{in-bounds} for a given capability if it is contained within the
capability's bounds, or \emph{out-of-bounds} otherwise; a capability is
in (or out) of bounds if its address is in (or out) of
bounds\footnote{\cite{woodruff19chericoncentrate} shows why authentic
capabilities can have an out-of-bounds address.}.
A \emph{bounds check} is said to be successful if a given capability, address, or
address range, is in-bounds for a given capability.
A processor instruction that operates on a capability requires at least one of: an authenticity
check, a permissions check, or a bounds check. If a capability fails the
relevant checks, then either: the hardware produces a
\lstinline{SIGPROT} exception (similar to \lstinline{SIGSEGV})
that terminates the program; or produces an inauthentic capability (where this is not
in violation of the CHERI rules).
CHERI allows both double-width capabilities and
single-width addresses-as-pointers to exist alongside each other at any time.
Conventionally, a program which uses both traditional addresses and
capabilities is said to be operating in \emph{hybrid} mode while a program
which uses only capabilities is in \emph{pure capability} -- henceforth ``purecap'' -- mode.
Some caution is necessary with these terms, because different parts of a system
may be hybrid or purecap. For example, kernels often use hybrid mode while
some parts of userland may use purecap: at least one of the sides in such
a relationship must translate between the hybrid and purecap worlds as necessary.
In this paper we make two simplifying assumptions: that we only have to
consider userland, and that userland is entirely purecap or entirely hybrid.
CHERI does not presuppose a particular Operating System (OS). While there is a
CHERI Linux port, at the time of writing the most mature OS for CHERI hardware
is CheriBSD, a FreeBSD descendent. In this paper we use CheriBSD.
\section{A Basic Pure Capability Allocator}
\label{sec:bumppointerallocator}
To illustrate how CHERI affects allocators, in this section we adapt a simple
non-CHERI aware allocator to become CHERI aware. For simplicity's sake,
we assume that this means at least running successfully on, and preferably
taking advantage of, purecap CHERI.
\begin{figure}[t]
\lstinputlisting[
language=C,
xleftmargin=1.3em,
caption={
A simple, but complete, non-CHERI aware, bump pointer allocator:
\fnc{malloc} works as per normal; \fnc{free} is a no-op; and
\fnc{realloc} always allocates a new chunk, copying
over the old block.
\fnc{\_\_builtin\_align\_up(v, a)} is an LLVM / clang primitive which rounds
\texttt{v} up to the next smallest multiple of \texttt{a};
\texttt{\_Alignof(max\_align\_t)} returns an alignment sufficiently large
for any scalar type (i.e.~integers and pointers).
},
label=lst:bump_alloc1]{code/bump_alloc1.c}
\end{figure}
\autoref{lst:bump_alloc1} shows a simple, complete, example of a C bump allocator:
\fnc{malloc} works as per normal; \fnc{free} is
a no-op; and \fnc{realloc} always allocates a new chunk of memory. The
allocator reserves a large chunk of memory using a single \fnc{mmap} call then
doles out chunks on each \fnc{malloc} / \fnc{realloc} calls. The bump pointer
moves through the \fnc{mmap}ed chunk until it reaches the upper limit, at which
point the allocator returns \texttt{NULL}. Though intentionally simplistic,
\fnc{realloc} is correct even when the block is increased in size.
\subsection{Adapting the Allocator to CHERI}
\label{sec:adapting_to_cheri}
Perhaps surprisingly, our simple bump allocator compiles, and \fnc{malloc} runs
correctly, on a purecap CHERI system. As this suggests, CHERI C is largely source
compatible with normal C code, though pointer types are transparently `upgraded' to become
\emph{capability types} (on Morello occupying exactly twice the space of a
non-capability pointer). CHERI also implies changes in libraries: on CheriBSD,
for example, \fnc{mmap} returns a capability whose bounds are at least those of
the size requested: from that capability our bump allocator derives new
capabilities that differ in their address but not their bounds. In other
words, calling \fnc{malloc} just once gives the caller the ability to read and
write from all past and future blocks returned by \fnc{malloc}!
\begin{figure}[t]
\lstinputlisting[language=C,
xleftmargin=1.3em,
caption={
Replacing the non-CHERI aware \fnc{malloc} from ~\autoref{lst:bump_alloc1}
with a CHERI-aware alternative
using the idioms suggested in~{\cite[p.~30]{watson20chericprogramming}}.
This \fnc{malloc} returns a capability whose bounds are sufficient
to cover \texttt{size} bytes starting at the capability's address
(calculated in lines 5--10),
such that two callers to \fnc{malloc} cannot read or write from
another block. We also have to update \fnc{realloc} so that it never tries to copy
more data from the old block than the \texttt{ptr} capability gives
it access to.
},
label=lst:bump_alloc2]
{code/bump_alloc2.c}
\end{figure}
As this suggests, using CHERI without careful thought may give
no additional security benefits. This then raises the question:
how should a secure `CHERI aware' allocator behave? There can be no single
answer to this question, but we believe that most programmers would at least expect
\texttt{malloc} to return a capability whose bounds are restricted to the block of memory
allocated. \autoref{lst:bump_alloc2}
shows how to adapt \texttt{malloc} to do this.
The code to create the capability (using the idioms suggested
in~\cite[p.~30]{watson20chericprogramming}) is more involved than one might
first expect. The underlying cause is that there aren't, and cannot reasonably
be, enough bits in CHERI's bounds to precisely represent every possible address
and size. Modern CHERI therefore uses an encoding for bounds that allows small
bounds to be precisely represented, at the expense of larger bounds becoming
progressively less precise~\cite{woodruff19chericoncentrate}. On Morello,
the smallest bound that cannot be precisely represented is 16,385 bytes,
which is rounded up to 16,392 bytes\footnote{For CHERI RISC-V the first unrepresentable length is
4,097 bytes, which is rounded up to 4,104.}. Our capability aware
\texttt{malloc} thus has to ensure that both the capability's low and high
bound addresses are rounded down and up (respectively) in a way that ensures
that the address and size can be fully covered.
The two versions of our allocator have meaningfully different security properties,
even when we run both on a purecap system. For example, consider this simple
C snippet which models a buffer overrun:
\begin{lstlisting}[language=C]
char *b = malloc(1);
b[0] = 'a';
b[1] = 'b';
\end{lstlisting}
\noindent
On a non-CHERI system, or a CHERI system with \autoref{lst:bump_alloc1}
as an allocator, this snippet compiles and runs without error. However, using
the allocator from \autoref{lst:bump_alloc2} on a purecap CHERI system,
while the snippet compiles, the tighter
capability bounds returned by \texttt{malloc} cause a \lstinline{SIGPROT}
when executing line 3. This demonstrates how CHERI can prevent
programmer errors becoming security violations.
However, just because a program compiles with CHERI C does not guarantee that
it will run without issue: when run on CHERI, the \fnc{realloc} in~\autoref{lst:bump_alloc1}
causes a \lstinline{SIGPROT} when asked to
increase the size of a block. This occurs because \fnc{memcpy} tries to copy
beyond the bounds of the input capability (e.g.~if the existing block is 8
bytes and we ask to resize it to 16 bytes, \fnc{memcpy} tries to read
16 bytes from a capability whose bounds are 8 bytes). On a non-CHERI system,
this is not a security violation, but it is treated as one on CHERI. In
\autoref{lst:bump_alloc2} we thus provide an updated \fnc{realloc} which
uses \texttt{cheri\_length\_get} (which returns a capability's bounds
in bytes) to ensure that it never copies more data than the input capability's
bounds allow.
\section{CHERI Allocators}
\label{sec:cheriallocators}
In this paper we consider a number of allocators that are available for
CheriBSD. We first explain the set of allocators we use, before
exploring in more detail how the allocators have been adapted (if at all) for
CHERI.
\subsection{The Allocators Under Consideration}
\begin{table}[tb]
\begin{center}
\begin{tabular}{llrrr}
\toprule
Allocator & Version & SLoC & \multicolumn{2}{c}{Changed}\\
\cmidrule(lr){4-5}
& & & LoC & \multicolumn{1}{c}{\%}\\
\midrule
\input{./data/results/slocs.tex}
\\ \bottomrule
\end{tabular}
\end{center}
\caption{The allocators we examined, their size in Source Lines of Code
(SLoC), and the number of lines changed (as an absolute value and relative
percentage) to adapt them for purecap CheriBSD. The top portion
of the table shows the allocators which
passed a basic test and are used in our experiments; the bottom portion
shows the allocators which failed a basic test.}
\label{tab:allocator_summary}
\end{table}
A number of allocators are available for purecap CheriBSD, installable via three
different routes: as part of the base distribution; via CheriBSD
\emph{packages}; or via external sources. We examined allocators
available from all three sources. We excluded
allocators aimed at debugging (e.g.~\emph{ElectricFence}).
We then ran a simple validation test, \fnc{malloc}ing a block of memory,
copying data into the block, and then \fnc{free}ing the block: we
excluded any allocator which failed this test.
On that basis, the allocators we consider in this paper, and the names we use
for them for the rest of this paper, are as follows:
\setlength{\leftskip}{6pt}
\vspace{6pt}\noindent
\memalloc{jemalloc}, a modified version of the well-known
allocator~\cite{evans06scalable}: this is the default allocator
for CheriBSD.
\vspace{6pt}\noindent
\memalloc{libmalloc-simple}\footnoteurl{https://github.com/CTSRD-CHERI/cheribsd/commit/e85ccde6d78d40f130ebf126a001589d75d60473}{23rd
February 2023}, a port of the allocator in the FreeBSD utility
\texttt{rtld-elf}\footnoteurl{https://github.com/freebsd/freebsd-src/blob/releng/4.3/libexec/rtld-elf/malloc.c}{
23rd of February 2023}, based on Kingsley's malloc from 4.2BSD.
\vspace{6pt}\noindent
\memalloc{snmalloc-cheribuild}, a version of
\emph{snmalloc}~\cite{lietar19snmalloc} that can be installed via
\texttt{cheribuild}. We found this version to have several problems
which we rectified by manually building a newer version from \emph{snmalloc}'s
GitHub repository. We term this more recent version \texttt{snmalloc-repo}.
\vspace{6pt}\noindent
\memalloc{dlmalloc-cheribuild}, a modified version of the well-known
allocator~\cite{lea96memory},
installable via \texttt{cheribuild}. \memalloc{dlmalloc-pkg64c} is an
unmodified version of the allocator, available as a package in CheriBSD. Both
versions are based on dlmalloc 2.8.6.
\vspace{6pt}\noindent
\memalloc{ptmalloc}~\cite{gloger06ptmalloc} is an extension of
\memalloc{dlmalloc}, with added support for multiple threads.
\vspace{6pt}\noindent
\memalloc{bump-alloc-nocheri} is the simple, non-CHERI-aware bump
allocator from~\autoref{lst:bump_alloc1}. Conversely, the CHERI aware version is
\memalloc{bump-alloc-cheri}, presented
in~\autoref{lst:bump_alloc2}.
\setlength{\leftskip}{0pt}
\vspace{6pt}\noindent
\autoref{tab:allocator_summary} shows the version of each allocator we used.
We have not included two other
major memory allocators that have only been
partly ported to CHERI: the \emph{Boehm-Demers-Weiser} conservative garbage
collector; and the \emph{WebKit} garbage collector.
\subsection{How Much Have the Allocators Been Adapted for CHERI?}
\label{sec:rqs}
As we saw from \autoref{lst:bump_alloc1}, simple allocators may not need
adapting for CHERI, though they are then likely to derive only minor security gains. In
practice, we expect most allocators to incorporate at least the capability bounds enforcement
of \autoref{lst:bump_alloc2}. Indeed, more sophisticated allocators tend
to crash on CHERI without at least some modifications. For example, most of the allocators available
via CheriBSD's package installer (e.g.~dlmalloc-pkg64c) have had no source-level
changes for CHERI: they compile correctly but crash on even the most trivial examples.
Understanding the details of the CHERI modifications to all of the allocators under
consideration is beyond the scope of this work.
Instead, \autoref{tab:allocator_summary} shows what proportion of an
allocator's LoC are `CHERI specific' by calculating the percentage of lines of code contained between
\texttt{\#ifdef CHERI} blocks and similarly guarded code. This count is an under-approximation, as some
code outside such \texttt{\#ifdef} blocks may also have been adapted, but it
gives a rough idea of the extent of changes.
With the exception of the extremely small
\texttt{bump-alloc} and \texttt{libmalloc-simple},
the pure capability memory manager libraries in Table~\ref{tab:allocator_summary} have a mean
2.31\% of their SLoC changed.
Although this is a relatively small portion,
it is two orders of magnitude larger than the
0.026\% lines that were adapted when porting a desktop environment (including X11 and
KDE)~\cite{watson21assessing}. It is a reasonable assumption that the
lower-level, and more platform dependent, nature of allocators
requires more LoC to be adapted.
\section{The Attacks}
\label{sec:atks}
\begin{table}[t]
\begin{center}
\begin{tabular}{lccccc}
\toprule
Allocator & \tblescunauthentic & \tblescperms & \tblnarrowwidencapleak & \tbloverlap & \tblundef\\
\midrule
\input{./data/results/tests.tex}
\\ \bottomrule
\end{tabular}
\caption{Attacks per allocator: $\times$ indicates that an allocator is vulnerable to an attack;
$\checkmark$ that the allocator is invulnerable; and $\oslash$ a failure for other
reasons (e.g.~a segfault).}
\label{tab:atks}
\end{center}
\end{table}
Our definition of CHERI in~\autoref{sec:cheri} might suggest that software
running on CHERI hardware is invulnerable to attack. Rather, CHERI gives us the
tools to make secure software, but it is up to us to use them correctly --- and wisely.
We must decide which attack model is relevant to our use-case, and then write,
or adjust, the software, to withstand such attacks. In our context, allocators are
subject to spatial (e.g.~buffer overrun) or temporal (e.g.~a sequence of
function calls) attacks, and those attacks can either target an allocators'
internals (e.g.~corrupting private data-structures) or its interface
(e.g.~allowing user code to bypass security checks).
In this section we introduce a number of simple `attacks' on CHERI allocators (4
temporal and 1 spatial) and then run those attacks on the allocators from
\autoref{sec:cheriallocators}, with the results shown in \autoref{tab:atks}.
Even the default CheriBSD allocator is vulnerable to some attacks: only
snmalloc is invulnerable.
In the rest of this section we explain each attack, giving C code using the
CHERI API. Our code examples assume that we start with a
`non-attacker' who allocates memory and hands it over to another part of the system which has been
taken over by an `\colorbox{gray!20}{attacker}' (whose code has a light grey background).
For each (allocator, attack) pair, we state whether it is
vulnerable, invulnerable, or whether the attack fails for other reasons.
We model this via a series of \lstinline{assert}s:
if all the \lstinline{assert}s pass, the attack is successful. The code we show
in the paper is elided relative to the version we run, which contains changes
that makes it possible for us to automate the running of the attacks over
multiple allocators. The
full code (available as part of our experiment) must be considered the definitive source of
truth for \autoref{sec:cheriallocators}.
Our descriptions use the following CHERI functions:
\setlength{\leftskip}{6pt}
\vspace{6pt}
\noindent\texttt{void *cheri\_address\_set(void *c, \\ptraddr\_t a)}
\hspace{\parindent}Takes a capability \texttt{c} as input and returns a capability
that is a copy of \texttt{c} except with the address \texttt{a}.
\texttt{ptraddr\_t} is a CHERI C integer type that is guaranteed to be big enough
to represent addresses but, unlike \texttt{intptr\_t}, is not big enough to
represent capabilities.
\vspace{6pt}
\noindent\texttt{ptraddr\_t cheri\_base\_get(void *c)}\\
Returns the address of the lower bound of a capability \texttt{c}.
\vspace{6pt}
\noindent\texttt{void *cheri\_bounds\_set(void *c, \\size\_t s)}
\hspace{\parindent}Takes a capability \texttt{c} as input and returns a new capability
that is a copy of \texttt{c} except with bounds \texttt{s}.
\vspace{6pt}
\noindent\texttt{size\_t cheri\_length\_get(void *c)}\\
Returns the bounds of a capability \texttt{c}.
\vspace{6pt}
\noindent\texttt{\_Bool cheri\_tag\_get(void *c)}\\
Returns true if the capability \texttt{c} is authentic.
\vspace{6pt}
\noindent\texttt{void* cheri\_perms\_and(void *c,\\ size\_t perms)}
\hspace{\parindent}Returns the capability \texttt{c} with its permissions bitwise-ANDed
with \texttt{perms}.
\vspace{6pt}
\noindent\texttt{size\_t cheri\_perms\_get(void *c)}\\
Returns the permissions of capability \texttt{c}.
\setlength{\leftskip}{0pt}
\subsection{\narrowwiden: Narrowing then Widening Can Allow Access to Hidden Data}
\label{narrowwiden}
In the simple bump allocator of \autoref{lst:bump_alloc1}, \texttt{realloc} always
allocates a new block of memory. While this is always correct, it is inefficient,
in part because it requires copying part of the block's existing content.
Most allocators thus try to avoid allocating a new block of memory if: the
new size is the same as, or smaller than, the existing size; or if
the new size would not lead to the block overwriting its nearest neighbour.
The latter optimisation is dangerous for a CHERI allocator.
Consider the case where \texttt{realloc} wants to increase a
block in size, and there is sufficient room to do so without moving the block.
\texttt{realloc} needs to return a capability whose bounds encompass the new
(larger) size. However, such a capability cannot be derived from the input
capability, as doing so would lead to an inauthentic capability, and
we would violate the property that a capability's abilities must monotonically
decrease. Thus, the allocator needs access to a `super'
capability which it can use to derive a capability representing the new bounds.
Let us call the `super' capability \texttt{SC} and introduce a function
\texttt{size\_of\_bucket} which tells us the maximum space available for
the block starting at \texttt{ptr}. Eliding extraneous details (e.g.~about
alignment), \fnc{realloc} will then look as follows:
\begin{lstlisting}[language=C]
void *realloc(void *ptr, size_t size) {
if (size <= size_of_bucket(ptr)) {
// No need to reallocate.
void *new_ptr =
cheri_address_set(SC, ptr),
return cheri_bounds_set(
new_ptr, size);
} else {
// Allocate a larger region of memory
// and copy the old contents.
}
}
\end{lstlisting}
\noindent Lines 4--7 need to deal with the
case where the block is to be increased in size but will still fit
in its current bucket. We
first use \fnc{cheri\_address\_set} to derive a new capability from
\texttt{SC} whose address is the same as \texttt{ptr} but whose bounds will be
those of \texttt{SC} (lines 5 and 6) before narrowing those bounds to
\texttt{size} (lines 6 and 7).
When implemented in this style, an allocator can be subject to the following
attack:
\begin{lstlisting}[language=C,linebackgroundcolor={\ifnum\value{lstnumber}>4\color{gray!20}\fi}]
uint8_t *arr = malloc(256);
for (uint8_t i = 0; i < 256; i++)
arr[i] = i;
arr = realloc(arr, 1);
arr = realloc(arr, 256);
for (uint8_t i = 0; i < 256; i++)
assert(arr[i] == i);
\end{lstlisting}
\noindent
We first allocate a block of memory, receiving a capability with
a bounds of 256 bytes (line 1). We fill the block up with data (lines 2 and 3)
then \fnc{realloc} the block down to a single byte, receiving back a capability
whose bounds are 1 byte\footnote{Some allocators return bounds bigger than 1 byte, though
none we tested returned a bounds of 256 bytes or more.} (line 4).
At this point, we expect to have permanently lost access to the values written to bytes 2-255
in lines 2 and 3 --- if an attacker \fnc{realloc}s the block back to
its original size they should not have access to the values written to bytes
2-255. However, allocators using the optimisation above will often
return a capability that covers the same portion of memory as the original
block, without zeroing it, allowing an attacker to read the original bytes out
unchanged (lines 6 and 7).
It might seem merely undesirable for \fnc{realloc} to allow an attacker access to the
original data, but in a capability system this attack is particularly
egregious if that data contains capabilities, since an attacker can read those
and gain new abilities.
\subsubsection{Mitigations}
Mitigating this attack is relatively simple. When \fnc{realloc} shrinks a
block, any excess storage should be zeroed. Note that we consider this safer
than the seemingly similar alternative of zeroing excess storage when
\fnc{realloc} enlarges a block, because that implies a delay in zeroing that
might give an attacker other unexpected opportunities to read data.
\subsection{\escperms: Escalate Permissions}
When an allocator uses a `super' capability (as seen in
\autoref{narrowwiden}), there may be potential to upgrade a
capability's permissions as shown in the following simple attack:
\begin{lstlisting}[language=C,linebackgroundcolor={\ifnum\value{lstnumber}>6\color{gray!20}\fi}]
uint8_t *arr = malloc(16);
assert(cheri_perms_get(arr)
& CHERI_PERM_STORE));
arr = cheri_perms_and(arr, 0);
assert((cheri_perms_get(arr)
& CHERI_PERM_STORE) == 0);
arr = realloc(arr, 16);
assert(cheri_perms_get(arr)
& CHERI_PERM_STORE);
\end{lstlisting}
\noindent
We first allocate a block and check that the capability returned
is allowed to store data (\texttt{CHERI\_PERM\-\_STORE}) to that block (lines 1--3).
We deliberately remove the store permission (line 4), checking
that this permission really has been removed (lines 5 and 6). We then call
\fnc{realloc} (without changing the block's size) and check whether we have
regained the ability to store data via the capability.
\subsubsection{Mitigations}
There are two ways of mitigating such an attack. The simplest is to AND
the output capability's permissions with the input capability's permissions:
doing so guarantees that the output capability has no more permissions
than the input capability.
However, in some cases, one should consider validating the input capability to
decide whether any action should be possible. For example, if handed a
capability whose address is genuinely an allocated block, but where the
capability has the read and write permissions unset, should \fnc{realloc}
refuse to reallocate the block? Perhaps \fnc{realloc} should check that
the capability handed to it has exactly the same permissions as the
capability handed out by the most recent \fnc{malloc} or \fnc{realloc}?
There are no universal answers, but some cases may be
easier to rule upon than others.
\subsection{\escinauthentic: Escalate Inauthentic Capabilities}
\label{sec:escinauthentic}
An important variant on \escperms is to see whether an allocator will
reallocate a block pointed to by an inauthentic capability and return
an authentic capability:
\begin{lstlisting}[language=C,linebackgroundcolor={\ifnum\value{lstnumber}>4\color{gray!20}\fi}]
uint8_t *arr = malloc(16);
assert(cheri_tag_get(arr));
arr = cheri_tag_clear(arr);
assert(!cheri_tag_get(arr));
arr = realloc(arr, 16);
assert(cheri_tag_get(arr));
\end{lstlisting}
\noindent
Interestingly, none of the allocators we examined was vulnerable to this
attack. However, several fall into something of a
grey zone: despite not explicitly checking the capability's authenticity, the
allocators cause a \lstinline{SIGPROT} when they try to perform an operation on
the inauthentic capability. It is difficult to know whether this was an
expected outcome or not, in the sense that programs often explicitly rely on
null pointer dereferencing causing \texttt{SIGSEGV} to maintain certain
security properties. However, since the outcome is a reasonable one, we have
chosen to give the allocators the benefit of doubt in this case, and have
classified them as
invulnerable to this attack.
\subsection{\myundef: Authentic capabilities from Undefined Behaviour}
It is easy to assume that authentic capabilities can only be derived if one
follows CHERI-C's rules correctly. However, it is possible for an attacker to use undefined
behaviour at the language level to trick an allocator into returning authentic capabilities
that it should not possess as shown in this attack:
\begin{lstlisting}[language=C,linebackgroundcolor={\ifnum\value{lstnumber}>5\color{gray!20}\fi}]
uint8_t *arr = malloc(256);
for (uint8_t i = 0; i < 256; i++)
arr[i] = i;
arr = realloc(arr, 1);
free(arr);
arr = malloc(256);
for (uint8_t i = 0; i < 256; i++)
assert(arr[i] == i);
\end{lstlisting}
\noindent
This follows a similar pattern to \narrowwiden. We first allocate a block
and fill it with data (lines 1--3). Although not strictly necessary to demonstrate
the attack, we then reallocate the block down to a single byte, modelling the
case where we pass a capability with few abilities to an attacker (line 4). The
attacker then frees that block (line 5) and immediately allocates a block of the same
size (line 6) hoping that the new block is allocated in the same place as the old
block. If that is the case, they will obtain a capability spanning the same memory as the old block,
which allows them access to secret data (lines 7 and 8).
Interestingly, this attack places both the `non-attack' and `attack' portions
into undefined behaviour. Most obviously, the attack portion of the code reads
data via a capability / pointer that it cannot ensure has been initialised. Less
obviously, both attacker and non-attacker have an equivalent capability (with
the same address and bounds) but, due to capability / pointer provenance rules,
the non-attackers version of the capability is, technically speaking, no
longer valid by those rules. This outcome is unlikely to trouble an attacker.
\subsubsection{Mitigations}
There are no general mitigations for \myundef. For the particular concrete
example, a partial mitigation is for \fnc{free} to scrub memory so that, at
least, whatever was present in the buffer cannot be read by the attacker.
More generally, attackers may sometimes be able to use undefined behaviour to `alias'
a capability. In most such cases, the only solution is to
scan memory looking for all references to a capability whose bounds
encompass an address and render them inauthentic (so-called `revocation'~\cite{xia19cherivoke}),
downgrading a security leak into a denial-of-service.
\subsection{\overlap: Capabilities Whose Memory Bounds Overlap with Another's}
\label{sec:overlap}
A capability's bounds span a portion of memory from a low to a high address. If
an allocator returns two distinct capabilities whose bounds overlap
(e.g.~because of the bounds imprecision we saw in
\autoref{sec:adapting_to_cheri} as a result of
\cite{woodruff19chericoncentrate}), an attacker might be able to read or write
memory they should not have access to. An example attack in this mould is:
\begin{lstlisting}[language=C,linebackgroundcolor={\ifnum\value{lstnumber}>1\color{gray!20}\fi}]
void *b1 = malloc(16);
void *b2 = malloc(16);
assert(
cheri_base_get(b1) >= cheri_base_get(b2)
&& cheri_base_get(b1) <
cheri_base_get(b2) + cheri_length_get(b2)
);
\end{lstlisting}
\noindent
In practice, such a simple attack is unlikely to succeed on all but the
most basic allocators, as the most likely attack vector is when an allocator
fails to take into account bounds imprecision. The `full' \overlap
attack initially finds the first 512 lengths that are not precisely representable
as bounds and then randomly allocates multiple blocks to see if any of the
resulting capabilities overlap.
\subsection{Failed Attacks}
Two attacks in~\autoref{tab:atks} failed ($\oslash$) due to what we
believe are unintended bugs (as distinct from
\autoref{sec:escinauthentic}, where we could not distinguish unintended
bugs from careful, efficient, programming), and thus we cannot state whether the allocator is
vulnerable or invulnerable.
\escperms fails on \memalloc{bump-alloc-nocheri} because \fnc{realloc}
causes a \lstinline{SIGPROT} when trying to increase a block in size. By
design, \memalloc{bump-alloc-cheri} contains a version of \fnc{realloc} which
fixes this issue (see~\autoref{sec:adapting_to_cheri}).
\overlap fails on \memalloc{snmalloc-cheribuild} due to what we believe is
an internal snmalloc bug. Because of this, we included a newer version
of snmalloc as \memalloc{snmalloc-repo} which is invulnerable to \overlap.
\section{Performance Evaluation}
\label{sec:performance}
We wanted to understand both the relative performance of allocators under CHERI,
and also the performance impact of porting allocators to CHERI. The former
measures performance within hybrid and purecap and is
straightforward (\autoref{ssec:withinhybridpurecap}). The latter measures
performance across hybrid and purecap and shows far greater differences
than we expected (\autoref{ssec:withinhybridpurecap}). Because of that,
in \autoref{sec:dissection} we attempt to understand possible causes
for these differences.
\subsection{Methodology}
\begin{table}[t]
\begin{center}
\begin{tabular}{lll}
\toprule
Benchmark & Source & Characterisation \\
\midrule
barnes & mimalloc & Floating-point compute \\
binary-tree & boehm & Alloc \& pointer indirection \\
cfrac & mimalloc & Alloc \& int compute \\
espresso & mimalloc & Alloc \& int compute \\
glibc-simple & mimalloc & Alloc \\
%glibc-thread & mimalloc & alloc, multi-thread \\
mstress & mimalloc & Alloc \\
richards & richards & Pointer indirection \\
%rptest & mimalloc & \dejice{alloc, multi-thread} \\
%xmalloc & mimalloc & alloc, multi-thread \\ \hline
\bottomrule
\end{tabular}
\end{center}
\caption{Our benchmark suite. We list the benchmark name, the source
of the benchmark, and a brief characterisation of it as a workload.
\emph{Alloc} is short-hand for `allocator intensive' (i.e.~frequent
allocation and deallocation).}
\label{tab:benchmarks}
\end{table}
\begin{figure*}[t]
\begin{center}
\includegraphics[width=\textwidth]{fig/sn_je_malloc_bm_overall.pdf}
\vspace{-25pt}
\end{center}
\caption{\label{fig:eval jemalloc}
\memalloc{jemalloc} (top) and \memalloc{snmalloc} (bottom) running on purecap, normalised
to their respective hybrid allocators. For example, the wall-clock
execution time \emph{total-time} of
barnes with \memalloc{jemalloc} on purecap is 1.22x greater than
on hybrid. To understand why purecap is slower than we expected, we recorded
several performance metrics: \emph{rss-kb} is
maximum memory utilisation; \emph{INST\_RETIRED} the number of instructions
retired while executing the benchmark; and \emph{\{L1-I, L1-D, L2-D\}\_CACHE}
the L1 instruction, L1 data, and L2 data, cache misses respectively.
None of these factors provides obvious clues as to why purecap is so
much slower than hybrid.
}
\end{figure*}
We conducted our experiments on Arm's prototype Morello hardware:
a quad-core Armv8-A 2.5GHz CPU, with 64KiB L1 data cache,
64KiB L1 instruction cache, 1MiB L2 unified cache per core, two 1MiB L3 unified
caches (each shared between a pair of cores), and 16GiB DDR4 RAM. We ran
CheriBSD 22.12 as the OS, with both purecap (\texttt{/usr/lib/}) and
hybrid (\texttt{/usr/lib64/}) userlands installed.
We wanted a benchmark suite that contains benchmarks written in C, that have
minimal library dependencies (so that we best understand what is being run),
and that execute fixed workloads (rather than those that execute for fixed
time). We selected 5 benchmarks from the \emph{mimalloc-bench}
suite~\cite{leijen19mimalloc} that meet this criteria, as well as the `classic'
binarytrees~\cite{boehm14artificial} and richards~\cite{richards99bench}
benchmarks. \autoref{tab:benchmarks} shows our complete benchmark suite.
Our benchmark suite deliberately contains a mix of allocation-heavy benchmarks
and non-allocation-heavy benchmarks, where the latter can serve as a
partial `control' to help us understand the effects of allocators on
performance against other factors.
We compile each benchmark with clang's \texttt{-O3} optimisation level. We
use \texttt{LD\_PRELOAD} at runtime to dynamically switch between allocators.
We measure wall-clock time on an otherwise unloaded Morello machine.
Since jemalloc and snmalloc were the highest performing allocators, we
concentrate on them for the rest of this section.
\subsection{Results Within Hybrid and Purecap}
\label{ssec:withinhybridpurecap}
With the geometric mean, snmalloc is faster than jemalloc by 1.25x
and 1.24x in hybrid and purecap respectively.
Overall, the wall-clock time roughly correlates with instruction counts
and L1I instruction cache misses. snmalloc retires a similar number of instructions
compared to jemalloc in hybrid and 1.19x fewer in purecap. snmalloc also
has 1.73x and 1.6x fewer L1-I cache misses than jemalloc in hybrid and purecap respectively.
It is difficult to know whether this difference is because snmalloc is
faster in general, or because it has been more carefully tuned for purecap than
has jemalloc --- indeed, a combination of both factors is plausible.
Because the overall performance differences are clear and because, as we will soon see, there are much
greater differences to consider elsewhere, we do not dwell further on these results.
\subsection{Results Across Hybrid and Purecap}
\autoref{fig:eval jemalloc} shows a comparison of \memalloc{jemalloc} and \memalloc{snmalloc}
across hybrid and purecap. We expected to see some performance difference between hybrid and purecap:
capabilities are likely to incur some costs due to their increased size
(e.g.~increased cache pressure); and purecap allocators make use of additional
security properties, such as bounds information, which require executing more
instructions. However, the differences are far greater than we expected:
using the geometric mean, jemalloc and snmalloc purecap
are 1.56x and 1.61x slower respectively than their hybrid counterparts.
This is a larger slowdown than we believe can be explained by the additional costs of using
capabilities in the allocator --- indeed, richards, a benchmark which performs
little allocation, also slows down by more than 2x.
\begin{figure}
\begin{center}
\includegraphics[width=1.0\columnwidth,trim={0.5cm 7cm 3.5cm 6.7cm},clip]{fig/instruction_mix/stacked_bar_chart.pdf}
\vspace{-30pt}
\end{center}
\caption{\label{fig:instrmix}
Dynamic instruction mix for hybrid (left-hand bar) and purecap (right-hand bar) benchmark runs.
Most of these are as expected. For example, branches are the same in hybrid
and purecap; some pointer arithmetic (a subset of `data proc' in hybrid)
moves from AArch64 to Morello (`morello arith'). Similarly, some loads
and stores move from AArch64 to Morello. There is a small but noticeable
increase in the overall quantity of loads and stores (AArch64 and Morello
combined). The `Morello misc' category captures instructions related to
capability bounds, tag checks, and so on that we would only expect to see in
any quantity in purecap.}
\end{figure}
\begin{figure}
\begin{center}
\includegraphics[width=1.0\columnwidth]{fig/fvp-stats.pdf}
\vspace{-25pt}
\end{center}
\caption{\label{fig:fvpstats}
Instruction counts per symbol, based on FVP traces for three benchmarks using
the CheriBSD system allocator (jemalloc). mstress and richards were
parameterised to perform fewer iterations than the default. Note that whilst
binary\_trees is dominated by the effects of \fnc{malloc} and \fnc{free}, the others
are dominated by their own code.
}
\end{figure}
\section{Analysing The Disparity Between Hybrid and Purecap Performance}
\label{sec:dissection}
The disparity in performance between hybrid and purecap surprised us,
and we thus explored three different avenues in an attempt to understand
the causes. In this section, we explain these avenues in ascending order of
difficulty.
Our results suggest that it is likely that much of the difference we see is not inherent to the use
of capabilities in a CPU, but is the result of immature tool-chains and
micro-architectural anomalies common in a first prototype implementation of an
ambitious platform such as Morello. Our results should be interpreted in
that context: it is likely that further research, and future evolutions of
CHERI hardware, will change our view of capabilities' effect on performance.
Indeed, since we submitted this paper we have become aware of other researchers who are also
investigating similar issues\footnote{Although not available at the time of writing, we have been
informed that a detailed performance white paper will soon be available at
\url{https://www.arm.com/-/media/Files/pdf/white-paper/prototype-morello-performance-results}}.
\subsection{Simple Metrics}
We started by measuring several simple metrics, including the
resident set size (RSS), and several CPU performance counters: the results are
shown in ~\autoref{fig:eval jemalloc}.
We hoped that this might highlight possible causes such as increased cache pressure
in purecap. Unsurprisingly, overall purecap has higher costs across all
metrics. L2 data cache accesses increase in almost all cases in purecap,
suggesting increased pressure on the L1 data cache with capabilities. In most
cases memory usage is only slightly worse in purecap, though in the
allocator-intensive mstress RSS doubles.
However, overall, these factors do not suggest an obvious cause for the slowdown
we see.
\subsection{Class, and Quantity of, Instructions}
We wondered whether the higher instruction counts on purecap benchmarks
could be explained by the extra CHERI instructions that
an allocator needs (compare Listings~\ref{lst:bump_alloc1} and~\ref{lst:bump_alloc2}).
To understand this, we wanted
to count how often different `classes' of instructions were executed.
Unfortunately, there is currently no direct way on either CheriBSD or Morello
to obtain such a count, so we had to cobble together two approaches: a fast,
somewhat imprecise, approach based on QEMU; and a slow, but more
precise, approach based on Arm's Morello Platform FVP~\cite[p.~23]{morellofvp}.
In essence, the latter serves as a sanity check for the former.
We wrote a custom QEMU plugin that maintains counts for each class of
instructions as a Morello instance is running. This includes the kernel booting
and shutting down as well as the benchmark we are interested in. We repeatedly
ran CheriBSD without any meaningful workload, so that we could find the
instruction counts that together constitute the `head' and `tail' of execution.
When we ran a benchmark, we then subtracted the head/tail instruction counts to
obtain the `benchmark only' instruction counts.
To validate these results, we used the Morello Platform FVP, which can emit \emph{tarmac}
traces~\cite[p.~5452]{fvpguide} representing complete records of execution. We
altered the FVP so that when we executed an otherwise unused instruction, it
toggled tracing on and off. We executed a complete run of the binary\_tree
benchmark, examined the traces, and counted the instructions contained therein,
which were a close match to our QEMU instruction counts. This gives us confidence
that our QEMU figures are representative.
Since our QEMU approach only has to track a few integers, whereas FVP produces
tarmac traces that are often a TiB long for our workloads, our two approaches
differ in performance by about 3 orders of magnitude. Running our full
benchmark suite for its full duration would be infeasible in FVP, so the
results are from our QEMU approach.
The instruction mixes in \autoref{fig:instrmix} look largely sensible: some
aspects (e.g.~branches) are identical in hybrid and purecap; some
vary where capabilities are sometimes used (e.g.~loads and stores);
and purecap uses Morello instructions. Overall, while there are some
minor oddities (e.g.~mstress uses Neon vector instructions -- classified as
floating point -- to a much greater degree on hybrid compared to purecap), there are no obvious smoking
guns.
We then wrote an analysis tool for our FVP traces (which contain virtual
addresses) to provide per-function profiling, and ran this for one full
benchmark, and shortened versions of two others.
\autoref{fig:fvpstats} shows that some
functions execute many more instructions in purecap (vs.~hybrid) than
others. \fnc{malloc} and \fnc{free} are particularly strongly affected, but the
effect on other functions is fairly uniform.
\subsection{Low-level Differences}
Hybrid and purecap CHERI imply certain differences which we would expect to