SKIVA: Flexible and Modular
Side-channel and Fault Countermeasures

Pantea Kiaei¹, Darius Mercadier², Pierre-Evariste Dagand³,
Karine Heydemann⁴ and Patrick Schaumont⁵

¹ Virginia Tech, Blacksburg, USA, pantea95@vt.edu
² LIP6, Paris, France, darius.mercadier@gmail.com
³ LIP6, Paris, France, pierre-evariste.dagand@lip6.fr
⁴ LIP6, Paris, France, karine.heydemann@lip6.fr
⁵ Virginia Tech, Blacksburg, USA, schaum@vt.edu

Abstract. We describe SKIVA, a customized 32-bit processor enabling the design of software countermeasures for a broad range of implementation attacks covering fault injection and side-channel analysis of timing-based and power-based leakage. We design the countermeasures as variants of bitslice programming. Our protection scheme is flexible and modular, allowing us to combine higher-order masking – fending off side-channel analysis – with complementary spatial and temporal redundancy – protecting against fault injection. Multiple configurations of side-channel and fault protection enable the programmer to select the desired number of shares and the desired redundancy level for each slice. Recurring and security-sensitive operations are supported in hardware through a custom instruction set extension. The new instructions support bitslicing, secret-share generation, redundant logic computation, and fault detection. We demonstrate and analyze multiple versions of AES from a side-channel analysis and a fault-injection perspective, in addition to providing a detailed performance evaluation of the protected designs.

Keywords: Bitslicing · Side-channel attacks · Fault attacks · Custom-instruction extensions · Software Countermeasures

1 Introduction

Side-channel analysis and fault attacks have plagued cryptographic software on embedded processors for many years. The threat of power-based and timing-based side-channel leakage is well understood and countermeasures such as masking and constant-time programming figure prominently in the cryptographer’s toolbox [RBV17, BDF+17]. In parallel, the research community has gained more insight into the fault behavior of hardware and software, thus greatly increasing the potency of fault attacks [YSW18, RNR+15]. The impact of fault attacks is minimized with fault detection and temporal or spatial redundancy of the software execution [LHB14, BCR16].

Although there exists an extensive array of specific, dedicated countermeasures, there is surprisingly few work available [RDB+18, SMG16, SBD+18] offering protection against both side-channel analysis and fault injection. This is especially true for software. The programmer is left selecting candidate solutions, figuring out if and how they can safely be assembled. This is not an easy task because countermeasures may interact in non-trivial (and unsafe) manners. For example, time-redundant software [CPT17] or error-detecting codes [RBIK12] as fault countermeasures may increase side-channel leakage, while constant-time programming as a side-channel countermeasure increases the risk of precisely synchronized fault attacks [SMKLM02].
In this paper, we introduce SKIVA, a processor that enables a modular approach to countermeasure design, giving programmers the flexibility to protect their ciphers against timing-based side-channel analysis, power-based side-channel analysis and/or fault injection at various levels of security. We leverage existing techniques in higher-order masking, in spatial and in temporal redundancy. Modularity is achieved through bitslicing, each countermeasure being expressed as a transformation from a bitsliced design into another bitsliced design. The capabilities of SKIVA are demonstrated on the Advanced Encryption Standard, but the proposed techniques can be applied to other ciphers as well.

Tackling physical effects with software. The protection of software against side-channel analysis and fault attacks is challenging. Side-channel leakage and faults are physical effects in the processor hardware. The programmer can control macro-level properties such as the control path of software or the amount and nature of memory references. But a large portion of software execution occurs “under the hood”. As a processor fetches, decodes and executes each instruction, the sensitive data handled by an instruction moves through the processor hardware. The timing, power consumption, and fault sensitivity of each instruction is obscured to the programmer. Due to this hardware abstraction, predicting physical execution properties of sensitive data is very hard for the programmer, as the following examples illustrate:

- The instruction timing is determined by the micro-architecture pipeline configuration, the cache organization, the presence of branch prediction, among other factors. A programmer cannot predict the execution time of a program from the source code alone. The processor hardware is optimized to make the common case fast [PH12], but it is unable to deliver strong guarantees on the timing of a single, specific instruction. Instead, instruction timing is strongly affected by the execution context.
- The instruction power dissipation is affected by signal transitions on programmer-invisible processor structures such as buses, buffers, memories, and logic. The power dissipation of these signal transitions is proportional to the Hamming distance between former and current data values in the hardware. In many cases, for example when the hardware is a shared resource, the former data value is unknown or invisible to the programmer.
- The instruction fault-sensitivity is determined by the electrical properties of hardware structures in the processor including their critical path and their threshold levels [BGV11, YSW18]. However, published hardware datasheets only list typical, maximum or minimum ratings. The fault sensitivity of a specific instruction is therefore unknown to the programmer. This applies not only to timing, but also to power dissipation.

As a result, contemporary processors do not offer a comprehensive guarantee on the physical execution properties of hardware: implementing a secure (yet reasonably efficient) cipher is thus exceedingly hard [BGG+14].

Symmetry from bitslicing. In order to get a handle on this problem, we adopt a bitsliced execution model. In the bitsliced model, the $n$-bit datapath of the processor is seen as $n$ 1-bit processors operating in parallel. Such an SIMD array of $n$ parallel 1-bit processors offers a significant degree of symmetry and regularity. Through this symmetry, the programmer gets a grip on the physical execution properties of software, at least in a relative sense:

- The cycle-time of parallel bitslices within an instruction is matched. The amount of clock cycles, used by any single bitslice within a processor word to complete a bitslice program, is the same as for any other bitslice of the same word. This property also holds under typical processor latency effects (pipeline hazards, caches, branch prediction). Every bit of a processor word experiences the same clock-cycle delay.
Figure 1: In a standard representation, processor registers are allocated per data word. In a bitsliced representation, processor registers are allocated per bit-weight of a block of data words. In an aggregated bitslice representation, multiple bitslices are allocated per data bit. Aggregated bitslices can be shares of a masked design, redundant data of a fault-protected design, or a combination of those.

- The power consumption of parallel bitslices in an instruction is matched. If two parallel bits in a CPU word make the same transition under the same bit-wise instruction, then they will have the same power dissipation. Of course, process manufacturing variations, and variations in on-chip and PCB routing may cause small differences in power. But these are second-order effects compared to the first-order symmetry obtained by the bitslices within a processor word.

- The instruction fault sensitivity of parallel bitslices in an instruction is matched. For example, if two parallel bits in a CPU word experience the same timing fault under the same bit-wise instruction, then we expect a matched fault effect. As with power consumption, there may be small static variations due to process manufacturing and routing [LSG+10].

Countermeasure design through bitslice aggregation. The symmetry of bitslices in a processor word is the basis for the modular protection schemes enabled by SKIVA. Figure 1 demonstrates three different organizations of a register file in a processor. We obtain the bitslice representation through a matrix transposition of the input data, so that one processor register contains all bits of a given weight. The key idea of bitslice aggregation is to allocate multiple slices to the representation of each data-bit.

Timing-based Side-channel Leakage. Bitslices – aggregated or not – are naturally synchronized, and always use a matching amount of cycles. Furthermore, bitslice programming naturally leads to straight-line programs with precisely defined timing characteristics [Bit97]. Straight-line programs have to be written at the level of bit-operations, and hence they are not easy to develop for the programmer. Fortunately, bitslice software generation can be automated [Por01, SS16, MDLM18].

Power-based Side-channel Leakage. In an order-\(d\) masked implementation, a single secret data bit is split into \(d+1\) shares using a masking method and \(d\) random shares. An order-\(d\) masked implementation is theoretically protected against order-\(d\) side-channel attacks. By allocating different shares as parallel slices, we obtain a parallel masked implementation [BDF+17], as demonstrated by Balasch et al. for ARM [BGRV15] and by Gregoire et al. for ARM-NEON [GPSS18]. Conceptually, their aggregate represents a single bit.
Fault redundancy. Bitslice aggregation enables spatial redundancy by encoding a single bit multiple times in each of the available slices. This allows the detection of data errors such as bit-flip and stuck-at faults and it is one element in a comprehensive fault countermeasure [PYGS16, LCFS18]. Interestingly, aggregation of bitslices can also offer support for temporal redundancy. In an iterative block cipher, for example, different bitslices can execute different rounds of a redundant cipher. This protects against instruction-skip and faults on the control path.

Contributions. In this paper we introduce SKIVA, a processor with built-in support for modular countermeasures against side-channel analysis and fault analysis. We make the following contributions:

1. A flexible and modular methodology for designing countermeasures. It enables the combination of a higher-order masking with spatial fault-redundancy and with temporal fault-redundancy. The number of shares and fault-redundancy levels are statically determined by the programmer (single, double, quadruple shares and single, double, quadruple fault-redundancy).

2. Hardware support for the proposed methodology in SKIVA, a processor with instruction set extensions specialized for bitsliced transposition, bitsliced masked operation, bitsliced fault detection, redundant bitsliced expansion and Boolean operations on complementary data.

3. Performance analysis of the Advanced Encryption Standard on SKIVA, under multiple levels of side-channel and fault-resistance.

4. Side-channel leakage evaluation of SKIVA implemented as a soft-core processor on a SAKURA-G FPGA board, extensive code size and performance evaluation, theoretical as well as empirical analysis of fault detection coverage.

Outline. In Section 2, we capture preliminaries to establish a common background among readers. We discuss the attacker model (Section 2.1) and review the related work, covering the design of countermeasures (Section 2.2) and the design of bitsliced software (Section 2.3). In Section 3, we introduce several modular countermeasure schemes. Starting with bitslicing (Section 3.1), we describe our systematic treatment of higher-order masking (Section 3.2), intra-instruction redundancy (Section 3.3), and temporal redundancy (Section 3.4). Finally, we demonstrate that these features naturally combine to form a coherent countermeasure within the assumptions of the attacker model (Section 3.5). In Section 4, we discuss the design and implementation of SKIVA, covering the instruction set extension and its overhead. In Section 5, we evaluate the software performance results, quantify the side-channel leakage of several levels of countermeasures, analytically and empirically bound the impact of fault attacks.

2 Preliminaries

We introduce the attacker model that is covered by our countermeasures (Section 2.1). We then review the literature for existing protection schemes (Section 2.2), with an eye toward modular techniques as well as countermeasures against fault and side-channel attacks. Finally, we recall the notion of bitslicing and review related work protecting bitsliced designs against fault-attacks and side-channel attacks (Section 2.3).

2.1 Attacker Model

The attacker model captures the presumed capabilities of an attacker. SKIVA considers adversaries with fault-injection and side-channel measurement capabilities.
Fault attacker model. We consider an attacker who intends to perform Differential Fault Analysis (DFA) [BS97, TMA11] or Statistical Fault Analysis (SFA) [FJLT13]. To carry out such an attack, she must induce a fault on an intermediate value or on control flow (e.g., to force a loop count reduction [DMM+13]). Fault injection is achieved by stressing the electrical environment of the digital hardware. This induces transient faults that set, reset, or flip one or several bits in a storage element (register or memory) of the platform. The exact effect of a fault (or its statistical distribution) depends highly on the injection technique, its parameters, the target architecture, the manufacturing technology of the attacked device and the attacker’s skills and time.

It is however generally agreed that there is a trade-off between the temporal and spatial resolution of a fault on the one hand, and the complexity of the fault injection on the other [YSW18]. The presumed fault attacker of SKIVA is not all-powerful: we consider low-cost injection means, with limited temporal resolution and/or spatial control over the fault effects. Concretely, we assume that the attacker is able to inject transient faults affecting a selected bit, byte, half-word, or word. We further restrict multi-bits faults to either setting or resetting the entire byte, half-word, or word (stuck-at 1 and stuck-at 0 models), or overwriting a selected byte, half-word or word with a random value. This excludes expensive equipment (e.g., multi-spot laser [Col19]) and the ability to inject chosen values.

At the software level, this manifests itself as either a data corruption or an instruction corruption. A data corruption results from either a direct corruption (e.g., a corruption of data transfer, of the bus, data path, or computational logic) or from indirect corruption (e.g., modification of an instruction opcode). Instruction corruption may occur either during instruction fetch [MDH+13] or instruction read from Flash [Col19]. Most instruction corruptions reduce to a data corruption: an instruction is substituted for another, leading to an incorrect value to be stored in a register. In line with the limited capabilities of our attacker, we consider that an attacker is unable to corrupt the address of a jump to a chosen value. We therefore model control faults as instruction skip, whereby a chosen instruction is simply ignored during execution.

Side-channel attacker model. SKIVA is oriented towards embedded applications. The attacker controls data input, and can perform chosen-plaintext or chosen-ciphertext operations. This enables the gamut of differential analysis techniques. We assume an attacker who can observe the timing as well as the power dissipation of the processor. Timing measurements, such as done by precise measurement of input/output operations, proceed at the cycle-accurate level. This resolution enables extraction of data-dependent control-flow, and a slew of micro-architectural attacks [GYCH16]. The attacker is also able to monitor power dissipation by shunting the power supply or by measuring electromagnetic emissions. SKIVA is thus subject to side-channel attacks such as cache-timing analysis [Ber05], correlation power analysis (CPA) [BCO04], and differential electromagnetic analysis [CPM+18]. We also consider high-order side-channel analysis. The use of aggregated bitslices in SKIVA means that all shares of a secret are processed in parallel. In this contribution, we aim to demonstrate that SKIVA successfully supports this mode of operation. For this reason, we use a univariate leakage assessment methodology [SM15a] that evaluates leakage in the sample stream at multiple different leakage orders. In a generalized higher-order side-channel evaluation methodology, the attacker would also combine independent observations of side-channel leakage. Multi-variate leakage evaluation is outside of the scope of this contribution.

Combined attacker model. We also consider the case of an active attacker, combining both side-channel measurements and fault injection [GM10, AVFM07]. For instance, it has been shown that fault protection mechanisms tend to increase leakage and thus
facilitate side-channel analysis [RBK12, CPT17]. We take this effect into account in our experimental evaluation (Section 5.2).

Faults can also be used to mitigate the effect of SCA countermeasures [RLK11, DV12]. Lacking a well-established methodology to evaluate countermeasures against such attacks, we exclude it from our attacker model. In particular, we assume that our target platform offers an embedded and protected True Random Number Generator (TRNG) providing 32 bit of randomness at regular intervals in a dedicated register. We shall therefore ignore fault attacks specifically targeting the TRNG to disable re-masking [YYP18].

Recent advances in the area of Statistical Ineffective Fault Attacks (SIFA) [DEK18] have demonstrated that countermeasures based (solely) on fault detection may be insufficient, even in the context of a masked implementation. SKIVA offers a framework for exploring the design space of countermeasures resilient to SIFA, such as self-destruction [YGD16], fault-correction, or hiding. However, designing and evaluating countermeasures against SIFA is a burgeoning area of research: in the present work, we therefore exclude this vector from our attacker model.

2.2 Countermeasures

A limitation of many countermeasures is that they are tailored to a specific algorithm. A systematic countermeasure is one which can be applied to any cryptographic operation. Instruction-level countermeasures are generic. They can be applied at the assembly level or as part of a compiler pipeline. We strive to design a set of systematic countermeasures that are modular, i.e., program transformations that can safely be chained one after the other, yielding an overall protection equal to the sum of its parts. Armed with these building blocks, programmers can adjust the security of their ciphers at will.

**Systematic countermeasures.** Systematic fault countermeasures are possible through automated instruction duplication and control-flow tracking [PHBC17, LHB14] or by exploiting intra-instruction redundancy in the target instruction set [CSN17]. Intra-instruction redundancy is enabled either by SIMD or custom instructions (Section 4).

Systematic side-channel countermeasures against power-based side-channel analysis have overwhelmingly used masking techniques, driven by correctness criteria for the resulting side-channel security such as Perfect Masking [BGK05], Threshold Implementations [NRR06], and DOM [GMK16]. The difficulty of producing secure and efficient masked implementations in software has lead to various attempts at automating this process. In the setting of the CAO domain-specific language [BMP11], it has been shown that the (strictly necessary) masking gadgets can be automatically synthesized from user-given annotations identifying public and private data [MOPT11, MOPT12]. A similar technique can be applied directly to C code [ABMP13, EW14] or assembly [BRN15], instrumenting the LLVM compiler infrastructure to carry public/private annotations to the intermediate representation, performing static analysis to identify vulnerable program points and inserting standard masking gadgets [RP10]. Threshold implementation lends itself naturally – by its very design – to a systematic treatment, which has been implemented as a compilation pass in the LLVM compiler [LAZ17].

**Modular countermeasures.** To the best of our knowledge, very few work tackles the issue of systematically protecting a cipher against both faults and side-channel analysis. One line of work, targeting hardware implementations, applies error-detecting codes on top of a threshold implementation [SMG16]. Another approach is the "tile-probe-and-fault" model [RDB18] that postulates the physical isolation of the underlying hardware (the tiles). However, from the author’s own admission, “this model [...] does not perfectly fit commercial off-the-shelf multi-core architectures”. To address these shortcomings, the
FRIT permutation \cite{SBD+18} has been specifically designed to allow inexpensive fault detection while being protected against side-channel attacks. This latter work demonstrates that combined defense is feasible at a software level (by exhibiting a secure, bitsliced implementation of FRIT). However, supporting legacy ciphers remains an open question.

2.3 Bitsliced software design

Bitslicing is a folklore technique to produce high-throughput, constant-time software implementations of cryptographic primitives \cite{Bih97, KS09}. A cipher is expressed as a Boolean circuit. The circuit is compiled into a straight-line program by leveling the circuit and translating each Boolean operation to a corresponding bitwise CPU instruction. Since the CPU manipulates registers of 32 bits, running the resulting program amounts to running 32 parallel instances of the original Boolean circuit.

**Bitslicing versus wordslicing.** In a block cipher, the state variables are \(k\)-bit wide. The bitsliced version of the cipher will store these \(k\) bits in a transposed manner, such that register \(i\) will contain the \(i\)-th bit of the state. This approach has been used for DES \cite{Bih97} as well as for AES \cite{RSD06}. However, one can also adopt wordslicing, which stores groups of \(b\) bits out of a \(k\)-bit state per register. A wordsliced design requires \(k/b\) registers, as opposed to \(k\) registers for a bitsliced design. Wordsliced design has been demonstrated for AES \cite{Kön08, KS09}. The choice between bitslicing and wordslicing has significant impact on the efficiency of the resulting design. The resulting code also changes significantly with the slicing strategy. The bitsliced implementation of AES has to juggle with 128 machine words while being restricted to straightforward logical instructions. The wordsliced implementation of AES fits within 8 registers, at the expense of complex permutations within individual words. On an embedded RISC-like CPU, our experiments have shown that the bitsliced implementation yields a higher throughput than the wordsliced one (Section 5.1). Conversely, on a high-end SIMD CPU, earlier work has shown that wordslicing is key to reach speed records in software encryption \cite{KS09}. The lack of SIMD instructions and the lesser register pressure for RISC CPUs thus favors bitsliced implementations, hence our focus on bitslicing in the present work.

**Countermeasures for bitsliced designs.** Many hardware-oriented countermeasures can be applied as transformations on the Boolean programs of bitsliced designs. An early effort to address power-based side-channel leakage is the duplication method \cite{DPA00}. More recently, several masking-oriented techniques have been proposed \cite{CS18, BDF+17, JS17, GPSS18}. Bitslicing is also a systematic countermeasure against timing attacks. By construction, a Boolean program runs in constant time. For instance, S-boxes have data-independent runtime when run as Boolean programs. Similarly, conditionals in a Boolean program are implemented through data-multiplexing: both results are (sequentially) computed and the relevant output is obtained by demultiplexing these intermediary results based on the conditional. Finally, the massively parallel nature of a bitsliced implementation can be exploited to provide intra-instruction redundancy (encrypting the same data in redundant slices) as well as various forms of temporal redundancy (processing data at distinct rounds in distinct, randomly-chosen slices) \cite{PYGS16, LCFS18}. In a bitsliced setting, these techniques translate into an end-to-end protection, protecting a cipher from the moment the plain text is introduced to the moment the cipher text is produced.

In the following, we demonstrate that, with some hardware support, bitslicing provides a sound basis to generalize some of the systematic protection schemes presented in the literature and gain protection against both attacks.
3 Modular design of countermeasures

In this section, we present the four protection mechanisms we adopt – bitslicing to protect against timing attacks, higher-order masking to protect against side-channels, intra-instruction redundancy to protect against data faults and temporal redundancy to protect against control faults – and exhaustively explore this design space opened up by our ability to compose them together.

Throughout this paper, we focus solely on the AES cipher targeting our 32-bit SKIVA processor. We chose the AES cipher for pedagogical reasons. It provides a yardstick to judge our protection scheme: it is well known in the community at large, both in terms of side-channel analysis, fault attack vectors as well as performance. However, the panel of techniques is not restricted to this cipher nor this processor: they naturally generalize – in a systematic manner – to any cipher in sliced form, for processors of arbitrary width as well as design (RISC as well as CISC, SIMD or not). We leave it to future work to evaluate their effectiveness on a broader range of cryptographic primitives and hardware platforms.

3.1 Constant-time programming

Our implementation of AES is fully bitsliced: the 128-bit input of the cipher is represented with 128 variables. Since each variable stores 32 bits on SKIVA, a single run of our primitive computes 32 parallel instances of AES. In Section 5.1, we show that, despite its register pressure, this implementation is the most efficient one on this RISC processor offering 32 general-purpose registers.

This bitsliced implementation of AES is the cornerstone of our work. The protection mechanisms presented in the following assume the availability of a bitsliced design while themselves producing a bitsliced design (of lesser parallelism) in return. The modularity of our approach lies in this simple observation: as long as there is enough parallelism to compute at least one run of the algorithm, we can chain these program transformations.

3.2 Higher-order masking

We protect our implementation against power analysis attacks by adopting the higher-order masking method of Barthe et al. [BDF+17, Algorithm 3] at order 4 and a Trichina gate at order 2. At order 4 and following Journault and Standaert [JS17] bitsliced implementation of AES, we conservatively refresh the output of every multiplication to achieve composability through strong non-interference (as per [BDF+17, Table 4]).

Our bitsliced version of the (order-2) Trichina gate has a built-in refresh at the output. Specifically, if $x$ and $y$ are two-share inputs and $r$ is a two-share random vector, then the two-share output is obtained as

$$z = x \cdot y \oplus x \cdot \text{rot}(y, 1) \oplus r \oplus \text{rot}(r, 1)$$

Optimizing this masked design by reducing the number of refresh [BBD+16, BGR18] is orthogonal to the present work: our performance results serve as a pessimistic baseline; the SKIVA platform would accommodate optimized implementations – already existing or to come – just as well.

We support masking with 1, 2, and 4 shares leading to respectively unmasked, 1st-order, and 3rd-order masked implementations. By convention, we use the letter $D$ to denote the number of shares ($D \in \{1, 2, 4\}$) of a given implementation. Within a machine word, the $D$ shares encoding the $i^{th}$ bit are grouped together, as illustrated in Figure 2 for ($D \in \{1, 2, 4\}, R_s = 1$). Starting from a bitsliced design, this transformation is systematic: non-linear instructions are expanded into a masked multiplication (followed by a refresh for $D = 4$) while linear instructions are replicated over each share. The parallelism of the resulting bitsliced design is divided by $D$. The overall run-time increases with the number of
latency is left unchanged, hence we expect the throughput of the resulting cipher to be complement operation on their redundant copies. We describe such an instruction set in translated into an instruction that performs this operation on half of the slices and the for equality of the output slices. The complementary-redundancy scheme requires special
forms. One can either implement a faults such as instruction skip: because redundancy is spatial and not temporal, skipping duplicates of the slices agree upon the result. By convention, we use the letter $D = 1$
provides a simpler model to reason about and control interferences across shares.

3.3 Intra-instruction redundancy

We protect our implementation against data faults using intra-instruction redundancy (IIR) [PYGS16, LCFS18, CSN+17]. We may duplicate a single slice into one (i.e. no spatial redundancy), two or four slices, checking at the end of each round that all (redundant) slices agree upon the result. By convention, we use the letter $R_s$ to denote the spatial redundancy ($R_s \in \{1, 2, 4\}$) of a given implementation. Within a machine word, the $R_s$
duplicates of the $i^{th}$ bit are interspersed every $32/R_s$ bits, as illustrated in Figure 2 for
($D = 1, R_s \in \{1, 2, 4\}$). Note that this scheme alone does not protect against control faults such as instruction skip: because redundancy is spatial and not temporal, skipping a (parallel, bitwise) operation would affect all the redundant slices simultaneously.

Starting from a bitsliced design, this transformation is systematic and exists in two forms. One can either implement a direct redundant implementation, in which the duplicated slices contain the same value, or a complementary redundant implementation, in which the duplicated slices are complemented pairwise. For example with $R_s = 4$, we can have 4 exact copies (direct redundancy) or 2 exact copies and 2 complementary copies (complementary redundancy). The direct-redundancy scheme requires no change to the code: we merely have to duplicate the inputs upon calling the protected code and testing for equality of the output slices. The complementary-redundancy scheme requires special support from the processor: a logical instruction in the original bitsliced design must be translated into an instruction that performs this operation on half of the slices and the complement operation on their redundant copies. We describe such an instruction set in Section 4. The parallelism of the resulting bitsliced design is divided by $R_s$. The overall latency is left unchanged, hence we expect the throughput of the resulting cipher to be divided by $R_s$.

In practice, we will favor complementary redundancy instead of direct redundancy. First,
it is less likely for complemented bits to flip to consistent values due to a single fault injection. For instance, timing faults during state transition [ZDCT13] or memory accesses [BGV11] follow a random word corruption or a stuck-at-0 model. Second, following the wave dynamic differential logic (WDDL) approach [TV04], this enables us to expose the same Hamming weight for each individual register throughout the entire execution of the cipher, which has been shown to reduce power leakage compared to direct redundancy [BJHB18].

3.4 Temporal redundancy

We protect our implementation against control faults using temporal redundancy (TR) across rounds [PYGS16]. This technique consists in pipelining the execution of 2 consecutive rounds in 2 aggregated slices (Figure 3a). By convention, we use the letter $R_t$ to distinguish implementations with temporal redundancy ($R_t = 2$) from implementations without ($R_t = 1$). For $R_t = 2$, half of the slices compute round $i$ while the other half compute round $i - 1$. The corresponding pseudo-code is shown in Figure 3b. The function init_round starts the pipeline by filling half of the slices (in state) with the output of the first round of AES, and the other half with the output of the initial AddRoundKey. At the end of round $i + 1$, we have re-computed the output of round $i$ (at a later time): we can therefore compare the two results (using the check procedure) and detect control faults based on the different results they may have produced. If a control fault has impacted the output of round $i$ during iteration $i$ or (exclusively) $i + 1$, it will necessarily be detected. To go unnoticed, the fault must be repeated in both rounds – while not impacting the subsequent round computed at iteration $i + 1$ – so as to yield the same output in both iterations.

Note that unlike usual implementations of temporal redundancy, such as instruction duplication [PHBC17], this technique does not increase code size: the same instructions compute both rounds at the same time. The last round, omitted from Figure 3b, is different from the others as it does not perform MixColumn, and must therefore be computed twice in a non-pipelined fashion (i.e., using instruction duplication), after which a final check is performed.

Whereas pipelining protects the inner round function, faults remain possible on the control path of the loop itself. For instance, one may attempt to sidestep the rounds by (data) faulting the loop counter or (control) faulting the conditional jump to reach the end of the loop earlier or later than desired. We protect against these threats by applying folklore loop hardening techniques: we spatially duplicate the 4-bit counter to protect against data faults and duplicate the control structure of the loop (Figure 3b).

By exploiting the iterative nature of the algorithm and the bitsliced implementation of a round, we obtain a data and control fault protection at minimal expense in code size and execution time. Since the parallelism of the inner round is divided by $R_t$, we expect the overall throughput of the cipher to be divided by $R_t$. Beside throughput, this implementation is also space-efficient (our target platform features only 128 kB of RAM): the protection against control faults piggybacks on the protection against data faults, thus avoiding instruction duplication and keeping code size in check.

3.5 Combining higher-order masking, IIR and TR

The protections described in the previous sections transform bitsliced designs into bitsliced designs, merely reducing parallelism (and thus throughput) in the process. As a result, they naturally compose: given a number of shares $D \in \{1, 2, 4\}$, a spatial redundancy $R_s \in \{1, 2, 4\}$ and a temporal redundancy $R_t \in \{1, 2\}$, we can systematically derive an implementation immune to power analysis and/or data faults and/or control faults, processing $32/(R_t \times R_s \times D)$ blocks at a times. The 9 possible layouts for $(D, R_s, R_t = 1)$ are illustrated in Figure 2.
void AES_secure(uint plain[128], uint keys[11][128],
uint cipher[128]) {
    uint state[128];
    // Aggregated bitslice 'state': plain and first round
    init_round(state, plain, keys[0], keys[1]);
    // Data-duplicated loop counter, increment and guard
    int round_cpt = 1 | (1 << 4);
    const int incr = 1 | (1 << 4);
    const int last_round = 9 | (9 << 4);
    // Duplicated loop structure
    while (1) {
        while (1) {
            // Retrieve key from duplicated round index
            uint[128] round_key = load_key(keys, round_cpt);
            // Compute current and previous round in parallel
            AES_round_bitsliced(state, round_key, plain);
            // Check temporal redundancy
            check(state, plain);
            memcpy_secure(plain, state, 128*sizeof(uint));
            // Increment data-duplicated counter
            round_cpt += incr;
            // Duplicated loop exit
            if (round_cpt == last_round) break;
        }
        if (round_cpt == last_round) break;
    }
    // last round twice, checked for temporal redundancy
    (...)}

Figure 3: Compact and protected AES skeleton

The modularity of our approach paves the way for pay-as-you-go countermeasures: depending on the execution context and security requirements of our cipher, we can decide to adopt a more or less aggressive set of parameters \((D, R_s, R_t)\). Different protections are obtained by combination of the 3 elementary protection mechanisms available. Let us first consider the most secure implementations and justify their relevance. In a setting where multiple cycle-accurate and bit-precise faults are possible \([NHH+16]\), we would recommend an implementation with \((D \in \{2, 4\}, R_s \in \{2, 4\}, R_t = 2)\). If faults cannot be reliably repeated in a cycle-accurate and/or bit-precise manner, then \((D \in \{1, 2, 4\}, R_s = 1, R_t = 2)\) is sufficient. A physically isolated device forbids power analysis but is not necessarily immune to faults \([TSS17, KDK+14]\) altogether: in this setting, one could adopt \((D = 1, R_s \in \{2, 4\}, R_t = 2)\). We may further strengthen our hypothesis about the device and thus relax our security requirements. For example, we may dispense from redundancy altogether if the device is physically protected against probes \([YGD+16, CBD+15, LLF16]\), yielding \((D \in \{2, 4\}, R_s = 1, R_t = 1)\). Or we may assume that the underlying architecture provides hardware support enforcing control-flow integrity \([WWM15, dCKC+16]\), in which case temporal redundancy can be disposed of but not spatial redundancy, covering the cases where \((D \in \{1, 2, 4\}, R_s \in \{2, 4\}, R_t = 1)\). We have thus mapped the entire design space.
Table 1: Proposed ISE. These instructions are added to the standard SPARC-V instruction set, occupying unused opcodes. Symbols in the instruction format - \( rs1, rs2, rd \) are registers. \( \text{imm} \) is an immediate operand. The “Type” column shows what opcode group was used for each instruction. Appendix A lists the functional specification for each instruction.

<table>
<thead>
<tr>
<th>Semantics</th>
<th>Instruction format</th>
<th>Immediate</th>
<th>Type</th>
</tr>
</thead>
<tbody>
<tr>
<td>Normal → Bitslice</td>
<td>TR2 rs1, rs2, rd</td>
<td>logic</td>
<td></td>
</tr>
<tr>
<td>Bitslice → Normal</td>
<td>INVTR2 rs1, rs2, rd</td>
<td>ld/st</td>
<td></td>
</tr>
<tr>
<td>Slice Rotation</td>
<td>SUBRO1 rs, imm, rd</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Redundancy Generation</td>
<td>RED rs, imm, rd</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Redundancy Checking</td>
<td>FTCHK rs, imm, rd</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Redundant AND (( R_s=2 ))</td>
<td>ANDC16 rs1, rs2, rd</td>
<td>logic</td>
<td></td>
</tr>
<tr>
<td>Redundant XOR (( R_s=2 ))</td>
<td>XORC16 rs1, rs2, rd</td>
<td>logic</td>
<td></td>
</tr>
<tr>
<td>Redundant XNOR (( R_s=2 ))</td>
<td>XORNC16 rs1, rs2, rd</td>
<td>ld/st</td>
<td></td>
</tr>
<tr>
<td>Redundant AND (( R_s=4 ))</td>
<td>ANDC8 rs1, rs2, rd</td>
<td>logic</td>
<td></td>
</tr>
<tr>
<td>Redundant XOR (( R_s=4 ))</td>
<td>XORC8 rs1, rs2, rd</td>
<td>logic</td>
<td></td>
</tr>
<tr>
<td>Redundant XNOR (( R_s=4 ))</td>
<td>XORNC8 rs1, rs2, rd</td>
<td>ld/st</td>
<td></td>
</tr>
</tbody>
</table>

4 Implementation aspects

In the previous section, we have introduced several bitslice aggregation schemes providing multiple levels of side-channel resistance and fault-attack resistance, depending on a selected number of shares, level of spatial redundancy, and level of temporal redundancy \((D, R_s, R_t)\). In this section, we present the SKIVA hardware, a custom instruction set extension (ISE) tailored to support an efficient and safe implementation of these schemes. We first lay bare the hardware and software assumptions our design is operating under (Section 4.1) and expound its semantics (Section 4.2).

4.1 Hardware design space

Custom instructions are commonly used as performance-enhancing mechanisms [GGP08] since a single custom instruction can replace multiple standard instructions. Custom ISE is also useful to support hardware-specific side-channel countermeasures, such as mask generation [TG07] or hiding [RCS+09]. Adding a new instruction to a processor requires modification of the processor data-path as well as modification of the processor software toolchain. With the advent of open platforms such as RISC-V and the widespread availability of programmable logic (FPGA), instruction extensions have become a practical methodology. For example, XCrypto [MPP19] defines instruction extensions for RISC-V. CRISP [GGP08] is another effort to add native support for bitslicing in a processor design. CRISP defines three new instructions, each with six operands. Their custom instruction datapath relies on two programmable lookup tables with four input bits and one output bit. However, these instructions only deal with bitslicing, and they do not offer redundancy nor support for countermeasures.

Design. The design of new instructions involves a trade-off between a specialized, application-specific solution and a general-purpose, universal solution. Each custom instruction must serve as many different applications as possible. We added new instructions to SKIVA to support computing on aggregated bitslices in three different areas. First, they help with the conversion from normal representation to bitsliced form and back. Second, they handle subword-operations for the computation of non-linear operations on two or four shares \((D \in \{2, 4\})\). Third, they handle subword-operations for spatially redundant computations and fault checking \((R_s \in \{2, 4\})\). The new instructions
are summarized in Table 1 and will be described in detail in further subsections. Appendix A provides their functional specification. These new instructions are orthogonal; they can be used in a mix-and-match fashion to obtain the desired level of sharing and redundancy. We integrated the new instructions on the SPARC-V instruction set of the open-source LEON3 processor and software toolchain [Res18].

**Hardware integration.** Figure 4 illustrates the integration of the custom datapath into the seven-stage RISC pipeline. The instructions follow a two-inputs, two-outputs operand format, encoded as two source registers, a destination register and an immediate field (\texttt{INS rs1, rs2, rd, imm}). The upper 32-bit output of the custom instruction is transferred to the Y-register, a register which is used for SPARC-V instructions with 64-bit output such as the regular data multiplication. The integration of custom-hardware deep into the pipeline necessitates the use of simple and fast datapath hardware. However, these instructions benefit from the same performance advantages as regular instructions including a typical throughput of 1 instruction per cycle and minimal stall effect thanks to forwarding [PH12].

The new instructions are mapped into unused opcodes of the SPARC-V instruction set [SI92]. This standard instruction set recognizes three different formats. The newly added instructions belong to the third format, sharing the same opcode space as the instructions for load/store, logic, and arithmetic, among others. Because all of the proposed instructions correspond to simple logic manipulations, we integrated them directly into the existing logic/shift group and the load/store group of SPARC. Within the logic/shift group, we identified eight unused opcode locations, and we allocated the most frequently needed instructions in these unused spaces. The remaining three instructions were moved into the load/store group. The last column of Table 1 identifies the allocation for each new instruction. Since we did not replace any existing SPARC instruction, SKIVA is backward binary-compatible with existing LEON applications. The new instructions add minimal overhead to the design. In terms of 180nm standard cell ASIC technology, we added 1250 gate-equivalent to the design, which amounts to 3% of the area of the integer unit of SKIVA.

**Software integration.** We integrated the new instructions into the software toolchain of SKIVA by extending the assembler. The new mnemonics were then integrated into the application in C through inline assembly coding. Because the custom-instruction format is compatible with that of existing, standard SPARC-V instructions, they benefit from off-the-shelf compiler optimizations.

### 4.2 Hardware support for aggregated bitslice operations

In the following, we describe each group of instructions, with emphasis on their semantics. The formal definition of each instruction is given in Appendix A.
Instructions for bitslicing. We introduce two instructions to transpose data into their bitsliced representation (Figure 5a). The first instruction, TR2 rs1, rs2, rd, performs an interleaving of the bits of two source registers into two output registers. This interleaving can be thought of as a 2-bit transposition, as it places bits within the same column of register rs1 and rs2 in adjacent positions of the output registers rd and y. The second instruction, INVTR2 rs1, rs2, rd, performs the inverse operation. Bitslice transposition for an arbitrary number of bits is achieved through repeated application of TR2. For example, Figure 5b shows an 8-bit transposition, that is, a bitslice conversion of 8-bit subwords within the input registers r1 to r8. Twelve applications of TR2 in a butterfly-like diagram yield the desired result. In general, for a \(2^n\)-bit transition, \(n \cdot 2^{n-1}\) applications of TR2 are needed. The reader may notice that the bitslice ordering of the output exhibits the same shuffling effect as for a Fast-Fourier Transform. This effect is dealt with through proper register ordering before transposition.

To create aggregated bitslices \((R > 1 \text{ or } D > 1)\), we pre-process the source registers (in non-bitsliced form) by duplicating them first and then transposing them to bitsliced form. The side-channel protection and fault-detection of SKIVA is not active during bitslice conversion but we check their consistency after transposition and before encryption.

Instructions for higher-order masking. To combat side-channel leakage, SKIVA supports two-share and four-share implementations of bitsliced algorithms, which provides first-order and third-order masked side-channel resistance. The shares are located in adjacent bits of a processor register. We use Boolean masking, so that the XOR of all shares yields the unmasked value. Linear operations on an ensemble of shares are computed as the linear operation on each individual share. Linear operations are done using bitwise operations on the two-share and four-share representation. As presented in Section 3.2, we implement the secure multiplication (AND) using the design of Barthe et al. [BDF+17] for third-order masking and the design of Trichina [Tri03] for first-order masking. The secure OR operation is implemented as the De Morgan’s equivalent of a secure AND.

Computing a secure multiplication over multiple shares requires the computation of the partial share-products. For example, the secure multiplication of the two-share slices \((a_1, a_0)\) with the two-share slices \((b_1, b_0)\) requires the partial products \(a_1.b_1, a_1.b_0, a_0.b_1, a_0.b_0\).
To align the slices for the cross-products, we implement a slice rotation instruction \texttt{SUBROT rs, imm, rd}. This instruction transforms the two-share slices \((a_1, a_0)\) into \((a_0, a_1)\). The same instruction \texttt{SUBROT} can also handle a four-share design, which transforms \((a_3, a_2, a_1, a_0)\) into \((a_2, a_1, a_0, a_3)\). The details of this instruction are given in listing A.3 in Appendix A.

The programming of side-channel protected bitsliced code using \texttt{SUBROT} assumes the following specific programming rules. Attention has to be paid to side-effects of shared storage elements in the architecture. Balasch \textit{et al.} [BGG+14] have shown that a \(d\)-th order security proof against value-based leakage leads to a \(\lfloor \frac{d}{2} \rfloor\)-th proof against transition-based leakage. Papagiannopoulos \textit{et al.} [PV17] identified three practical cases in micro-controller code, where such transition-based leakage occurs. The most obvious source of transition-based leakage is the overwriting of registers, since the power dissipation of overwriting the register is proportional to the Hamming distance between the former and the new value. They also observe transition-based leakage by overwriting of shared memory locations.

Finally, they observe a “neighbour leaking” effect where operations on one register cause leakage from another.

In a bitslice design, different shares are stored in different bits. Transition-based leakage will occur when one bitslice overwrites another, and this can unmask the shares as follows. Assume a two-share bitslice design \((a_0, a_1) = (r \oplus v, r)\), with \(r\) a random bit and \(v\) a secret bit. Then writing the value \((a_1, a_0)\) into a register holding \((a_0, a_1)\) leads to unmasking. In this case, the Hamming distance is \((a_0 \oplus a_1, a_1 \oplus a_0) = (r \oplus v \oplus r, r \oplus v \oplus r) = (v, v)\). This example directly applies to \texttt{SUBROT}, when this instruction would write its output into its own source register.

To avoid these known sources of transition-based leakage, and to minimize the risk of (undesired) unmasking resulting from this leakage, we applied the following conservative strategy. (1) For \(D = 4\), we refresh the masks at the output of every secure multiplication (AND) using Barthe’s parallel refreshing algorithm [BDF+17]. For \(D = 2\), the refresh is implicit due to the construction of the Trichina gate; (2) We avoid reusing registers within the secure multiplication by constraining the set of registers that the compiler is allowed to use. This ensures that \texttt{SUBROT} will never overwrite its own input. In addition, after the result of a \texttt{SUBROT} instruction is used, we clear that register to prevent later overwriting by another instruction. Figure 6 shows an example of a first-order and a third-order secure multiplication. (3) We maintain strict separation between registers used for the masked algorithm (\textit{i.e.} AES), and registers used for mask generation and mask distribution. This ensures that registers containing masked data cannot be overwritten by registers directly related to random masks. A separation for transition-base leakage between two registers \(\texttt{ra}\) and \(\texttt{rb}\) means that neither register is allowed to overwrite the other one.
Instructions for fault redundancy checking. We present the instructions related to fault redundancy in two groups. The first is related to generation and checking of fault-redundant slices, while the second is related to computations. The redundant bits with respect to fault injection are stored in adjacent bytes of half-word. Figure 7(a) shows the example of a halfword operation to generate redundant data, while Figure 7(b) shows the example of a halfword operation to verify redundant data.

The RED rs1, imm, rd instruction generates redundant data. The redundant copy is stored in the upper halfword ($R_s = 2$) or in the three upper bytes ($R_s = 4$). The redundant portion can be either a direct or else a complement of the original data. There are six variants of RED rs1, imm, rd. Two of them support dual redundancy ($R_s = 2$), they duplicate the lower and upper halfword, in direct or complementary form. Four additional variants support quadruple redundancy ($R_s = 4$), and they quadruple the lower two bytes or the upper two bytes, each in direct or complementary form. Listing A.4 gives a formal
The $\text{FTCHK rs1, imm, rd}$ instruction verifies the consistency of the redundant data. This instruction generates a fault-flag in redundant form (over $R_s$ bits, Appendix A.11), which can be used to drive a fault condition test. Figure 7(b) illustrates the case of a dual-redundancy check on direct data. The fault-check is evaluated in a redundant manner, so that the fault-check itself can detect fault injection on its own check. The expected faultless result of the instruction example in Figure 7(b) is $0xFFFF0000$. There are four variants of this instruction, for either dual ($R_s = 2$) or quadruple redundancy ($R_s = 4$), and direct or complementary redundancy.

**Instructions for fault-redundant computations.** Computations on direct-redundant bit-slices can be done using standard bitwise operations. However, for complementary-redundant bit-slices, the bitwise operations have to be adjusted to complement-operations. The complement-redundant data format can be introduced at the halfword boundary ($R_s = 2$) or at the byte boundary ($R_s = 4$). We opted to provide support for bitwise AND, XOR and XNOR on these complement-redundant data formats. Figure 7(c-d) illustrates the case of $\text{ANDC8}$ and $\text{XORC16}$. Their detailed behavior is given in Appendix A.

## 5 Results

This section evaluates the performance and side-channel security of AES on SKIVA. Next, we analyze the fault coverage of applications on SKIVA under the assumed fault model.

### 5.1 Performance evaluation

Our experimental evaluation has been carried on a prototype of SKIVA deployed on the main FPGA (Cyclone IV EP4CE115) of an Altera DE2-115 board. The processor is clocked at 50 MHz and has access to 128 kB of RAM. Our performance results are obtained by running the desired programs on bare metal. We assume that we have access to a TRNG that frequently fills a register with a fresh 32-bit random string. We use a software pseudo-random number generator (32-bit xorshift) to emulate a TRNG refreshed at a rate of our choosing. We checked that our experiments did not overflow the period of the RNG.

Several implementations of AES are available on our 32-bit, SPARC-derivative processor, with varying degrees of performance. A straightforward byte-oriented implementation (supplementary material) takes 77 C/B whereas an optimized 32-bit T-box implementation (supplementary material) takes 23 C/B. Both implementations are prone to timing attacks. The constant-time, byte-sliced implementation (using only 8 variables to represent 128 bits of data) of BearSSL [Por] performs at 48 C/B. Our bitsliced implementation (using 128 variables to represent 128 bits of data) (supplementary material) performs favorably at 44 C/B while weighing 7772B: despite a significant register pressure (128 live variables for 32 machine registers), the rotations of MixColumn and the ShiftRows operations are compiled away. This bitsliced implementation serves as our baseline in the following.

**Micro-benchmarks.** The instructions TR2, INVTR2, RED and SUBROT were introduced solely for performance reasons. We evaluate their associated performance benefits by micro-benchmarking them against an equivalent, purely software emulation. The instructions TR2/INVTR2 improve performance by $\times 3.64$ whereas SUBROT improves performance by $\times 4.2$. The instruction RED improves performance from $\times 2.7$ for $R_s = 2$ to $\times 4.1$ for $R_s = 4$. These results are consistent with the number of instructions necessary to emulate each instruction (Appendix B). The impact of memory transfers (which takes a significant portion of the computation time, independently of the instruction set) somewhat reduces the absolute benefits of TR2/INVTR2 instructions: a full, bitsliced transposition takes 426
cycles with software emulation while it takes 302 cycles with custom instructions, yielding a speedup of $\times 1.4$ with custom instructions.

**Code size (AES).** We measure the impact of our hardware and software design on code size, using our bitsliced implementation of AES (Section 3) as a baseline. Our hardware design provides us with native support for spatial, complementary redundancy ($\text{ANDC}$, $\text{XORC}$, and $\text{XNORC}$). Performing these operations through software emulation would result in an increase of $\times 1.2$ (for $D = 2$) to $\times 1.3$ (for $D = 4$) in code size. One must nonetheless bear in mind that the security provided by emulation is *not* equivalent to the one provided by native support. The temporal redundancy ($R_t = 2$) mechanism comes at the expense of a small increase (less than $\times 1.06$) in code size, due to the loop hardening protections as well as the checks validating results across successive rounds. The higher-order masking comes at a reasonable expense in code size: going from 1 to 2 shares increases code size by $\times 1.4$ whereas going from 1 to 4 shares corresponds to a $\times 1.8$ increase. A fully protected implementation ($D = 4, R_s = 4, R_t = 2$) thus weighs 14048 bytes.

**Throughput (AES).** We report on the impact of our hardware and software design on the performance of our bitsliced implementation of AES (Section 3). To do so, we evaluate the performance of our 18 variants of AES, for each value of $(D \in \{1, 2, 4\}, R_s \in \{1, 2, 4\}, R_t \in \{1, 2\})$. To remove the influence of the TRNG’s throughput from the performance evaluation, we assume that its refill frequency is strictly higher than the rate at which our implementation consumes random bits. In practice, a refill rate of 10 cycles for 32 bits is enough to meet this requirement.

We report our performance results\(^1\) in Table 2. As expected, for $D$ and $R_t$ fixed, the throughput decreases linearly with $R_s$. Comparing Table 2a with Table 2b at fixed $R_s$, we notice that the throughput decreases by a factor $\times 2.5$ ($D = 4$) to $\times 3$ ($D = 1$): temporal redundancy mechanically divides the throughput by a factor 2, on top of which one must account for the overhead of scheduling and checking the redundant slices. Note that this overhead is less acute as $D$ increases since more time is spent computing each AES round (and, thus, relatively less time is spent in the runtime implementing temporal redundancy). At fixed $D$, we also notice that the variant $(D, R_s = 1, R_t = 2)$ (temporal redundancy by a factor 2) exhibits similar performances as $(D, R_s = 2, R_t = 1)$ (spatial redundancy by a factor 2). However and crucially, both implementation are *not* equivalent from a security standpoint: as discussed in Section 5.3, the former offers weaker security guarantees than the latter. Similarly, at fixed $D$ and $R_s$, we may be tempted to run twice the implementation $(D, R_s, R_t = 1)$ rather than running once the implementation $(D, R_s, R_t = 2)$: once again, the security of the former is reduced compared to the latter since temporal redundancy ($R_t = 2$) couples the computation of 2 rounds within each instruction, whereas pure instruction redundancy ($R_t = 1$) does not. At fixed $R_s$ and $R_t$, going from $D = 1$ to $D = 2$ implies a serious performance toll: first, the throughput is mechanically divided by a factor 2; second, non-linear instructions must be expanded into secure ones; third, there is a significant run-time overhead induced by masking, such as creating shares, fetching random numbers, *etc*. Comparatively, going from 2 to 4 shares, is less expensive since these run-time overheads are identical.

In Figure 2c and Figure 2d, we report the speedup offered by the custom instruction set compared to a software emulation of these instructions. For large values of $R_s$ or $D$, the custom instruction set yields a speedup between $\times 1.4$ to $\times 1.6$, which is a reasonable expectation for a fine-grain custom-instruction based hardware acceleration mechanism [IL07]. On the one hand, custom instructions can be emulated in 2 instructions on

\(^1\)To fully account for the 3 dimensions of our design space ($D$, $R_s$, and $R_t$), we present our results in a tabular form – rather than graphical – to avoid biasing the interpretation toward 2 particular dimensions, at the exclusion of the third one.
Table 2: Exhaustive evaluation of the AES design space

<table>
<thead>
<tr>
<th>$R_t = 1$</th>
<th>$D$</th>
<th>$R_t = 2$</th>
<th>$D$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$R_s$ 1</td>
<td>14 C/B</td>
<td>183 C/B</td>
<td>621 C/B</td>
</tr>
<tr>
<td></td>
<td>44 C/B</td>
<td>127 C/B</td>
<td>470 C/B</td>
</tr>
<tr>
<td>$R_s$ 2</td>
<td>89 C/B</td>
<td>447 C/B</td>
<td>1615 C/B</td>
</tr>
<tr>
<td></td>
<td>262 C/B</td>
<td>1122 C/B</td>
<td>3838 C/B</td>
</tr>
<tr>
<td>$R_s$ 4</td>
<td>169 C/B</td>
<td>847 C/B</td>
<td>3042 C/B</td>
</tr>
<tr>
<td></td>
<td>513 C/B</td>
<td>2148 C/B</td>
<td>7272 C/B</td>
</tr>
</tbody>
</table>

(a) Throughput ($R_t = 1$)  
(b) Throughput ($R_t = 2$)  
(c) Speedup w/ custom instructions ($R_t = 1$)  
(d) Speedup w/ custom instructions ($R_t = 2$)

average: at best, our speedup is at most ×2. On the other hand, relatively few custom instructions are used: they appear mostly in the S-box, whereas the remainder of the cipher consists of linear operations and memory transfers. This is consistent with previous custom cryptographical ISE, such as CRISP [GGP08] where a speedup of ×1.36 was reported. Note, once again, that both implementations are not comparable from a security standpoint: the security argument of the former is simpler than the latter while a successful fault against the former requires a more powerful adversary than the latter.

5.2 Side-channel analysis

To show the security of our masking scheme, we test SKIVA on the main FPGA of SAKURA-G board running at 9.8MHz and powered at 5V by an external power generator. We use a LeCroy WaveRunner 610Zi oscilloscope, sampling 250M samples/sec. To limit the noise level, we use a low-pass filter with a cutoff frequency of 81MHz on the power probe. Furthermore, to have more accurate power traces, we set the scope to average five traces and execute each encryption five times on SKIVA. To trigger the scope, we assign one GPIO pin of LEON to a header pin on SAKURA-G board. We program a C implementation of AES on SKIVA. This C code gets a plaintext from a PC through UART and runs the encryption on the plaintext five times. The AES code sets and resets the trigger before and after the encryption steps described in the following subsections. The ciphertext is sent back to the PC for checking and validation of the power trace.

Correlation power analysis. To evaluate our design, we conduct 1st order correlation power analysis (CPA) [BCO04] on power consumption traces of the SubBytes stage of the first round of AES. We use hamming weight of the SubBytes output as the power model. To speed up our attack, we use a sampling rate of 50M samples/sec. In this testcase, we attack a single bitslice out of 32 parallel bitslices; the unused bitslices perform constant encryption of an all-zero plaintext with an all-zero key. Our CPA attack analyzes 50K traces and confirms that 1st order CPA on the unmasked scheme can reveal half of the key with 12K traces while it reveals all the secret key bytes with 24K traces (see Appendix C.1 for specifics). When masking is enabled, no key byte is revealed under any configuration at the maximum number of traces we considered (50K).

Test vector leakage assessment. To show that our 1st and 3rd order masked schemes are immune to power-based attacks of orders up to their masking orders, we use the TVLA methodology [GJJR11, BCD+13] and conduct the 1st and 2nd order t-tests on our 1st
order masked implementation and the $1^{st}$ to $4^{th}$ order t-tests on our $3^{rd}$ order masked encryption. We set the trigger on one S-box in the fourth round of AES based on the observation that the t-test shows more accurate results for the second third part of AES [BCD+13].

For our experiments, following our attacker model discussed in Section 2.1, we conduct the univariate non-specific fixed-vs.-random t-test in which a set of random inputs and a set of fixed inputs are interspersed in a random order and sent to the device. The fixed plaintext is selected such that the output of the SubBytes stage in the $4^{th}$ round of AES is zero. Furthermore, for higher order t-tests, we post-process the traces [SM15b] to calculate the t-scores of the target order. We adopt the histogram methodology [RGV17] to speed up our t-test calculations. Using a threshold value of 4.5 gives us a confidence of 99.999% to test the null hypothesis that the two sets are from the same population, i.e. the device is not leaking information correlated to the secret data.

Figure 8 and Figure 9 show the results of the t-test on our masked implementations. The right column in Figure 8 (resp. Figure 9) indicates that our first (resp. third) order masked scheme shows no leakage of first (resp. first, second, or third) order on 500K fixed vs. 500K random traces while showing second (resp. fourth) order leakage as expected. The left columns show how turning the PRNG off causes the implementations to have leakage of all orders.

We conclude that, by applying the conservative approach mentioned in Section 4.2, our implementation gives $d^{th}$ order security on a $d^{th}$ order masking scheme.

**Power leakage of direct and complementary redundancy.** To compare the effect of the direct and complementary redundancy schemes on side-channel leakage, we run the following test. We make two different versions of our AES C code: (1) 16 parallel aggregated bitslices of the direct ($D = 2, R_s = 1, R_t = 1$) scheme as the input to the first S-box in
Figure 9: Example power trace and 1\textsuperscript{st} to 4\textsuperscript{th} order t-tests of 3\textsuperscript{rd} order masked implementation. Left column: 35K fixed vs. 35K random traces with PRNG off. Right column: 500K fixed vs. 500K random traces with PRNG on.
the fourth round of AES; and (2) 8 parallel aggregated bitslices of the complementary 
\((D = 2, R_s = 2, R_t = 1)\) scheme as the input to the first S-box in the fourth round of AES.

We then measure 5K traces for fixed input and 5K traces for random input and apply 
a second order t-test on the measured traces. To speed up our measurements, the traces 
were collected at 50MS/s. As expected, Figures 10a and 10b show second order leakage for 
both schemes. However, the direct redundancy results in much higher t-values indicating 
a higher probability of leakage than complementary redundancy. To make this point more 
clear, Figure 10c shows the evolution of t-values for the second order t-test with respect to the 
number of traces for both redundancy schemes. We observe that the direct redundancy 
shows leakage with as few as about 200 traces while the complementary redundancy shows 
leakage only after around 2500 traces. From this experiment, we conclude that having 
the complementary redundancy is better than its direct counterpart in hiding secret data 
from the power leakage. Note that despite exhibiting different power leakage profiles, we 
have confirmed that a first order t-test on both implementations shows no leakage for a 
non-specific test of 25K fixed vs. 25K random traces (Figure 11 in Appendix C.2).
5.3 Security analysis of data faults

In the following, we analyze the fault sensitivity of our protected implementations according to the attacker models defined in Section 2.1. Our data protection scheme relies on spatial redundancy ($R_s \in \{2, 4\}$). Faults that cannot be detected are those that affect redundant copies within a single register in a consistent manner, which implies either identical values in case of direct redundancy or negated values in case of complemented redundancy.

Note that this analysis is independent of whether sharing ($D$) is used or not. From the standpoint of redundancy, each share is independently protected: for example, if two shares of the same data are subjected to a bit flip, our redundancy mechanism will report an error, even though the underlying data remains unchanged ($x_1 \oplus x_2 = \overline{x_1} \oplus \overline{x_2}$).

There are different ways to achieve undetected faults, i.e. generate a consistent value: one may skip an instruction whose destination register already holds a consistent value; one may replace an instruction with another (e.g., substitute an ANDC by an XORC); or directly perform a data fault.

If $P$ is the probability for a data fault to result in a consistent value, then the detection rate is $1 - P$. Such a probability depends on the injection technique, its parameters, the target architecture as well as physical properties of the device. In the following, we develop a theoretical analysis based on the assumption that data faults follow a stuck-at 0 or stuck-at 1 model, or uniformly distributed random byte, half-word, and word model. We then complement this analysis by an empirical evaluation of the impact of instruction skip.

Theoretical analysis of spatial redundancy In this analysis, we use the fault coverage ($FC$) metric [GMK12] $FC = 1 - F_{\text{undetected}}/F_{\text{total}}$ where $F_{\text{total}}$ is the total number of faults covered by the fault model and $F_{\text{undetected}}$ is the number of faults that affect the execution while escaping detection by the countermeasure.

By construction, data fault effects such as single bit set, single reset, single bit flip, byte or half-word zeroing, faulty random byte or faulty random half-word are all detected ($FC = 100\%$). Word zeroing or stuck-at 1 on complementary redundant data are also all detected ($FC = 100\%$) but direct redundancy will never detect it ($FC = 0\%$).

If the attacker injects random data faults following a uniform distribution, it means that there are $F_{\text{total}} = 2^{32}$ fault injection possibilities. For $R_s = 2$ and independently of the redundancy (direct or complementary), $2^{16}$ of those values are consistent, including the expected output. Hence $F_{\text{undetected}} = 2^{16} - 1$ and $FC = 99.99\%$. For $R_s = 4$, there are $F_{\text{undetected}} = 2^8 - 1$ faults that are left undetected, thus $FC = 99.99\%$.

For illustrative purposes, we now consider a slightly stronger attacker who may flip $p$ randomly chosen data bits. In practice, such an analysis ought to be tailored to account for the specific distribution of faults of a given injection technique on a given platform. Under this attacker model, there are $F_{\text{total}} = \binom{32}{p}$ fault injection possibilities leading to a $p$-bit flip (with $p$ an even number). For $R_s = 2$, there are $F_{\text{undetected}} = \binom{16}{2}$ faults corresponding to a $p$-bit flip that are left undetected. The lower-bound for $FC$ is reached for $p = 2$ and $p = 30$, where $FC = 96.77\%$. For $R_s = 4$, there are $F_{\text{undetected}} = \binom{8}{2}$ faults corresponding to a $p$-bit flip that are left undetected. The lower-bound for $FC$ is reached for $p = 4$ and $p = 28$, where $FC = 99.97\%$. A $p$-bit set or reset fault model leads to a 100% detection rate if complementary redundancy is used. If direct redundancy is used, then this amounts to the $p$-bit flip model. Either way the detection rate is very high.

Experimental evaluation of temporal redundancy. We have simulated the impact of faults on our implementation of AES. We focus our attention exclusively on control faults (instruction skips) since our above analytical model already predicts the outcome of data faults. To this end, we use a fault injection simulator based on gdb running through the JTAG interface of the FPGA board. We execute our implementation up to a chosen
Table 3: Experimental results of simulated instruction skips

<table>
<thead>
<tr>
<th></th>
<th>With impact</th>
<th></th>
<th>Without impact</th>
<th></th>
<th>Crash</th>
<th># of faults</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Detected</td>
<td>Not detected</td>
<td>Detected</td>
<td>Not detected</td>
<td></td>
<td></td>
</tr>
<tr>
<td>$R_t=1$</td>
<td>0.19%</td>
<td>94.40%</td>
<td>0%</td>
<td>2.56%</td>
<td>2.84%</td>
<td>8507</td>
</tr>
<tr>
<td>$R_t=2$</td>
<td>80.75%</td>
<td>0%</td>
<td>7.74%</td>
<td>7.96%</td>
<td>3.55%</td>
<td>19552</td>
</tr>
</tbody>
</table>

breakpoint, after which we instruct the processor to jump to a given address, hence simulating the effect of an instruction skip. In particular, we have exhaustively targeted every instruction of the first and last round as well as the AES_secure routine (for $R_t = 2$) and its counterpart for $R_t = 1$. Since rounds 2 to 9 use the same code as the first round, the absence of vulnerabilities against instruction skips within the latter means that the former are secure against instruction skip as well. This exposes a total of 1222 injection points for $R_t = 2$ and 1097 injection points for $R_t = 1$. For each such injection point, we perform an instruction skip from 512 random combinations of key and plaintext for $R_t = 2$ and 256 random combinations for $R_t = 1$.

The results are summarized in Table 3. Injecting a fault had one of five effect. A fault may yield an incorrect ciphertext with (1) or without (2) being detected. A fault may yield a correct ciphertext, with (3) or without (4) being detected. Finally, a fault may cause the program or the board to crash (5). According to our attacker model, only outcome (2) witnesses a vulnerability. In every other outcome, the fault either does not produce a faulty ciphertext, or is detected within 2 rounds. For $R_t = 2$, we verify that every instruction skip was either detected (outcome 1 or 3) or had no effect on the output of the corresponding round (outcome 4) or lead to a crash (outcome 5). Comparatively, with $R_t = 1$, nearly 95% of the instruction skips lead to an undetected fault impacting the ciphertext. In 0.19% of the cases, the fault actually impacts the fault-detection mechanism itself, thus triggering a false positive.

5.4 Discussion

SKIVA sets out to provide a platform for implementing cryptographic primitives resilient to combined attacks. In this section, we have evaluated a set of candidate designs for AES in terms of performance (Section 5.1) as well as security. We have carried a theoretical and empirical evaluation of the impact of faults (Section 5.3) on our designs, hence quantifying their adequacy with respect to our “fault attacker model” (Section 2.1). We have also carried out an empirical evaluation of the security of our masking scheme through CPA and the TVLA methodology, hence quantifying its adequacy with respect to our “side-channel attacker model” (Section 2.1). Besides, we have quantified the amplification of side-channel leakage induced by the fault protection mechanism, hence validating our “combined attacker model” (Section 2.1). Admittedly, our combined attacker model exposes a narrow attack surface, excluding an attacker actively mitigating the SCA countermeasures or drawing conclusions from the distribution of masked values (SIFA). As the design and implementation of protections against such attacks mature, we will be able to integrate them in a (software) implementation of AES, leaving SKIVA, the underlying (hardware) platform, untouched. This example thus illustrates the strengths of our approach: thanks to SKIVA’s support for aggregated bitslice operations, we benefit from techniques and advances in the field of hardware (e.g., boolean masking) as well as software (e.g., temporal redundancy) protection mechanisms, while taking full advantage of the flexibility of software.
Conclusion

We have presented SKIVA, a general-purpose 32-bit processor supporting high-throughput, secure block ciphers on embedded devices. Our objective in extending the SPARC instruction set was to provide cryptographers with a manageable programming model for implementing secure ciphers on a general-purpose CPU. On the software side, we advocate an approach centered around bitslicing, where cryptographic primitives are treated as combinational circuits. By design, bitslicing protects an implementation against timing-based side-channel attacks. But it also provides a sound basis for modular protections against fault and/or power-based side-channel attacks, thus paving the way for a pay-as-you-go security approach. In essence, SKIVA can be understood as a Turing machine for efficiently and securely executing combinational circuits in software.

These design choices translate into protection mechanisms that can naturally and systematically be integrated together. To protect against faults, we have shown that intra-instruction redundancy enables a purely analytic security analysis, guaranteeing significant coverage, while we experimental showed that temporal redundancy protects against instruction skips. To protect against side-channel, we crucially rely on the physical isolation of slices thus significantly reducing the risk of involuntary interference due to architectural details invisible to the programmer.

We have demonstrated the benefits of our approach with a bitsliced implementation of AES with 1, 2 and 4 shares, a temporal redundancy of 1 and 2 as well as a spatial redundancy of 1, 2, and 4. In terms of code size, we have shown that all security levels can be implemented in less than 14048B. In terms of performance, we have seen that it scales well with protection levels, dividing the throughput by 163 with all protections enabled at their maximum \(D = 4, R_s = 4, R_t = 2\).

Future work. In this paper, we have studied AES running on the SKIVA platform. To demonstrate the versatility of SKIVA, we intend to evaluate more ciphers at various security levels on this platform, including physical fault injection. Besides, we would like to compare it with alternative platforms, including general-purpose processors – ARM Cortex, AVR, or RISC-V – as well as cryptographic extensions – namely XCrypto [MPP19]. To cover this design space, we plan to invest in automation, integrating our countermeasures into a bitslicing compiler [Por01, MDLM18].

Acknowledgements

The SKIVA design was supported in part by the National Science Foundation Award 1617203, and the National Institute for Standards and Technology Award 70NANDBH17280.
References


A Custom instructions details

A.1 TR2 instruction

\[
\begin{align*}
\text{TR2} & \quad rs1, \quad rs2, \quad rd \\
\text{reg}_{rs1}[31:0] & := \text{CONCAT}(\text{reg}_{rs1}[15], \text{reg}_{rs2}[15], \text{reg}_{rs1}[14], \text{reg}_{rs2}[14], \ldots) \\
& \quad \text{reg}_{rs1}[13], \text{reg}_{rs2}[13], \text{reg}_{rs1}[12], \text{reg}_{rs2}[12], \ldots \\
& \quad \text{reg}_{rs1}[11], \text{reg}_{rs2}[11], \text{reg}_{rs1}[10], \text{reg}_{rs2}[10], \ldots \\
& \quad \text{reg}_{rs1}[9], \text{reg}_{rs2}[9], \text{reg}_{rs1}[8], \text{reg}_{rs2}[8], \ldots \\
& \quad \text{reg}_{rs1}[7], \text{reg}_{rs2}[7], \text{reg}_{rs1}[6], \text{reg}_{rs2}[6], \ldots \\
& \quad \text{reg}_{rs1}[5], \text{reg}_{rs2}[5], \text{reg}_{rs1}[4], \text{reg}_{rs2}[4], \ldots \\
& \quad \text{reg}_{rs1}[3], \text{reg}_{rs2}[3], \text{reg}_{rs1}[2], \text{reg}_{rs2}[2], \ldots \\
& \quad \text{reg}_{rs1}[1], \text{reg}_{rs2}[1], \text{reg}_{rs1}[0], \text{reg}_{rs2}[0]) \\
\text{y}[31:0] & := \text{CONCAT}(\text{reg}_{rs1}[31], \text{reg}_{rs2}[31], \text{reg}_{rs1}[30], \text{reg}_{rs2}[30], \ldots) \\
& \quad \text{reg}_{rs1}[29], \text{reg}_{rs2}[29], \text{reg}_{rs1}[28], \text{reg}_{rs2}[28], \ldots \\
& \quad \text{reg}_{rs1}[27], \text{reg}_{rs2}[27], \text{reg}_{rs1}[26], \text{reg}_{rs2}[26], \ldots \\
& \quad \text{reg}_{rs1}[25], \text{reg}_{rs2}[25], \text{reg}_{rs1}[24], \text{reg}_{rs2}[24], \ldots
\end{align*}
\]
A.2 INVTR2 instruction

INVTR2 rs1, rs2, rd

\[
\begin{align*}
\text{reg}_{rs1}[31:0] & := \text{CONCAT}(\text{reg}_{rs1}[30],\text{reg}_{rs1}[28],\text{reg}_{rs1}[26],\text{reg}_{rs1}[24], \ldots, \\
& \quad \text{reg}_{rs1}[22],\text{reg}_{rs1}[20],\text{reg}_{rs1}[18],\text{reg}_{rs1}[16], \ldots, \\
& \quad \text{reg}_{rs1}[14],\text{reg}_{rs1}[12],\text{reg}_{rs1}[10],\text{reg}_{rs1}[8], \ldots, \\
& \quad \text{reg}_{rs1}[6],\text{reg}_{rs1}[4],\text{reg}_{rs1}[2],\text{reg}_{rs1}[0], \ldots, \\
& \quad \text{reg}_{rs2}[30],\text{reg}_{rs2}[28],\text{reg}_{rs2}[26],\text{reg}_{rs2}[24], \ldots, \\
& \quad \text{reg}_{rs2}[22],\text{reg}_{rs2}[20],\text{reg}_{rs2}[18],\text{reg}_{rs2}[16], \ldots, \\
& \quad \text{reg}_{rs2}[14],\text{reg}_{rs2}[12],\text{reg}_{rs2}[10],\text{reg}_{rs2}[8], \ldots, \\
& \quad \text{reg}_{rs2}[6],\text{reg}_{rs2}[4],\text{reg}_{rs2}[2],\text{reg}_{rs2}[0]) \\
\text{y}[31:0] & := \text{CONCAT}(\text{reg}_{rs1}[31],\text{reg}_{rs1}[29],\text{reg}_{rs1}[27],\text{reg}_{rs1}[25], \ldots, \\
& \quad \text{reg}_{rs1}[23],\text{reg}_{rs1}[21],\text{reg}_{rs1}[19],\text{reg}_{rs1}[17], \ldots, \\
& \quad \text{reg}_{rs1}[15],\text{reg}_{rs1}[13],\text{reg}_{rs1}[11],\text{reg}_{rs1}[9], \ldots, \\
& \quad \text{reg}_{rs1}[7],\text{reg}_{rs1}[5],\text{reg}_{rs1}[3],\text{reg}_{rs1}[1], \ldots, \\
& \quad \text{reg}_{rs2}[31],\text{reg}_{rs2}[29],\text{reg}_{rs2}[27],\text{reg}_{rs2}[25], \ldots, \\
& \quad \text{reg}_{rs2}[23],\text{reg}_{rs2}[21],\text{reg}_{rs2}[19],\text{reg}_{rs2}[17], \ldots, \\
& \quad \text{reg}_{rs2}[15],\text{reg}_{rs2}[13],\text{reg}_{rs2}[11],\text{reg}_{rs2}[9], \ldots, \\
& \quad \text{reg}_{rs2}[7],\text{reg}_{rs2}[5],\text{reg}_{rs2}[3],\text{reg}_{rs2}[1])
\end{align*}
\]

A.3 SUBROT instruction

SUBROT rs, imm, rd

\[
\begin{align*}
\text{IF } \text{imm}[2:0] & = 010 \\
\text{FOR } i := 0:15 \\
\text{\quad j} & := 2\times i \\
\text{\quad reg}_{rd}[j+1:j] := \text{reg}_{rs}[j+1] \\
\text{\quad ENDFOR} \\
\text{ELIF } \text{imm}[2:0] & = 100 \\
\text{FOR } i := 0:7 \\
\text{\quad j} & := 4\times i \\
\text{\quad reg}_{rd}[j+3:j] := \text{CONCAT}(\text{reg}_{rs}[j+2:j],\text{reg}_{rs}[j+3]) \\
\text{\quad ENDFOR} \\
\text{FI}
\end{align*}
\]

A.4 RED instruction

RED rs, imm, rd

\[
\begin{align*}
\text{IF } \text{imm}[2:0] & = 010 \\
\text{\quad reg}_{rd}[15:0] := \text{reg}_{rs}[15:0] \\
\text{\quad reg}_{rd}[31:16] := \text{reg}_{rs}[15:0] \\
& \quad y[15:0] := \text{reg}_{rs}[31:16] \\
& \quad y[31:16] := \text{reg}_{rs}[31:16]
\end{align*}
\]
ELIF imm[2:0] = 011
  regₐ[15:0] := regₐ[15:0]
  regₐ[31:16] := \( \text{NOT} \ regₐ[15:0] \)
  y[15:0] := \( \text{NOT} \ regₐ[31:16] \)
ELIF imm[2:0] = 100
  regₐ[7:0] := regₐ[7:0]
  regₐ[15:8] := regₐ[7:0]
  regₐ[23:16] := regₐ[7:0]
  regₐ[31:24] := regₐ[7:0]
  y[7:0] := regₐ[15:8]
  y[15:8] := \( \text{NOT} \ regₐ[15:8] \)
  y[23:16] := \( \text{NOT} \ regₐ[15:8] \)
  y[31:24] := \( \text{NOT} \ regₐ[15:8] \)
ELIF imm[2:0] = 101
  regₐ[7:0] := regₐ[23:16]
  regₐ[15:8] := \( \text{NOT} \ regₐ[7:0] \)
  regₐ[23:16] := \( \text{NOT} \ regₐ[7:0] \)
  regₐ[31:24] := \( \text{NOT} \ regₐ[7:0] \)
  y[7:0] := rs[15:8]
  y[15:8] := \( \text{NOT} \ regₐ[15:8] \)
  y[23:16] := \( \text{NOT} \ regₐ[15:8] \)
  y[31:24] := \( \text{NOT} \ regₐ[15:8] \)
ELIF imm[2:0] = 110
  regₐ[7:0] := regₐ[23:16]
  y[7:0] := regₐ[31:24]
ELIF imm[2:0] = 111
  regₐ[7:0] := regₐ[23:16]
  regₐ[15:8] := \( \text{NOT} \ regₐ[23:16] \)
  regₐ[23:16] := \( \text{NOT} \ regₐ[23:16] \)
  regₐ[31:24] := \( \text{NOT} \ regₐ[23:16] \)
  y[7:0] := regₐ[31:24]
  y[15:8] := \( \text{NOT} \ regₐ[31:24] \)
  y[23:16] := \( \text{NOT} \ regₐ[31:24] \)
  y[31:24] := \( \text{NOT} \ regₐ[31:24] \)
FI

A.5  ANDC16 instruction

ANDC16 rs1, rs2, rd

  regₐ[15:0] := (regₐ[15:0] AND regₐ[15:0])
  regₐ[31:16] := (regₐ[31:16] OR regₐ[31:16])
A.6 XORC16 instruction

\[
\text{XORC16 } rs1, rs2, rd
\]
\[
\text{reg}_{rd}[15:0] := (\text{reg}_{rs1}[15:0] \text{ XOR } \text{reg}_{rs2}[15:0])
\]
\[
\text{reg}_{rd}[31:16] := (\text{reg}_{rs1}[31:16] \text{ XOR } \text{reg}_{rs2}[31:16])
\]

A.7 XNORC16 instruction

\[
\text{XNORC16 } rs1, rs2, rd
\]
\[
\text{reg}_{rd}[15:0] := (\text{reg}_{rs1}[15:0] \text{ XNOR } \text{reg}_{rs2}[15:0])
\]
\[
\text{reg}_{rd}[31:16] := (\text{reg}_{rs1}[31:16] \text{ XOR } \text{reg}_{rs2}[31:16])
\]

A.8 ANDC8 instruction

\[
\text{ANDC8 } rs1, rs2, rd
\]
\[
\text{reg}_{rd}[7:0] := (\text{reg}_{rs1}[7:0] \text{ AND } \text{reg}_{rs2}[7:0])
\]
\[
\text{reg}_{rd}[15:8] := (\text{reg}_{rs1}[15:8] \text{ OR } \text{reg}_{rs2}[15:8])
\]
\[
\text{reg}_{rd}[23:16] := (\text{reg}_{rs1}[23:16] \text{ AND } \text{reg}_{rs2}[23:16])
\]
\[
\text{reg}_{rd}[31:24] := (\text{reg}_{rs1}[31:24] \text{ OR } \text{reg}_{rs2}[31:24])
\]

A.9 XORC8 instruction

\[
\text{XORC8 } rs1, rs2, rd
\]
\[
\text{reg}_{rd}[7:0] := (\text{reg}_{rs1}[7:0] \text{ XOR } \text{reg}_{rs2}[7:0])
\]
\[
\text{reg}_{rd}[15:8] := (\text{reg}_{rs1}[15:8] \text{ XNOR } \text{reg}_{rs2}[15:8])
\]
\[
\text{reg}_{rd}[23:16] := (\text{reg}_{rs1}[23:16] \text{ XOR } \text{reg}_{rs2}[23:16])
\]
\[
\text{reg}_{rd}[31:24] := (\text{reg}_{rs1}[31:24] \text{ XNOR } \text{reg}_{rs2}[31:24])
\]

A.10 XNORC8 instruction

\[
\text{XNORC8 } rs1, rs2, rd
\]
\[
\text{reg}_{rd}[7:0] := (\text{reg}_{rs1}[7:0] \text{ XNOR } \text{reg}_{rs2}[7:0])
\]
\[
\text{reg}_{rd}[15:8] := (\text{reg}_{rs1}[15:8] \text{ XOR } \text{reg}_{rs2}[15:8])
\]
\[
\text{reg}_{rd}[23:16] := (\text{reg}_{rs1}[23:16] \text{ XNOR } \text{reg}_{rs2}[23:16])
\]
\[
\text{reg}_{rd}[31:24] := (\text{reg}_{rs1}[31:24] \text{ XOR } \text{reg}_{rs2}[31:24])
\]

A.11 FTXCHK instruction

\[
\text{FTCHK } rs, \text{ imm, rd}
\]
\[
\text{IF } \text{imm}[3:0] = 1010
\]
\[
\text{FOR } i := 0:15
\]
\[
\text{reg}_{rd}[i] := (\text{reg}_{rs}[i+16] \text{ XOR } \text{reg}_{rs}[i])
\]
\[
\text{reg}_{rd}[i+16] := (\text{reg}_{rs}[i+16] \text{ XNOR } \text{reg}_{rs}[i])
\]
\[
\text{ENDFOR}
\]
\[
\text{ELIF } \text{imm}[3:0] = 1011
\]
FOR i:=0:15
    regrd[i] := (regrs[i+16] XNOR regrs[i])
    regrd[i+16] := (regrs[i+16] XOR regrs[i])
ENDFOR
ELIF imm[3:0] = 1100
    FOR i:=0:7
        regrd[i] := ((regrs[i+8] XOR regrs[i]) OR ...
                      (regrs[i+16] XOR regrs[i]) OR ...
                      (regrs[i+24] XOR regrs[i]))
        regrd[i+8] := ((regrs[i+8] XNOR regrs[i]) AND ...
                       (regrs[i+16] XNOR regrs[i]) AND ...
                       (regrs[i+24] XNOR regrs[i]))
        regrd[i+16] := ((regrs[i+8] XOR regrs[i]) OR ...
                        (regrs[i+16] XOR regrs[i]) OR ...
                        (regrs[i+24] XOR regrs[i]))
        regrd[i+24] := ((regrs[i+8] XNOR regrs[i]) AND ...
                        (regrs[i+16] XNOR regrs[i]) AND ...
                        (regrs[i+24] XNOR regrs[i]))
    ENDFOR
ELIF imm[3:0] = 1101
    FOR i:=0:7
        regrd[i] := ((regrs[i+8] XNOR regrs[i]) OR ...
                      (regrs[i+16] XOR regrs[i]) OR ...
                      (regrs[i+24] XNOR regrs[i]))
        regrd[i+8] := ((regrs[i+8] XOR regrs[i]) AND ...
                       (regrs[i+16] XNOR regrs[i]) AND ...
                       (regrs[i+24] XNOR regrs[i]))
        regrd[i+16] := ((regrs[i+8] XNOR regrs[i]) OR ...
                        (regrs[i+16] XOR regrs[i]) OR ...
                        (regrs[i+24] XNOR regrs[i]))
        regrd[i+24] := ((regrs[i+8] XOR regrs[i]) AND ...
                        (regrs[i+16] XNOR regrs[i]) AND ...
                        (regrs[i+24] XOR regrs[i]))
    ENDFOR
FI

B Efficient C emulation of the custom instructions

We provide here the C codes for emulating some of the custom instructions. We omitted ftchk, red, tr2 and invtr2, for which the emulation code is the straightforward implementation of the specification.

#define ANDC8(r,a,b) r = (((a) | (b)) & 0xFF00FF00) | (((a) & (b)) & 0x00FF00FF)
#define XORC8(r,a,b) r = (a) ^ (b) ^ 0xFF00FF00
#define XNORC8(r,a,b) r = (a) ^ (b) ^ 0x00FF00FF
#define ANDC16(r,a,b) r = (((a) | (b)) & 0xFFFF0000) | (((a) & (b)) & 0x000FFFF)
#define XORC16(r,a,b) r = (a) ^ (b) ^ 0xFFFF0000
#define XNORC16(r,a,b) r = (a) ^ (b) ^ 0x0000FFFF
C Side-channel analysis results

C.1 CPA results

Table 4: Detailed report of 1st order CPA results on unmasked SubBytes of 1st round AES

<table>
<thead>
<tr>
<th># of traces</th>
<th># of key bytes revealed</th>
</tr>
</thead>
<tbody>
<tr>
<td>3K</td>
<td>1</td>
</tr>
<tr>
<td>4K</td>
<td>3</td>
</tr>
<tr>
<td>9K</td>
<td>5</td>
</tr>
<tr>
<td>10K</td>
<td>6</td>
</tr>
<tr>
<td>11K</td>
<td>7</td>
</tr>
<tr>
<td>12K</td>
<td>8 (half key)</td>
</tr>
<tr>
<td>14K</td>
<td>10</td>
</tr>
<tr>
<td>18K</td>
<td>11</td>
</tr>
<tr>
<td>19K</td>
<td>12</td>
</tr>
<tr>
<td>21K</td>
<td>13</td>
</tr>
<tr>
<td>22K</td>
<td>14</td>
</tr>
<tr>
<td>23K</td>
<td>15</td>
</tr>
<tr>
<td>24K</td>
<td>16 (full key)</td>
</tr>
</tbody>
</table>

C.2 TVLA results

Figure 11: 1st order t-test on 1st order masked AES S-box in complementary and direct redundancy (25K fixed vs. 25K random)