# Attacking the polynomials in the E0 cipher

### Introduction

The short-range wireless technology Bluetooth uses the encryption standard E0. The keystream generator E0 is built using a combination generator with four LFSRs of lengths $L_1 = 25, L_2 = 31, L_3 = 33, L_4 = 39$ and primitive feedback polynomials

$\left\{ \begin{array}{rcl} p_1(x) &=& x^{25} +x^{20} +x^{12} + x^8 +1,\\\\ p_2(x) &=& x^{31} +x^{24} +x^{16} +x^{12} +1,\\\\ p_3(x) &=& x^{33} +x^{28} +x^{24} + x^4 +1,\\\\ p_4(x) &=& x^{39} +x^{36} +x^{28} + x^4 +1, \end{array} \right.$

respectively.

This is a short result on the E0 Core (sometimes called one-level E0), explicitly stating the low-weight polynomials needed in the key-recovery attack proposed in [LV04].

To mount a more efficient attack, the authors of [LV04] developed a maximum-likelihood decoding algorithm based on the fast Walsh transform. The algorithm recovers, given an element in $\mathbb{F}_q^n$ and a linear code over $\mathbb{F}_q$ of dimension $L$ and length $n$, the closest codeword in time $\mathcal{O}(n+L \cdot 2^L)$ and memory $\min(n,2^L)$.

In order to succeed with the attack, they need to find weight-4 polynomial multiples of $p_1(x)p_3(x)p_4(x)$ which has degree 97.

### Parallell implementation

We implement an algorithm based on [LJ14], with multiple threads in mind.

Algorithm $1$

Input: Polynomial $P(x)$ of degree $d_P$, small integer $\alpha$.
Output: Polynomial multiple $K(x)$ of weight 4.

1. Select a random subset $\mathcal{Q} \subset \{1,2\ldots,d_P\}$.
2. From $P(x)$, create all residues $x^i \bmod P(x)$ for $0 \leq i \leq 2^{d_P/3 + \alpha}$ and put $(x^i \bmod P(x), i)$ in list $\mathcal{L}_1$.
3. Create all $\phi_\mathcal{Q}(x^i + x^j) = 0$ for $0 \leq i < j < 2^{d_P/3 + \alpha}$. This is done by merging the list $\mathcal{L}_1$ with it self, keeping only elements $\phi_\mathcal{Q}(x^i + x^j) = 0$.
4. Sort the resulting list $\mathcal{L}_2$ by residue value.
5. Merge the list $\mathcal{L}_2$ by itself, keeping only multiples $x^i + x^j + x^k + x^l \bmod P(x) = 0$. Output a list of all, or the one with smallest degree.

Let the number of worker threads be $m$, where $m$ is not a multiple of 2. We choose $m$ to be a prime number close to the number of max threads. We then create $m^2$ buckets, each containing a hashmap or other $\mathcal{O}(1)$ data structure. We order the buckets as a two-dimensional structure $T_1[i,j]$ with $0 < i,j < m$.

In implementation, every thread generates the full set of monomials $x^0, x^1, \ldots, x^{d_P/3 + \alpha}$, each taken modulo $P(x)$. This computation is very light in terms of performance compared to memory reads and writes.

Define the function

$f : \mathbb{F}_2[x]/P(x) \longrightarrow \mathbb{Z}_m$

as $f(x^i) = \left(2^i \bmod P(2)\right) \bmod m$. Clearly if $\phi(x^i) = \phi(x^j)$, then necessarily $f(\phi(x^i)) = f(\phi(x^j))$ when taken modulo $P(x)$.

The functional value of $f$ determines if the thread should consider it or not, i.e., if the value matches the thread number. This way, we achieve thread safety without using mutex.

A thread $j$ will put element $x^i$ in $T_1[ f(\phi(x^i)), j]$, unless it is already occupied. If so, say with $x^k$ was already occupying that space in $T_1$, then it proceeds to $T_2[f(\phi(x^i + x^k)), j]$ and puts $x^i + x^k$ there.

In the final step, we swap the accessing order. So each thread will only handle elements $x^i + x^j$ which had equal $f$-function values. As equal $f$ values are necessary (but not sufficient) condition for collision to occur, it is impossible to filter out correct candidates by mistake.

Each thread will merge the elements in the corresponding row. Once collision is found, program outputs $i,j,k,l$ such that $x^i + x^j + x^k + x^l = 0 \bmod P(x)$ and terminates. If no collision is found, we re-iterate with a new mask selection $\mathcal{Q}$ for $\phi$.

### Results

The polynomial of degree 72 can be found by setting $\alpha = 1$ and runs in a few seconds using $m = 23$ threads. For the 97-degree polynomial, we need about 560 GB of RAM. While this may sound immensly large, it can actually be done by running it on an AWS instance at relatively low cost.

Another solution is to pick $|\mathcal{Q}| = d_P/3 - u$ for some small integer $u$. This reduces the size of the first collision layer $T_1$. Then, for each thread, we require that $f(x^i + x^j) = 0$ for the element to be added into $T_2$. This reduces the size of the stored data to $\mathcal{O}(2^{d_P/3 - u} + 2^{d_P/3 + u}/m)$, increases the running time by a factor $m$.

$\begin{array}{l|ll} \text{Polynomial} \qquad & \qquad \text{Degree} & \qquad\text{Weight-4 multiple} \\ \\ \hline\hline \\ p_3(x)p_4(x) \qquad & \qquad 72 & \qquad x^{24352899}+x^{10387660}+x^{7833802}+1\\ \\ & & \qquad x^{31807163}+x^{24477832}+x^{4308951}+1\\ \\ p_1(x)p_3(x)p_4(x)\qquad & \qquad 97 & \qquad x^{12647431389}+x^{8008354628}+x^{1277498415}+1\\ \\ & & \qquad x^{12671067191}+x^{6590666154}+x^{1306461382}+1 \end{array}$

This actualizes the attack on the E0 cipher by Lu and Vaudenay.

Verification

sage: Q0 = R(x^12647431389+x^8008354628+x^1277498415+1)
sage: Q1 = R(x^12671067191+x^6590666154+x^1306461382+1)
sage: P = R(x^97+x^92+x^89+x^80+x^76+x^73+x^69+x^68+x^64+x^61+x^59+x^58+x^57+x^56+x^55+x^54+x^53+x^51+x^49+x^44+
....: x^43+x^41+x^39+x^36+x^35+x^34+x^32+x^31+x^30+x^21+x^19+x^15+x^13+x^9+x^5+x^3+1)
sage: Q0 % P
0
sage: Q1 % P
0

### Acknowledgements

Many thanks to klondike (KISec AB) for providing help with computational resources and improving the code.