Attacking the polynomials in the E0 cipher

Introduction

The short-range wireless technology Bluetooth uses the encryption standard E0. The keystream generator E0 is built using a combination generator with four LFSRs of lengths L_1 = 25, L_2 = 31, L_3 = 33, L_4 = 39 and primitive feedback polynomials

\left\{ \begin{array}{rcl} p_1(x) &=& x^{25} +x^{20} +x^{12} + x^8 +1,\\\\ p_2(x) &=& x^{31} +x^{24} +x^{16} +x^{12} +1,\\\\ p_3(x) &=& x^{33} +x^{28} +x^{24} + x^4 +1,\\\\ p_4(x) &=& x^{39} +x^{36} +x^{28} + x^4 +1, \end{array} \right.

respectively.

This is a short result on the E0 Core (sometimes called one-level E0), explicitly stating the low-weight polynomials needed in the key-recovery attack proposed in [LV04].

To mount a more efficient attack, the authors of [LV04] developed a maximum-likelihood decoding algorithm based on the fast Walsh transform. The algorithm recovers, given an element in \mathbb{F}_q^n and a linear code over \mathbb{F}_q of dimension L and length n, the closest codeword in time \mathcal{O}(n+L \cdot 2^L) and memory \min(n,2^L).

In order to succeed with the attack, they need to find weight-4 polynomial multiples of p_1(x)p_3(x)p_4(x) which has degree 97.

Parallell implementation

We implement an algorithm based on [LJ14], with multiple threads in mind.


Algorithm 1

Input: Polynomial P(x) of degree d_P, small integer \alpha.
Output: Polynomial multiple K(x) of weight 4.

1. Select a random subset \mathcal{Q} \subset \{1,2\ldots,d_P\}.
2. From P(x), create all residues x^i \bmod P(x) for 0 \leq i \leq 2^{d_P/3 + \alpha} and put (x^i \bmod P(x), i) in list \mathcal{L}_1.
3. Create all \phi_\mathcal{Q}(x^i + x^j) = 0 for 0 \leq i < j < 2^{d_P/3 + \alpha}. This is done by merging the list \mathcal{L}_1 with it self, keeping only elements \phi_\mathcal{Q}(x^i + x^j) = 0.
4. Sort the resulting list \mathcal{L}_2 by residue value.
5. Merge the list \mathcal{L}_2 by itself, keeping only multiples x^i + x^j + x^k + x^l \bmod P(x) = 0. Output a list of all, or the one with smallest degree.


Let the number of worker threads be m, where m is not a multiple of 2. We choose m to be a prime number close to the number of max threads. We then create m^2 buckets, each containing a hashmap or other \mathcal{O}(1) data structure. We order the buckets as a two-dimensional structure T_1[i,j] with 0 < i,j < m.

In implementation, every thread generates the full set of monomials x^0, x^1, \ldots, x^{d_P/3 + \alpha}, each taken modulo P(x). This computation is very light in terms of performance compared to memory reads and writes.

Define the function

f : \mathbb{F}_2[x]/P(x) \longrightarrow \mathbb{Z}_m

as f(x^i) = \left(2^i \bmod P(2)\right) \bmod m. Clearly if \phi(x^i) = \phi(x^j), then necessarily f(\phi(x^i)) = f(\phi(x^j)) when taken modulo P(x).

The functional value of f determines if the thread should consider it or not, i.e., if the value matches the thread number. This way, we achieve thread safety without using mutex.

Skärmavbild 2020-04-26 kl. 20.23.16

A thread j will put element x^i in T_1[ f(\phi(x^i)), j], unless it is already occupied. If so, say with x^k was already occupying that space in T_1, then it proceeds to T_2[f(\phi(x^i + x^k)), j] and puts x^i + x^k there.

In the final step, we swap the accessing order. So each thread will only handle elements x^i + x^j which had equal f-function values. As equal f values are necessary (but not sufficient) condition for collision to occur, it is impossible to filter out correct candidates by mistake.

Skärmavbild 2020-04-27 kl. 11.14.45

Each thread will merge the elements in the corresponding row. Once collision is found, program outputs i,j,k,l such that x^i + x^j + x^k + x^l = 0 \bmod P(x) and terminates. If no collision is found, we re-iterate with a new mask selection \mathcal{Q} for \phi.

Results

The polynomial of degree 72 can be found by setting \alpha = 1 and runs in a few seconds using m = 23 threads. For the 97-degree polynomial, we need about 560 GB of RAM. While this may sound immensly large, it can actually be done by running it on an AWS instance at relatively low cost.

Another solution is to pick |\mathcal{Q}| = d_P/3 - u for some small integer u. This reduces the size of the first collision layer T_1. Then, for each thread, we require that f(x^i + x^j) = 0 for the element to be added into T_2. This reduces the size of the stored data to \mathcal{O}(2^{d_P/3 - u} + 2^{d_P/3 + u}/m), increases the running time by a factor m.

\begin{array}{l|ll} \text{Polynomial} \qquad & \qquad \text{Degree} & \qquad\text{Weight-4 multiple} \\ \\ \hline\hline \\ p_3(x)p_4(x) \qquad & \qquad 72 & \qquad x^{24352899}+x^{10387660}+x^{7833802}+1\\ \\ & & \qquad x^{31807163}+x^{24477832}+x^{4308951}+1\\ \\ p_1(x)p_3(x)p_4(x)\qquad & \qquad 97 & \qquad x^{12647431389}+x^{8008354628}+x^{1277498415}+1\\ \\ & & \qquad x^{12671067191}+x^{6590666154}+x^{1306461382}+1 \end{array}

This actualizes the attack on the E0 cipher by Lu and Vaudenay.

Verification

sage: Q0 = R(x^12647431389+x^8008354628+x^1277498415+1)
sage: Q1 = R(x^12671067191+x^6590666154+x^1306461382+1) 
sage: P = R(x^97+x^92+x^89+x^80+x^76+x^73+x^69+x^68+x^64+x^61+x^59+x^58+x^57+x^56+x^55+x^54+x^53+x^51+x^49+x^44+
....: x^43+x^41+x^39+x^36+x^35+x^34+x^32+x^31+x^30+x^21+x^19+x^15+x^13+x^9+x^5+x^3+1) 
sage: Q0 % P 
0
sage: Q1 % P 
0

Acknowledgements

Many thanks to klondike (KISec AB) for providing help with computational resources and improving the code.

References

[LV04] Yi Lu and Serge Vaudenay, Faster Correlation Attack on Bluetooth Keystream Generator E0, CRYPTO 2004, LNCS vol. 3152, pp. 407-425, Springer-Verlag, 2004.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s