My notes on the Timing attacks on SSH paper ------------------------------------------ Author: Atul Prakash This paper is about collecting timing interval between keystrokes and using that to infer the characters being typed for the password. SSH protocol used to send password characters as they were typed. Also, echo-mode in TCP is turned off when a password is being typed, so it is potentially possible to infer when the password is being typed. Measuring intervals between packets for each character can give clues regarding the password. We will assume that the network does not add much variance to the timing data. Methodology: Let's say we know what the timing interval is between every pair of characters. Of course, this interval is not going to be exact. So, there is likely to be a probability distribution of interval per character pair. So, now we know p(latency = y | (x1,x2)), where (x1,x2) is the character pair being typed. We suppose know this for every pair of characters. Let's call these pairs as q1, q2, ..., qn. In short, we will simply say that we know the distribution of p(y|q). Usually, this will be a continuous distribution. In the paper, it is shown that it is Gaussian distribution. Now, we can ask the inverse question. Suppose we observe a certain latency. What can we say about the probability that it came from a character pair q? i.e., we want to calculate P(q | y), for a given value of latency y, and for each value of character-pair q. It turns out that probability theory gives us some mechanisms to do that. One of the axioms of probability theory is P(X ^ Y) = P(X|Y) * P(Y) where P(X|Y) is the conditional probability of occurence of X, given a value of Y. We can apply that as follows: P(q | y) * P(y) = P(q ^ y) (this is the probability that character q was typed and the latency y was observed). But P(q ^ y) = P(y | q) * P(q). Therefore, P(q | y) = P(y|q)*P(q)/P(y) In genral, such a relationship between conditional probabilities is known as Bayes Theorem. We know P(y|q) from the measurements. We also need to know in general how likely each q pair is among all the character pairs. For now, we will assume that all pairs are equally likely, though that is not necessarily the case. P(y) can now be calculated as follows: P(y) = P(y | q1) * P(q1) + P(y|q2) * P(q2) + .... (y is caused by one of the characters. We are adding up all the probabilities). So, P(y) can also be estimated, given all the conditional distributions P(y|qi) and P(qi). Thus, now we can determine P(q | y) for each latency value. Of course, the attacker doesn't get complete information about the character pair -- he only gets a ranking, along with probabilities. Let's consider a simple example with only 4 possible values of q: q1, q2, q3, and q4. Suppose that the attacker observes a latency of y1,y2, y3, and y4. according to the following rule: P(y1|q1) = 1 (i.e., typing q1 always leads to a latency of q1) P(y2|q2) = 0.5 P(y3|q2) = 0.5 P(y2|q3) = 0.5 P(y3|q3) = 0.5 P(y4 | q4) = 1. Further, assume that all the q's are equally likely: P(q1) = P(q2) = P(q3) = P(q4) = 1/4 You should be able to verify that P(y1) = 1*1/4 = 1/4 P(y2) = 0.5*1/4 + 0.5*1/4 = 1/4 P(y3) = 1/4 P(y4) = 1/4 And P(q1 | y1) = P(y1|q1)*P(q1)/P(y1) = 1; rest are 0. P(q2 | y2) = (0.5*1/4)/(1/4) = 0.5; P(q3|y2) = 0.5; rest are 0. P(q2 | y3) = 0.5; P(q3|y3) = 0.5; rest are 0 P(q4|y4) = 1; rest are 0. In other words, seeing a latency of y1 guarantees that q1 was transmitted. Seeing a latency of y2 or y3 tells us that q1 or q4 was not transmitted, but q2 and q3 were equally likely to be transmitted. Seeing y4 guarantees that q4 was transmitted. Next interesting question is how much roughly is this knowledge worth to the attacker in terms of reducing his workload in verifying a guessed character pair? We can first try to give an intuitive answer. Each pair in the original data could be expressed using 2 bits, since there are 4 equally likely possibilities. We say that there are 2 bits of information content in each pair on the average. Thus, in the absence of any knowledge, the attacker has to figure out 2 bits of information per y. (We will assume for now that pairs can come in any order, which is strictly not true for characters sequences). But, if the latency is correlated as above, the guesswork is reduced significantly. q1 and q4 are revealed immediately (thus 0 bits to be guessed). The only guess work is between q2 and q3, where are 2 choices, or 1 bit of information. Thus, 25% of time: 0 bits to be guessed. 25% of time: 1 bit to be guessed (one of two choices) 25% of time: 1 bit to be guessed (one of two choices) 25% of time: 0 bits to be guessed. Average bits to be guessed per latency value: 2/4 = 0.5. Originally, it was 2 bits. Thus, the above channel has leaked 1.5 bits of information out of 2. How can we generalize this for arbitrary values of probabilities? This requires development of some background in (Shannon's) information theory. First thing Shannon defined is the notion of "entropy" or "information" in a message. If a message has n possible values, all equally likely, then its entropy is log(n). We will assume that the log is to the base 2 since information is being expressed in bits. Mor generally, if a message y has possible values y1, y2, ..., yn, with probabilities of occurrences being p(y1), p(y2), ...p(yn), then the information or entropy in a message y is: H(y) = - sum(p(yi) * logP(yi)) (If p(yi) = 0, the product is considered to be 0). Why does this work? Let's see if it matches our intuition when all probabilities are equal to 1/n. The above value computes to: - (1/n log (1/n) + ... + 1/n log (1/n) or log(n) after you simplify. So, it seems to work. Let's consider another case. We have two possible values for a message: y1 and y2. But P(y1) = 1 and P(y2) = 0. In other words, y1 is always sent. The above formula will show that the entropy is 0. In other words, the sender does not really need to send any bits. The receiver "knows" even without receiving anything that y1 must have been sent. In case of information leakage situation, the sender is generating messages with certain inherent entropy (e.g., x bits/character). The information is encrypted, so ideally, the attacker just gets what looks likes noise. But in practice, the attacker observes a modified signal that is correlated with the input. What is the information content per character in the modified signal? We can apply the same reasoning by measuring entropy in the received data. H(q | y1) = - sum over i (p(qi|y1) * log(p(qi|y1)) Average entropy in received data y when the sent data was q: H(q | y) = sum over i(H(q|yi) * P(yi)) Original entropy: H(q). Entropy lost: H(q) - H(q|y) This is a measure of "information leakage"; i.e., bits leaked per y, on the average. Let's apply it to the earlier example with 4 q's and 4 y's. H(q | y1) = 0; H(q | y2) = -(0 + 1/2*log(1/2) + 1/2*log(1/2)+0) = 1 H(q|y3) = 1 by a similar calculation. And H(q|y4) = 0. H(q|y) = average entropy = 0*P(y1) + 1*P(y2) + 1*P(y3) + 0*P(y4) = 0 + 1*0.25 + 1*0.25 + 0 = 0.5 Information leakage (also called mutual information) = 2 - 0.5 = 1.5 bits per unit In other words, if we had a message of length n, the average entropy in the sent message was 2n (2 bits per unit). In other words, the attacker needs to figure out 2n bits. In the received message, using latency measurements, the attacker needs to figure out only 1/4th as many bits or 0.5n bits. That means, trying out 2**(0.5n) possibilities instead of 2**(2n) possibilities. If n is 8, for example, that means that the attacker only needs to explore 16 possibilities instead of 2**16 or 256K (all in the average). Intuitively, this makes sense. 4 of these would be y1 or y4 and definitely known. The remaining 4 can only 2 possible values each, leading to 16 possibilities in all to explore. Relation to the paper ---------------------- In Section 3.4 of the paper, the above results are applied. Ho(q) = entropy in the source = -sigma over i(p(qi) * log(qi)). If all character-pairs are equally likely, p(qi) will be identical to 1/|Q|, where |Q| is the number of character-pairs. The above reduces to log(|Q|). This is the information content per character-pair. So, theoretically, the attacker needs to try out 2**log(|Q|) values per pair typed if he didn't have latency info. If the attacker sees latency y0, we can calculate H1(q|y = y0) = entropy in input given that we see latency y0 = - sum(P(qi|y0) * log(P(qi|y0)) (sum over all possible chars.) P(qi|y0) can be computed from P(y|qi) as before: p(qi|y0) = (p(yo | qi) * P(qi))/p(y0) (This will actually be a probability distribution, since p(y|qi) is a distribution). We can read off p(yo | qi) from the Gaussian distribution graph or measured probability. P(qi) is assumed to 1/|Q|. p(y0) can be computed as the average sum of probabilities of seeing a latency y0 over all characters. = sum( p(qi) * p(y0|qi)) (over all characters). So, now we can compute H1(q | y = y0) for a given value y0. We can repeat this for all values of y0 and plot it. That is figure 6(a). Intuitively, notice that if the observed latency is 300 ms, the entropy is low - in other words, the attacker knows a lot about the message. That makes sense because Figure 5 shows that there are very few characters that can lead to a latency of 300 ms with reasonable odds. The information leakage then for a given latency value y0 is Ho(q) - H1(q | y0). That is plotted in Figure 6. We can compute the average information leakage over all possible values of the latencies. This is called "mutual information". Essentially, this will be sum or integeral (information leakage(y) * p(y)) for all y This is equivalent to Ho(q) - sum or integeral (H1(q|y) * p(y)) for all y (since Ho(q) is a constant and sum of p(y) = 1, it can be taken out) For a continuous distribution of latencies, this becomes Ho(q) - integeral (H1(q|y) * p(y) dy) where p(y) was computed as before. We know everything. The answer for the data in the paper is that about 1.2 bits per character pair is available if character pairs have uniform distribution. If we consider 26**2 character pairs, inherent entropy is log(26**2) or 2*log(26) or about 9 bits. Losing 1.2 bits doesn't seem so bad. But, if the passwords are from a dictionary of english language/phrases, entropy is pretty low -- around 1-2 bits/character. For a character pair, that would translate to a pretty low latency -- 1-4 bits/pair. Thus, the impact is pretty high in terms of information loss. Inferring character sequences ----------------------------- Given a latency measurement, we can estimate p(qi), where qi is a character pair. However, q(i) depends on q(i-1). We will use q(i) and qi interchangeably. For example, if the first character pair is "ac", the next one can only be "ca", "cb", "cc", etc. It must start with the 2nd letter of the previous pair. The paper makes a simplifying assumption that all character pairs are initially equally likely. And given a particular ith character pair, all the possible (i+1)st character pairs are equally likely, and depend only on the value of the ith character pair. English language is only going to restrict these possibilities further, making the job of the attacker easier. So, it is a reasonable worst-case assumption. We can formally state this as something like: p(q(i+1) | q(i)) = 0 if q(i+1) cannot follow the pair q(i). Else p(q(i+1) | q(i)) = 1/|A| if q(i+1) can follow q(i), where |A| is the number of possible characters. This can be drawn graphically, in what is called the Markov model, of possible state transitions between various values of q. The problem now is that given a sequence of observerations of latencies, what is the most likely sequence of state transitions for qi's? That is our most likely password. Of course, because the password space is large, we could be wrong. The paper tries to figure out n most likely state sequences that could lead to the given sequence of latencies. We will not go into the details of the algorithm in the paper. We will just show how to compute the following for any given sequence q1...qn of state transitions, when we know y1....yn: P(q1...qn | y1 ....yn) If we can do the above, we can theoretically solve the problem (though inefficiently) by doing the computation for every combination of q1...qn, and then choosing the one that maximizes the computed P. P(q1 ... qn | y1 ... yn) = P(y1 ... yn| q1 ... qn) * P(q1 .... qn)/alpha where alpha is P(y1...yn), i.e., the odds of observing that sequence of latencies among all possible sequences. The first term is simply P(y1|q1)*P(y2|q2)*...P(yn|qn), assuming that the observed latency only depends on the current character pair. This may not strictly be true, but is a good simplifying assumption, and helps us to compute the first term. P(q1 ... qn) can be computed from the Markov chain. It is basically, P(q1) * P(q2 | q1) * P(q3 | q2) ... P(qn | q(n-1))) i.e., product of probabilities in that sequence. The term alpha can be computed as well as sum over all sequences of q (p(sequence of q ^ y1 ^ .... ^ yn)) We know how to compute the above values for any sequence of q and can sum that up. The problem with the above is that the above requires O(|Q|)**n) complexity, . The Viterbi algorithm reduce that complexity to O(|Q|**2 * n) steps, for guessing a sequence of length n. If want to guess m most likely sequences, Viterbi algorithm doesn't quite address that question. But the algorithm in the paper leads to a cost of P(|Q| **2 * n * m), i.e., it is like apply Viterbi algorithm m times.