Tim Black's Blog

How many spy retreats? Information stealers meet Information Theory

Posted October 7, 2018

FiveThirtyEight's Riddler Classic this week is a puzzle about sniffing out spies. It came from Moin Qureshi. Here's the puzzle:

There are N agents at the RIA (that’s the Riddler Intelligence Agency), and K of them are spies. Your job is to identify all the spies. (No, this is not a game of Codenames.)

You can send a given number of agents to a “retreat” on a remote island. If all K spies are present at the retreat, they will meet to strategize. If even one spy is missing, this spy meeting will not take place. The only information you get from a retreat is whether or not the spy meeting happened. You can send as many agents as you like to the retreat, and the retreat can happen as many times as needed. You know the values of N and K.

What’s the minimum number of retreats needed to guarantee you can identify all K spies? If each retreat costs $1,000 per person, what is the total cost to identify all K spies?
To begin with, let’s assume you know that N = 1,024 and K = 17.

If you claim that the minimum number of retreats is $R$, there are two things you need to prove. For one, you need to describe a strategy that you can prove always finishes in at most $R$ retreats (or at least, prove that such a strategy exists). But you also need to prove that every strategy requires at least $R$ retreats in the worst case.

For the setting of $N = 1024$, $K = 17$ we will prove that any succesful strategy requires at least $122$ retreats in the worst case, and we will describe a strategy that is guaranteed to finish in at most $131$ retreats. It is possible that there are solutions that require a number of retreats less than $131$ (but still at least $122$), and I would be excited to see them, especially if there is some underlying principle behind them.

For general $N$ and $K$, will show any strategy requires at least $\newcommand{\ceil}[1]{\lceil #1 \rceil} \ceil{\log_2 \binom{N}{K}}$ retreats, and our strategy will use at most $K - 1$ retreats beyond that; that is, our strategy is guaranteed to find the spies in at most $\ceil{\log_2 \binom{N}{K}} + K - 1$ retreats.

A point of inspiration: Guess my number. And, a lower bound.

The classic game Guess my number is essentially a simpler version of the spy puzzle (actually, it's equivalent to the $K = 1$ case). I'm thinking of a number between $1$ and $100$, and you try to figure it out by asking yes-no questions. You want a strategy that is guaranteed to finish in as few questions as possible.

A standard first question is "Is your number between $1$ and $50$?" This is a good question, because you make good progress whether the answer is yes or no — either way, the number of possibilities cuts down from $100$ to $50$. Compare that to the narrower question "Is your number between $1$ and $10$?" If the answer to this question is yes, you are in good shape, because you've really narrowed down the possibilities. But in the worst case, the answer is no, and there's still $90$ options for my number.

The best strategy is to always ask questions that apply to half of the remaining possibilities. The first question cuts the possibilities from $100$ down to $50$, the second from $50$ to $25$, the third from $25$ to $12$ or $13$, and so on, until after at most $7$ questions, you know what my number is.

Since each question can do no better than cutting the possibilities in half (in the worst case), there is no way to get down from $100$ choices to $1$ in fewer than $7$ questions. For another way to see this, suppose we ask at most $q$ questions. Each question has two possible answers (yes or no), so there are at most $2^q$ possible sequences of answers to all the questions. This is true even if we sometimes don't ask all the question, and it is even true if the later questions may vary depending on the answers to the earlier ones. But $2^q$ possible sequences can only distinuish between $2^q$ possibilities. So, to distinguish between the $100$ numbers I might be thinking of, we need $2^q \geq 100$. So, $q \geq \log_2 100$. Since $q$ is an integer, that means every strategy requires at least $q \geq 7$ questions in the worst case. If you follow the divide-in-half strategy (binary search), you are guaranteed to find the number in at most $7$ questions.

What does this mean for the spy puzzle? The number of ways that $K$ spies can hide among $N$ agents is $\binom{N}{K}$. This is the number of possibilities we need to distinguish between. Each retreats gives us the answer to a yes-no question. Like in Guess my number, the number of questions needed (for any strategy, in the worst case) is at least the log base $2$ of this. That is, every strategy requires a lower bound of $\ceil{\log_2 \binom{N}{K}}$ retreats. When $N = 1024$, $K = 17$, this number is $122$.

Unlike in Guess my number, however, in the spy puzzle you are only allowed to ask a specific kind of question (by sending a certain set of agents on a retreat). You might not always (or even usually) have a question available to you that applies to half the possibilities. So, it's not clear if there actually exists a strategy that achieves the lower bound. We'll see what we can do.

A first attempt at a retreat schedule

Let's try out a strategy. We'll number the agents $1, 2, 3, \dots, N$. We'll try to identify the spies one-by-one, starting with the highest-numbered spy (numbered $H$).

To do this, we'll hold retreats consisting of all spies numbered $1$ through $m$, for some $m$ between $1$ and $m$. If we choose $m \geq H$, then the spies will hold a meeting, while if we choose $m < h$, the spies will not hold a meeting. By holding the retreat, we get an answer to the question "Is $H \leq m$?"

This lets us do a binary search, just like in Guess that number. For our first retreat, we take $m = N / 2$. If the spies hold a meeting at this first retreat, then we learn that $1 \leq H \leq N / 2$, while if they don't, we learn that $N / 2 + 1 \leq H \leq N$. Either way, we cut down the possibilities for $H$ by half, and for the next question. For the second retreat, we pick $m$ to be the median of the remaining possibilities (so, either $m = N / 4$, or $m = 3 N / 4$). This again cuts down the possibilities by half. We continue this way, until after at most $\ceil{\log_2 N}$ retreats, we have determined $H$ — we now know the identity of one spy!

We next want to discover the second-highest-numbered spy. We follow the exact same process that we used for the first spy, with one exception: The spy we've already identified is sent to the island permanently, to participate in every single retreat (not a bad punishment, perhaps?). So every retreat consists of agents $1$ through $m$ for some $m$, along with agent $H$. After at most $\ceil{\log_2 N}$ retreats, we identify this second spy.

We repeat this process again to find the third-highest-numbered spy, then the fourth, and so on, with each retreat consisting of the agents $1$ through $m$ for some $m$, along with all the spies discovered so far. After $K$ iterations of this process (with each iteration consisting of at most $\ceil{\log_2 N}$ retreats), we have discovered the identity of all $K$ spies.

Thus, this strategy is guaranteed to require at most $K \ceil{\log_2 N}$ retreats in total. When $N = 1024$, $K = 17$, this means holding $170$ retreats.

Evaluating our first attempt

When $N = 1024$, $K = 17$, our strategy required holding $170$ retreats, whereas our lower bound was $122$. That's a gap of $48$. We use about 39% more retreats than what the lower bound forces on us.

For general $N$ and $K$, how does the guarantee of $K \ceil{\log_2 N}$ retreats compare to our lower bound of $\ceil{\log_2 \binom{N}{K}}$ from before? Do they continue to be about $50$ apart from each other? Does the first continue to be about 40% more than the second? To address this, let's expand out these two quantities: \[ \begin{align*} K \ceil{\log_2 N} &= \sum_{i = 0}^{K - 1} \ceil{\log_2 N}, \text{ and} \\ \left\lceil \log_2 \binom{N}{K} \right\rceil &= \left\lceil \sum_{i = 0}^{K - 1} \log_2 (N - i) - \log_2 (K!) \right\rceil. \end{align*} \]

The two summation symbols seems to be close together in value. The $\log_2 N$ vs. $\log_2 (N - i)$ is not a big deal (in fact, by being just a little bit more careful with our argument above, we could have a $\log_2 (N - i)$ in the first expression as well). A bigger deal is that in the first summation, every single term has a ceiling around it. These ceilings could increase the value of entire summation by almost $K$ when $N$ is slightly more than a power of $2$. An even bigger difference between the two expressions is that the second one has an final $- \log_2 (K!)$ term (this is approximately $- K \log_2 K + 1.4 K$). For small $K$, this is not too bad, but it could be better.

What's holding us back? If we inspect our strategy, and look back at our very first retreat, we find something suspicious. When $N = 1024$, $K = 17$, there are $\binom{1024}{17} \approx 3.68 \times 10^{36}$ possible teams of spies. Our first retreat consists of agents $1$ through $512$. The spies hold a meeting if (and only if) all $17$ of them are among the first $512$ agents, but this is only the case for $\binom{512}{17} \approx 2.45 \times 10^{31}$ of the possible teams of spies, which is only about a $1 / 150000$ fraction of the possibilities. So this retreat actually represents a very, very narrow question, and therefore is not a very good question to ask.

But we don't have to give up on this idea! We can take this strategy and try to modify it to make it even better. In fact, since we identified a flaw, this gives us some guidance on what we should try to improve.

In our new strategy, we will continue to only hold retreats consisting of agents $1$ through $m$ for some $m$, along with all the spies we have identified so far. The change will be what values we choose for $m$. We now have a sense that we should choose $m$ so that no matter whether the spies end up holding a meeting or not, we will be able to cross off about half the list of possible spy teams. This means, for example, that our first $m$ will not be $N / 2$ but actually something much closer to $N$.

Beefing up binary search: A more sophisticated Guess my number and prefix-free codes

Just as our first strategy for finding spies was based on a strategy for Guess my number, our improved strategy will be based on another game, a modfication of Guess my number. In this section, we will look at Guess my number from a different angle, then we'll introduce our modification of it, and a strategy for that modification.

Let's talk about Guess my number again, but this time for simplicity, I'm thinking of a number between $1$ and $6$. You will determine my number by asking questions of the form "Is the number between $1$ and $m$?", where you can replace $m$ with the number of your choice. For the first question, you could take $m = 3$. Then, depending on my answer, for the second question, you could take either $m = 1$ or $m = 4$. If my number is either $1$ or $4$, you're done at this point, and otherwise you need to ask one more question (taking either $m = 2$ or $m = 5$). Let's say you down the answer to each of my questions, writing down "0" if the answer is no and "1" if it is yes. The next table shows what you end up with: If my number is an entry in the left column, the sequence you end up writing down is the corresponding entry in the right column.

My number	Sequence of answers
6	000
5	001
4	01
3	100
2	101
1	11

Something to notice about these six sequences: No one is a prefix of another; in other words, there's no sequence in the list where you can add something to the end and get a different sequence in the list. So, we say that this list of six sequences is a prefix-free code. This is an important property! To illustrate the point, suppose for a moment there were some sequence that was a prefix of another. For example, let's suppose that the number $4$ were associated with the sequence 01 while the number $5$ were associated with the sequence 010. This would be saying that if the answers to your first two questions were no, yes, then you would know that my number is $4$ and would stop asking questions, but somehow at the same time you would ask a third question, and if that answer was no then you would conclude my number is $5$. But that doesn't make sense!

For any strategy for Guess my number, we can do the same thing we did above: for each possible number $n$ I could be thinking of, we write down the sequence representing the answers you'd get if you followed the strategy when my number is $n$. Just like above, if we write down all the sequences (one for each number that I could be thinking of), they must together form a prefix-free code.

If we ask if there is a strategy to win Guess that number (with the numbers $1$ through $6$) in at most 3 questions, guaranteed, we know the answer is yes, because $\ceil{\log_2 6} = 3$. Now let's make it more complicated.

Problem A. I'm thinking of a number between $1$ and $6$. If my number is $6$, you want to figure it out in 1 question. If my number is $5$, $4$, $3$, or $2$, you want to figure it out in at most 3 questions. And if my number is $1$, you want to determine that in at most 4 questions. Can you do it?

If the answer to Problem A were yes, then we could turn the strategy into a prefix-free code. This prefix-free code would consist of one sequence of length 1, four sequences of length 3, and one sequence of length 4. Is there such a prefix-free code? The following theorem, due to Kraft and McMillan, says no (and so the answer to Problem A is also no).

Theorem 1. If sequences $s_1, s_2, s_3, \dots, s_N$ — with lengths $\ell_1, \ell_2, \ell_3, \dots, \ell_N$, respectively — form a prefix-free code, then, \[ \sum_{i = 1}^N \frac{1}{2^{\ell_i}} \leq 1. \]

Proof. Consider a fair coin with "0" written on one side, and "1" written on the other. We flip this coin repeatedly, recording the outcome of each flip. We stop when the sequence that we've written down is one of $s_1, s_2, \dots, s_N$. The probability that we write down $s_i$ is the probability that $\ell_i$ coin flips in a row go the right way, which is $1 / 2^{\ell_i}$. Note that it's important that this is a prefix-free code; if some other $s_j$ were a prefix of $s_i$, then we would never actually write down all of $s_i$ because we would always stop once we had written down $s_j$. We always write down at most one of the $s_i$, so, \[ \sum_{i = 1}^N \operatorname{Pr}[\text{we write down } s_i] \leq 1. \] That is, \[ \sum_{i = 1}^N \frac{1}{2^{\ell_i}} \leq 1. \square \]

In Problem A, we have that \[ \sum_{i = 1}^N \frac{1}{2^{\ell_i}} = \frac{1}{2^1} + \frac{1}{2^3} + \frac{1}{2^3} + \frac{1}{2^3} + \frac{1}{2^3} + \frac{1}{2^4} > 1, \] so we can conclude that this task is impossible.

The general version of Problem A is as follows.

Problem B. I'm thinking of a number between $1$ and $N$. Given are non-negative integers $\ell_1, \ell_2, \dots, \ell_N$. Your goal is to determine my number by asking yes-no questions of the form "Is the number between $1$ and $m$?", where you may choose the value of $m$. But, you must do it in such a way that if my number is $i$, you ask at most $\ell_i$ questions. Is there strategy to accomplish this?

Theorem 1 gives us a condition under which this problem cannot be solved. Specifically, we have the following.

Corollary 2. If \[ \sum_{i = 1}^N \frac{1}{2^{\ell_i}} > 1, \] then there does not exist a successful strategy for Problem B.

We are also interested in conditions under which Problem B does have a successful solution. Kraft and McMillan proved that if the inequality in Theorem 1 holds for some $\ell_1, \ell_2, \ell_3, \dots, \ell_N$, then there in fact exists a prefix-free code with sequences of those lengths. But, the thing that will be useful for our purposes is slightly different from this, because we want to make sure all of our questions are of the specific form "Is the number between $1$ and $m$?" (for some $m$).

Proposition 3. Let $\ell_1, \ell_2, \ell_3, \dots, \ell_N$ be nonnegative integers. Suppose that $\ell_1 \geq \ell_2 \geq \dots \geq \ell_N$, and \[ \sum_{i = 1}^N \frac{1}{2^{\ell_i}} \leq 1. \] Then, there exists a successful strategy for Problem B.

Proof. To describe our strategy, we will describe the sequence $s_i$ associated with each number $i$ between $1$ and $N$. To have a valid strategy,

we need the sequences $s_1, s_2, \dots, s_N$ to form a prefix-free code;
we need each sequence $s_i$ to have length at most $\ell_i$; and
we need the sequences to be in antilexicographic order — that is, if you write out $s_N, s_{N - 1}, \dots, s_2, s_1$ in that order, they must be in alphabetical order (where we consider "0" to be alphabetically before "1") — this is to satisfy the constraint that questions be of our special form.

Here is how we choose the $s_i$. For $1 \leq i \leq N$, write the number \[ f(i) = \sum_{j = i + 1}^{N} \frac{1}{2^{\ell_j}} \] in binary, and take $s_i$ to be the first $\ell_i$ digits (or rather, bits) after the decimal point (binary point?).

For all $1 \leq i \leq N$, we have that $0 \leq f(i) < 1$. Because of this, and because $f(i)$ is decreasing as a function of $i$, the sequences are in antilexicographic order. By construction, the sequence $s_i$ has length $\ell_i$. So, it only remains to show that this is a prefix-free code.

Notice that in the binary expansion of $f(i)$, all decimal places beyond the first $\ell_i$ are $0$. This is because $f(i)$ is divisible by $1 / 2^{\ell_i}$. Consider any $1 \leq i < j \leq N$. We will check whether $s_{j}$ is a prefix of $s_i$ (since $\ell_{j} \leq \ell_i$, we don't need to check the reverse). If it were, then $f(i)$ and $f(j)$ would have to agree on the $\ell_{j}$ bits after the decimal place, and in this case, we would have $f(i) - f(j) < 1 / 2^{\ell_{j}}$. But actually, $f(i) - f(j) = \sum_{a = i + 1}^{j} 1 / 2^{\ell_a} \geq 1 / 2^{\ell_{j}}$. So, $s_j$ is not a prefix of $s_i$. We conclude that this is indeed a prefix-free code.

This completes the proof. $\square$

Proposition 3 says describes a strategy, but describes it in terms of the sequences associated with it. Let's translate that back into yes-no questions. At each step in the process, do the following. Write "0.". Then after the decimal point, write the answers to each question you've asked so far, where no counts as "0" and yes counts as "1" (if you haven't asked any questions yet, you skip this step). Then, add one more "1" to the end. Interpret this as a number in binary, and call it $b$. Let $m$ be the value such that \[ \sum_{i = m + 1}^N \frac{1}{2^{\ell_i}} = b. \] Your next question is "Is the number between $1$ and $m$?" If no such $m$ exists because $m = M$ makes the sum too small while $m = M - 1$ makes the sum too big, then stop asking questions! The answers you have received so far form the sequence $s_M$, and the number you are seeking is $M$.

With this strategy in hand, we are ready to find some spies!

Our improved strategy for finding spies, with prefix-free codes and recursion

In our first strategy for finding spies, we only held retreats of a special form, where each retreat included all the agents from $1$ through $m$ for some $m$, along with all the spies we have already positively identified with certainty. The new strategy that we develop in this section will still only use retreats of this form, although for different choices of $m$.

For $N \geq K \geq 0$, let $R(N, K)$ be the minimum value for which there exists a strategy that is guaranteed to determine $K$ spies from among $N$ agents using at most $R(N, K)$ retreats. It's nice to have notation for this, because we will develop a recursive formula for $R(N, K)$.

The restriction that retreats must have a special form forces us to identify the spies in order from highest-numbered to lowest. Furthermore, as of the moment we identify the $i$th spy (who is agent number $M$, say), we have no information about the remaining $K - i$ spies, other than that they must all be numbered at most $M - 1$. So, the number of remaining retreats we need to guarantee victory is $R(M - 1, K - i)$.

We note the following.

Lemma 3.5. For fixed $K$, $R(N, K)$ is non-decreasing as a function of $N$.

Proof. Consider any $N < N'$. Any strategy that finds $K$ spies among $N'$ agents can also be used to find $K$ spies among $N$ agents, by simply ignoring any directives for agents numbered $N + 1$ through $N'$. $\square$

Let's find a recursive formula for $R(N, K)$. As a base case, we'll take $K = 0$.

Lemma 4. For all $N \geq 0$, we have that $R(N, K) = 0$.

Proof. There is only one way for there to be $0$ spies, so we don't need to hold any retreats. $\square$

The recursive formula is based on the strategy for Problem B in the previous section.

Lemma 5. For all $N \geq K \geq 1$, we have that $R(N, K$\) is the smallest value of $r$ for which \[ \sum_{i = K}^{N} \frac{1}{2^{r - R(i - 1, K - 1)}} \leq 1. \]

Proof. Let $r$ be a positive integer. We will determine whether there is a strategy for finding $K$ spies among $N$ agents that is guaranteed to use at most $r$ retreats. For any $i$, if the highest-numbered spy is agent number $i$, then a successful strategy will require $R(i - 1, K - 1)$ more retreats to guarantee that it finds the remaining $K - 1$ spies. So, a successful strategy using at most $r$ total retreats needs to identify the highest-numbered spy using at most $r - R(i - 1, K - 1)$ retreats when the highest-numbered spy is agent $i$.

This is possible exactly when Problem B from the previous section, with $\ell_i = r - R(i - 1, K - 1)$, has a successful solution. The retreat consisting of agents $1$ through $m$ exactly corresponds to the question "Is the number between $1$ and $m$?"

So, there there is a strategy that always finds the spies in $r$ questions if and only if \[ \sum_{i = K}^{N} \frac{1}{2^{r - R(i - 1, K - 1)}} \leq 1. \quad \square \]

We combine and rearrange the previous two lemmas to get our recursive formula for $R(N, K)$.

Proposition 7. For $N \geq 0$, we have that $R(N, 0) = 0$. For all $N \geq K \geq 1$, \[ R(N, K) = \left\lceil \log_2 \left( \sum_{i = K}^N 2^{R(i - 1, K - 1)} \right) \right\rceil. \]

Proof. The case of $M = 0$ was Lemma 4.

Consider $K \geq 1$. Multiplying the inequality from Lemma 5 by $2^r$, and taking the log base 2, we find that $R(N, K)$ is the least value of $r$ for which \[ r \geq \log_2 \left( \sum_{i = K}^{N} 2^{R(i - 1, K - 1)} \right). \quad \square \]

Now that we have a recursive formula for $R(N, K)$, it would be nice to have a closed-form expression for it. We are not going to get that right now, but we will find some decent bounds on $R(N, K)$. For one, we have the lower bound from the first section, $R(N, K) \geq \ceil{\log_2 \binom{N}{K}}$. We'll now prove an upper bound that is only $K - 1$ larger than that.

Proposition 8. For all $N \geq K \geq 1$, we have that \[ R(N, K) \leq \left\lceil \log_2 \binom{N}{K} \right\rceil + K - 1. \]

Proof. We preceed by induction on $K$. As our base case, for $K = 1$, the inequality holds with equality, by using the typical binary search strategy that we saw for Guess my number. To be precise, by Proposition 7, \[ R(N, 1) = \left\lceil \log_2 \left( \sum_{i = 1}^N 2^{R(i - 1, 0)} \right) \right\rceil = \ceil{\log_2 N}. \]

Consider any $K \geq 2$, and suppose the proposition holds for smaller values of $K$. Then, \[ \begin{align*} R(N, K) &= \left\lceil \log_2 \left( \sum_{i = K}^N 2^{R(i - 1, K - 1)} \right) \right\rceil \\ &\leq \left\lceil \log_2 \left( \sum_{i = K}^N 2^{\ceil{\log_2 \binom{i - 1}{K - 1}} + K - 2} \right) \right\rceil \\ &\leq \left\lceil \log_2 \left( 2^{K - 1} \sum_{i = K}^N \binom{i - 1}{K - 1} \right) \right\rceil \\ &= \left\lceil \log_2 \left( 2^{K - 1} \binom{N}{K} \right) \right\rceil \\ &= \left\lceil \log_2 \binom{N}{K} \right\rceil + K - 1. \end{align*} \] This completes the inductive step. $\square$

For our case of interest, $N = 1024$, $K = 17$, this tells us that $R(1024, 17) \leq 138$, that is, that our new strategy requires at most 138 retreats, just 16 more than our lower bound. But Proposition 8 only gives a bound. If we are willing to do some computer programming, we can find the exact value of $R(1024, 17)$. Such a program calculates the function $R(N, K)$ using our recursive formula and dynamic programming. It requires only a few lines of code, and runs relatively quickly. We find that $R(1024, 17) = 131$.

So, our strategy from this section is guaranteed to find all $17$ spies from among the $1024$ agents using at most $131$ retreats. This is the best that one can do only using retreats of our special form, though it is possible that there is a strategy to do better that uses retreats not of this form. However, even still, there is no strategy that can guarantee fewer than $122$ retreats.

last updated 10.8.2018