A few days ago I participated in the 2017 Google Hash Code challenge, which was about to find an optimal configuration of videos to cache on different cache servers, given some requests and respective latencies from endpoints that may or may not be connected to individual cache servers. You can download the problem statement and data sets here.

Let me say the most important thing upfront: The solution described below is not the solution that our team came up with within in the time limit. In fact, we merely implemented a greedy solution that ended up somewhere midfield. Kudos to all teams that came up with sophisticated algorithms within the four hour time limit! Keep in mind that I took at least twice that amount of time to come up with and implement more sucessfull solutions, and that is on top of the time spent for the actual challenge.

But still, I think that taking this kind of additional time to understand and explore efficient solutions to the problem is valuable experience. In the following I will describe approaches that I ended up coding up in the aftermath of the Hash Code challenge.

Hash Code 2017: Streaming Videos

But let me tell you more about the precise problem formulation first. The introduction in the problem statement reads:

Have you ever wondered what happens behind the scenes when you watch a YouTube video? As more and more people watch online videos (and as the size of these videos increases), it is critical that video-serving infrastructure is optimized to handle requests reliably and quickly.

This typically involves putting in place cache servers, which store copies of popular videos. When a user request for a particular video arrives, it can be handled by a cache server close to the user, rather than by a remote data center thousands of kilometers away.

But how should you decide which videos to put in which cache servers?

This is precisely what we will be dealing with. There were different data sets available with different network configurations and different numbers of caches (\(C\) with \(1 \le C \le 1000\)), endpoints (\(E\) with \(1 \le E \le 1000\)), videos (\(V\) with \(1 \le V \le 10000\)) and requests (\(R\) with \(1 \le R \le 1000000\)). The maximum cache capacity (\(X\) with \(1 \le X \le 500000\)) in megabyte was constant within one data set.

The following image shows how a network could look like. At the top we have the data center that holds all videos. The second row from top shows the cache servers and the bottom row shows individual endpoints. Problem Statement: Streaming Videos

The requested output file basically required you to decide which videos to put on each cache. Solution were scored according to the average time (in milliseconds) saved per request, in comparison to fetching the video from the data enter, over all \(R\) requests, rounded down to the nearest integer, i.e.

$$\mathcal{L} = \left\lfloor \dfrac{1000}{N} \sum\limits_{i=1}^R (l^\text{D} - l^\text{C}_\text{e}) \cdot k_i \right\rfloor,$$

where \(N\) is the total number of requests, \(l^\text{D}\) is the data center latency, \(l^\text{C}_\text{e}\) is the latency of the endpoint's best connection to a cache server holding the video in question and \(k_i\) is the number of times the video is requested.

The Knapsack Problem

The knapsack problem is one of the standard problems in CS and combinatorial optimization, asking for an optimal subset out of \(n\) items to put in a knapsack of limited load \(W\), where each item is associated with a weight \(w_i\) and a value \(v_i\). More precisely, we want to solve the following optimization problem:

$$\operatorname{max} \Big\{ \sum\limits_{i=1}^n v_i k_i \Big\} \text{ s.t. } \sum\limits_{i=1}^n w_i k_i \le W \text{ and } 0 \le k_i \le K_i,$$

where \(k_i\) is the number of times that the \(i^\text{th}\) item is included when \(K_i\) items of this kind are available. There are different variants of the problem: The knapsack problem is called unbounded if we have unlimited items of each kind available, i.e. \(K_i \to \infty \; \forall i\), otherwise it is called bounded. A special case of the bounded knapsack problem is the so called 0/1 knapsack problem, where the upper bound is only one item, which we can either decide to put in the knapsack once or not at all.

From Theory to Practice

So you should start to see some parallels here. Obviously, the knapsacks are our caches, and the items are our videos. One way to apply this theory to our problem would be to solve \(C\) 0/1 knapsack problems individually, one for each cache server, where we assume that all other cache servers are fixed. Because of the last assumption, i.e. because we actually change the configuration of the other cache servers after having obtained its optimal knapsack configuration, we will ultimately obtain a solution that will not be globally optimal. But this is okay for now, let's just keep that in mind.

If we want to drop the above independence assumption and fill the caches simultaneously, things become much harder! Then we could think of the problem as a variant of the multiple knapsack problem, with the important difference that the profits actually depend on the cache server you choose. Furthermore, the problem needs to be considered unbounded since it obviously might be beneficial to put the same video on different caches. This problem is NP–complete. There are most likely some approximation schemes available, however, we will not tackle this for now but instead work with the above assumption.

Implementation using Dynamic Programming

Now that we have reduced the problem to a standard problem in CS, we can come up with an efficient solution using Dynamic Programming (DP). This is just a fancy term for recursion paired with a mechanism to avoid computing sub–problems multiple times, e.g. using a lookup table or something like an LRU cache.

Since this problem has optimal substructure, we can divide the problem into smaller sub–problems, limiting the number of items \(n\) and the maximum allowance \(W\), and build up the optimal solution of the full problem step–by–step by combining the optimal solutions of the sub-problems. There are usually two ways to implement a DP algorithm, a recursive top–down approach with memoization, and an array–based bottom–up approach. In this case I settled for the latter implementation.

So here is what it looks like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def knapsack01(n, W, video_sizes, latency_gains):
    ''' Solves the 0/1–knapsack problem in quadratic time and quadratic space using a bottom–up DP approach. '''

    weights, v = np.hstack(([0], video_sizes)), np.hstack(([0], latency_gains))
    dp = np.zeros((n + 1, W + 1), dtype = np.int64)
    for i in range(1, n + 1):
        for j in range(W + 1):
            if weights[i] > j:
                dp[i, j] = dp[i - 1, j]
            else:
                dp[i, j] = max(dp[i - 1, j - weights[i]] + v[i], dp[i - 1, j])

    # Now we need to do the backtracking to come up with the actual optimal configuration
    i, j, videos, total_weight, total_value = n, W, [], 0, 0
    while i > 0 and j > 0:
        while dp[i, j] == dp[i - 1, j] and i >= 0:
            i -= 1

        if i <= 0:
            break

        # Video i must be part of the solution
        videos.append(i - 1)
        total_value += v[i]
        total_weight += weights[i]

        j -= weights[i]
        i -= 1

    assert len(videos) == len(set(videos)), 'Invalid solution: Items should only appear at most once in the cache.'
    assert total_value == dp[-1, -1], 'Invalid solution: Optimal total value should be %d, but is %d instead.' % (dp[-1, -1], total_value)
    assert total_weight <= W, 'Invalid solution: Total weight of items can not exceed cache capacity...'

    return videos

Now, what does it do and how does it work? It builds a matrix \(\text{dp}\) with \(n + 1\) rows and \(W + 1\) columns where \(\text{dp}[i,j]\) will contain the highest possible value you can get by choosing items (either exactly once or not at all) from a subset up to the \(i^\text{th}\) item and putting them in a knapsack of maximum capacity \(j\). This matrix is constructed from top–left to the bottom–right, where we use information from row \((i-1)\) to construct the \(i^\text{th}\) row. Consequently, the value at the lower right, \(\text{dp}[n, W]\) will contain the total value of the optimal set of videos to put on the cache server.

The heart of this solution to the knapsack problem is found at lines 9-11. Every time we take a new item into consideration (each row \(i\)), we check if the items weight alone exceeds weights incrementally increasing up until the maximum allowance (each column \(j\)) and if this is the case then basically we can't take the item, no matter how much value it would add, and the maximum score we can get is simply the maximum score we'd have without the item. If however the item does not exceed this limit, then we can start to find out if it would be beneficial to include the item into the knapsack. This is precisely what's happening in line 11: We compare the sum of the item's value \(v_i\) and the maximum achievable score for the rest of the knapsack, \(\text{dp}[i - 1, j - w_i]\), to the alternative of not including the item into the knapsack.

A Memory Efficient Implementation

The above is an implementation that has \(\mathcal{O}(n W)\) time and space complexity, so it is quadratic (actually pseudo–polynomial) in both time and space. There is a solution that pulls down space complexity to \(\mathcal{O}(W)\), since you actually only every need to keep the current row and its predecessor in memory. However, the downside of this memory efficient implementation is that it is much–harder to actually reconstruct the optimal solution. This is because when you don't have the full matrix, you can't use the backtracking procedure described below to explicitly state your solution and compile the set of videos to cache. But anyway, here is a possible implementation of this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def _knapsack01(n, W, video_sizes, latency_gains):
    ''' Solves the 0/1–knapsack problem in quadratic time and linear space using a bottom–up DP approach. '''

    weights, v = np.hstack(([0], video_sizes)), np.hstack(([0], latency_gains))
    dp = np.zeros((2, W + 1), dtype = np.int64)

    for i in range(1, n + 1):
        for j in range(W + 1):
            if weights[i] > j:
                dp[1, j] = dp[0, j]
            else:
                dp[1, j] = max(dp[0, j - weights[i]] + v[i], dp[0, j])

        dp[0] = dp[-1]

    return dp

In the following we will use the first implementation though, because of the possibility to use backtracking, as described above.

Backtracking

So far we have only found the highest value we can expect from our optimal solution, but we still need to find the specific set of videos to put on the cache server. This is something that we can achieve by inspecting our matrix \(\text{dp}\), as implemented in lines 13-28. The general idea is to walk up/down the rows (i.e. decrementing \(i\)) starting from \(\text{dp}[n,W]\) until \(\text{dp}[i, j] \neq \text{dp}[i-1, j]\). When this is the case, then we know that item \(i\) must be included, since the value above will be smaller. We add this item to the cache, jump to the left according to the weight of item \(i\), i.e. decrementing \(j\) by \(w_i\) and decrement \(i\) by one. Then we simply repeat this procedure until we reach the first row.

We will end up with a list videos that will contain the IDs of the videos to cache on the server. Lines 24-25 and 30-32 are optional, they merely perform a sanity check on the solution using assert statements.

How to Compute the Latency Gains

To solve the 0/1 knapsack problem we need both the weights (video sizes) and the values (latency gain) of all videos. The weights are given of course, but we still need to compute \(v_{ij}\) for all cache-video pairs \((c_i, v_j)\). The following function compute_latency_gains() will compute the overall latency gain for all videos, that would result from caching video \(j\) on cache \(i\), assuming that all other caches are fixed (i.e. their specific configuration is fixed).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
def compute_latency_gains(i, requests, endpoints, caches, n_videos, p = 1.0):

    latency_gains = np.zeros(n_videos, dtype = np.int64)
    # We iterate over all requests
    for endpoint_id, video_id, n_requests in requests:

        # Flip a coin...
        if np.random.rand() > p:
            continue

        # We only need to consider requests from endpoints that are actually connected to the cache
        if i in endpoints[endpoint_id].caches:

            # If the video is already cached in a cache that the endpoint is connected to, the gain is zero
            if any([c.video_in_cache(video_id) for c in caches]):
                continue

            latency_gains[video_id] += (endpoints[endpoint_id].dc_latency - endpoints[endpoint_id].caches[i]) * n_requests

    return latency_gains

Note that lines 7-9 are optional. I introduced the coin flip parameter p to speed up the computation of the latency gains for very high \(R\), i.e. large number of requests. In every iteration we draw a random number from the interval \([0, 1]\), compare it to the specified value of p, and just skip the request if the number drawn is larger than p. In other words: p is nothing else than the probability of taking each request into account. For large number of requests, this is essentially identical to a sampling approach, i.e. only iterating over a random subsample of the requests in the first place (where the sample size is \(R \cdot p\)). I found that this approximation has negligible effects on the final score for the largest data sets.

Variations

One could immediately come up with slight variations of the above that will most likely yield different scores:

  • The actual order of filling the caches/solving the knapsack problem matters, so we can try different procedures to decide which caches to fill first. The simplest idea is to just choose them in random order. Another idea would be to sort the caches according to something like their endpoint density, i.e. considering how important that specific cache is in the network. One could probably come up with more sophisticated metrics to sort for by diving deeper into the data sets.

  • We could potentially improve our solution in different ways after having obtained a reasonably good solution. This would involve coming up with some local search mechanism. Another way could be to use an evolutionary algorithm.

  • We could improve our solution by continuing to solve the knapsack problems for each cache independently until the score doesn't improve any further, where at each iteration, we randomly select a cache server, empty it and refill it with the knapsack solution. This sounds trivial, but actually provides very good results!

Results

To speed up the execution I made use of the Numba just–in–time (JIT) compiler to compile the two functions knapsack01() and compute_latency_gains(), since this is where the bottleneck is. When marking the compute_latency_gains() function for compilation with the @jit decorator, you have to convert the list comprehension in line 15 to an explicit for–loop, since Numba doesn't support list comprehensions.

If we use the algorithm described above paired with the JIT compiler, the program runs quite fast for the smaller data sets and a bit longer for the larger data sets, but still terminates in a reasonable time. Here are the best scores achieved:

Data Set \(C\) \(V\) \(E\) \(R\) \(X\) Score
Me at the zoo \(10\) \(100\) \(10\) \(100\) \(100\) \(503121\)
Videos worth spreading \(100\) \(10000\) \(100\) \(100000\) \(10000\) \(596044\)
Trending today \(100\) \(10000\) \(100\) \(100000\) \(50000\) \(499999\)
Kittens \(500\) \(10000\) \(1000\) \(200000\) \(6000\) \(847603\)

This leads, or rather would have led, to a total score of \(2446767\) in the competition, which already would have been enough to beat \(95 \%\) of all solutions. Yay!

The best results are in fact achieved by repeatedly flushing and refilling random cache servers, recomputing the knapsack solution each time. I think this is because of the following reason: Remember our assumption from above that we solve the knapsack problems independently, assuming that all other cache servers are fixed? Now, assuming we have a large number of cache servers that all have a reasonably good configuration, emptying and refilling any of the other cache servers will basically only introduce a small change to the overall configuration. The scores get better and better in each iteration and I suspect that the actual changes that are made by refilling the cache servers become smaller and smaller as well, i.e. we end up putting almost an identical configuration of videos back on the cache server that we have just emptied. In other words, the assumptions made above happen to be closer and closer to reality in each iteration, slowly converging to some local optimum.

Interestingly enough, the scores for the smallest data set Met at the zoo do not improve by this procedure, which is probably due to the fact that only \(10\) cache servers are not enough, such that we actually destroy a good configuration every time we flush a cache server, because the changes we make to the overall configuration are not small perturbations but significant enough to render the other local solutions useless.

Visualizing the Requests

Google provides a way to visualize the hit ratios. All endpoints are shown on the left together with their hit ratios, and lines going to a cache server or the data center (on the right) represent requests where the colors indicate the cost of the request.

Let's see what it looks like for the Me at the zoo data set: Visualization for the first data set In contrast, this is what it looks like for the data set Trending today: Visualization for the second data set Apparently we have obtained a close to optimal solution for this data set! Finally, have a look at the Videos worth spreading data set: Visualization for the third data set So we can see that we still have lots of room for improvement on this, although we can't tell for sure that there exists a perfect solution where all requests are cached on a favorable cache server.

Further Thoughts

  • The solution presented in this post if of course only one solution out of many possible solutions, although I suspect that a good part of the top–scoring submissions use at least some knapsack variation, most likely followed by local search. Although the scoreboard of the extended round closes this Sunday, I'm excited to see what other solutions pop up in the next weeks!
  • Maybe I will write a follow–up post when I figure out how to deal with the multiple knapsack problem to fill the caches simultaneously. The problems are most likely computationally hard enough to prevent us from obtaining exact solutions, but I guess there are some approximation schemes available.
  • Make sure to check out the previous problems of Hash Code competitions in previous years. They are fun, too!
  • Good luck to all teams participating in the final round in Paris!