Software Engineering Challenge: Recipe Database

Tags: programming.

There exists the term “Software Engineer” and “Computer Scientist”. There exists almost no consensus on what these terms mean and what the distinction between the two of them is. I’m going to offer my personal definition. I’m not saying it’s the one true definition, and I don’t even think that it’s the most widely used definition. I have no idea which definition is most widely used, but I am very confident that that most widely used definition is also a minority; i.e. for all definitions D, the size of the population that uses D is (much) less than 50% of the population of all people who use those terms.

So here’s my definition: A computer scientist is more likely to be interested in the theory, whereas a computer engineer is more likely to be interested in the practical implementations. A computer scientist would be able to come up with a proof showing that a particular algorithm is the optimal way to solve a given problem, when run on an idealized abstraction of computation. A computer engineer would then be able to take that algorithm and implement it on actual (not idealized) computers with finite memories, and may need to deviate from the supposedly “proven optimal” design in order to increase performance due to issues like cache misses or finite word sizes (dealing with integers smaller than 64 bits is trivial. As soon as you cross that threshold, you have a whole other story).

For what it’s worth, I self-identify more as a computer scientist than as a software engineer (though my official job title these days is “software development engineer”).

With that in mind, here’s a little challenge for you (regardless of which one you self identify as). I was prompted to write this challenge from a StackOverflow post (which I won’t link to until the end of the post ‘cause I don’t want you to be tempted to cheat and look at the answer), but I’ve seen variations of this problem many times over the years.

Imagine we’re a new start up and we want to build a recipe database, where the user can enter in what ingredients they have available to them, and we return the set of all recipes that can be followed using only those ingredients.

For the purposes of this challenge, a Recipe is merely a Set<Ingredient> i.e. we don’t care how long it takes to prepare, what the steps are, and so on; all we care about is whether or not we have the ingredients necessary to follow the recipe. The database is some sort of collection of Recipes (it might be a list, a hashset, a map, or whatever; that’s part of what you need to figure out for your design). And when the user issues a query to you, they will do so by providing a Set<ingredients> (It’s a set because ordering doesn’t matter).

Also note that for simplicity, we don’t care how much of an ingredient is necessary. For example, if a recipe needs milk, and the user says they have milk, then we should include that recipe (assuming the user also has all the other ingredients necessary for that recipe). We don’t attempt to have the user specify how much milk they have and compare that against how much milk the recipe say it needs.

Finally, to give you an idea of any scaling issues we may someday face, we want loads of about 100’000 distinct possible ingredients and about 1’000’000 distinct recipes (although of course, right now we only have about 1000 or so recipes, we’re adding more every day). We expect the typical recipe to require somewhere between 5 and 50 ingredients, and we expect the user to on average submit queries with about 100 ingredients (don’t worry, they don’t have to type them all in; we’re going to make an app that takes care of keeping track of what the user has available, the user just has to press a single button and the app populates the query). And ideally, we’re hoping to one day get to 1’000’000 queries issued per day.

Let’s look at a concrete toy example. In this example, there exists only 3 ingredients: bread, egg and milk. Here are all the recipes that exist in the DB:

  • Glass of milk. Required ingredients: milk.
  • Fried egg. Required ingredients: egg.
  • Omelet. Required ingredients: egg, milk.
  • Toast. Required ingredients: bread.
  • French Toast. Required ingredients: bread, egg, milk.

And here’s the test cases:

  • If a user submits the query {milk}, we should return {Glass of milk}.
  • If the user submitthe query {egg, milk}, we should return {Glass of milk, Fried Egg, Omelet}.
  • If a user submits the query {bread, milk}, we should return {Glass of milk, Toast}.

Go ahead and try to come up with a design now, and then come back to this blog post once you feel you’ve got a clean, simple design that you could give to a development team to implement.

Spoiler Alert! Care before scrolling further!
Original Wikipe-tan image by Kasuga; CC BY-SA 3.0

Done? Great.

If you’re like most professional software developers, you probably started thinking about data structures and algorithms, and possibly about their running time and memory requirements.

Perhaps you tried to construct some sort of 20-questions style decision tree, along with an algorithm to rebalance the trees as new recipes/ingredients are added, and some sort of heuristic for determining the optimal order in which to ask those questions to reduce the space of possible recipe-sets as quickly as possible. E.g. Does the query include bread? If not we can eliminate Toast and French Toast, but now we need to ask a question to determine whether or not to keep Glass of milk, Fried egg and Omelet.

Or perhaps you tried to set up a key-value-Map-like datastructure where the keys are internally organized into a Hasse Diagram so that when a user submitted their list of available ingredients, you’d fetch the recipes that exactly match the user’s available ingredients, and then travel down the “subset edges” in the Hasse diagram to also include recipes that don’t necessarily require every single ingredient the user owns. E.g. The user has egg and milk? Excellent, then they can definitely make Omelet. But now let’s look up all possible subsets of their available ingredients to see if there are other things they might be able to milk.

Maybe you thought of using a distributed network of actors, each actor perhaps responsible for some cluster of ingredients that commonly appear together, and when a request comes in, it gets passed around with a result list appended to.

Or maybe you went with an eventual consistency CQRS route where when a new recipe is submitted, it gets batched into some sort of MapReduce batch process where we precompute the answer to every possible query the user might issue into a caching layer. Queries are instantaneous, and we just recompute the caches every now and then when we have new recipes to integrate in.

Almost nobody thinks “I’ll just brute force it.”

Here’s a JavaScript program that creates a database of with 10 ingredients, 10 recipes, 2 ingredients per recipe, and assumes a typical user query will say they have 5 ingredients available to them.

var i, j;
var numIngredients = 10;
var numRecipes = 10;
var numIngredientsPerRecipe = 2;
var numIngredientsInQuery = 5;

function containsAll(needles, haystack){ 
  var i, len;
  for(i = 0 , len = needles.length; i < len; i++){
      if(haystack.indexOf(needles[i]) == -1) {
          return false;
      }
  }
  return true;
}

// Set up a fake DB of recipes
var ingredients = [];
for (i = 0; i < numIngredients; i++) {
    ingredients.push(i);
}
console.log('Here are the ingredients:', ingredients);

var recipes = [];
for (i = 0; i < numRecipes; i++) {
    var neededIngredients = [];
    for (j = 0; j < numIngredientsPerRecipe; j++) {
        neededIngredients.push(Math.floor(Math.random() * numRecipes));
    }
    recipes.push({ recipeId: i, needed: neededIngredients});
}
console.log('Here are the recipes:', recipes);

// Set up a fake query
var ingredientsAvailable = [];
for (i = 0; i < numIngredientsInQuery; i++) {
    ingredientsAvailable.push(Math.floor(Math.random() * numRecipes));
}

console.log("Here's a query:", ingredientsAvailable);

//Time how long brute force takes
var start = Date.now();
var result = [];
for (i = 0; i < numRecipes; i++) {
    var candidateRecipe = recipes[i];
    if (containsAll(candidateRecipe.needed, ingredientsAvailable)) {
        result.push(candidateRecipe);
    }
}
var end = Date.now();
console.log('Found ' + result.length + ' recipes in ' + (end - start) + ' milliseconds.');
console.log(result);

You can run it in your browser’s debug console, or save it to a local file and run it in Node.js. Either way, it typically finishes in 0 milliseconds.

I’ve chosen these small numbers so you can run it and inspect the output and convince yourself it does the right thing and is bug free.

Once you’re satisfied the program is free from bugs, change numIngredients to 1000000, numRecipes to 1000000, numIngredientsPerRecipe to 50, and numIngredientsInQuery to 100 as per the original “ideal case” projections. I don’t recommend running this in the browser (Chromium and Firefox both don’t appreciate creating a couple of arrays with a million entries), but it should run fine in Node.js, where it completes it reports the brute force algorithm takes 125 milliseconds.

Okay, So What?

So the brute force solution is pretty face. The smarter solutions are probably faster. And faster is better, right?

Well, maybe… It depends on the cost-benefit analysis.

The brute force solution is dead simple. The code snippet I pasted above is 53 lines long, but the majority of those lines are setting up an environment to test the solution in. The solution itself is only 17 lines long, but 10 of those lines implement the containsAll function and could be eliminated if you’re already using a library like lodash.js:

 1 function containsAll(needles, haystack){ 
 2   var i, len;
 3   for(i = 0 , len = needles.length; i < len; i++){
 4       if(haystack.indexOf(needles[i]) == -1) {
 5           return false;
 6       }
 7   }
 8   return true;
 9 }
10 
11 var result = [];
12 for (i = 0; i < numRecipes; i++) {
13     var candidateRecipe = recipes[i];
14     if (containsAll(candidateRecipe.needed, ingredientsAvailable)) {
15         result.push(candidateRecipe);
16     }
17 }

In fact, if we were really code-golfing and had “reasonable” libraries available to us, we could probably write a one liner, and it wouldn’t even suffer in terms of readability:

recipes.filter(recipe => ingredientsAvailable.containsAll(recipe.needed))

How many lines of code would it take to implement an Akka actor based solution? A MapReduce cluster? Actually, never mind lines, how many files would it take to implement such a system? After it’s all done, how confident are you that it’s free of bugs? And if there were a bug, how difficult would it be to track down the bug and fix it? Did everyone on your team understand the Hasse solution? If you break up into groups and work on different parts of the problem, what are the odds that when you assemble all your modules together, everything will work?

How long would it take you to implement the “more sophisticated” solutions? How long would it have taken you to write the brute force solution? 120 seconds? An hour or two? Do you think you could implement the decision tree solution in a day, or is it looking more like a week or two? Are you assuming you working solo on the problem, or are you gonna split the work with another developer or two? What are your combined salaries? How much more did it cost to implement the smarter solution? Assuming the more sophisticated solution was twice as fast, was the added cost worth bringing the query time down from 125ms to 60ms? Hell, assuming the more sophisticated solution was infinitely fast, was the price worth removing 125ms from the user’s wait time? Would the user even notice a 125ms speed up in the app, or is the latency from their cell phone negotiating packets with their wireless carrier more likely to be the bottleneck there?

If you’re a startup, you don’t even have a million recipes in your database yet. Those were just projections for an ideal case. You’d only have like 1000 recipes, and maybe 10 users a day if you’re lucky. Do you want to pay 3 developers for 2 weeks to launch this site, and then find out whether or not it’s a successful venture? Or you do want to pay 1 developer for an hour’s worth of work and then launch and find out whether there’s any demand for this app? (BTW, feel free to use 1 liner your recipe start up. I’m releasing it to the public domain.)

I think this is a huge case of YAGNI and premature optimization.

All that said, let me admit that I spent a day or two thinking about this problem, and the majority of that time was spent trying to work out the Hasse solution.

It was only when I worked out that the running time for my Hasse-based datastructure was on the order of that I decide to write the above JavaScript code just to find out how slow the brute force solution was, and whether it’d be within tolerable levels. Imagine my surprise when it turned out that the brute force solution offered near instantaneous performance levels.

I don’t personally know any people who self-identify as software engineers (in my sense of the term; maybe they’ll utter the syllables “software engineer” while giving a description of themselves, but when asked to elaborate, it won’t be split along the same axis as I have), so I don’t know how a typical software engineer would have answered this question.

That said, I think the culture associated with (my definition of) computer scientists would push us to immediately start thinking about fancy data structures. This cultural baggage does one step worse than have us immediately dismiss the “brute force solution”, instead it typically doesn’t even occur to us to consider it for dismissal.

At work the other day, an intern got bogged down in a piece of software, and I tried to push him towards “Just get something working that solves the core problem. If we need to, we can always come back and improve it later.” Seems like I need to learn the same lesson.

Originally published on Jul 4, 2015