What Is the Next Best Ingredient to Add

database-designrecursive

I am building a program that allows users to optimize their grocery shopping so they can make the most recipes using the fewest ingredients.

One of the features of this program is a function I’m calling the “NEXT BEST INGREDIENT” or NBI. For example, if you already have salt, oil, and rosemary, what NBI will unlock the most additional recipes?

Let’s say the answer is beef. If you buy beef, you’ll be able to create more new recipes than if you bought any other ingredient.

After buying beef, what is the NEXT BEST INGREDIENT one should add after that? And so on.

The feature will allow a user to enter any number of starting ingredients (including zero). So someone could start with 0, 3, or even 50 ingredients. What should they add next?

I’m using a database of recipes (about 500K recipes in all) to compile results.

I can map out the individuals steps for writing out a specific algorithm. But I need help writing a more generic algorithm.

Here’s a specific algorithm:

A user enters in 3 ingredients A, B, and C

Isolate all 4-ingredient recipes that use use A+B+C + BLANK. Note the BLANK with highest frequency.
Isolate all 3-ingredient recipes that use A+B + BLANK… A+C + BLANK… B+C + BLANK. Note the BLANK with highest frequency.
Isolate all 2-ingredient recipes that use A + BLANK….B + BLANK…C+ BLANK. Note the BLANK with highest frequency.
Isolate all 1-ingredient recipes that simply use BLANK. Note the BLANK with highest frequency.
Calculate the BLANK with the highest TOTAL frequency. This is the NEXT BEST INGREDIENT.

But I need a way to write a more generic algorithm for when a user enters N number of ingredients (anything from 0 ingredients to 100).

I can write out the rules in plain English – but my programmer needs a way to write a generic rule using coding logic.

Here it is in plain English.

When user searches N ingredients (including zero), instantly ignore any recipes that require N+2 ingredients or more. And then isolate ALL remaining recipes that use ALL, SOME, or NONE of those ingredients such that there is 1 (and only 1) vacant slot for an “unsearched” ingredient.
Tally up the frequency of all “unsearched” ingredients. Whichever “unsearched” ingredient has the highest frequency becomes the NEXT BEST INGREDIENT.

The goal is to make this search feature accurate (or as accurate as my database is) and fast. A few ingredients is easy. But when looking at 10+ ingredients, it may slow down a bit.

Any thoughts?

Best Answer

If you assume a 3NF design in the entity relationship diagram below:

Entity Relationship Diagram with 4 tables

Recipes are stored in a Recipe table, a recipe's ingredients are stored in RecipeIngredients table, and the MyIngredients table contains an entry for each ingredient on hand.

Your Next Best Ingredient(s) is found with the following query:

SELECT Ingredients.IngredientID 
  , Ingredients.IngredientName
  , count() RecipesCompleted
FROM Ingredients 
 INNER JOIN RecipeIngredients on Ingredients.IngredientID = RecipeIngredients.IngredientID
 INNER JOIN (
   SELECT RecipeID
     , count() RemainingIngredients
   FROM Recipes
     INNER JOIN RecipeIngredients on Recipes.RecipeID = RecipeIngredients.IngredientID
     LEFT JOIN MyIngredients on RecipeIngredients.IngredientID = MyIngredients.IngredientID
   WHERE MyIngredients.IngredientID is null
   GROUP BY RecipeID
   HAVING count(*) = 1) RecipesWOneIngLeft 
      on RecipeIngredients.RecipeID = RecipesWOneIngLeft.RecipeID
 LEFT JOIN MyIngredients on RecipIngredients.IngredientID = MyIngredients.IngredientID
WHERE MyIngredients.IngredientID is null
GROUP BY Ingredients.IngredientID
 , Ingredients.IngredientName
ORDER BY RecipesCompleted DESC

Let's break the query down a bit. It works using the following theories:

By inserting your ingredients into the MyIngredients table allows, you're allowed to have N ingredients.
The next step is to use your algorithm to identify all recipes which are missing just one ingredient. This is done in the RecipesWOneIngLeft sub-query
The final step counts the number of recipes which are completed by each ingredient and then orders the result so the NBI is first.

Consider rewording your plain english algorithm to the following:

When user searches N ingredients, isolate ALL remaining recipes such that there is 1 (and only 1) vacant slot for an “unsearched” ingredient. Tally up the frequency of all “unsearched” ingredients. Whichever “unsearched” ingredient has the highest frequency becomes the NEXT BEST INGREDIENT.

Related Solutions

Mysql – Appropriate multi-language schema, or overkill

We've done a lot of this, and (administrative) users were allowed to fix the translations live. (You still might want a caching layer, but I'm totally down with driving this with a real database and not resource files - it gives you a ton of power to query and find things which need to be translated, etc). I think your schema is probably fine, so I'll just pass on some stuff we learned in the hope that it's useful.

One thing you have left out is phrases with insertion points. In the example below, the order is reversed and the language is still English, but this could very easily be two different languages - pretend this is just two languages who normally put things in a different order.

Hello, <username> you have <x> points!

You've got <x> points to spend, <username>!

In our pre-.NET, we had a routine which did the insertion so the phrases would look like this:

Hello, {0:username} you have {1:points} points!

You've got {1:points} points to spend, {0:username}!

This would obviously simply be used in your code as <%= String.Format(phrase, username, points); %> or similar

Which helped the translator a bit. But .NET String.FOrmat doesn't support commenting within the format string, unfortunately.

As you say, you would not want to handle that in your php with locale awareness or meta phrases.

So what we had was a master phrase table:

phraseid, english, supplemental info

and a localized table:

phraseid, localeid, translation

You've also assumed with INNER JOINS that the localized versions exist - we tended to leave them out until they were translated, so that query of yours would end up returning nothing at first (not even the default)

If a translation didn't exist, ours defaulted to English, then fellback to code-provided (in case the database didn't have the ID, and it was also clear from the code what phrase identifier "TXT_LNG_WRNNG_INV_LOW" was actually trying to get) - so the equivalent of this query is what we used:

SELECT COALESCE(localized.translation, phrase.english, @fallback)
FROM DUAL
LEFT JOIN phrase
    ON phrase.phraseid = @phraseid
LEFT JOIN localized
    ON localized.phraseid = phrase.phraseid
    AND localized.localeid = @localeid

Obviously, you might get all the things at one time using your page system.

We tended not to link things to the page because they were re-used a lot between pages (and not just in page-fragments or controls), but that's certainly fine.

In the case of our Windows-native apps, we used reflection and a mapping file from control to translation tag so that translation did not require re-compiles (in pre-.NET apps we had to tag the controls using the Tag or other special properties). This is probably a little more problematic in PHP or ASP.NET MVC, but possible in ASP.NET where there is a full-blown server-side page model.

For testing, you can obviously query to find missing translations very easily. To find places which need to be tagged, translate the entire phrase dictionary using pig-latin or Klingon or something like replace every non-space character with ? - the English should stand out and let you know that some naked plaintext has crept into your HTML.

Database Design – How to Implement Different Object Types as Similar Tables

Assuming that the app will be treating these various shop-types in the same manner for some operations (operations that do not care what type of shop that it is), then my preferred approach is to use a subclass / inheritance model. I have detailed this out in a couple of other answers here:

The basic concept would be as follows (using Microsoft SQL Server T-SQL semantics).

First you need the main entity table. It represents various "types" so those need to be defined in a lookup table (and the lookup table should have a matching enum in the app layer). Please note that I am using a TINYINT here which holds values from 0 to 255. Use the smaller numeric type that will meet your needs.

CREATE TABLE dbo.ShopType
(
  ShopTypeID TINYINT NOT NULL PRIMARY KEY, -- NO IDENTITY / auto-increment!
  ShopType   VARCHAR(50) NOT NULL
);

CREATE TABLE dbo.Shop
(
  ShopID     INT NOT NULL IDENTITY(1, 1) PRIMARY KEY,
  ShopTypeID TINYINT NOT NULL REFERENCES dbo.ShopType (ShopTypeID), -- FK to ShopType
  ShopName   NVARCHAR(100) NOT NULL,
  VisitorsLastWeek INT NOT NULL DEFAULT (0),
  -- {OtherCommonFields},...
  CreateDate DATETIME NOT NULL DEFAULT (GETDATE()), -- or GETUTCDATE()
  ModifiedDate DATETIME NOT NULL DEFAULT (GETDATE()) -- or GETUTCDATE()
);

Next you create a subclass table for each particular "type". It is sometimes appropriate to share a subclass table for multiple "types", but just be careful when doing that so one type doesn't evolve differently than the other type(s) that might be sharing the same subclass table. Please note that the PK of the subclass tables is the same field as the PK in the parent class (i.e. the [Shop] table) and even FKs to it. The reason is that the row in both tables (e.g. [Shop] and [ShopRepair] that have the same value for [ShopID] are together a complete entity; the row in the subclass table does not represent anything by itself and doesn't need a separate ID.

CREATE TABLE dbo.ShopFood
(
  ShopID INT NOT NULL PRIMARY KEY REFERENCES dbo.Shop(ShopID),
  MostPopularItem  NVARCHAR(50) NULL,
  ModifiedDate DATETIME NOT NULL DEFAULT (GETDATE()) -- or GETUTCDATE()
);

CREATE TABLE dbo.ShopRepair
(
  ShopID INT NOT NULL PRIMARY KEY REFERENCES dbo.Shop(ShopID),
  CurrentComputerRepairs SMALLINT NOT NULL DEFAULT (0),
  RepairEstimatedHours TINYINT NOT NULL DEFAULT (0),
  ModifiedDate DATETIME NOT NULL DEFAULT (GETDATE()) -- or GETUTCDATE()
);

CREATE TABLE dbo.ShopRetail
(
  ShopID INT NOT NULL PRIMARY KEY REFERENCES dbo.Shop(ShopID),
  NextPromotionDate DATETIME NULL,
  CurrentWeekSalesEstimate MONEY NULL,
  ModifiedDate DATETIME NOT NULL DEFAULT (GETDATE()) -- or GETUTCDATE()
);

To simplify SELECT operations, you can create Views against each type, if you like:

CREATE VIEW dbo.FoodShop AS
SELECT sh.ShopID, sh.ShopTypeID, sf.MostPopularItem
FROM   dbo.Shop sh
INNER JOIN dbo.ShopFood sf
        ON sf.ShopID = sh.ShopID;

CREATE VIEW dbo.RepairShop AS
SELECT sh.ShopID, sh.ShopTypeID, sr.CurrentComputerRepairs, sr.RepairEstimatedHours
FROM   dbo.Shop sh
INNER JOIN dbo.ShopRepair sr
        ON sr.ShopID = sh.ShopID;

CREATE VIEW dbo.RetailShop AS
SELECT sh.ShopID, sh.ShopTypeID, sr.NextPromotionDate, sr.CurrentWeekSalesEstimate
FROM   dbo.Shop sh
INNER JOIN dbo.ShopRetail sr
        ON sr.ShopID = sh.ShopID;

EDIT:
I forgot to add in the "audit" fields of [CreateDate] and [ModifiedDate]. I have now added them to the tables above. Please note that:

these fields are not in the [ShopType] table as there really is no purpose here, but that is a bit subjective and some people still like to know at least when the records were created, so maybe [CreateDate] is fine there

the [CreateDate] field is not in the subclass tables as the assumption is that those rows are created at the same time as the parent record. But when it comes to updates, those might just touch one of the two tables, depending on which fields are being updated. If using SQL Server, it is possible to create the records for both tables in a single statement even, using the OUTPUT clause as follows:

CREATE PROCEDURE dbo.Shop_CreateRepairShop
(
  @ShopName NVARCHAR(100),
  @CurrentComputerRepairs SMALLINT,
  @RepairEstimatedHours TINYINT,
  @ShopID INT OUTPUT = -1
)
AS
SET NOCOUNT ON;

INSERT INTO dbo.Shop (ShopName, ShopTypeID)
  OUTPUT INSERTED.ShopID, @CurrentComputerRepairs, @RepairEstimatedHours
  INTO   dbo.ShopRepair (ShopID, CurrentComputerRepairs, RepairEstimatedHours)
  VALUES (@ShopName, 2); -- assuming ShopTypeID for "Repair Shop" = 2

SET @ShopID = SCOPE_IDENTITY();

Best Answer

Related Solutions

Mysql – Appropriate multi-language schema, or overkill

Database Design – How to Implement Different Object Types as Similar Tables

Related Question