My next challenge in building the Whiskey Recommender is to add attributes to the data table that will become the core of the information shown on the 'baseball card' for the recommended whiskies.
I'm going to start with 'price'. I already know I won't be adding actual prices to the table because I've decided to stop my decomposition process at the 'brand' level. Since all the variants of a brand, including bottle size, will have a significant impact on the actual price I need to aggregate that pricing into something more conceptual, like 'price level'.
I start by searching the internet for a downloadable source of whiskey prices. I found one pretty close to home: I live in Northern Virginia, and the liquor stores are state run, or at least state controlled. I found a downloadable file with all the current liquor prices. It looks like this:
This type of file is called a "csv" file, which stands for 'comma separated values'. The first row in a csv file usually contains a description of the data that follows. In this case, I learned that each row in the table contains seven items of information: the description of the item, an identifying code of some sort, the brand, the bottle size, the age of the liquor, its proof, and price.
It's pretty simple to import this kind of file into Excel, which will create a table using the first row for column headings and placing the data in the correct cells beneath.
The next thing I did is eliminate all the rows that did not contain whiskey information. The VA liquor stores sell many different alcoholic products and I needed to pull the irrelevant ones out of the analysis. I also eliminated columns that weren't relevant to my analysis: code, age and proof. If I were decomposing my whiskey brands to a lower level, like different ages, I would need this information, but since I have made the decision to aggregate all the variants into a single price level I can eliminate these columns. Now I was down to description, brand, bottle size and price.
One variant I can't ignore is the impact of bottle size.To make sure I am aggregating 'apples to apples' prices, I need to average prices for one bottle size only; this is called 'controlling' for bottle size. I notice by looking at the data that 750 ml is the most common bottle size, so I used the Excel data functions to filter the data and keep only the rows for 750 ml bottles. This is now the data set I will use for my analysis.
I then used Excel's average function to calculate averages for each description. For example, the data shown above would be included in the 'scotch' average. It's clear from inspection that there are some outliers -- some bottles are very expensive! But our hope, as data analysts, is that the existence of outliers is spread out across different categories, and if there are clusters of them (like in scotch) that is a fair representation of the relative price for that category. Remember, we are trying to calculate a price rating of some sort, not an actual price. Scotch, and in particular, single malt scotch, is relatively expensive.
Once I had calculated an average for each description, I created a 5 point scale and a price range for each, with "5" being the most expensive. I used the description to map as closely as possible into the classification structure I had created in my data table and entered the price rating. To keep with our scotch example, I set the price level for every scotch in the table to "4". Now is where judgment comes in.
If I were building this application for a client, I would take the time to breakout sub-classes like 'single malt' in the Virginia price list so I could actually calculate the average price for that group. But since I am building this app as a demonstration piece, I was comfortable with sampling the single malts (I mean statistically sample, not literally - though I have done some of that in the course of my research) and giving single malt scotch a '5' on price level.
As a final pass of the data, I looked for brands that were priced significantly lower than other brands in their cohort. Different companies will have portfolios of brands in which one is used as their entry price point and is priced lower than the rest. If I can call those out without too much additional effort, I will. So I gave Dewar's a '3' and a scotch called Passport a '2'. I'm not sure that I caught all of these special cases, because the Virginia liquor stores don't sell every whiskey that's in my database, but I did the best I could with the data that is readily available.
Here's a sample of the whiskey data table with the price attribute added:
Notice Speyburn is a single malt scotch that I gave a '3' price rating, based on the actual price being significantly lower than other single malt scotches in the Virginia price list.