Edgar Pino

Wine Recommendations using Nearest Neighbors

April 24, 2019

See full notebook

Wine Recommendations

import pandas as pd
import numpy as np
import string
from nltk.probability import FreqDist
from nltk.tokenize import RegexpTokenizer
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import NearestNeighbors
%matplotlib inline
def toWordList(text):
    text_array = []
    word_list= []
    tokenizer = RegexpTokenizer(r'\w+')
    for idx in range(len(text)):
        text_array.append(np.array(tokenizer.tokenize(text[idx].lower())))
    for words in text_array:
        for word in words:
            word_list.append(word)
    return np.array(word_list)
def text_process(text):
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    return [word for word in nopunc.split() if word.lower() not in vectorizer.get_stop_words()]
def get_recommended_wines(search_term, model, vectorizer):
    distances, indices = nbrs.kneighbors(vectorizer.transform([search_term]), 5)
    distances = distances.flatten()
    indices = indices.flatten()
    descriptions = reviewsDf.iloc[indices]['description']
    titles = reviewsDf.iloc[indices]['title']
    return distances, np.array(descriptions), np.array(titles)
def showResults(closest):
    for i in range(len(closest[0])):
        print(f'Score: {closest[0][i]}')
        print(f'Title: {closest[2][i]}')
        print(f'Description: {closest[1][i]}')
        print('\n')

Data Prep

reviewsDf = pd.read_csv('./data/winemag-data-130k-v2.csv')
reviewsDf = reviewsDf[['description','title']]
descriptions = reviewsDf.description.tolist()
all_words = toWordList(descriptions)
wordFreq = FreqDist(all_words)

We want to remove the 20 most common words since they are not useful.

plt.figure(figsize=(10, 5))
wordFreq.plot(20,cumulative=False) # Top 50 used words
plt.show()

png

stop_words = [word[0] for word in wordFreq.most_common(20)]

Create NB Model

vectorizer = CountVectorizer(stop_words = stop_words, analyzer=text_process)
tfidf_matrix = vectorizer.fit_transform(descriptions)
nbrs = NearestNeighbors(n_neighbors=5, metric='cosine').fit(tfidf_matrix)

Let’s query for “white and sweet wine with a vanilla flavor”

closest = get_recommended_wines('white and sweet wine with a vanilla flavor', nbrs, vectorizer)
showResults(closest)
Score: 0.5256583509747431
Title: The White Knight 2011 Riesling (Lake County)
Description: This is a sweet wine with flavors of white sugar, orange, honey and vanilla, all brightened by crisp acidity.


Score: 0.5256583509747431
Title: Rexford 2010 Regan Vineyard Pinot Gris (Santa Cruz Mountains)
Description: This seems heavy and sweet for a Pinot Gris, with flavors of white sugar, orange and vanilla.


Score: 0.5256583509747431
Title: The White Knight 2011 Riesling (Lake County)
Description: This is a sweet wine with flavors of white sugar, orange, honey and vanilla, all brightened by crisp acidity.


Score: 0.5285954792089682
Title: Lander-Jenkins 2010 Spirit Hawk Chardonnay (California)
Description: Very sweet in white sugared orange and vanilla flavors, like a honey-nut candy bar. Will satisfy Chard lovers with a sweet tooth


Score: 0.5669872981077806
Title: Bougetz 2013 Sauvignon Blanc (Napa Valley)
Description: Barrel-fermented and blended with a splash of Sémillon, this is a creamy, rounded and concentrated white, intense in vanilla and apricot flavor.

Not let’s query for “white and sweet wine with a vanilla flavor”

closest = get_recommended_wines('red and dry with cherry flavor', nbrs, vectorizer)
showResults(closest)
Score: 0.4429139854688444
Title: Gantenbein 2011 Pinot Noir (Switzerland)
Description: This wine is cherry red with soft brown tinges, offering a nose of cherry with notes of summer farmstand. The predominant flavor is of tart cherry, interlaced with hints of red raspberry and bell pepper.


Score: 0.4787139648573131
Title: Cramele Recas 2014 Dreambird Merlot (Viile Timisului)
Description: This easy-drinking red wine has aromas of cherry, black plum and eucalyptus. Flavors of red cherry, cherry turnover and vanilla remain on the palate through the soft finish.


Score: 0.4787139648573131
Title: Camille Giroud 2008 Charmes Chambertin  (Charmes-Chambertin)
Description: A solid structure and layers of dense fruit characterize this wine. Rich acidity, red cherry flavor and a dry, dark tannic core are surrounded by juicy red berry fruits and the freshest finish.


Score: 0.4836022205056778
Title: Merriam 2008 SNED Red (Sonoma County)
Description: A rustic red blend, with tobacco, herb, cherry and red currant flavors that finish dry and spicy. Useful as a bistro-style wine.


Score: 0.4896896369201712
Title: Kovács Nimród 2012 Monopole Rhapsody Red (Eger)
Description: This Hungarian red blend is garnet in color, with aromas of red cherry and black berry and flavors of red cherry and pomegranate. The tannins are soft with a medium-length finish.

Edgar Pino

Software Engineer @Pluralsight. Interested in distributed systems, machine learning, and the web. Follow me on Twitter.