Summary

I recommend Juan Doe to open up a new pet store in three different zip codes around the St. Paul, MN region:

First is the Macalester-Groveland neighborhood (55105).
They have high median income with many pet services and only one other pet store in the area. In order to succeed, John Doe needs to make sure that the type of items sold in the pet stores are different enough than the ones sold in grocery stores and the other pet store.
Second is the Falcon Heights area (55108).
They do not have a pet store, but veterinarian hospitals and animal shelters are in the area.
Third is the Battle Creek neighborhood (55119).
They have a similar make up as the Falcon Heights and a dog park nearby.

Introduction

Sometime on March 2019, my wife and I decided to adopt a 10 year old dog named Filbert from the Animal Humane Society. Here is him in his natural habitat:

To ensure his happiness, we go to pet stores in our area for foods and treats quite regularly. On the other hand, I have also been trying to brush up on my k-means clustering algorithm understanding via IBM data science specialization. So I took it upon myself to do some research and came up with finding a new pet store location around St. Paul, MN area for my capstone project.

Unbeknown to me, US residents spent more than $36 billion for pet food and treats in 2019. So, pet store is a lucrative business to get into.

Target Audience

Our friend, Juan Doe, wants to open a new pet store in St. Paul, MN area, and has asked me to help finding a location for it. Cost of doing business in a metropolis city like St. Paul can be stratospheric. Therefore, market share analysis can be helpful in figuring out a good location / locations and reduce the risk of opening a store in a saturated area.

Data and Methodology

To run the analysis, I pulled several data from various sources:

Zipcode data from uszipcode Python package.
It provides data for the latitude and longitude of all zipcode in the St. Paul, MN area along with the median household income.

stpaul_zip_merge.shape

(30, 5)

json_zip_stpaul = stpaul_zip_merge.to_json(orient='split')

# mn_zipcodes = alt.topo_feature(url=json_zip_stpaul, feature='')

# background = alt.Chart(mn_zipcodes).mark_geoshape(
#     stroke='black'
# ).encode(
#     tooltip=['zipcode:N'],
# ).properties(
#     width=500,
#     height=350,
#     title=f'MN Zip Code'
# ).project('albersUsa')

# background
# background
# .transform_lookup(
#     lookup='properties.ZCTA5CE10',
#     from_=alt.LookupData(data=zipStPaul,key='zipcode',
#                          fields=['zipcode','med_income'])
# ).encode(
#     color=alt.Color('med_income:Q',scale=alt.Scale(scheme='greenblue')),
#     tooltip=['zipcode:Q','med_income:Q'],
# )
# .transform_lookup(
#     lookup='properties.ZCTA5CE10',
#     from_=alt.LookupData(data=dfmnlatest,key='fips',
#                          fields=['county','cases','deaths','date'])
# )

To visualize the data, we are going to use choropleth map

There are thirty zip code data that can be used for the FourSquare Places API.
We will use folium Python package to visualize the data:

First, we use geopy library to get the latitude and longitude values of St Paul, MN.

# map_stPaul = folium.Map(location=[latitude, longitude], zoom_start=10)

# folium.Choropleth(
#     geo_data=mn_zipcode,
#     name='MN zip code',
# ).add_to(map_stPaul)

# folium.LayerControl().add_to(map_stPaul)

# map_stPaul

Removing zip codes that are far awar from St. Paul area

I am going to remove zip codes that are far away from the St. Paul area:
55109, 55110, 55111, 55112, 55115, 55118, 55120, 55121, 55122, 55123, 55124, 55125, 55126,55127,55128, 55129

dropziplist = ['55109','55110','55111','55112', 
               '55115', '55118', '55120', '55121', 
               '55122', '55123', '55124', '55125', 
               '55126','55127','55128', '55129']

zipStPaul = zipStPaul[~zipStPaul['zipcode'].isin(dropziplist)]

# saving the dataframe for presentation
zipStPaul.to_csv('zipStPaul.csv',index=False)

FourSquare data from FourSquare Places API.

I will utilize the explore function to get venue recommendations and zone in on the pet related categories (e.g. pet store, dog park, dog friendly restaurants)

# get the ID and secret from the obfuscated file 
CLIENT_ID = pd.read_csv('../../Coursera_Capstone/FSclientID.txt',header=None)[0][0] # your Foursquare ID
CLIENT_SECRET = pd.read_csv('../../Coursera_Capstone/FSclientSecret.txt',header=None)[0][0] # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

# modified from previous work:
def getNearbyVenues(zipcode, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for zipcode, lat, lng in zip(zipcode, latitudes, longitudes):
        # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()
        results2 = results["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            results['response']['headerLocation'], 
            zipcode,
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'], 
            v['venue']['location'].get('postalCode',0),
            v['venue']['categories'][0]['name']) for v in results2])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',
                             'Zipcode',
                             'Neighborhood Latitude', 
                             'Neighborhood Longitude', 
                             'Venue', 
                             'Venue Latitude', 
                             'Venue Longitude',
                             'Venue Zipcode',
                             'Venue Category']
    
    return(nearby_venues)

LIMIT = 100

stpaul_venues = getNearbyVenues(zipcode=zipStPaul['zipcode'],latitudes=zipStPaul['latitude'],
                               longitudes=zipStPaul['longitude'],
                               radius=4000)

stpaul_venues.head()

stpaul_venues.shape

(1400, 9)

Cleaning up the `petstpaul_venues` data

I clean up the stpaul_venues data through the following methods:

Based on the Venue column, separate the unique and duplicated values into two
In unique dataframe, fill in the Venue Zipcode column with the Zipcode column
In duplicated dataframe, separate them into zero Venue Zipcode and non-zero Venue Zipcode dataframe:
- for non-zero Venue Zipcode dataframe, only keep the values where Venue Zipcode and Zipcode column matches
- for zero Venue Zipcode dataframe, search manually and fill in the missing zipcode value

Separate the unique and duplicated values into two:

unique_stpaul_venues = stpaul_venues[~stpaul_venues.duplicated(subset='Venue',keep=False)]
duplicated_stpaul_venues = stpaul_venues[stpaul_venues.duplicated(subset='Venue',keep=False)]

In unique dataframe, fill in the `Venue Zipcode` column with the `Zipcode` column

unique_stpaul_venues = unique_stpaul_venues.reset_index(drop=True)
for i in range(len(unique_stpaul_venues)):
    if unique_stpaul_venues.loc[i,'Venue Zipcode'] == 0:
        unique_stpaul_venues.loc[i,'Venue Zipcode'] = unique_stpaul_venues.loc[i,'Zipcode']

In duplicated dataframe, separate them into zero `Venue Zipcode` and non-zero `Venue Zipcode` dataframe:

duplicated_stpaul_venues = duplicated_stpaul_venues.sort_values(by='Venue')
zero_duplicated_stpaul_venues = duplicated_stpaul_venues[duplicated_stpaul_venues['Venue Zipcode']==0]
nonzero_duplicated_stpaul_venues = duplicated_stpaul_venues[~(duplicated_stpaul_venues['Venue Zipcode']==0)]

for non-zero `Venue Zipcode` dataframe, only keep the values where `Venue Zipcode` and `Zipcode` column matches

nonzero_duplicated_stpaul_venues = nonzero_duplicated_stpaul_venues.reset_index(drop=True)
nonzero_duplicated_stpaul_venues = nonzero_duplicated_stpaul_venues[
    nonzero_duplicated_stpaul_venues['Zipcode']==nonzero_duplicated_stpaul_venues['Venue Zipcode']]

for zero `Venue Zipcode` dataframe, search manually and fill in the missing zipcode value

zero_duplicated_stpaul_venues = zero_duplicated_stpaul_venues.reset_index(drop=True)
zero_duplicated_stpaul_venues.shape

(33, 9)

zero_duplicated_stpaul_venues.loc[29:33]

# Carl's Gizmo
zero_duplicated_stpaul_venues.at[1,'Venue Zipcode'] = '55108'
# Falcon Heights Community Park
zero_duplicated_stpaul_venues.at[3,'Venue Zipcode'] = '55113'
# Leinie Lodge Bandshell
zero_duplicated_stpaul_venues.at[6,'Venue Zipcode'] = '55108'
# Lulu's Public House
zero_duplicated_stpaul_venues.at[8,'Venue Zipcode'] = '55108'
# machinery hills
zero_duplicated_stpaul_venues.at[13,'Venue Zipcode'] = '55108'
# mighty midway
zero_duplicated_stpaul_venues.at[17, 'Venue Zipcode'] = '55113'
# minneaple pie
zero_duplicated_stpaul_venues.at[19, 'Venue Zipcode'] = '55108'
# real meal deli
zero_duplicated_stpaul_venues.at[24, 'Venue Zipcode'] = '55101'
# redbox - Maplewood
zero_duplicated_stpaul_venues.at[26, 'Venue Zipcode'] = '55117'
# redbox - St Paul
zero_duplicated_stpaul_venues.at[27, 'Venue Zipcode'] = '55119'
# redbox - Maplewood
zero_duplicated_stpaul_venues.at[28, 'Venue Zipcode'] = '55117'
# subtext books
zero_duplicated_stpaul_venues.at[29, 'Venue Zipcode'] = '55102'
# subtext books
zero_duplicated_stpaul_venues.at[29, 'Zipcode'] = '55102'

zero_duplicated_stpaul_venues = zero_duplicated_stpaul_venues[
    zero_duplicated_stpaul_venues['Zipcode']==zero_duplicated_stpaul_venues['Venue Zipcode']]

zero_duplicated_stpaul_venues

Combining them all back

stpaul_venues = pd.concat([nonzero_duplicated_stpaul_venues,zero_duplicated_stpaul_venues, unique_stpaul_venues])

Ensuring the venues are all located in St. Paul region zipcode

# set Venue Zipcode column into integer so that I can sort it by values
stpaul_venues['Venue Zipcode'] = stpaul_venues['Venue Zipcode'].astype(int)
# sort by Venue Zipcode value
stpaul_venues = stpaul_venues.sort_values(by='Venue Zipcode')
# ensure that venues are located in the zipStPaul['zipcode']
stpaul_venues = stpaul_venues[stpaul_venues['Venue Zipcode'].isin(zipStPaul['zipcode'])]
# reset the index
stpaul_venues = stpaul_venues.reset_index(drop=True)
# see what the shape look like
stpaul_venues.shape

(562, 9)

Let's save the queries into a csv file:

stpaul_venues.to_csv('stpaul_venues.csv',index=False)

Let's inspect the different categories

stpaul_venues['Venue Category'].unique()

array(['Pizza Place', 'Southern / Soul Food Restaurant', 'Dessert Shop',
       'Brewery', 'New American Restaurant', 'Breakfast Spot',
       'Baseball Stadium', 'Gym / Fitness Center', 'Dive Bar',
       'Asian Restaurant', 'Train Station', 'Sandwich Place', 'Bakery',
       'Liquor Store', 'Coffee Shop', 'Hotel', 'Farmers Market',
       'Theater', 'Gym', 'Thai Restaurant', 'Café', 'Supermarket', 'Bar',
       'Beer Bar', 'Trail', 'American Restaurant', 'Park',
       'Italian Restaurant', 'Non-Profit', 'Mexican Restaurant', 'Spa',
       'Restaurant', 'French Restaurant', 'Lounge', 'Scenic Lookout',
       'Museum', "Men's Store", 'Movie Theater', 'Performing Arts Venue',
       'Eastern European Restaurant', 'Organic Grocery', 'Music Venue',
       'Cocktail Bar', 'History Museum', 'Diner', 'Athletics & Sports',
       'Steakhouse', 'Burger Joint', 'Sporting Goods Shop', 'Market',
       'Science Museum', 'Wine Shop', 'Dog Run', 'Gastropub',
       'Donut Shop', 'Residential Building (Apartment / Condo)',
       'Candy Store', 'Bookstore', 'Jazz Club', 'Hockey Arena',
       'Clothing Store', 'African Restaurant', 'Pub',
       'Vietnamese Restaurant', 'Garden', 'Zoo Exhibit', 'Lake',
       'Noodle House', 'Zoo', 'Theme Park', 'Korean Restaurant',
       'Mini Golf', 'Chinese Restaurant', 'Soccer Stadium',
       'Turkish Restaurant', 'Gift Shop', 'Bistro',
       'Ethiopian Restaurant', 'Vegetarian / Vegan Restaurant', 'Pool',
       'Garden Center', 'Thrift / Vintage Store', 'Toy / Game Store',
       'Ramen Restaurant', 'BBQ Joint', 'Rock Club', 'Hobby Shop',
       'Boutique', 'Bubble Tea Shop', 'Arts & Crafts Store',
       'Music Store', 'Salon / Barbershop', 'Grocery Store', 'Wine Bar',
       'Butcher', 'Cheese Shop', 'Sushi Restaurant',
       'Mediterranean Restaurant', 'Gourmet Shop', 'Smoke Shop',
       'Taco Place', 'Juice Bar', 'Ice Cream Shop', "Women's Store",
       'Yoga Studio', 'Cosmetics Shop', 'Seafood Restaurant',
       'Fast Food Restaurant', 'Church', 'Indie Theater',
       'Basketball Court', 'Sports Bar', 'Motel', 'Burrito Place',
       'Harbor / Marina', 'Automotive Shop', 'Electronics Store',
       'Scandinavian Restaurant', 'Event Space', 'Exhibit',
       'General College & University', 'Concert Hall',
       'General Entertainment', 'Caribbean Restaurant', 'Bagel Shop',
       'Stadium', 'Snack Place', 'Campground', 'Beer Garden', 'Casino',
       'Kitchen Supply Store', 'Shoe Store', 'German Restaurant',
       'Department Store', 'Golf Course', 'Jewelry Store',
       'Mobile Phone Shop', 'Disc Golf', 'Indian Restaurant',
       'Shopping Mall', 'Lingerie Store', 'Skating Rink',
       'Furniture / Home Store', 'Bike Shop', 'Greek Restaurant',
       'Miscellaneous Shop', 'Big Box Store', 'Pet Store',
       'Theme Park Ride / Attraction', 'Middle Eastern Restaurant',
       'Antique Shop', 'Record Shop', 'Tea Room', 'Massage Studio',
       'Deli / Bodega', 'Frozen Yogurt Shop', 'Hookah Bar',
       'Discount Store', 'Soccer Field', 'Gas Station', 'Beach',
       'Tex-Mex Restaurant', 'Video Store', 'Veterinarian',
       'Baseball Field', 'Coworking Space', 'Bowling Alley', 'ATM',
       'Pharmacy', 'Supplement Shop'], dtype=object)

Category of interests are:

Dog Run
Pet Store
Veterinarian

How many of them are there in the dataset?

# listing all the category of interest:
catInt = ['Dog Run','Pet Store','Veterinarian']
# subsetting the data
justpetfoursquare= stpaul_venues[stpaul_venues['Venue Category'].isin(catInt)]
# resetting the index:
justpetfoursquare = justpetfoursquare.reset_index(drop=True)
justpetfoursquare

There are only 8 values and there should be more. So, I will augment the values from FourSquare API with values from yelp API.

Yelp API

# get the ID and secret from the obfuscated file 
APIkey = pd.read_csv('../../brainstorming/yelp-cred/APIkey',header=None)[0][0] # my yelp APIkey
headers = {'Authorization': 'Bearer %s' % APIkey}

# yelp has a limit of 50 for their query
LIMIT = 50

# modified from the earlier function to match the yelp API
def getQueryVenues(query, zipcode, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for zipcode, lat, lng in zip(zipcode, latitudes, longitudes):
        
        # create the API request URL
        url = 'https://api.yelp.com/v3/businesses/search?term={}&latitude={}&longitude={}&radius={}&limit={}'.format(
            query,
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url, headers=headers).json()['businesses']
        
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            zipcode, 
            lat, 
            lng, 
            v['name'], 
            v['location']['zip_code'],
            v['coordinates']['latitude'], 
            v['coordinates']['longitude'],  
            v['categories'][0]['title']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['zip_code', 
                  'Zipcode Latitude', 
                  'Zipcode Longitude', 
                  'Venue',
                  'Venue Zipcode',
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)

petstpaul_venues = getQueryVenues(query='pet',
                                  zipcode=zipStPaul['zipcode'],
                                  latitudes=zipStPaul['latitude'],
                                  longitudes=zipStPaul['longitude'],
                                  radius=4000)
petstpaul_venues.shape

(660, 8)

# limit only to the zipcode from what we queried in zipStPaul:
petstpaul_venues = petstpaul_venues[petstpaul_venues['Venue Zipcode'].isin(zipStPaul['zipcode'])]

Cleaning up the `petstpaul_venues` data

I clean up the petstpaul_venues data through the following methods:

Based on the Venue column, separate the unique and duplicated values into two
In unique dataframe, they are all set!
In duplicated dataframe, separate them into rows where Venue Zipcode and Zipcode are the same and rows where they are not:
- Keep all the values when Venue Zipcode and Zipcode are the same
- Remove the duplicates when Venue Zipcode and Zipcode are not the same

Based on the `Venue` column, separate the unique and duplicated values into two

unique_petstpaul_venues = petstpaul_venues[~petstpaul_venues.duplicated(subset='Venue',keep=False)]
duplicated_petstpaul_venues = petstpaul_venues[petstpaul_venues.duplicated(subset='Venue',keep=False)]
# see the unique values:
unique_petstpaul_venues

In unique dataframe, they are all set!

In duplicated dataframe, separate them into rows where `Venue Zipcode` and `Zipcode` are the same and rows where they are not:

Keep all the values when `Venue Zipcode` and `Zipcode` are the same

duplicated_petstpaul_venues = duplicated_petstpaul_venues.sort_values(by='Venue')
duplicated_petstpaul_venues = duplicated_petstpaul_venues.reset_index(drop=True)
value_duplicated_petstpaul_venues = duplicated_petstpaul_venues[
    duplicated_petstpaul_venues['zip_code']==duplicated_petstpaul_venues['Venue Zipcode']]
value_duplicated_petstpaul_venues = value_duplicated_petstpaul_venues.reset_index(drop=True)
value_duplicated_petstpaul_venues.shape

(65, 8)

Remove the duplicates when `Venue Zipcode` and `Zipcode` are not the same

pickled_duplicated_petstpaul_venues = duplicated_petstpaul_venues[
    ~(duplicated_petstpaul_venues['Venue'].isin(value_duplicated_petstpaul_venues['Venue']))]
pickled_duplicated_petstpaul_venues = pickled_duplicated_petstpaul_venues.drop_duplicates(subset='Venue')
pickled_duplicated_petstpaul_venues

Combining all the values

petstpaul_venues = pd.concat([unique_petstpaul_venues,
                              value_duplicated_petstpaul_venues, 
                              pickled_duplicated_petstpaul_venues])
petstpaul_venues.shape

(71, 8)

resetting the index

petstpaul_venues['Venue Zipcode'] = petstpaul_venues['Venue Zipcode'].astype(int)
petstpaul_venues = petstpaul_venues.sort_values(by='Venue Zipcode')
petstpaul_venues = petstpaul_venues.reset_index(drop=True)

petstpaul_venues['Venue Category'].unique()

array(['Dog Walkers', 'Pet Groomers', 'Veterinarians', 'Pet Sitting',
       'Pet Stores', 'Pet Training', 'Pet Services', 'Animal Shelters',
       'Event Photography', 'Pet Photography', 'Cremation Services',
       'Bird Shops'], dtype=object)

# saving the dataframe for presentation
petstpaul_venues.to_csv('petstpaul_venues.csv',index=False)

In order to make life a bit easier, we are going to focus on dogs and cats services, so the categories that we are going to consider are:

Dog Walkers
Veterinarians
Pet Groomers
Pet Sitting
Pet Stores
Pet Training
Pet Services
Animal Shelters

catInt2 = ['Dog Walkers','Veterinarians','Pet Groomers','Pet Sitting',
           'Pet Stores','Pet Training','Pet Services','Animal Shelters']

justpetyelp= petstpaul_venues[petstpaul_venues['Venue Category'].isin(catInt2)]
justpetyelp.shape

(67, 8)

justpetyelp.head()

Combining `stpaul_venues` and `justpetyelp` and cleaning up the `combined_df` data

I combine the two dataframe and named it combined_df. I cleaned up the combined_df data through the following methods:

Separate the dataframe in two based on the categories in the stpaul_venues and justpetyelp: check_one and check_two
We will keep the dataframe without the categories in the stpaul_venues and justpetyelp (check_two) intact
In check_one dataframe, separate them into rows where there are duplicates in Venue and rows where they are not (duplicated_check_one and unique_check_one):
- Check manually and remove the duplicates on duplicated_check_one
- Remove the duplicates when Venue Zipcode and Zipcode are not the same

# combining the two dataframes:
onecomb = stpaul_venues.loc[:,['Venue','Venue Zipcode', 'Venue Latitude','Venue Longitude','Venue Category']]
twocomb = justpetyelp.loc[:,['Venue','Venue Zipcode', 'Venue Latitude','Venue Longitude','Venue Category']]
combined_df = pd.concat([onecomb,twocomb])
combined_df = combined_df.sort_values(by='Venue Zipcode')
combined_df = combined_df.reset_index(drop=True)
combined_df.shape

(629, 5)

Separate the dataframe in two based on the categories in the `stpaul_venues` and `justpetyelp`

check_one = combined_df[combined_df['Venue Category'].isin(catInt + list(set(catInt2) - set(catInt)))]
check_two = combined_df[~combined_df['Venue Category'].isin(catInt + list(set(catInt2) - set(catInt)))]
# keep check_two intact

In `check_one` dataframe, separate them into rows where there are duplicates in `Venue` and rows where they are not (`duplicated_check_one` and `unique_check_one`)

duplicated_check_one = check_one[check_one.duplicated(subset='Venue',keep=False)].sort_values(by='Venue')
unique_check_one = check_one[~check_one.duplicated(subset='Venue',keep=False)]

Check manually and remove the duplicates on `duplicated_check_one`

The duplicated values are Chuck and Don's, Como Park, Pet Supplies Plus, and Twin Cities Reptiles

strtodel = ['Chuck','Como','Pet Supplies Plus','Twin Cities Reptiles']
for i in strtodel:
    subsetvalue = duplicated_check_one[duplicated_check_one['Venue'].str.contains(i)]
    duplicated_check_one.drop(subsetvalue[subsetvalue.duplicated(subset='Venue Zipcode')].index,inplace=True)
duplicated_check_one = duplicated_check_one.reset_index(drop=True)
# there are two Petco in 55116, delete index 6
duplicated_check_one.drop(6,inplace=True)
duplicated_check_one

combined the three dataframes back into `combined_df`

combined_df = pd.concat([duplicated_check_one,unique_check_one,check_two])
combined_df = combined_df.sort_values(by='Venue Zipcode')
combined_df = combined_df.reset_index(drop=True)
combined_df.shape

(624, 5)

# saving the dataframe for presentation
combined_df.to_csv('combined_df.csv',index=False)

Methodology for k-means clustering machine learning

Analyze Each Neighborhood

# one hot encoding
stpaul_onehot = pd.get_dummies(combined_df[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
stpaul_onehot['Zipcode'] = combined_df['Venue Zipcode'] 

# move neighborhood column to the first column
fixed_columns = [stpaul_onehot.columns[-1]] + list(stpaul_onehot.columns[:-1])
stpaul_onehot = stpaul_onehot[fixed_columns]

stpaul_onehot.head()

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

stpaul_grouped = stpaul_onehot.groupby('Zipcode').mean().reset_index()
stpaul_grouped.head()

Borrow the function to get the most common venue from previous works:

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Zipcode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Zipcode'] = stpaul_grouped['Zipcode']

for ind in np.arange(stpaul_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(stpaul_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Cluster Neighborhoods

After combining them, let's start with running multiple K values to find the optimal number of clusters for this particular dataset:

stpaul_grouped_clustering = stpaul_grouped.drop('Zipcode', 1)


K = range(1, 14)
kmeans_per_k = [KMeans(n_clusters=k, random_state=42).fit(stpaul_grouped_clustering)
                for k in K]
Sum_of_squared_distances = [model.inertia_ for model in kmeans_per_k]

# plot the K
optimal_k= pd.DataFrame(data= Sum_of_squared_distances, columns = ['Sum_of_squared_distances'], index = K)
optimal_k.rename_axis('K', axis = 'columns', inplace = True)
optimal_k.plot(kind = 'line', figsize = (10, 5), marker = '.')
plt.annotate('Elbow',
             xy=(5, Sum_of_squared_distances[4]),
             xytext=(0.55, 0.55),
             textcoords='figure fraction',
             arrowprops=dict(facecolor='black', shrink=0.1)
            )
plt.xlabel('$k$')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

Another approach is to use silhouette score (mean of the silhouette coefficient over all the instances).

silhouette_scores = [silhouette_score(stpaul_grouped_clustering,
                                      model.labels_)
                     for model in kmeans_per_k[1:]]

plt.figure(figsize=(8, 3))
plt.plot(range(2, 14), silhouette_scores, "bo-")
plt.xlabel("$k$", fontsize=14)
plt.ylabel("Silhouette score", fontsize=14)
plt.axis([1.8, 7.5, 0.03, 0.25])
plt.show()

The best K value seems to be 5 for our datasets.

Run the k-means into 5 clusters:

# set number of clusters
kclusters = 5

stpaul_grouped_clustering = stpaul_grouped.drop('Zipcode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(stpaul_grouped_clustering)

neighborhoods_venues_sorted['Zipcode']=neighborhoods_venues_sorted['Zipcode'].astype(int)
stpaul_merged = zipStPaul
stpaul_merged['zipcode']=stpaul_merged['zipcode'].astype(int)

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

# merge zipStPaul with neighborhoods_venues to add latitude/longitude for each neighborhood
stpaul_merged = stpaul_merged.join(neighborhoods_venues_sorted.set_index('Zipcode'), on='zipcode')

stpaul_merged.head()

Visualizing the clusters:

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(stpaul_merged['latitude'], stpaul_merged['longitude'], stpaul_merged['zipcode'], stpaul_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster+1), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Examine Clusters

Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.

Cluster 1

stpaul_merged.loc[stpaul_merged['Cluster Labels'] == 0, stpaul_merged.columns[
    [0,1] + list(range(5, stpaul_merged.shape[1]))]]

Cluster 1 is located in 55114 zipcode, which can be categorized as the St. Anthony neighborhood. It does not seem to be a good place to open a new pet store, with low median income value without any sign of other pet related categories in the 10 most common venues.

Cluster 2

stpaul_merged.loc[stpaul_merged['Cluster Labels'] == 1, stpaul_merged.columns[
    [0,1] + list(range(5, stpaul_merged.shape[1]))]]

Cluster 2 has Dog Walkers, Veterinarians, Pet Sitting, Pet Groomers, and Pet Stores as part of their top 10 most common categories.

From the list, I can recommend opening up a new pet store from 55105 zip code, which is located in the Macalester-Groveland neighborhood. It has a high median income and there is no mention of pet store in the top 10 most common venues in the area.

The second place that I would recommend is the 55108 zip code, which is located near the Falcon Heights area. It has a slightly lower median income, but there are already veterinarians as the top 10 most common venues in this neighborhood.

The third place that I would recommend is the 55119 zip code, which is located in the Battle Creek neighborhood. It has a similar feature as the 55108 location.

Cluster 3

stpaul_merged.loc[stpaul_merged['Cluster Labels'] == 2, stpaul_merged.columns[
    [0,1] + list(range(5, stpaul_merged.shape[1]))]]

Cluster 3 is located in 55130 zipcode, which can be categorized as the Payne-Phalen neighborhood. It does not seem to be a good place to open a new pet store, with low median income value and having Pet Training, Pet Sitting, and Dog Run already in the top 10 most common venue. They signal a saturated neighborhood to open a Pet Store.

Cluster 4

stpaul_merged.loc[stpaul_merged['Cluster Labels'] == 3, stpaul_merged.columns[
    [0,1] + list(range(5, stpaul_merged.shape[1]))]]

Cluster 4 is located in 55107 zipcode, which can be categorized as the West End neighborhood. It does not seem to be a good place to open a new pet store, with having both Pet Training and Pet Groomers already in the top 10 most common venue. Pet Groomers usually sell similar items that are the same as Pet Stores on top of their grooming business.

Cluster 5

stpaul_merged.loc[stpaul_merged['Cluster Labels'] == 4, stpaul_merged.columns[
    [0,1] + list(range(5, stpaul_merged.shape[1]))]]

Cluster 5 is located in 55103 zipcode, which can be categorized as the North End neighborhood. It does not seem to be a good place to open a new pet store, with low median income.

Dig deeper into the Three Clusters

I picked: 55105, 55108, and 55119 as the three best zip codes to open a new pet store. Let's look at the data on what is available in the area.

55105: Macalester-Groveland neighborhood

zipcode55105 = combined_df[combined_df['Venue Zipcode']==55105]
zipcode55105[zipcode55105['Venue Category'].isin(catInt + list(set(catInt2) - set(catInt)))]

Check on the grocery and supermarket in the area:

zipcode55105[zipcode55105['Venue Category'].isin(['Grocery Store','Supermarket'])]

There is no pet store competitor for this area, with two veterinary center / animal medical center that might sell pet related items. Moreover, there are two grocery stores that do sell pet related items in their shelves. This area can be a very good location to open up a new pet store, if the type of items sold in the pet stores are different enough than the ones sold in grocery stores.

55108: Falcon Heights area

zipcode55108 = combined_df[combined_df['Venue Zipcode']==55108]
zipcode55108[zipcode55108['Venue Category'].isin(catInt + list(set(catInt2) - set(catInt)))]

Check on the grocery and supermarket category:

zipcode55108[zipcode55108['Venue Category'].isin(['Grocery Store','Supermarket'])]

We did not see any pet store in this zip code. Plus, the Tim & Tom's Speedy Market is a specialty grocery store that do not sell pet related items. A close contender in terms of a good location to open up a new pet store.

55119: Battle Creek neighborhood

zipcode55119 = combined_df[combined_df['Venue Zipcode']==55119]
zipcode55119[zipcode55119['Venue Category'].isin(catInt + list(set(catInt2) - set(catInt)))]

Check on the grocery and supermarket category:

zipcode55119[zipcode55119['Venue Category'].isin(['Grocery Store','Supermarket'])]

We did not see any pet store in this zip code, but there are two veterinarians hospitals, where they can potentially sell food and toys for their patients. They also have Aldi grocery store in their area. It is still a great area to have a pet store since the Battle Creek Dog Park is in this neighborhood.

	zipcode	med_income	latitude	longitude
0	55101	40300	44.95	-93.09
1	55102	46255	44.93	-93.12
2	55103	28899	44.97	-93.13
3	55104	44629	44.96	-93.17
4	55105	76472	44.93	-93.16

	Neighborhood	Zipcode	Neighborhood Latitude	Neighborhood Longitude	Venue	Venue Latitude	Venue Longitude	Venue Zipcode	Venue Category
0	Saint Paul	55101	44.95	-93.09	Barrel Theory Beer Company	44.951021	-93.088258	55101	Brewery
1	Saint Paul	55101	44.95	-93.09	Mears Park	44.949371	-93.087920	55101	Park
2	Saint Paul	55101	44.95	-93.09	The Buttered Tin	44.950857	-93.088679	55101	Bakery
3	Saint Paul	55101	44.95	-93.09	CHS Field	44.950619	-93.084789	55101	Baseball Stadium
4	Saint Paul	55101	44.95	-93.09	St. Paul Farmers' Market	44.949620	-93.085131	55101	Farmers Market

	Neighborhood	Zipcode	Neighborhood Latitude	Neighborhood Longitude	Venue	Venue Latitude	Venue Longitude	Venue Category
29	Saint Paul	55106	44.94	-93.05	Subtext Books	44.945629	-93.094695	Bookstore
30	Saint Paul	55130	44.97	-93.09	Subtext Books	44.945629	-93.094695	Bookstore
31	Saint Paul	55107	44.93	-93.09	Subtext Books	44.945629	-93.094695	Bookstore
32	Saint Paul	55101	44.95	-93.09	Subtext Books	44.945629	-93.094695	Bookstore

	Neighborhood	Zipcode	Neighborhood Latitude	Neighborhood Longitude	Venue	Venue Latitude	Venue Longitude	Venue Zipcode	Venue Category
0	Saint Paul	55102	44.93	-93.12	High Bridge Dog Park	44.935525	-93.106320	55102	Dog Run
1	Roseville	55113	45.01	-93.16	Chuck & Don's Pet Food Outlet	45.024081	-93.169918	55113	Pet Store
2	Saint Paul	55114	44.97	-93.20	Twin Cities Reptiles	44.963767	-93.196522	55114	Pet Store
3	Saint Paul	55116	44.91	-93.17	Chuck & Don's Pet Food Outlet	44.916832	-93.189630	55116	Pet Store
4	Saint Paul	55116	44.91	-93.17	Petco	44.918520	-93.196697	55116	Pet Store
5	Maplewood	55117	45.00	-93.10	Como Park Animal Hospital	44.971809	-93.125984	55117	Veterinarian
6	Saint Paul	55119	44.94	-93.01	Battle Creek Dog Park	44.939410	-93.001641	55119	Dog Run
7	Saint Paul	55130	44.97	-93.09	Arlington Arkwright Dog Park	44.982032	-93.083346	55130	Dog Run

	zip_code	Zipcode Latitude	Zipcode Longitude	Venue	Venue Zipcode	Venue Latitude	Venue Longitude	Venue Category
301	55107	44.93	-93.09	The Day Spaw	55107	44.882454	-93.076205	Pet Groomers
382	55113	45.01	-93.16	Metro Pet Spa	55113	45.032089	-93.176186	Pet Sitting
570	55119	44.94	-93.01	Funtastic K9 Training	55119	44.949660	-93.017220	Pet Training
574	55119	44.94	-93.01	Century Animal Clinic	55119	44.966652	-92.985361	Veterinarians

	Neighborhood	Zipcode	Neighborhood Latitude	Neighborhood Longitude	Venue	Venue Latitude	Venue Longitude	Venue Zipcode	Venue Category
1	Falcon Heights	55108	44.98	-93.18	Carl's Gizmo	44.986444	-93.169245	55108	Sandwich Place
3	Roseville	55113	45.01	-93.16	Falcon Heights Community Park	44.998524	-93.185745	55113	Park
6	Falcon Heights	55108	44.98	-93.18	Leinie Lodge Bandshell - Minnesota State Fair	44.979990	-93.168811	55108	Concert Hall
8	Falcon Heights	55108	44.98	-93.18	Lulu's Public House	44.981229	-93.177533	55108	Southern / Soul Food Restaurant
13	Falcon Heights	55108	44.98	-93.18	Machinery Hill - Minnesota State Fair	44.985519	-93.169700	55108	Electronics Store
17	Roseville	55113	45.01	-93.16	Mighty Midway	44.980074	-93.175349	55113	Theme Park Ride / Attraction
19	Falcon Heights	55108	44.98	-93.18	Minneapple Pie	44.978595	-93.171173	55108	Diner
24	Saint Paul	55101	44.95	-93.09	Real Meal Delicatessen	44.947036	-93.090568	55101	American Restaurant
26	Maplewood	55117	45.00	-93.10	Redbox	44.976818	-93.067060	55117	Video Store
27	Saint Paul	55119	44.94	-93.01	Redbox	44.962580	-92.984799	55119	Video Store
28	Maplewood	55117	45.00	-93.10	Redbox	44.987457	-93.105877	55117	Video Store
29	Saint Paul	55102	44.94	-93.05	Subtext Books	44.945629	-93.094695	55102	Bookstore

	Venue	Venue Zipcode	Venue Latitude	Venue Longitude	Venue Category
0	Banfield Pet Hospital	55113	45.016502	-93.175787	Veterinarians
1	Banfield Pet Hospital	55116	44.917463	-93.186226	Veterinarians
2	Chuck & Don's Pet Food & Supplies	55113	45.021352	-93.170601	Pet Stores
3	Chuck & Don's Pet Food & Supplies	55116	44.917735	-93.189564	Pet Stores
4	Como Park Animal Hospital	55117	44.971808	-93.125987	Veterinarians
5	Petco	55113	45.018555	-93.178525	Pet Stores
7	Petco	55116	44.918520	-93.196697	Pet Store
8	Twin Cities Reptiles	55114	44.963989	-93.196365	Pet Stores

	Zipcode	African Restaurant	American Restaurant	Animal Shelters	Arts & Crafts Store	Asian Restaurant	Athletics & Sports	...	Vegetarian / Vegan Restaurant	Veterinarians	Vietnamese Restaurant	Wine Bar	Wine Shop	Women's Store	Yoga Studio	Zoo	Zoo Exhibit
0	55101	0.000000	0.020833	0.000000	0.000000	0.041667	0.000000	...	0.000000	0.020833	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
1	55102	0.012987	0.051948	0.000000	0.000000	0.000000	0.012987	...	0.000000	0.038961	0.000000	0.000000	0.012987	0.000000	0.000000	0.000000	0.000000
2	55103	0.000000	0.000000	0.000000	0.000000	0.058824	0.000000	...	0.000000	0.000000	0.117647	0.000000	0.000000	0.000000	0.000000	0.058824	0.117647
3	55104	0.020833	0.000000	0.020833	0.000000	0.020833	0.000000	...	0.020833	0.020833	0.062500	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
4	55105	0.000000	0.013699	0.000000	0.027397	0.027397	0.000000	...	0.000000	0.027397	0.000000	0.013699	0.013699	0.013699	0.013699	0.000000	0.000000

	Venue	Venue Zipcode	Venue Latitude	Venue Longitude	Venue Category
252	Dog Days Daycare & Boarding - Grand Ave	55105	44.939774	-93.174835	Pet Training
253	Animal Medical Clinic	55105	44.934563	-93.166825	Veterinarians
254	Grand Avenue Veterinary Center	55105	44.939693	-93.148247	Veterinarians
255	Grand Groomers	55105	44.939659	-93.140472	Pet Groomers
262	Bubbly Paws	55105	44.940280	-93.140160	Pet Groomers

	Venue	Venue Zipcode	Venue Latitude	Venue Longitude	Venue Category
190	Kowalski's Market	55105	44.940270	-93.153639	Grocery Store
228	Trader Joe's	55105	44.926666	-93.146004	Grocery Store

	Venue	Venue Zipcode	Venue Latitude	Venue Longitude	Venue Category
346	Raptor Center	55108	44.982871	-93.181639	Veterinarians
347	Animal Humane Society - St. Paul	55108	44.974723	-93.154704	Animal Shelters
354	University of Minnesota Veterinary Medical Center	55108	44.982202	-93.181925	Veterinarians

	Venue	Venue Zipcode	Venue Latitude	Venue Longitude	Venue Category
591	Scenic Hills Animal Hospital	55119	44.948980	-93.018270	Veterinarians
592	Funtastic K9 Training	55119	44.949660	-93.017220	Pet Training
593	Century Animal Clinic	55119	44.966652	-92.985361	Veterinarians
600	Battle Creek Dog Park	55119	44.939410	-93.001641	Dog Run

Summary

Introduction

Target Audience

Data and Methodology

Removing zip codes that are far awar from St. Paul area

FourSquare data from FourSquare Places API.

Cleaning up the petstpaul_venues data

Separate the unique and duplicated values into two:

In unique dataframe, fill in the Venue Zipcode column with the Zipcode column

In duplicated dataframe, separate them into zero Venue Zipcode and non-zero Venue Zipcode dataframe:

for non-zero Venue Zipcode dataframe, only keep the values where Venue Zipcode and Zipcode column matches

for zero Venue Zipcode dataframe, search manually and fill in the missing zipcode value

Combining them all back

Ensuring the venues are all located in St. Paul region zipcode

Let's save the queries into a csv file:

Let's inspect the different categories

Yelp API

Cleaning up the petstpaul_venues data

Based on the Venue column, separate the unique and duplicated values into two

In unique dataframe, they are all set!

In duplicated dataframe, separate them into rows where Venue Zipcode and Zipcode are the same and rows where they are not:

Keep all the values when Venue Zipcode and Zipcode are the same

Remove the duplicates when Venue Zipcode and Zipcode are not the same

Combining all the values

resetting the index

Combining stpaul_venues and justpetyelp and cleaning up the combined_df data

Separate the dataframe in two based on the categories in the stpaul_venues and justpetyelp

In check_one dataframe, separate them into rows where there are duplicates in Venue and rows where they are not (duplicated_check_one and unique_check_one)

Check manually and remove the duplicates on duplicated_check_one

combined the three dataframes back into combined_df

Methodology for k-means clustering machine learning

Analyze Each Neighborhood

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

Cluster Neighborhoods

Examine Clusters

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Dig deeper into the Three Clusters

Cleaning up the `petstpaul_venues` data

In unique dataframe, fill in the `Venue Zipcode` column with the `Zipcode` column

In duplicated dataframe, separate them into zero `Venue Zipcode` and non-zero `Venue Zipcode` dataframe:

for non-zero `Venue Zipcode` dataframe, only keep the values where `Venue Zipcode` and `Zipcode` column matches

for zero `Venue Zipcode` dataframe, search manually and fill in the missing zipcode value

Cleaning up the `petstpaul_venues` data

Based on the `Venue` column, separate the unique and duplicated values into two

In duplicated dataframe, separate them into rows where `Venue Zipcode` and `Zipcode` are the same and rows where they are not:

Keep all the values when `Venue Zipcode` and `Zipcode` are the same

Remove the duplicates when `Venue Zipcode` and `Zipcode` are not the same

Combining `stpaul_venues` and `justpetyelp` and cleaning up the `combined_df` data

Separate the dataframe in two based on the categories in the `stpaul_venues` and `justpetyelp`

In `check_one` dataframe, separate them into rows where there are duplicates in `Venue` and rows where they are not (`duplicated_check_one` and `unique_check_one`)

Check manually and remove the duplicates on `duplicated_check_one`

combined the three dataframes back into `combined_df`