Finding a New Pet Store Location in St. Paul, MN
Using k-means clustering in finding a new Pet Store location using Yelp and FourSquare API. _The original analysis was done in [September 2019](https://github.com/atunanggara/Coursera_Capstone/tree/master/Capstone-PetStore) and updated on May 2020 for blog publication._
- Summary
- Introduction
- Target Audience
- Data and Methodology
- FourSquare data from FourSquare Places API.
- Cleaning up the petstpaul_venues data
- Separate the unique and duplicated values into two:
- In unique dataframe, fill in the Venue Zipcode column with the Zipcode column
- In duplicated dataframe, separate them into zero Venue Zipcode and non-zero Venue Zipcode dataframe:
- for non-zero Venue Zipcode dataframe, only keep the values where Venue Zipcode and Zipcode column matches
- for zero Venue Zipcode dataframe, search manually and fill in the missing zipcode value
- Combining them all back
- Ensuring the venues are all located in St. Paul region zipcode
- Let's save the queries into a csv file:
- Let's inspect the different categories
- Yelp API
- Cleaning up the petstpaul_venues data
- Based on the Venue column, separate the unique and duplicated values into two
- In unique dataframe, they are all set!
- In duplicated dataframe, separate them into rows where Venue Zipcode and Zipcode are the same and rows where they are not:
- Keep all the values when Venue Zipcode and Zipcode are the same
- Remove the duplicates when Venue Zipcode and Zipcode are not the same
- Combining all the values
- resetting the index
- Combining stpaul_venues and justpetyelp and cleaning up the combined_df data
- Separate the dataframe in two based on the categories in the stpaul_venues and justpetyelp
- In check_one dataframe, separate them into rows where there are duplicates in Venue and rows where they are not (duplicated_check_one and unique_check_one)
- Check manually and remove the duplicates on duplicated_check_one
- Methodology for k-means clustering machine learning
- Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
- Cluster Neighborhoods
- Cleaning up the petstpaul_venues data
- Cleaning up the petstpaul_venues data
- Examine Clusters
Summary
I recommend Juan Doe to open up a new pet store in three different zip codes around the St. Paul, MN region:
-
First is the Macalester-Groveland neighborhood (55105).
They have high median income with many pet services and only one other pet store in the area. In order to succeed, John Doe needs to make sure that the type of items sold in the pet stores are different enough than the ones sold in grocery stores and the other pet store. -
Second is the Falcon Heights area (55108).
They do not have a pet store, but veterinarian hospitals and animal shelters are in the area. -
Third is the Battle Creek neighborhood (55119).
They have a similar make up as the Falcon Heights and a dog park nearby.
Sometime on March 2019, my wife and I decided to adopt a 10 year old dog named Filbert from the Animal Humane Society. Here is him in his natural habitat:
To ensure his happiness, we go to pet stores in our area for foods and treats quite regularly. On the other hand, I have also been trying to brush up on my k-means clustering algorithm understanding via IBM data science specialization. So I took it upon myself to do some research and came up with finding a new pet store location around St. Paul, MN area for my capstone project.
Unbeknown to me, US residents spent more than $36 billion for pet food and treats in 2019. So, pet store is a lucrative business to get into.
Our friend, Juan Doe, wants to open a new pet store in St. Paul, MN area, and has asked me to help finding a location for it. Cost of doing business in a metropolis city like St. Paul can be stratospheric. Therefore, market share analysis can be helpful in figuring out a good location / locations and reduce the risk of opening a store in a saturated area.
- Zipcode data from uszipcode Python package.
It provides data for the latitude and longitude of all zipcode in the St. Paul, MN area along with the median household income.
stpaul_zip_merge.shape
json_zip_stpaul = stpaul_zip_merge.to_json(orient='split')
# mn_zipcodes = alt.topo_feature(url=json_zip_stpaul, feature='')
# background = alt.Chart(mn_zipcodes).mark_geoshape(
# stroke='black'
# ).encode(
# tooltip=['zipcode:N'],
# ).properties(
# width=500,
# height=350,
# title=f'MN Zip Code'
# ).project('albersUsa')
# background
# background
# .transform_lookup(
# lookup='properties.ZCTA5CE10',
# from_=alt.LookupData(data=zipStPaul,key='zipcode',
# fields=['zipcode','med_income'])
# ).encode(
# color=alt.Color('med_income:Q',scale=alt.Scale(scheme='greenblue')),
# tooltip=['zipcode:Q','med_income:Q'],
# )
# .transform_lookup(
# lookup='properties.ZCTA5CE10',
# from_=alt.LookupData(data=dfmnlatest,key='fips',
# fields=['county','cases','deaths','date'])
# )
To visualize the data, we are going to use choropleth map
There are thirty zip code data that can be used for the FourSquare Places API.
We will use folium Python package to visualize the data:
First, we use geopy library to get the latitude and longitude values of St Paul, MN.
# map_stPaul = folium.Map(location=[latitude, longitude], zoom_start=10)
# folium.Choropleth(
# geo_data=mn_zipcode,
# name='MN zip code',
# ).add_to(map_stPaul)
# folium.LayerControl().add_to(map_stPaul)
# map_stPaul
dropziplist = ['55109','55110','55111','55112',
'55115', '55118', '55120', '55121',
'55122', '55123', '55124', '55125',
'55126','55127','55128', '55129']
zipStPaul = zipStPaul[~zipStPaul['zipcode'].isin(dropziplist)]
# saving the dataframe for presentation
zipStPaul.to_csv('zipStPaul.csv',index=False)
Places API.
FourSquare data from FourSquareI will utilize the explore function to get venue recommendations and zone in on the pet related categories (e.g. pet store, dog park, dog friendly restaurants)
# get the ID and secret from the obfuscated file
CLIENT_ID = pd.read_csv('../../Coursera_Capstone/FSclientID.txt',header=None)[0][0] # your Foursquare ID
CLIENT_SECRET = pd.read_csv('../../Coursera_Capstone/FSclientSecret.txt',header=None)[0][0] # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
# modified from previous work:
def getNearbyVenues(zipcode, latitudes, longitudes, radius=500):
venues_list=[]
for zipcode, lat, lng in zip(zipcode, latitudes, longitudes):
# print(name)
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
LIMIT)
# make the GET request
results = requests.get(url).json()
results2 = results["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
results['response']['headerLocation'],
zipcode,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['location'].get('postalCode',0),
v['venue']['categories'][0]['name']) for v in results2])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Neighborhood',
'Zipcode',
'Neighborhood Latitude',
'Neighborhood Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Zipcode',
'Venue Category']
return(nearby_venues)
LIMIT = 100
stpaul_venues = getNearbyVenues(zipcode=zipStPaul['zipcode'],latitudes=zipStPaul['latitude'],
longitudes=zipStPaul['longitude'],
radius=4000)
stpaul_venues.head()
stpaul_venues.shape
petstpaul_venues
data
Cleaning up the I clean up the stpaul_venues data through the following methods:
- Based on the
Venue
column, separate the unique and duplicated values into two - In unique dataframe, fill in the
Venue Zipcode
column with theZipcode
column - In duplicated dataframe, separate them into zero
Venue Zipcode
and non-zeroVenue Zipcode
dataframe:- for non-zero
Venue Zipcode
dataframe, only keep the values whereVenue Zipcode
andZipcode
column matches - for zero
Venue Zipcode
dataframe, search manually and fill in the missing zipcode value
- for non-zero
unique_stpaul_venues = stpaul_venues[~stpaul_venues.duplicated(subset='Venue',keep=False)]
duplicated_stpaul_venues = stpaul_venues[stpaul_venues.duplicated(subset='Venue',keep=False)]
unique_stpaul_venues = unique_stpaul_venues.reset_index(drop=True)
for i in range(len(unique_stpaul_venues)):
if unique_stpaul_venues.loc[i,'Venue Zipcode'] == 0:
unique_stpaul_venues.loc[i,'Venue Zipcode'] = unique_stpaul_venues.loc[i,'Zipcode']
duplicated_stpaul_venues = duplicated_stpaul_venues.sort_values(by='Venue')
zero_duplicated_stpaul_venues = duplicated_stpaul_venues[duplicated_stpaul_venues['Venue Zipcode']==0]
nonzero_duplicated_stpaul_venues = duplicated_stpaul_venues[~(duplicated_stpaul_venues['Venue Zipcode']==0)]
nonzero_duplicated_stpaul_venues = nonzero_duplicated_stpaul_venues.reset_index(drop=True)
nonzero_duplicated_stpaul_venues = nonzero_duplicated_stpaul_venues[
nonzero_duplicated_stpaul_venues['Zipcode']==nonzero_duplicated_stpaul_venues['Venue Zipcode']]
zero_duplicated_stpaul_venues = zero_duplicated_stpaul_venues.reset_index(drop=True)
zero_duplicated_stpaul_venues.shape
zero_duplicated_stpaul_venues.loc[29:33]
# Carl's Gizmo
zero_duplicated_stpaul_venues.at[1,'Venue Zipcode'] = '55108'
# Falcon Heights Community Park
zero_duplicated_stpaul_venues.at[3,'Venue Zipcode'] = '55113'
# Leinie Lodge Bandshell
zero_duplicated_stpaul_venues.at[6,'Venue Zipcode'] = '55108'
# Lulu's Public House
zero_duplicated_stpaul_venues.at[8,'Venue Zipcode'] = '55108'
# machinery hills
zero_duplicated_stpaul_venues.at[13,'Venue Zipcode'] = '55108'
# mighty midway
zero_duplicated_stpaul_venues.at[17, 'Venue Zipcode'] = '55113'
# minneaple pie
zero_duplicated_stpaul_venues.at[19, 'Venue Zipcode'] = '55108'
# real meal deli
zero_duplicated_stpaul_venues.at[24, 'Venue Zipcode'] = '55101'
# redbox - Maplewood
zero_duplicated_stpaul_venues.at[26, 'Venue Zipcode'] = '55117'
# redbox - St Paul
zero_duplicated_stpaul_venues.at[27, 'Venue Zipcode'] = '55119'
# redbox - Maplewood
zero_duplicated_stpaul_venues.at[28, 'Venue Zipcode'] = '55117'
# subtext books
zero_duplicated_stpaul_venues.at[29, 'Venue Zipcode'] = '55102'
# subtext books
zero_duplicated_stpaul_venues.at[29, 'Zipcode'] = '55102'
zero_duplicated_stpaul_venues = zero_duplicated_stpaul_venues[
zero_duplicated_stpaul_venues['Zipcode']==zero_duplicated_stpaul_venues['Venue Zipcode']]
zero_duplicated_stpaul_venues
stpaul_venues = pd.concat([nonzero_duplicated_stpaul_venues,zero_duplicated_stpaul_venues, unique_stpaul_venues])
# set Venue Zipcode column into integer so that I can sort it by values
stpaul_venues['Venue Zipcode'] = stpaul_venues['Venue Zipcode'].astype(int)
# sort by Venue Zipcode value
stpaul_venues = stpaul_venues.sort_values(by='Venue Zipcode')
# ensure that venues are located in the zipStPaul['zipcode']
stpaul_venues = stpaul_venues[stpaul_venues['Venue Zipcode'].isin(zipStPaul['zipcode'])]
# reset the index
stpaul_venues = stpaul_venues.reset_index(drop=True)
# see what the shape look like
stpaul_venues.shape
stpaul_venues.to_csv('stpaul_venues.csv',index=False)
stpaul_venues['Venue Category'].unique()
Category of interests are:
- Dog Run
- Pet Store
- Veterinarian
How many of them are there in the dataset?
# listing all the category of interest:
catInt = ['Dog Run','Pet Store','Veterinarian']
# subsetting the data
justpetfoursquare= stpaul_venues[stpaul_venues['Venue Category'].isin(catInt)]
# resetting the index:
justpetfoursquare = justpetfoursquare.reset_index(drop=True)
justpetfoursquare
There are only 8 values and there should be more. So, I will augment the values from FourSquare API with values from yelp API.
# get the ID and secret from the obfuscated file
APIkey = pd.read_csv('../../brainstorming/yelp-cred/APIkey',header=None)[0][0] # my yelp APIkey
headers = {'Authorization': 'Bearer %s' % APIkey}
# yelp has a limit of 50 for their query
LIMIT = 50
# modified from the earlier function to match the yelp API
def getQueryVenues(query, zipcode, latitudes, longitudes, radius=500):
venues_list=[]
for zipcode, lat, lng in zip(zipcode, latitudes, longitudes):
# create the API request URL
url = 'https://api.yelp.com/v3/businesses/search?term={}&latitude={}&longitude={}&radius={}&limit={}'.format(
query,
lat,
lng,
radius,
LIMIT)
# make the GET request
results = requests.get(url, headers=headers).json()['businesses']
# return only relevant information for each nearby venue
venues_list.append([(
zipcode,
lat,
lng,
v['name'],
v['location']['zip_code'],
v['coordinates']['latitude'],
v['coordinates']['longitude'],
v['categories'][0]['title']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['zip_code',
'Zipcode Latitude',
'Zipcode Longitude',
'Venue',
'Venue Zipcode',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
return(nearby_venues)
petstpaul_venues = getQueryVenues(query='pet',
zipcode=zipStPaul['zipcode'],
latitudes=zipStPaul['latitude'],
longitudes=zipStPaul['longitude'],
radius=4000)
petstpaul_venues.shape
# limit only to the zipcode from what we queried in zipStPaul:
petstpaul_venues = petstpaul_venues[petstpaul_venues['Venue Zipcode'].isin(zipStPaul['zipcode'])]
petstpaul_venues
data
Cleaning up the I clean up the petstpaul_venues
data through the following methods:
- Based on the
Venue
column, separate the unique and duplicated values into two - In unique dataframe, they are all set!
- In duplicated dataframe, separate them into rows where
Venue Zipcode
andZipcode
are the same and rows where they are not:- Keep all the values when
Venue Zipcode
andZipcode
are the same - Remove the duplicates when
Venue Zipcode
andZipcode
are not the same
- Keep all the values when
unique_petstpaul_venues = petstpaul_venues[~petstpaul_venues.duplicated(subset='Venue',keep=False)]
duplicated_petstpaul_venues = petstpaul_venues[petstpaul_venues.duplicated(subset='Venue',keep=False)]
# see the unique values:
unique_petstpaul_venues
duplicated_petstpaul_venues = duplicated_petstpaul_venues.sort_values(by='Venue')
duplicated_petstpaul_venues = duplicated_petstpaul_venues.reset_index(drop=True)
value_duplicated_petstpaul_venues = duplicated_petstpaul_venues[
duplicated_petstpaul_venues['zip_code']==duplicated_petstpaul_venues['Venue Zipcode']]
value_duplicated_petstpaul_venues = value_duplicated_petstpaul_venues.reset_index(drop=True)
value_duplicated_petstpaul_venues.shape
pickled_duplicated_petstpaul_venues = duplicated_petstpaul_venues[
~(duplicated_petstpaul_venues['Venue'].isin(value_duplicated_petstpaul_venues['Venue']))]
pickled_duplicated_petstpaul_venues = pickled_duplicated_petstpaul_venues.drop_duplicates(subset='Venue')
pickled_duplicated_petstpaul_venues
petstpaul_venues = pd.concat([unique_petstpaul_venues,
value_duplicated_petstpaul_venues,
pickled_duplicated_petstpaul_venues])
petstpaul_venues.shape
petstpaul_venues['Venue Zipcode'] = petstpaul_venues['Venue Zipcode'].astype(int)
petstpaul_venues = petstpaul_venues.sort_values(by='Venue Zipcode')
petstpaul_venues = petstpaul_venues.reset_index(drop=True)
petstpaul_venues['Venue Category'].unique()
# saving the dataframe for presentation
petstpaul_venues.to_csv('petstpaul_venues.csv',index=False)
In order to make life a bit easier, we are going to focus on dogs and cats services, so the categories that we are going to consider are:
- Dog Walkers
- Veterinarians
- Pet Groomers
- Pet Sitting
- Pet Stores
- Pet Training
- Pet Services
- Animal Shelters
catInt2 = ['Dog Walkers','Veterinarians','Pet Groomers','Pet Sitting',
'Pet Stores','Pet Training','Pet Services','Animal Shelters']
justpetyelp= petstpaul_venues[petstpaul_venues['Venue Category'].isin(catInt2)]
justpetyelp.shape
justpetyelp.head()
stpaul_venues
and justpetyelp
and cleaning up the combined_df
data
Combining I combine the two dataframe and named it combined_df
. I cleaned up the combined_df
data through the following methods:
- Separate the dataframe in two based on the categories in the
stpaul_venues
andjustpetyelp
:check_one
andcheck_two
- We will keep the dataframe without the categories in the
stpaul_venues
andjustpetyelp
(check_two
) intact - In
check_one
dataframe, separate them into rows where there are duplicates inVenue
and rows where they are not (duplicated_check_one
andunique_check_one
):- Check manually and remove the duplicates on
duplicated_check_one
- Remove the duplicates when
Venue Zipcode
andZipcode
are not the same
- Check manually and remove the duplicates on
# combining the two dataframes:
onecomb = stpaul_venues.loc[:,['Venue','Venue Zipcode', 'Venue Latitude','Venue Longitude','Venue Category']]
twocomb = justpetyelp.loc[:,['Venue','Venue Zipcode', 'Venue Latitude','Venue Longitude','Venue Category']]
combined_df = pd.concat([onecomb,twocomb])
combined_df = combined_df.sort_values(by='Venue Zipcode')
combined_df = combined_df.reset_index(drop=True)
combined_df.shape
check_one = combined_df[combined_df['Venue Category'].isin(catInt + list(set(catInt2) - set(catInt)))]
check_two = combined_df[~combined_df['Venue Category'].isin(catInt + list(set(catInt2) - set(catInt)))]
# keep check_two intact
duplicated_check_one = check_one[check_one.duplicated(subset='Venue',keep=False)].sort_values(by='Venue')
unique_check_one = check_one[~check_one.duplicated(subset='Venue',keep=False)]
strtodel = ['Chuck','Como','Pet Supplies Plus','Twin Cities Reptiles']
for i in strtodel:
subsetvalue = duplicated_check_one[duplicated_check_one['Venue'].str.contains(i)]
duplicated_check_one.drop(subsetvalue[subsetvalue.duplicated(subset='Venue Zipcode')].index,inplace=True)
duplicated_check_one = duplicated_check_one.reset_index(drop=True)
# there are two Petco in 55116, delete index 6
duplicated_check_one.drop(6,inplace=True)
duplicated_check_one
combined_df = pd.concat([duplicated_check_one,unique_check_one,check_two])
combined_df = combined_df.sort_values(by='Venue Zipcode')
combined_df = combined_df.reset_index(drop=True)
combined_df.shape
# saving the dataframe for presentation
combined_df.to_csv('combined_df.csv',index=False)
# one hot encoding
stpaul_onehot = pd.get_dummies(combined_df[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
stpaul_onehot['Zipcode'] = combined_df['Venue Zipcode']
# move neighborhood column to the first column
fixed_columns = [stpaul_onehot.columns[-1]] + list(stpaul_onehot.columns[:-1])
stpaul_onehot = stpaul_onehot[fixed_columns]
stpaul_onehot.head()
stpaul_grouped = stpaul_onehot.groupby('Zipcode').mean().reset_index()
stpaul_grouped.head()
Borrow the function to get the most common venue from previous works:
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Zipcode']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Zipcode'] = stpaul_grouped['Zipcode']
for ind in np.arange(stpaul_grouped.shape[0]):
neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(stpaul_grouped.iloc[ind, :], num_top_venues)
neighborhoods_venues_sorted.head()
After combining them, let's start with running multiple K values to find the optimal number of clusters for this particular dataset:
stpaul_grouped_clustering = stpaul_grouped.drop('Zipcode', 1)
K = range(1, 14)
kmeans_per_k = [KMeans(n_clusters=k, random_state=42).fit(stpaul_grouped_clustering)
for k in K]
Sum_of_squared_distances = [model.inertia_ for model in kmeans_per_k]
# plot the K
optimal_k= pd.DataFrame(data= Sum_of_squared_distances, columns = ['Sum_of_squared_distances'], index = K)
optimal_k.rename_axis('K', axis = 'columns', inplace = True)
optimal_k.plot(kind = 'line', figsize = (10, 5), marker = '.')
plt.annotate('Elbow',
xy=(5, Sum_of_squared_distances[4]),
xytext=(0.55, 0.55),
textcoords='figure fraction',
arrowprops=dict(facecolor='black', shrink=0.1)
)
plt.xlabel('$k$')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
Another approach is to use silhouette score (mean of the silhouette coefficient over all the instances).
silhouette_scores = [silhouette_score(stpaul_grouped_clustering,
model.labels_)
for model in kmeans_per_k[1:]]
plt.figure(figsize=(8, 3))
plt.plot(range(2, 14), silhouette_scores, "bo-")
plt.xlabel("$k$", fontsize=14)
plt.ylabel("Silhouette score", fontsize=14)
plt.axis([1.8, 7.5, 0.03, 0.25])
plt.show()
The best K value seems to be 5 for our datasets.
Run the k-means into 5 clusters:
# set number of clusters
kclusters = 5
stpaul_grouped_clustering = stpaul_grouped.drop('Zipcode', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(stpaul_grouped_clustering)
neighborhoods_venues_sorted['Zipcode']=neighborhoods_venues_sorted['Zipcode'].astype(int)
stpaul_merged = zipStPaul
stpaul_merged['zipcode']=stpaul_merged['zipcode'].astype(int)
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
# merge zipStPaul with neighborhoods_venues to add latitude/longitude for each neighborhood
stpaul_merged = stpaul_merged.join(neighborhoods_venues_sorted.set_index('Zipcode'), on='zipcode')
stpaul_merged.head()
Visualizing the clusters:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(stpaul_merged['latitude'], stpaul_merged['longitude'], stpaul_merged['zipcode'], stpaul_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster+1), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(map_clusters)
map_clusters
Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.
stpaul_merged.loc[stpaul_merged['Cluster Labels'] == 0, stpaul_merged.columns[
[0,1] + list(range(5, stpaul_merged.shape[1]))]]
Cluster 1
is located in 55114 zipcode, which can be categorized as the St. Anthony neighborhood. It does not seem to be a good place to open a new pet store, with low median income value without any sign of other pet related categories in the 10 most common venues.
stpaul_merged.loc[stpaul_merged['Cluster Labels'] == 1, stpaul_merged.columns[
[0,1] + list(range(5, stpaul_merged.shape[1]))]]
Cluster 2
has Dog Walkers, Veterinarians, Pet Sitting, Pet Groomers, and Pet Stores as part of their top 10 most common categories.
From the list, I can recommend opening up a new pet store from 55105 zip code, which is located in the Macalester-Groveland neighborhood. It has a high median income and there is no mention of pet store in the top 10 most common venues in the area.
The second place that I would recommend is the 55108 zip code, which is located near the Falcon Heights area. It has a slightly lower median income, but there are already veterinarians as the top 10 most common venues in this neighborhood.
The third place that I would recommend is the 55119 zip code, which is located in the Battle Creek neighborhood. It has a similar feature as the 55108 location.
stpaul_merged.loc[stpaul_merged['Cluster Labels'] == 2, stpaul_merged.columns[
[0,1] + list(range(5, stpaul_merged.shape[1]))]]
Cluster 3
is located in 55130 zipcode, which can be categorized as the Payne-Phalen neighborhood. It does not seem to be a good place to open a new pet store, with low median income value and having Pet Training, Pet Sitting, and Dog Run already in the top 10 most common venue. They signal a saturated neighborhood to open a Pet Store.
stpaul_merged.loc[stpaul_merged['Cluster Labels'] == 3, stpaul_merged.columns[
[0,1] + list(range(5, stpaul_merged.shape[1]))]]
Cluster 4
is located in 55107 zipcode, which can be categorized as the West End neighborhood. It does not seem to be a good place to open a new pet store, with having both Pet Training and Pet Groomers already in the top 10 most common venue. Pet Groomers usually sell similar items that are the same as Pet Stores on top of their grooming business.
stpaul_merged.loc[stpaul_merged['Cluster Labels'] == 4, stpaul_merged.columns[
[0,1] + list(range(5, stpaul_merged.shape[1]))]]
Cluster 5
is located in 55103 zipcode, which can be categorized as the North End neighborhood. It does not seem to be a good place to open a new pet store, with low median income.
I picked: 55105, 55108, and 55119 as the three best zip codes to open a new pet store. Let's look at the data on what is available in the area.
zipcode55105 = combined_df[combined_df['Venue Zipcode']==55105]
zipcode55105[zipcode55105['Venue Category'].isin(catInt + list(set(catInt2) - set(catInt)))]
Check on the grocery and supermarket in the area:
zipcode55105[zipcode55105['Venue Category'].isin(['Grocery Store','Supermarket'])]
There is no pet store competitor for this area, with two veterinary center / animal medical center that might sell pet related items. Moreover, there are two grocery stores that do sell pet related items in their shelves. This area can be a very good location to open up a new pet store, if the type of items sold in the pet stores are different enough than the ones sold in grocery stores.
- 55108: Falcon Heights area
zipcode55108 = combined_df[combined_df['Venue Zipcode']==55108]
zipcode55108[zipcode55108['Venue Category'].isin(catInt + list(set(catInt2) - set(catInt)))]
Check on the grocery and supermarket category:
zipcode55108[zipcode55108['Venue Category'].isin(['Grocery Store','Supermarket'])]
We did not see any pet store in this zip code. Plus, the Tim & Tom's Speedy Market is a specialty grocery store that do not sell pet related items. A close contender in terms of a good location to open up a new pet store.
- 55119: Battle Creek neighborhood
zipcode55119 = combined_df[combined_df['Venue Zipcode']==55119]
zipcode55119[zipcode55119['Venue Category'].isin(catInt + list(set(catInt2) - set(catInt)))]
Check on the grocery and supermarket category:
zipcode55119[zipcode55119['Venue Category'].isin(['Grocery Store','Supermarket'])]
We did not see any pet store in this zip code, but there are two veterinarians hospitals, where they can potentially sell food and toys for their patients. They also have Aldi grocery store in their area. It is still a great area to have a pet store since the Battle Creek Dog Park is in this neighborhood.