Summary

Visualizing Covid-19 patients number in the State of Minnesota using altair. The county data is taken from New York Times repository.

Introductions

I enjoy browsing through the data provided by the Minnesota Department of Health on Covid-19 cases daily. It provides the most up-to-date number of confirmed positive cases in the state. However, to get the historical data is a little circuitous.

So, I was thrilled to learn that the New York Times published their county-by-county Covid-19 data historically. It motivated me to create an interactive map of Minnesota and their historical data on a similar fashion as shown on the covid-19 dashboards.

In summary, using a Jupyter notebook, I want to showcase fastpages' capacity to visualize interactive map and bar graph in Altair using Covid-19 data.

Results

Click on County to Filter Chart Below

Code Explanations

Here are the process to create the visuals shown above. Feel free to hit the badges above in order to see the code, or directly here.

Importing modules

I start by importing all the necessary components:

%reload_ext autoreload
%autoreload 2
import pandas as pd
import altair as alt
import requests

I used %reload_ext autoreload Python magic in order to reload all changed modules before executing a new line. Pandas and Altair are used for the data formatting and data visualizations along with requests to obtain the json formatted data.

Data Extraction and Cleaning

New York Times County-by-county Data

I am getting the data from the New York Times repository and converting it into a pandas DataFrame:

# data from NY-Times:
url="https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
df = pd.read_csv(url)
dfmn = df[df.state=="Minnesota"]
dfmn['date'] = pd.to_datetime(dfmn['date'])

As mentioned on the their github repository:

Many state health departments choose to report cases separately when the patient’s county of residence is unknown or pending determination. In these instances, we record the county name as “Unknown.” As more information about these cases becomes available, the cumulative number of cases in “Unknown” counties may fluctuate.

I dropped the unknown number of confirmed cases because the number of confirmed cases is higher than reported on the Minnesota Department of Health web site if I include them.

# drop the null values by first finding the index of the unknown, then use list 
# comprehension to drop the values 
unknown = dfmn[dfmn.isnull().any(axis="columns")].index
dfmn = dfmn.drop([i for i in unknown], axis=0)
del unknown
# now that there is no null value:
# convert the FIPS as string in integer format to match with the json county data for Minnesota
dfmn.fips = dfmn.fips.astype(int).astype(str)

However, that is not the case for death counts. The death count discrepancies are on the 27th and the 26th of March. By checking between the Minnesota Department of Health web archive and our New York Times dataset, I decided to add the unknown death count data to Hennepin county. This might not reflect the reality, but adding the data to Hennepin county makes the most sense in order to maintain anonimity and accuracy of the data due to the county having the highest number of cases in the State of Minnesota.

# dealing with discrepancy with the death count: 
# use: https://web.archive.org/web/*/https://www.health.state.mn.us/diseases/coronavirus/situation.html
# and see the difference between the two dates of 03/28 and 03/27
# 03-28
# dfmn[(dfmn.date=="2020-03-28")&(dfmn.deaths != 0)]
# 03-27
# dfmn[(dfmn.date=="2020-03-27")&(dfmn.deaths != 0)]
#hide
# let's add the death data into Hennepin county for 03-27:
dfmn.loc[dfmn[(dfmn.county=="Hennepin")&(dfmn.date=="2020-03-27")].index[0],'deaths'] = 2
# new 2020-03-27:
# dfmn[(dfmn.date=="2020-03-27")&(dfmn.deaths != 0)]
# 2020-03-26
# dfmn[(dfmn.date=="2020-03-26")&(dfmn.deaths != 0)]
# I added 1 death data into Hennepin county for 2020-03-26
dfmn.loc[dfmn[(dfmn.county=="Hennepin")&(dfmn.date=="2020-03-26")].index[0],'deaths'] = 1

County data

The topographic and FIPS data is taken from David Eldersved's github repository.

# getting minnesota county data
url_counties = "https://raw.githubusercontent.com/deldersveld/topojson/master/countries/us-states/MN-27-minnesota-counties.json"
resp = requests.get(url_counties)
data= resp.json()
del resp
df_county = pd.json_normalize(data['objects']['cb_2015_minnesota_county_20m']['geometries'])
df_county = df_county[["properties.GEOID","properties.NAME"]]
df_county.columns = ['fips','county']
del data

Combining Case count and County Data

We combine the two data sources using for loop:

# Remove state column since it is all in Minnesota:
df3mneasy = dfmn.drop('state',1)
# grabbing all the dates in the data for the for loop
uniquedate = df3mneasy.date.unique()
# an empty DataFrame for the new data
countycase_df = pd.DataFrame()
# add 0 cases and 0 deaths for counties that do not have cases
values = {'cases': 0, 'deaths': 0}

for i in uniquedate: 
    df = df3mneasy[df3mneasy.date == i] #filter for each date
    df = df_county.merge(df,'left') # we want to keep all the counties 
    df['date'] = df['date'].fillna(i) # fill the appropriate date
    df = df.fillna(value=values) # fill the zero counts
    countycase_df = countycase_df.append(df, ignore_index=True) # new DataFrame

# remove df3mneasy and uniquedate
del df3mneasy, uniquedate
# getting the data for the latest date
dfmnlatest = countycase_df[countycase_df.date==max(countycase_df.date)]

Counting New Cases

In order to get the number of new cases, I used pandas dataframe diff to calculate the difference of consecutive rows.

# getting data to see the increment of new cases per day
newcases = countycase_df.sort_values(["county","date"],ascending=(True,True))
# grabbing all the county in minnesota
uniquecounty = newcases.county.unique()
# empty_df for 
newcases_df = pd.DataFrame()

for i in uniquecounty: 
    df = newcases[newcases.county == i]
    df['newcases'] = df.cases.diff()
    df['newdeaths'] = df.deaths.diff()
    newcases_df = newcases_df.append(df, ignore_index=True)

del newcases, uniquecounty
# I drop all the NAs because that means that on the first day, there were no differences to count
newcases_df = newcases_df.dropna()

Visualization Code

First off, I created a few variables needed in order to render the visuals correctly:

# inspired by 
# https://github.com/github/covid19-dashboard/blob/master/_notebooks/2020-03-15-us-growth-by-state-map.ipynb
# to fix the altair transformation
alt.data_transformers.disable_max_rows()

#https://github.com/altair-viz/altair/issues/1005#issuecomment-403237407
def to_altair_datetime(dt):
    return alt.DateTime(year=dt.year, month=dt.month, date=dt.day,
                        hours=dt.hour, minutes=dt.minute, seconds=dt.second,
                        milliseconds=0.001 * dt.microsecond)

# getting the maximum and minimum date
dmax = (countycase_df.date.max() + pd.DateOffset(days=3))
dmin = countycase_df.date.min()

# getting the current_date
curr_date = countycase_df.date.max().date().strftime('%Y-%m-%d')
# getting the topographic data
url_counties = "https://raw.githubusercontent.com/deldersveld/topojson/master/countries/us-states/MN-27-minnesota-counties.json"
counties = alt.topo_feature(url=url_counties, feature='cb_2015_minnesota_county_20m')

In order to connect the three graphs, I created a variable selector:

# use selector
selector = alt.selection_single(empty='none', fields=['county'], init={'county':'Hennepin'})

I chose the empty field to be equal to 'none', in order to clear up the binding when I do not select any county. I also picked Hennepin county as the initial choice due to it being the highest number of cases. Altair's documentation provides a good read on all the different arguments.

Here I created a variable background for the topographic map of Minnesota:

background = alt.Chart(counties).mark_geoshape(
    stroke='black'
).transform_lookup(
    lookup='properties.GEOID',
    from_=alt.LookupData(data=dfmnlatest,key='fips',
                         fields=['county','cases','deaths','date'])
).encode(
    color=alt.Color('cases:Q',scale=alt.Scale(scheme='greenblue')),
    tooltip=['county:N','cases:Q','deaths:Q'],
).properties(
    width=500,
    height=350,
    title=f'Total Confirmed Cases by County as of {curr_date}'
).add_selection(
    selector
).project('albersUsa')

I combined the topojson data with the dfmnlatest DataFrame using transform_lookup. I added the selector in the add_selection statement to connect it with the other graph. Vega provided a nice selection of color schemes that can be used in the graph.

Here I created a variable newcases to graph the new cases throughout the duration of the Covid-19 data:

newcases = alt.Chart(newcases_df).mark_bar().properties(
    width=500,
    height=350,
    title="New Confirmed Cases by County",
).transform_filter(
    selector
).add_selection(
    selector
).encode(
    x=alt.X('date:T', title='Date', timeUnit='yearmonthdate',
            axis=alt.Axis(format='%y/%m/%d', labelAngle=-30,titleColor='black'), 
            scale=alt.Scale(domain=[to_altair_datetime(dmin), to_altair_datetime(dmax)])),
    y=alt.Y('newcases',
             axis=alt.Axis(title='# of New Confirmed Cases',titleColor='black'),
           ),
    color=alt.Color('county',scale=alt.Scale(scheme='viridis')),
    tooltip=['county:N','date:T',
             alt.Tooltip('newcases:Q',title='# of new cases'), 
             alt.Tooltip('newdeaths:Q', title='# of new deaths')]
)

The transform_filter tranform the filter based on the selection of our click. I used Tooltip wrapper for renaming the title of the column being grabbed from the newcases_df DataFrame.

Here I created a chart variable to show the breakdown of total confirmed cases in Minnesota:

chart = alt.Chart(dfmn).mark_bar().properties(
    width=500,
    height=350,
    title="Total Confirmed Cases by County",
).add_selection(
    selector
).encode(
    x=alt.X('date:T', title='Date', timeUnit='yearmonthdate',
            axis=alt.Axis(format='%y/%m/%d', labelAngle=-30,titleColor='black'), 
            scale=alt.Scale(domain=[to_altair_datetime(dmin), to_altair_datetime(dmax)])),
    y=alt.Y('sum(cases)',
             axis=alt.Axis(title='# of Total Cases',titleColor='black'),
           ),
    color=alt.Color('county',scale=alt.Scale(scheme='viridis')),
    order=alt.Order(
        'cases'
    ),
    opacity=alt.condition(selector, alt.value(1), alt.value(0.5)),
    tooltip=['county:N','date:T','cases:Q', 'deaths:Q']
)

Finally, I created final_chart to combine all the three graphs together:

final_chart = alt.vconcat(
    background, 
    newcases,
    chart, 
).resolve_scale(
    shape='independent',
).configure(
    padding={'left':10, 'bottom':0}
).configure_axis(
    labelFontSize=10,
    labelPadding=10,
    titleFontSize=12,
).configure_view(
     stroke=None
)

final_chart

Concluding thoughts

I hope you enjoy the visualization and the breakdown summary on the process that I take to create this Altair visualization. The module is full of wonderful ways to create visualization that can enhance your analysis. Feel free to explore their gallery for more awesome graphs.

If you have any comments or would like to connect, add me on my LinkedIn profile!