Simple skills sunday! We’re writing pandas queries.
I have a dataset of Walmart’s sales data from 2011 to 2016 in 3 tables.
-
calendar.csv
- a calendar for the period of the dataset. contains the events. -
sales_train_validation.csv
historical daily unit sales, per product and store -
sell_prices.csv
- a weekly mean of the prices of the products sold in stores
Let’s start by imports and loading the sales_train_validation.csv
, sell_prices.csv
,and calendar.csv
sales = pd.read_csv("dataset/sales_train_validation.csv")
price = pd.read_csv("dataset/sell_prices.csv")
calendar = pd.read_csv("dataset/calendar.csv")
Now that we have a sense of what our tables are, let’s start asking some questions and answer them one by one.
1. Sales trend comparison between states
states = ["WI","CA","TX"]
total_sales = pd.DataFrame()
for state in states:
total_sales[state] = sales[sales["state_id"]==state].sum(axis=0)
total_sales.drop(total_sales.index[range(0,6)],axis=0,inplace=True)
We can plot the trends with a 25 day simple moving average( rolling(25).mean()
)
sns.set()
total_sales.select_dtypes(object).astype(int)
total_sales_ma = total_sales.apply(lambda col:col.rolling(25).mean(),axis=0)
total_sales_ma.index = total_sales_ma.index.str.split("_").str[1]
total_sales_ma.plot(kind="line")
This is a good time for taking advantage of the calendar
data frame. I replace d_1 to d_1941
with date. We can do it by merging on d
column.
total_sales = pd.merge(total_sales,calendar.set_index("d")[["date"]],left_index=True,right_index=True,how="left")
total_sales.set_index("date",inplace=True)
sns.set()
fig, ax = plt.subplots(figsize=(10,5))
total_sales.select_dtypes(object).astype(int)
total_sales_ma = total_sales.apply(lambda col:col.rolling(25).mean(),axis=0)
total_sales_ma.plot(kind="line", ax=ax)
California sells more in total. Wisconsin and Texas are more stable than California. Seems like Wisconsin has been on the verge of surpassing Texas for a while, and it finally does. Now we should investigate the trend by looking into the sales trends per store in each state.
2. The sales trend comparison for stores in all states
groupby the sales data by state_id
and store_id
, and then transpose it. This gives us multilevel columns which we should get rid of. It makes things simpler.
sales_st_store = sales.groupby(["state_id","store_id"]).sum(axis=1).loc[:,"d_1":]
sales_st_store = sales_st_store.T
sales_st_store.columns = sales_st_store.columns.get_level_values(1)
Now let’s make a list of stores and states and plot the data for each of them.
states = sales["state_id"].unique().tolist()
stores = sales["store_id"].unique().tolist()
fig, axes = plt.subplots(1,3,figsize=(20,5))
fig.suptitle("Sales trend of stores in states",fontsize=20)
for i, state in enumerate(states):
ax = axes[i]
ax.set_xlabel(state)
for col in sales_st_store.columns:
if state in col:
sales_st_store[col].rolling(50).mean().plot(kind="line",ax=ax)
fig.legend(sales_st_store.columns.tolist(),fontsize=15)
3. Sales trend comparison within the categories of products
We need to find all the categories and then calculate the daily sales for each category over days.
catgs = sales["cat_id"].unique().tolist()
['HOBBIES', 'HOUSEHOLD', 'FOODS']
We’re iterating over the 3 categories and getting a daily sum of sales for each one of them. then we put everything in a new data frame, hist_sales
.
hist_sales = pd.DataFrame()
for cat in catgs:
cat_df = sales[sales["cat_id"]==cat]
for day in days:
hist_sales.loc[day,cat] = cat_df[day].sum()
hist_sales.head()
hist_sales = pd.merge(hist_sales,calendar.set_index("d")[["date"]],
left_index=True,right_index=True,how="left")
hist_sales.set_index("date",inplace=True)
let’s plot it
sns.set()
fig, ax = plt.subplots(figsize=(10,5))
hist_sales.select_dtypes(object).astype(int)
hist_sales_ma = hist_sales.apply(lambda col:col.rolling(15).mean(),axis=0)
hist_sales_ma.plot(kind="line", ax=ax)
Foods category is generating most of the sales. it also has a seasonal trend which we obviously don’t see in Household and Hobbies.
4. Sales trend of different categories in each state
Here I flatten the multi level column using join
.
sales_cat_st = sales.groupby(["state_id","cat_id"]).sum(axis=1).T.loc["d_1":,:]
sales_cat_st.columns = ["_".join(c) for c in sales_cat_st.columns]
fig, axes = plt.subplots(3,3,figsize=(20,10))
for i,state in enumerate(states):
for j, col in enumerate(sales_cat_st.columns):
if state in col:
ax= axes[i][j%3]
ax.set_ylabel(state)
ax.set_xlabel(col.split("_")[1] )
sales_cat_st[col].rolling(50).mean().plot(ax=ax)
5. Total sales of Walmart over days of the week and months of the year
starting with the sales
data frame, get a sum over days(axis=0
). then get weekday
and month
from calendar and finally merge them together on axis=1
.
days_sales = pd.DataFrame()
days_sales["sales"] = sales.sum()
days_sales = days_sales.loc["d_1":,:]
days = calendar.set_index("d")
year_sales = pd.DataFrame()
year_sales= pd.merge(days_sales,days[["weekday","month"]],right_index=True,left_index=True)
Now that we have them together we can groupby on month
and weekday
, and aggregate sales on each group. we also need to sort_values()
weekdays, and translate months using a dictionary. The method for turning weekdays into Categorical
before sorting them is a good one!
week_month = year_sales.groupby(["month","weekday"]).agg({"sales": "sum"}).unstack("weekday")
monthDict = {1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun',
7:'Jul', 8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'}
week_month.index = map(lambda m: monthDict[m],week_month.index.values)
week_month.columns = week_month.columns.get_level_values(1)
week_month = week_month.T
week_month.index = pd.Categorical(week_month.index,categories=["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"],
ordered=True)
week_month.sort_index()
Now that we have everything in order, it’s time for a beautiful seaborn heatmap:
plt.subplots(figsize=(15,5))
sns.heatmap(week_month, annot=False, cmap="YlGnBu")
6. Aggregate sales of categories, monthly comparison and all time trend
cat_sales = sales.groupby(["cat_id"]).sum().T.loc["d_1":,:]
cat_sales = pd.merge(cat_sales,days[["month"]],right_index=True,left_index=True)
cat_sales[cat_sales.select_dtypes(object).columns] = cat_sales.select_dtypes(object).astype(int)
cat_sum_sales_monthly = cat_sales.groupby(["month"]).sum()
cat_median_sales_monthly = cat_sales.groupby(["month"]).agg({
"FOODS":"median",
"HOBBIES":"median",
"HOUSEHOLD":"median",
})
fig , ax = plt.subplots(1,figsize=(10,10))
cat_sum_sales_monthly.plot(kind="bar",ax=ax)
_ = ax.set_xticks(range(0,13))
cat_sales = pd.merge(cat_sales,calendar.set_index("d")[["date"]],left_index=True,right_index=True,how="left")
cat_sales.set_index("date",inplace=True)
fig , axes = plt.subplots(3,1,figsize=(15,10))
for i,cat in enumerate(catgs):
ax = axes[i]
ax.set_xlabel(cat)
cat_sales[cat].rolling(50).mean().plot(ax=ax,kind="line",label=cat)
ax.legend([cat])
7. Distribution of yearly mean item prices in different categories
I would like to see if we can find any interesting changes in the mean of prices over time. Over each category and state. We need to merge and groupby sales
, price
and calendar
.
Start by adding the wm_yr_wk
and year
to the prices.
price_year = pd.merge(price,calendar[["wm_yr_wk","year"]],on="wm_yr_wk",how="left")
price_year_mean = price_year.groupby(["item_id","year"]).mean()
Then we can groupby
and get a mean
of the price of each item over the years.
Drop the wm_yr_wk
and move the year
from index to a column. do the merge and then we can we can use kdeplot
to visualize the distributions sharing the same axes.
price_year_mean = price_year_mean.drop("wm_yr_wk",axis=1)
price_year_mean.reset_index(1,inplace=True)
price_year_mean = price_year_mean.drop("wm_yr_wk",axis=1)
price_year_mean.reset_index(1,inplace=True)
price_dept = pd.merge(price_year_mean,sales.set_index("item_id")[["dept_id","store_id","state_id"]],on="item_id",how="left")
first I looked into mean prices for all departments in states, and the result is too messy to draw any conclusions from. but still if you look closely, you can see how the distributions of HOBBIES_2 and HOUSEHOLD_2 have been changing to a bimodal from 2011 to 2016.
We can make our figure more useful by drawing the same plots for categories, instead of states.
import matplotlib
catgs = ["FOODS","HOBBIES","HOUSEHOLD"]
fig , axes = plt.subplots(6,3,figsize=(20,20))
fig.tight_layout()
for y,year in enumerate(cat_year_price["year"].unique().tolist()):
for i,cat in enumerate(catgs):
ax = axes[y%6][i]
cat_price = cat_year_price[(cat_year_price["cat_id"]==cat)&(price_dept["year"]==year)]
for dept in cat_price["dept_id"].unique().tolist():
plt.setp(ax, xlabel=cat)
plt.setp(ax, ylabel=year)
ax.set_xscale('log', base=10)
ax.set_xlim(0.4,30)
ax.set_xticks([0.5,1,2,5,10,25])
ax.set_xticklabels(["$0.5","$1","$2","$5","$10","$25"])
sns.kdeplot(cat_price[cat_price["dept_id"]==dept]["sell_price"],
ax=ax,shade=True,label=dept,legend=True)
This one’s much better. Again take a look at HOBBIES_2
price distribution and how it’s changing over the years. it is centred around $3 in 2011, and is reforming to a bimodal and then slowly centering around $1.