時間:2023-03-15 23:04:02 | 來源:電子商務(wù)
時間:2023-03-15 23:04:02 來源:電子商務(wù)
項(xiàng)目背景#導(dǎo)入包import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsplt.style.use("bmh")plt.rc('font',family='SimHei', size=13)%matplotlib inline#導(dǎo)入數(shù)據(jù)data_df = pd.read_excel('OnlineRetail.xlsx', sheet_name='Online Retail',dtype=str )
4.2查看數(shù)據(jù)#查看數(shù)據(jù)信息data_df.info()data_df.shape#查看前五行data_df.head(3)
可以發(fā)現(xiàn):#列名重命名data_df.rename(columns={'InvoiceDate':'InvoiceTime'},inplace=True)
4.4刪除重復(fù)值#刪除重復(fù)值rows_before = data_df.shape[0]data_df.drop_duplicates(inplace = True)rows_after = data_df.shape[0]print('原行數(shù):{0},現(xiàn)行數(shù):{1},刪除行數(shù):{2}'.format(rows_before,rows_after,rows_before-rows_after))#刪除后,重設(shè)索引data_df.reset_index(drop = True, inplace = True)
可以發(fā)現(xiàn):原始數(shù)據(jù)541909條記錄數(shù),刪除5268記錄數(shù),即有5268條重復(fù)記錄#查看缺失值data_df.isnull().sum().sort_values(ascending=False)#統(tǒng)計(jì)數(shù)據(jù)各列缺失數(shù)量和比例def missing_value_table(data): #統(tǒng)計(jì)缺失數(shù)量 mis_val = data.isnull().sum() #缺失占比百分?jǐn)?shù) mis_val_percent = mis_val/len(data) #結(jié)果制作一個表格 mis_val_table = pd.concat([mis_val,mis_val_percent], axis = 1) #給類名重命名 mis_val_table_rename = mis_val_table.rename(columns={0:'Missing Values',1:'% of Total Values'}) #對結(jié)果進(jìn)行排序 mis_val_table_rename = mis_val_table_rename[mis_val_table_rename.iloc[:,1] != 0].sort_values('% of Total Values',ascending=False).round(1) return mis_val_table_rename#統(tǒng)計(jì)缺失值missing_value_table(data_df)
可以發(fā)現(xiàn):#查詢數(shù)據(jù)中是否存在0號客戶data_df[data_df['CustomerID']=='0'].shape[0]#用0填充CustomerIDdata_df['CustomerID'].fillna('0',inplace=True)#再次統(tǒng)計(jì)缺失值data_df.isnull().sum().sort_values(ascending = False)
可以發(fā)現(xiàn),客戶單號已經(jīng)不存在缺失值了#一致化處理data_df['InvoiceTime'] = pd.to_datetime(data_df['InvoiceTime'],errors = 'coerce')#新增Date特征data_df['Date'] = pd.to_datetime(data_df['InvoiceTime'].dt.date,errors = 'coerce')#新增特征Month特征data_df['Month'] = data_df['InvoiceTime'].astype('datetime64[M]')#查看特征信息data_df.info()
4.6.2數(shù)據(jù)類型一致化處理#類型轉(zhuǎn)換data_df['Quantity'] = data_df['Quantity'].astype('int32')data_df['UnitPrice'] = data_df['UnitPrice'].astype('float')data_df['CustomerID'] = data_df['CustomerID'].astype('int32')#添加總價SumPrice特征sales_df['SumPrice'] = sales_df['Quantity'] * sales_df['UnitPrice']
4.7異常值處理data_df.describe()
可以發(fā)現(xiàn),特征Quantity、UnitPrice、SumPrice存在負(fù)值,且絕對值的數(shù)量較大,SumPrice特征是根據(jù)Quantity和UnitPrice相乘得到,二者之一為負(fù)值。#查看Quantity、UnitPrice、SumPrice的記錄數(shù)data_df[(data_df['Quantity'] <= 0)|(data_df['UnitPrice'] <= 0)].shape[0]#查看前5條記錄data_df[(data_df['Quantity'] <= 0)|(data_df['UnitPrice'] <= 0)].head()
可以發(fā)現(xiàn):數(shù)據(jù)有兩類(1)C字頭被取消的訂單(2)單價為0的訂單query_c = data_df['InvoiceNo'].str.contains('C')#只含取消訂單data_cancel = data_df.loc[query_c,:].copy()#只含成功訂單data_success = data_df.loc[-query_c,:].copy()#為sales_cancel增加字段SrcInvoiceNo,用于存放去掉“C”的發(fā)票編號data_cancel['SrcInvoiceNo'] = data_cancel['InvoiceNo'].str.split('C',expand=True)[1]print('原始訂單記錄:{0},取消訂單記錄數(shù):{1},成功訂單記錄數(shù):{2}'.format(data_df.shape[0],data_cancel.shape[0],data_success.shape[0]))
可以發(fā)現(xiàn),取消訂單數(shù)和成功訂單數(shù)沒有重合query_free = data_success['UnitPrice'] == 0# 只含免費(fèi)訂單data_free = data_success.loc[query_free,:].copy()# 只含普通訂單data_success = data_success.loc[-query_free,:]#查看處理后的描述統(tǒng)計(jì)data_success.describe()
可以發(fā)現(xiàn),還有異常值時,單價為負(fù)值的訂單#查看單價為負(fù)值的訂單query_minus = data_success['UnitPrice'] < 0#正常訂單數(shù)據(jù)data_success = data_success.loc[-query_minus,:]data_success.shape
可以發(fā)現(xiàn),經(jīng)過數(shù)據(jù)清洗之后,數(shù)據(jù)還剩data_success還剩524878條記錄。#data_success根據(jù)訂單號進(jìn)行分組,對Quantity的商品數(shù)量和SumPrice的總價進(jìn)行分組求和:innovice_grouped = data_success.groupby('InvoiceNo')[['Quantity','SumPrice']].sum()innovice_grouped.describe()
可以發(fā)現(xiàn):#訂單總交易分布圖f,[ax1, ax2] = plt.subplots(2,1,figsize=(12,10))sns.distplot(innovice_grouped['SumPrice'],bins=100,kde=False,ax=ax1, hist_kws={'alpha':1,'color':'g'})ax1.set_title('SumPrice Distribution of Orders')ax1.set_ylabel('Frequency')ax1.set_xlabel('SumPrice')sns.distplot(innovice_grouped[innovice_grouped.SumPrice<1000]['SumPrice'],bins=100,kde=True,color='r',ax=ax2, hist_kws={'alpha':0.8,'color':'g'})ax2.set_title('SumPrice Distribution of Orders (Below 1000)')ax2.set_ylabel('Frequency')ax2.set_xlabel('SumPrice')plt.savefig('1-1.png')
可以發(fā)現(xiàn),400英鎊以內(nèi)的訂單金額占比較大,有三個峰值,分別是20、100-220、300-330英鎊,這些區(qū)域的訂單數(shù)量比較多。#全部訂單數(shù)量分布f,[ax1, ax2] = plt.subplots(2,1,figsize=(12,10))sns.distplot(innovice_grouped['Quantity'],bins=100,kde=False,ax=ax1, hist_kws={'alpha':1,'color':'g'})ax1.set_title('SumPrice Distribution of Orders')ax1.set_ylabel('Frequency')ax1.set_xlabel('SumPrice')#單筆訂單數(shù)量小于2000分布sns.distplot(innovice_grouped[innovice_grouped['Quantity']<2000]['Quantity'],bins=100,kde=True,color='r',ax=ax2, hist_kws={'alpha':0.8,'color':'g'})ax2.set_title('SumPrice Distribution of Orders (Below 1000)')ax2.set_ylabel('Frequency')ax2.set_xlabel('SumPrice')
可以發(fā)現(xiàn),在總體訂單分布中存在大量訂單使得數(shù)據(jù)較大,訂單量小于2000分布時,商品數(shù)量呈現(xiàn)長尾 分布,大部分訂單的商品數(shù)量集中在300以內(nèi),同時,訂單商品數(shù)量越多,訂單越少。plt.figure(figsize=(14,4))plt.subplot(121)plt.scatter(innovice_grouped['Quantity'], innovice_grouped['SumPrice'], color='g')plt.title('SumPrice & Quantity')plt.ylabel('SumPrice')plt.xlabel('Quantity')# 篩去商品件數(shù)在20000及以上的訂單plt.subplot(122)plt.scatter(innovice_grouped[innovice_grouped.Quantity < 20000]['Quantity'], innovice_grouped[innovice_grouped.Quantity < 20000]['SumPrice'], color = 'g')plt.title('SumPrice & Quantity (Quantity < 20000)')plt.ylabel('SumPrice')plt.xlabel('Quantity')
可以發(fā)現(xiàn),訂單的交易金額和訂單商品的數(shù)量在總體趨勢上呈正比,即訂單商品數(shù)量越多,定金額越多。同時在數(shù)量接近0的位置,存在高價訂單。#僅對含有CustomerID的客戶進(jìn)行分析:data_customer = data_success[data_success['CustomerID'] != 0].copy()#按照客戶ID和訂單編號分組customer_grouped = data_customer.groupby(['CustomerID','InvoiceNo'])[['Quantity','SumPrice']].sum()#重設(shè)索引customer_grouped = customer_grouped.reset_index()#統(tǒng)計(jì)每個客戶的訂單數(shù)、商品數(shù)量和訂單金額customer_grouped = customer_grouped.groupby('CustomerID').agg({'InvoiceNo': np.size, 'Quantity': np.sum, 'SumPrice': np.sum})描述性統(tǒng)計(jì)customer_grouped.describe()
可以發(fā)現(xiàn):#客戶消費(fèi)金額分布f,[ax1, ax2] = plt.subplots(2,1,figsize=(12,10))sns.distplot(customer_grouped['SumPrice'],bins=50,kde=False,ax=ax1, hist_kws={'alpha':1,'color':'r'})ax1.set_title('SumPrice Distribution of Customers')ax1.set_ylabel('Frequency')ax1.set_xlabel('SumPrice')#消費(fèi)金額低于5000的客戶分布sns.distplot(customer_grouped[customer_grouped['SumPrice']<5000]['SumPrice'],bins=60,kde=True,color='g',ax=ax2, hist_kws={'alpha':0.8,'color':'r'})ax2.set_title('SumPrice Distribution of Customers (Below 5000)')ax2.set_ylabel('Frequency')ax2.set_xlabel('SumPrice')
可以看出,大部分客戶的消費(fèi)能力不高,金額更為集中在1000英鎊以內(nèi)。與訂單金額的多峰分布相比,客戶消費(fèi)金的分布呈現(xiàn)單峰長尾狀。#繪制客戶消費(fèi)金額與消費(fèi)件數(shù)的散點(diǎn)圖plt.figure(figsize=(14,4))plt.subplot(121)plt.scatter(customer_grouped['Quantity'], customer_grouped['SumPrice'], color = 'r')plt.title('SumPrice & Quantity')plt.ylabel('SumPrice')plt.xlabel('Quantity')plt.subplot(122)plt.scatter(customer_grouped[customer_grouped['Quantity'] < 25000]['Quantity'], customer_grouped[customer_grouped.Quantity < 25000]['SumPrice'], color = 'r')plt.title('SumPrice & Quantity (Quantity<25000)')plt.ylabel('SumPrice')plt.xlabel('Quantity')
可以發(fā)現(xiàn),客戶群體的消費(fèi)規(guī)律性更強(qiáng),客戶消費(fèi)主要集中低商品數(shù)量上,同時存在一定消費(fèi)能力比較強(qiáng)的客戶??傮w上講,消費(fèi)金額和客戶購買數(shù)量呈正相關(guān),與實(shí)際相符。data_success.loc[data_success['StockCode']=='21484',:]['UnitPrice'].value_counts()
#按照商品編號對商品數(shù)量Quantity和商品總價SumPrice進(jìn)行分組goods_grouped = data_success.groupby('StockCode')[['Quantity','SumPrice']].sum()#計(jì)算商品的均價goods_grouped['AvgPrice'] = goods_grouped['SumPrice']/goods_grouped['Quantity']
查看商品的均價分布:#所有商品均價分布f,[ax1, ax2] = plt.subplots(2,1,figsize=(12,10))sns.distplot(goods_grouped['AvgPrice'],bins=100,kde=False,ax=ax1, hist_kws={'alpha':1,'color':'b'})ax1.set_title('AvgPrice Distribution')ax1.set_ylabel('Frequency')ax1.set_xlabel('SumPrice')#均價小于100英鎊的商品分布sns.distplot(goods_grouped[goods_grouped['AvgPrice']<100]['AvgPrice'],bins=100,kde=True,color='r',ax=ax2, hist_kws={'alpha':0.8,'color':'b'})ax2.set_title('AvgPrice Distribution (Below 100)')ax2.set_ylabel('Frequency')ax2.set_xlabel('SumPrice')plt.show()
可以發(fā)現(xiàn),總體上來上商品的均價都是出于低價位,高價商品很少。商品價位低于100時,商品銷售數(shù)量高峰的價位是1-5英鎊,高于10英鎊的商品銷量已經(jīng)很低,可知,該電商的低昂為低價的小商品市場。#總體商品單價和商品數(shù)量散點(diǎn)圖plt.figure(figsize=(14,4))plt.subplot(121)plt.scatter(goods_grouped['AvgPrice'],goods_grouped['Quantity'],color='b')plt.title('AvgPrice & Quantity')plt.ylabel('Quantity')plt.xlabel('AvgPrice')#商品價位低于50的單價和商品數(shù)量分布圖plt.subplot(122)plt.scatter(goods_grouped[goods_grouped.AvgPrice<50]['AvgPrice'], goods_grouped[goods_grouped.AvgPrice<50]['Quantity'],color='b')plt.title('AvgPrice & Quantity (AvgPrice < 50)')plt.ylabel('Quantity')plt.xlabel('AvgPrice')
可以發(fā)現(xiàn),前面分析低價位商品更受歡迎是正正確的。#總體商品單價和銷售額的散點(diǎn)圖plt.figure(figsize=(14,4))plt.subplot(121)plt.scatter(goods_grouped['AvgPrice'], goods_grouped['SumPrice'], color = 'y')plt.title('AvgPrice & SumPrice')plt.ylabel('SumPrice')plt.xlabel('AvgPrice')#商品價格低于50的單價和銷售額分布圖plt.subplot(122)plt.scatter(goods_grouped[goods_grouped.AvgPrice < 50]['AvgPrice'], goods_grouped[goods_grouped.AvgPrice < 50]['SumPrice'], color = 'y')plt.title('AvgPrice & SumPrice (AvgPrice < 50)')plt.ylabel('SumPrice')plt.xlabel('AvgPrice')
可以發(fā)現(xiàn),低價區(qū)的商品缺失是銷售額的主要構(gòu)成部分,高價商品銷量低,并沒有帶來多少的銷售額,建議采購部門可以多采購低價位商品進(jìn)行銷售。time_grouped = data_success.groupby('InvoiceNo').agg({'Date': np.min, 'Month': np.min, 'Quantity': np.sum, 'SumPrice': np.sum}).reset_index()
5.4.1月份信息分析#根據(jù)月份進(jìn)行分組統(tǒng)計(jì)month_grouped = time_grouped.groupby('Month').agg({'Quantity': np.sum, 'SumPrice': np.sum, 'InvoiceNo': np.size})#畫雙軸折線圖month = month_grouped.plot(secondary_y = 'InvoiceNo', x_compat=True,figsize=(12,4))month.set_ylabel('Quantity & SumPrice')month.right_ax.set_ylabel('Order quantities')
可以發(fā)現(xiàn),三條折線的趨勢較為相似,除了2011.2和2011.4外,2010.12-2011.8銷售趨勢較為平穩(wěn),9-11月銷售趨勢增加,這有可能和節(jié)假日活動有關(guān),經(jīng)調(diào)研,發(fā)現(xiàn)該區(qū)間有節(jié)日:#將日期設(shè)為索引,按日畫折線圖time_grouped = time_grouped.set_index('Date')day = time_grouped.groupby('Date').agg({'Quantity': np.sum, 'SumPrice': np.sum, 'InvoiceNo': np.size}).plot(secondary_y = 'InvoiceNo', figsize = (14, 5))day.set_ylabel('Quantity & SumPrice')day.right_ax.set_ylabel('Order quantities')
可以發(fā)現(xiàn):銷售額和銷量趨勢較為一致,但是最后一天2.11.12.9銷量和銷售額明顯提高。# 取2011年10月1日至2011年12月9日day_part = time_grouped['2011-10-01':'2011-12-09'].groupby('Date').agg({'Quantity': np.sum, 'SumPrice': np.sum, 'InvoiceNo': np.size}).plot(secondary_y = 'InvoiceNo', figsize = (14, 5))day_part.set_ylabel('Quantity & SumPrice')day_part.right_ax.set_ylabel('Order quantities')
可以發(fā)現(xiàn),12月份前8天的三條這些趨勢較為一致,最后一天的銷量和銷售額明顯提高,把這天信息顯示出來查驗(yàn)。data_success[data_success.Date == '2011-12-09'].sort_values(by = 'SumPrice', ascending = False).head()
可以發(fā)現(xiàn),以為英國客戶,一次性購買8萬余件的紙品工藝,拉高了銷售量。#提取一張客戶ID及其對應(yīng)國家的關(guān)系表data_country = data_success.drop_duplicates(subset=['CustomerID', 'Country'])#按客戶分組,計(jì)算消費(fèi)總額country_grouped = data_customer.groupby('CustomerID')[['SumPrice']].sum()#將上述兩張表合并data_country = data_country.drop({'SumPrice'},axis=1)country_grouped = country_grouped.merge(data_country, on=['CustomerID'])#按國家再次分組,計(jì)算出各國客戶消費(fèi)總額和客戶總數(shù)country_grouped2 = country_grouped.groupby('Country').agg({'SumPrice':np.sum,'CustomerID': np.size})新增AvgAmount字段,存放該國家客戶的人均消費(fèi)金額country_grouped2['AvgAmount'] = country_grouped2['SumPrice']/country_grouped2['CustomerID']對消費(fèi)總額降序排列country_grouped2.sort_values(by='SumPrice',ascending=False).head()
可以發(fā)現(xiàn),絕大部分客戶來自英國本土,主要境外收入來源是英國周邊的國家,基本上符合以英國為中心向外輻射的情況。#提取數(shù)據(jù)select_customer = data_success[data_success['CustomerID'] != 0].copy()#查看用戶初次與末次消費(fèi)時間#客戶的初次消費(fèi)時間mindate = data_customer.groupby('CustomerID')[['Date']].min()#客戶的末次消費(fèi)時間maxdate = data_customer.groupby('CustomerID')[['Date']].max()#計(jì)算用戶消費(fèi)的生命周期life_time = maxdate - mindate#生命周期描述性統(tǒng)計(jì)分析life_time.describe()
可以發(fā)現(xiàn):#新增life_times特征life_time['life_times'] = life_time['Date'].dt.days#繪制總體客戶生命周期柱狀圖f,[ax1, ax2] = plt.subplots(2,1,figsize=(12,10))sns.distplot(life_time['life_times'],bins=20,kde=False,ax=ax1, hist_kws={'alpha':1,'color':'g'})ax1.set_title('Life Time Distribution')ax1.set_ylabel('Customer number')ax1.set_xlabel('Life time (days)')#繪制不止一次消費(fèi)的客戶生命周期柱狀圖sns.distplot(life_time[life_time['life_times']>0]['life_times'],bins=100,kde=True,color='r',ax=ax2, hist_kws={'alpha':0.8,'color':'g'})ax2.set_title('Life Time Distribution without One-time Deal Hunters')ax2.set_ylabel('Customer number')ax2.set_xlabel('Life time (days)')plt.savefig('16-1.png')plt.show()
可以發(fā)現(xiàn):#sales_customer新增字段用戶首次消費(fèi)日期mindatecustomer_retention=select_customer.merge(mindate,on='CustomerID',how='inner',suffixes=('','Min'))#新增字段DateDiff,用于存放本次消費(fèi)日期與首次消費(fèi)日期的時間差,并轉(zhuǎn)為數(shù)值:customer_retention['DateDiff'] = (customer_retention.Date-customer_retention.DateMin).dt.days#新增字段DateDiffBin存放時間分段date_bins = [0, 3, 7, 30, 60, 90, 180]customer_retention['DateDiffBin'] = pd.cut(customer_retention.DateDiff, bins = date_bins)customer_retention['DateDiffBin'].value_counts()#畫柱狀圖customer_counts = customer_retention['DateDiffBin'].value_counts()customer_counts.plot.bar()#畫餅圖labels = customer_counts.keys().categories.sort_values(ascending=False)values = customer_counts.valuesexplode = (0.1,0,0,0,0,0)plt.pie(values,explode=explode,labels=labels,autopct='%1.1f%%',shadow=False,startangle=150)plt.title("客戶留存分布")plt.show()
可以發(fā)現(xiàn):#排除客戶在同一天購買商品記錄customer_cycle = customer_retention.drop_duplicates(subset=['CustomerID', 'Date'], keep='first')#按照日期進(jìn)行排序customer_cycle.sort_values(by = 'Date',ascending = True) #定義函數(shù)diff,用于計(jì)算相鄰兩次消費(fèi)的時間差def diff(group): d = group.DateDiff - group.DateDiff.shift() return d#先按客戶編碼分組,在應(yīng)用diff函數(shù):last_diff = sales_cycle.groupby('CustomerID').apply(diff)last_diff.head(10)
按照訂單統(tǒng)計(jì)的購買日期分析last_diff.hist(bins = 70, figsize = (12, 6), color = 'r')
可以發(fā)現(xiàn),大部分購買行為的消費(fèi)間隔比較短。但這是所有訂單的購買周期分布,并不是對客戶個體為統(tǒng)計(jì)單位的購買周期分布。last_diff_customer = last_diff.groupby('CustomerID').mean()last_diff_customer.hist(bins = 70, figsize = (12, 6), color = 'r')
可以發(fā)現(xiàn),購物周期的峰值在15-70天范圍內(nèi),該電商平臺可以以30天為周期推出優(yōu)惠活動,吸引客戶,提高客戶購買周期。關(guān)鍵詞:數(shù)據(jù),分析,銷售,用戶
客戶&案例
營銷資訊
關(guān)于我們
微信公眾號
版權(quán)所有? 億企邦 1997-2025 保留一切法律許可權(quán)利。