Cris 的 Python 資料分析筆記 06:Pandas 常見的資料預處理
阿新 • • 發佈:2018-11-30
文章目錄
1. Pandas 對指定列排序
import pandas as pd
'''
sort_values 表示按照指定列進行排序;inplace 引數如果為 True,表示對原 DataFrame 進行排序處理,否則就是返回一個
新的排序後的 DataFrame,NaN 表示缺失值;預設升序排序,可以使用 ascending 引數改變排序規則
'''
data = pd.read_csv('food_info.csv')
print(data['Sodium_(mg)'])
data.sort_values('Sodium_(mg)' ,inplace=True)
print(data['Sodium_(mg)'])
data.sort_values('Sodium_(mg)',inplace=True,ascending=False)
print(data['Sodium_(mg)'])
0 643.0 1 659.0 2 2.0 3 1146.0 4 560.0 5 629.0 6 842.0 7 690.0 8 644.0 9 700.0 10 604.0 11 364.0 12 344.0 13 372.0 14 308.0 15 406.0 16 365.0 17 812.0 18 917.0 19 800.0 20 600.0 21 819.0 22 714.0 23 800.0 24 600.0 25 627.0 26 710.0 27 619.0 28 682.0 29 628.0 ... 8588 2.0 8589 2.0 8590 7.0 8591 564.0 8592 464.0 8593 490.0 8594 1.0 8595 199.0 8596 297.0 8597 16.0 8598 486.0 8599 0.0 8600 2.0 8601 1297.0 8602 1435.0 8603 2838.0 8604 10.0 8605 2.0 8606 12.0 8607 0.0 8608 3326.0 8609 1765.0 8610 3750.0 8611 29.0 8612 58.0 8613 4450.0 8614 667.0 8615 58.0 8616 70.0 8617 68.0 Name: Sodium_(mg), Length: 8618, dtype: float64 760 0.0 758 0.0 405 0.0 761 0.0 2269 0.0 763 0.0 764 0.0 770 0.0 774 0.0 396 0.0 395 0.0 6827 0.0 394 0.0 393 0.0 391 0.0 390 0.0 787 0.0 788 0.0 2270 0.0 2231 0.0 407 0.0 748 0.0 409 0.0 747 0.0 702 0.0 703 0.0 704 0.0 705 0.0 706 0.0 707 0.0 ... 8153 NaN 8155 NaN 8156 NaN 8157 NaN 8158 NaN 8159 NaN 8160 NaN 8161 NaN 8163 NaN 8164 NaN 8165 NaN 8167 NaN 8169 NaN 8170 NaN 8172 NaN 8173 NaN 8174 NaN 8175 NaN 8176 NaN 8177 NaN 8178 NaN 8179 NaN 8180 NaN 8181 NaN 8183 NaN 8184 NaN 8185 NaN 8195 NaN 8251 NaN 8267 NaN Name: Sodium_(mg), Length: 8618, dtype: float64 276 38758.0 5814 27360.0 6192 26050.0 1242 26000.0 1245 24000.0 1243 24000.0 1244 23875.0 292 17000.0 1254 11588.0 5811 10600.0 8575 9690.0 291 8068.0 1249 8031.0 5812 7893.0 1292 7851.0 293 7203.0 4472 7027.0 4836 6820.0 1261 6580.0 3747 6008.0 1266 5730.0 4835 5586.0 4834 5493.0 1263 5356.0 1553 5203.0 1552 5053.0 1251 4957.0 1257 4843.0 294 4616.0 8613 4450.0 ... 8153 NaN 8155 NaN 8156 NaN 8157 NaN 8158 NaN 8159 NaN 8160 NaN 8161 NaN 8163 NaN 8164 NaN 8165 NaN 8167 NaN 8169 NaN 8170 NaN 8172 NaN 8173 NaN 8174 NaN 8175 NaN 8176 NaN 8177 NaN 8178 NaN 8179 NaN 8180 NaN 8181 NaN 8183 NaN 8184 NaN 8185 NaN 8195 NaN 8251 NaN 8267 NaN Name: Sodium_(mg), Length: 8618, dtype: float64
2. 泰坦尼克經典入門案例
import numpy as np
'''
isnull 函式可以判斷一列資料的缺失值,NaN 則返回 True,正常值則返回 False
'''
titanic_survival = pd.read_csv('titanic_train.csv')
titanic_survival.head()
age = titanic_survival['Age']
age_top_10 = (age[0:10])
age_is_null = pd.isnull(age_top_10)
print(age_is_null)
# 通過索引過濾得到缺失值的資料集
age_null = age_top_10[age_is_null]
print(age_null)
age_null_count = len(age_null)
print(age_null_count)
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
Name: Age, dtype: bool
5 NaN
Name: Age, dtype: float64
1
3. Pandas 常用資料預處理函式
3.1 缺失值處理
'''
如果不對 NaN 值處理,得到的計算結果就是 nan 的~~~
'''
average_age = sum(titanic_survival['Age'])/len(titanic_survival['Age'])
print(average_age)
'''
非常厲害的缺失值處理:通過切片判斷表示式得到所有不是 NaN 值的正常資料
'''
# 先通過 isnull 函式得到指定列的所有值,正常值正常顯示,非正常值以 NaN 顯示
all_age_null = pd.isnull(titanic_survival['Age'])
print(all_age_null)
# 然後通過切片表示式作為索引得到所有的正常值
good_ages = titanic_survival['Age'][all_age_null == False]
print(good_ages)
age_average = sum(good_ages)/len(good_ages)
# 29.69911764705882
print(age_average)
nan
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 True
18 False
19 True
20 False
21 False
22 False
23 False
24 False
25 False
26 True
27 False
28 True
29 True
...
861 False
862 False
863 True
864 False
865 False
866 False
867 False
868 True
869 False
870 False
871 False
872 False
873 False
874 False
875 False
876 False
877 False
878 True
879 False
880 False
881 False
882 False
883 False
884 False
885 False
886 False
887 False
888 True
889 False
890 False
Name: Age, Length: 891, dtype: bool
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
6 54.0
7 2.0
8 27.0
9 14.0
10 4.0
11 58.0
12 20.0
13 39.0
14 14.0
15 55.0
16 2.0
18 31.0
20 35.0
21 34.0
22 15.0
23 28.0
24 8.0
25 38.0
27 19.0
30 40.0
33 66.0
34 28.0
35 42.0
37 21.0
38 18.0
...
856 45.0
857 51.0
858 24.0
860 41.0
861 21.0
862 48.0
864 24.0
865 42.0
866 27.0
867 31.0
869 4.0
870 26.0
871 47.0
872 33.0
873 47.0
874 28.0
875 15.0
876 20.0
877 19.0
879 56.0
880 25.0
881 33.0
882 22.0
883 28.0
884 25.0
885 39.0
886 27.0
887 19.0
889 26.0
890 32.0
Name: Age, Length: 714, dtype: float64
29.69911764705882
3.2 Pandas 預處理函式自動過濾缺失值
# missing data is so common that many pandas methods automatically filter for it
# 雖然 Pandas 為我們提供了過濾缺失值的函式,但是仍然不是很推薦使用,因為資料最好不要輕易過濾,通常的做法都是
# 為其新增一份計算後的預設值
mean_age = titanic_survival['Age'].mean()
print(mean_age)
29.69911764705882
3.3 手動來計算每種船艙的平均價格
Pclass = [1,2,3]
Pclass_avg_price = {}
for this_pclass in Pclass:
# 首先我們需要根據列來篩選出符合條件的行資料(樣本資料),然後篩選出來的樣本的指定列(特徵值)的值求和併除以對應行數求均值
# 得到的資料就是指定特徵值的均值
prices = titanic_survival[titanic_survival['Pclass'] == this_pclass]
# Pclass_avg_price[this_pclass] = sum(prices['Fare'])/len(prices)
# 求均值可以使用 3.2節所示的 Pandas 內建函式!
Pclass_avg_price[this_pclass] = prices['Fare'].mean()
print(Pclass_avg_price)
{1: 84.15468749999992, 2: 20.66218315217391, 3: 13.675550101832997}
3.4 Pandas 的內建函式簡化 3.3 節的計算
'''
index tells the method which column to group by
values is th column that we want to apply the calculation to
aggfunc specifies the calculation we want to perform
'''
passenger_survival = titanic_survival.pivot_table(index='Pclass', values='Survived', aggfunc=np.mean)
print(passenger_survival)
# 注意:aggfunc 屬性如果不寫,預設就是求均值
avg_age = titanic_survival.pivot_table(index='Pclass', values='Age')
print(avg_age)
age = titanic_survival.pivot_table(index='Pclass', values='Age', aggfunc=np.mean)
print(age)
Survived
Pclass
1 0.629630
2 0.472826
3 0.242363
Age
Pclass
1 38.233441
2 29.877630
3 25.140620
Age
Pclass
1 38.233441
2 29.877630
3 25.140620
3.5 分組計算制定列之間的關係
# 這裡根據登船地點進行分組,然後分別統計船票價格之和以及獲救人數之和(按照分組顯示)
Fare_survived = titanic_survival.pivot_table(index='Embarked', values=['Fare', 'Survived'], aggfunc=np.sum)
print(Fare_survived)
Fare Survived
Embarked
C 10072.2962 93
Q 1022.2543 30
S 17439.3988 217
# specifying axis = 1 or axis = 'columns' will drop any columns that have null values
drop_col = titanic_survival.dropna(axis=1)
print(drop_col.head())
# 如果 Age 和 Sex 列缺失值,那麼丟掉這一行樣本
new_data = titanic_survival.dropna(axis=0, subset=['Age','Sex'])
print(new_data.head())
# 對應的 fillna 函式則是對 null 值進行填充
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex SibSp Parch \
0 Braund, Mr. Owen Harris male 1 0
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 1 0
2 Heikkinen, Miss. Laina female 0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 1 0
4 Allen, Mr. William Henry male 0 0
Ticket Fare
0 A/5 21171 7.2500
1 PC 17599 71.2833
2 STON/O2. 3101282 7.9250
3 113803 53.1000
4 373450 8.0500
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
3.6 資料定位
# Pandas 根據行號和列名來定位具體的某個值
print(titanic_survival.loc[12,'Age'])
print(titanic_survival.loc[342,'Pclass'])
20.0
2
3.7 重排序索引
new_data = titanic_survival.sort_values('Age', ascending=False)
# 拋棄以前的索引,對排序後的資料的索引進行重新計算,inplace 為 True 表示對原資料直接更改
new_data.reset_index(drop=True,inplace=True)
print(new_data.head())
PassengerId Survived Pclass Name Sex \
0 631 1 1 Barkworth, Mr. Algernon Henry Wilson male
1 852 0 3 Svensson, Mr. Johan male
2 494 0 1 Artagaveytia, Mr. Ramon male
3 97 0 1 Goldschmidt, Mr. George B male
4 117 0 3 Connors, Mr. Patrick male
Age SibSp Parch Ticket Fare Cabin Embarked
0 80.0 0 0 27042 30.0000 A23 S
1 74.0 0 0 347060 7.7750 NaN S
2 71.0 0 0 PC 17609 49.5042 NaN C
3 71.0 0 0 PC 17754 34.6542 A5 C
4 70.5 0 0 370369 7.7500 NaN Q
3.8 自定義函式
# 定義新函式返回第一百行的資料
def handredth_data (column):
data = column.loc[99]
return data
data = titanic_survival.apply(handredth_data)
print(data)
# 獲取每列的缺失值的樣本數
def null_count (column):
col_null = pd.isnull(column)
null = column[col_null]
return len(null)
count = titanic_survival.apply(null_count)
print('----------')
print(count)
print(help(pd.isnull))
PassengerId 100
Survived 0
Pclass 2
Name Kantor, Mr. Sinai
Sex male
Age 34
SibSp 1
Parch 0
Ticket 244367
Fare 26
Cabin NaN
Embarked S
dtype: object
----------
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Help on function isna in module pandas.core.dtypes.missing:
isna(obj)
Detect missing values for an array-like object.
This function takes a scalar or array-like object and indictates
whether values are missing (``NaN`` in numeric arrays, ``None`` or ``NaN``
in object arrays, ``NaT`` in datetimelike).
Parameters
----------
obj : scalar or array-like
Object to check for null or missing values.
Returns
-------
bool or array-like of bool
For scalar input, returns a scalar boolean.
For array input, returns an array of boolean indicating whether each
corresponding element is missing.
See Also
--------
notna : boolean inverse of pandas.isna.
Series.isna : Detetct missing values in a Series.
DataFrame.isna : Detect missing values in a DataFrame.
Index.isna : Detect missing values in an Index.
Examples
--------
Scalar arguments (including strings) result in a scalar boolean.
>>> pd.isna('dog')
False
>>> pd.isna(np.nan)
True
ndarrays result in an ndarray of booleans.
>>> array = np.array([[1, np.nan, 3], [4, 5, np.nan]])
>>> array
array([[ 1., nan, 3.],
[ 4., 5., nan]])
>>> pd.isna(array)
array([[False, True, False],
[False, False, True]])
For indexes, an ndarray of booleans is returned.
>>> index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None,
... "2017-07-08"])
>>> index
DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'],
dtype='datetime64[ns]', freq=None)
>>> pd.isna(index)
array([False, False, True, False])
For Series and DataFrame, the same type is returned, containing booleans.
>>> df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']])
>>> df
0 1 2
0 ant bee cat
1 dog None fly
>>> pd.isna(df)
0 1 2
0 False False False
1 False True False
>>> pd.isna(df[1])
0 False
1 True
Name: 1, dtype: bool
None
3.9 每行迭代及資料轉換
ages = titanic_survival['Age']
print(ages.head())
def which_class (row):
pclass = row['Pclass']
if pd.isnull(pclass):
return 'Unknown'
elif pclass == 1:
return 'First Class'
elif pclass == 2:
return 'Second Class'
else:
return 'Third Class'
# apply 函式中,axis 屬性為1,表示對每行進行函式判斷,即資料迭代
result = titanic_survival.apply(which_class, axis=1)
print(result.head())
def age_class (row):
age = row['Age']
if pd.isna(age):
return 'Unknown'
elif age < 18:
return '年輕人'
elif age < 40:
return '中年人'
else:
return '老年人'
age_lable = titanic_survival.apply(age_class, axis=1)
print(age_lable.tail())
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64
0 Third Class
1 First Class
2 Third Class
3 First Class
4 Third Class
dtype: object
886 中年人
887 中年人
888 Unknown
889 中年人
890 中年人
dtype: object
3.10 巧妙分組計算資料之間的關係
# 為 DataFrame 新增一列
titanic_survival['age_label'] = age_lable
result = titanic_survival.pivot_table(index='age_label', values='Survived')
print(result)
Survived
age_label
Unknown 0.293785
中年人 0.383562
年輕人 0.539823
老年人 0.374233