Pandas缺失值

1
2
3
import numpy as np
import pandas as pd
import re,os,random
1
2
3
4
5
dct = {'name':['Python','Java','C++','Sql','Html','Css'],
'score':[90,30,np.nan,np.nan,78,np.nan],'pop':[65,43,np.nan,88,np.nan,90]}

df = pd.DataFrame(dct)
df
name score pop
0 Python 90.0 65.0
1 Java 30.0 43.0
2 C++ NaN NaN
3 Sql NaN 88.0
4 Html 78.0 NaN
5 Css NaN 90.0

检查缺失值

  1. isnull()
  2. isna
1
2
3
df.isnull().sum()

df.isna().sum()
name     0
score    3
pop      2
dtype: int64






name     0
score    3
pop      2
dtype: int64

处理缺失值

平均值填充缺失值

  1. fillna(Series.mean())
  2. fillna(Series.interpolate())
1
2
3
4
# 填充缺失值:该列的平均值
df['score'] = df['score'].fillna(df['score'].mean())
df

name score pop
0 Python 90.0 65.0
1 Java 30.0 43.0
2 C++ 66.0 65.5
3 Sql 66.0 88.0
4 Html 78.0 89.0
5 Css 66.0 90.0
1
2
3
4
5
6
7
8
9
10
11
12
dct = {'name':['Python','Java','C++','Sql','Html','Css'],
'score':[90,30,np.nan,np.nan,78,np.nan],'pop':[65,43,np.nan,88,np.nan,90]}

df = pd.DataFrame(dct)
df

# 填充缺失值:缺失值上下的数字平均值。
df['score'] = df['score'].fillna(df['score'].interpolate())
# df

df['pop'] = df['pop'].fillna(df['pop'].interpolate())
df
name score pop
0 Python 90.0 65.0
1 Java 30.0 43.0
2 C++ NaN NaN
3 Sql NaN 88.0
4 Html 78.0 NaN
5 Css NaN 90.0
name score pop
0 Python 90.0 65.0
1 Java 30.0 43.0
2 C++ 46.0 65.5
3 Sql 62.0 88.0
4 Html 78.0 89.0
5 Css 78.0 90.0

缺失值前后的值填充

  1. fillna(axis=0/1,method = ‘ffill/bfill/pad’)
  2. axis : 0: 纵向, 1:横向
  3. method: ffill: 填充前值,bfill:填充后值,pad:优先取前值,否则取后值
1
2
3
4
5
6
7
8
dct = {'name':['Python','Java','C++','Sql','Html','Css'],
'score':[90,30,np.nan,np.nan,78,np.nan],'pop':[65,43,np.nan,88,np.nan,90]}

df = pd.DataFrame(dct)
df

df.fillna(axis=0,method='ffill')

name score pop
0 Python 90.0 65.0
1 Java 30.0 43.0
2 C++ NaN NaN
3 Sql NaN 88.0
4 Html 78.0 NaN
5 Css NaN 90.0
name score pop
0 Python 90.0 65.0
1 Java 30.0 43.0
2 C++ 30.0 43.0
3 Sql 30.0 88.0
4 Html 78.0 88.0
5 Css 78.0 90.0
1
2
3
4
5
6
7
dct = {'name':['Python','Java','C++','Sql','Html','Css'],
'score':[90,30,np.nan,np.nan,78,np.nan],'pop':[65,43,np.nan,88,np.nan,90]}

df = pd.DataFrame(dct)
df

df.fillna(axis=0,method='bfill')
name score pop
0 Python 90.0 65.0
1 Java 30.0 43.0
2 C++ NaN NaN
3 Sql NaN 88.0
4 Html 78.0 NaN
5 Css NaN 90.0
name score pop
0 Python 90.0 65.0
1 Java 30.0 43.0
2 C++ 78.0 88.0
3 Sql 78.0 88.0
4 Html 78.0 90.0
5 Css NaN 90.0
1
2
3
4
5
6
7
dct = {'name':['Python','Java','C++','Sql','Html','Css'],
'score':[90,30,np.nan,np.nan,78,np.nan],'pop':[65,43,np.nan,88,np.nan,90]}

df = pd.DataFrame(dct)
df

df.fillna(axis=0,method='pad')
name score pop
0 Python 90.0 65.0
1 Java 30.0 43.0
2 C++ NaN NaN
3 Sql NaN 88.0
4 Html 78.0 NaN
5 Css NaN 90.0
name score pop
0 Python 90.0 65.0
1 Java 30.0 43.0
2 C++ 30.0 43.0
3 Sql 30.0 88.0
4 Html 78.0 88.0
5 Css 78.0 90.0

删除缺失值

  1. df.dropna(axis,how,threst,subset,inplace)
  2. axis:0为行 1为列,default 0,数据删除维度
  3. how:{‘any’, ‘all’}, default ‘any’,any:删除带有nan的行;all:删除全为nan的行
  4. threst:int,保留至少 int 个非nan行
  5. subset:list,在特定列缺失值处理
  6. inplace:bool,是否修改源文件
1
2
3
4
5
6
7
8
9
10
11
dct = {'name':['Python','Java','C++','Sql','Html','Css'],
'score':[90,30,np.nan,np.nan,78,np.nan],'pop':[65,43,np.nan,88,np.nan,90]}

df = pd.DataFrame(dct)
df

# 删除缺失值所在的行
df.dropna(axis=0)

# 删除缺失值所在的列
df.dropna(axis=1)
name score pop
0 Python 90.0 65.0
1 Java 30.0 43.0
2 C++ NaN NaN
3 Sql NaN 88.0
4 Html 78.0 NaN
5 Css NaN 90.0
name score pop
0 Python 90.0 65.0
1 Java 30.0 43.0
name
0 Python
1 Java
2 C++
3 Sql
4 Html
5 Css
1
2
3
4
5
# thresh 和 how 不能同时使用
df.dropna(axis=0,
# how='any',
thresh=2,
)
name score pop
0 Python 90.0 65.0
1 Java 30.0 43.0
3 Sql NaN 88.0
4 Html 78.0 NaN
5 Css NaN 90.0
1
2
3
4
df.dropna(axis=0,
how='any',
subset=['score'],
)
name score pop
0 Python 90.0 65.0
1 Java 30.0 43.0
4 Html 78.0 NaN

na_action

  1. 主要应用在 map函数下
  2. None:默认值 忽略NA
  3. ignore:NA值将不会被传递给映射函数
  4. raise:如果存在缺失值,则抛出异常
1
2
3
4
5
6
7
8
9
10
dct = {'name':['Python','Java','C++','Sql','Html','Css'],
'score':[90,30,np.nan,np.nan,78,np.nan],'pop':[65,43,np.nan,88,np.nan,90]}

df = pd.DataFrame(dct)
df

df['score'] = df['score'].map(lambda x: f'{x:.2f}', na_action=None)

df['pop'] = df['pop'].map(lambda x: f'{x:g}', na_action='ignore')
df
name score pop
0 Python 90.0 65.0
1 Java 30.0 43.0
2 C++ NaN NaN
3 Sql NaN 88.0
4 Html 78.0 NaN
5 Css NaN 90.0
name score pop
0 Python 90.00 65
1 Java 30.00 43
2 C++ nan NaN
3 Sql nan 88
4 Html 78.00 NaN
5 Css nan 90