Python 數據前處理流程
Table of Contents
匯入必要的函式庫
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, Normalizer, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer, KNNImputer
加載數據
df = pd.read_csv("data.csv") # 替換為實際的文件路徑
df.head()
資料視覺化
- 數值變數的分佈圖
sns.histplot(df, kde=True)
plt.show()
- 類別變數的計數圖
sns.countplot(x='category', data=df)
plt.show()
- 箱型圖檢視異常值
sns.boxplot(data=df)
plt.xticks(rotation=90)
plt.show()
處理缺失值
- 移除缺失值
df.dropna(inplace=True)
- 填補缺失值
- 平均值填補(適用於數值型數據)
imputer = SimpleImputer(strategy='mean') df.iloc[:, :] = imputer.fit_transform(df)
- 中位數填補(適用於異常值較多的數據)
imputer = SimpleImputer(strategy='median') df.iloc[:, :] = imputer.fit_transform(df)
- 最常見值填補(適用於分類變數)
imputer = SimpleImputer(strategy='most_frequent') df.iloc[:, :] = imputer.fit_transform(df)
- KNN 鄰近填補(適用於模式明顯的數據)
knn_imputer = KNNImputer(n_neighbors=5) df.iloc[:, :] = knn_imputer.fit_transform(df)
處理離群值(移除異常值)
- 基於四分位距(IQR)移除離群值
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]
處理類別數據
- 標籤編碼(Label Encoding)(適用於二元或有序類別)
le = LabelEncoder()
df['category'] = le.fit_transform(df['category'])
- 獨熱編碼(One-Hot Encoding)(適用於無序類別)
df = pd.get_dummies(df, columns=['category'], drop_first=True)
- 頻率編碼(適用於有序但類別數量過多的情況)
freq_encoding = df['category'].value_counts(normalize=True)
df['category_encoded'] = df['category'].map(freq_encoding)
數據標準化 / 正規化
- 最小-最大縮放(Min-Max Scaling)(將數值縮放至 0 到 1 之間,適用於分佈穩定的數據)
scaler = MinMaxScaler()
df.iloc[:, :] = scaler.fit_transform(df)
- 標準化(Standardization)(均值 = 0,標準差 = 1,適用於正態分佈數據)
scaler = StandardScaler()
df.iloc[:, :] = scaler.fit_transform(df)
- 魯棒縮放(Robust Scaling)(基於中位數與四分位距,適用於含異常值的數據)
scaler = RobustScaler()
df.iloc[:, :] = scaler.fit_transform(df)
- 歐幾里得正規化(Normalizer)(將每個數據點的範圍縮放為單位向量,適用於向量數據)
scaler = Normalizer()
df.iloc[:, :] = scaler.fit_transform(df)
特徵工程
- 去除相關性高的特徵
correlation_matrix = df.corr().abs()
upper = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))
drop_cols = [column for column in upper.columns if any(upper[column] > 0.9)]
df.drop(columns=drop_cols, inplace=True)
- PCA 進行特徵縮減
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
df_pca = pca.fit_transform(df)
拆分訓練集與測試集
X = df.drop(columns=['target']) # 替換 'target' 為實際的目標欄位
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
儲存處理後的數據
X_train.to_csv("X_train.csv", index=False)
X_test.to_csv("X_test.csv", index=False)
y_train.to_csv("y_train.csv", index=False)
y_test.to_csv("y_test.csv", index=False)
Disclaimer: All reference materials on this website are sourced from the internet and are intended for learning purposes only. If you believe any content infringes upon your rights, please contact me at csnote.cc@gmail.com, and I will remove the relevant content promptly.
Feedback Welcome: If you notice any errors or areas for improvement in the articles, I warmly welcome your feedback and corrections. Your input will help this blog provide better learning resources. This is an ongoing process of learning and improvement, and your suggestions are valuable to me. You can reach me at csnote.cc@gmail.com.