資料科學基礎-1：Pandas 數據處理

資料準備

本次使用Iris資料集出處

Fisher, R. (1936). Iris [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C56C76.

根據資料集的描述文檔，得知資料欄為標題，後續補上資料欄位

Attribute Information:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa
-- Iris Versicolour
-- Iris Virginica

讀取資料

依據剛剛得知欄位在read_csv() 補上欄位

import pandas as pd
columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "class"]
df = pd.read_csv("../Dataset/iris/iris.data", header=None, names=columns)

查看行列

查看行數與列數，一般是資料筆數與欄位類別數，這裡是150筆資料與5種資料類別

# 顯示資料的行列數
print(df.shape)

(150, 5)

查看第一行

確認目前資料樣式

# 顯示資料的第一行
print(df.head(1))

   sepal_length  sepal_width  petal_length  petal_width        class
0           5.1          3.5           1.4          0.2  Iris-setosa

查看資料資訊

查看詳細資料資訊

# 資料型態與欄位
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   class         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None

查看欄位

從剛剛info() 中得知class 屬於object類別，在pandas中物件類別一般表示標籤或分類，屬於字串類型，對此使用value_counts() 來查看個別的資料總數

# 類別分布（class 欄位）
print(df["class"].value_counts())

class
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

特徵值描述

用於了解資料的分佈情況，是否有極端值或集中分佈等

print(df.describe())

       sepal_length  sepal_width  petal_length  petal_width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000