推薦：使用Python實現機器學習特徵選擇的4種方法（附程式碼）-知識星球

作者：Sugandha Lahoti；翻譯：李潔；校對：楊光

本文約3500字，建議閱讀13分鐘。

本文中，我們將研究從資料集中選擇特徵的不同方法;同時透過使用Python中Scikit-learn (sklearn)庫實現討論了特徵選擇演演算法的型別。

註：本文節選自Ankit Dixit所著的《整合機器學習》(Ensemble Machine Learning)一書。這本書組合強大的機器學習演演算法來建立最佳化模型，可以作為初學者的指南。

在本文中，我們將研究從資料集中選擇特徵的不同方法;同時透過使用Python中Scikit-learn (sklearn)庫實現討論了特徵選擇演演算法的型別:

單變數選擇
遞迴特徵消除(RFE)
主成分分析(PCA)
選擇重要特徵(特徵重要度)

我們簡要介紹了前三種演演算法及其實現。然後我們將詳細討論在資料科學社群中廣泛使用的選擇重要特徵(特性重要度)部分的內容。

單變數選擇

統計測試可用於選擇那些與輸出變數關係最強的特徵。

scikit-learn庫提供了SelectKBest類，它可以與一組不同的統計測試一起使用，以選擇特定數量的特徵。

下麵的例子使用chi²非負性特徵的統計測試，從皮馬印第安人糖尿病發病資料集中選擇了四個最好的特徵:

1. #Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)

2. #Import the required packages

3. #Import pandas to read csv import pandas

4. #Import numpy for array related operations import numpy

5. #Import sklearn’s feature selection algorithm

6. from sklearn.feature_selection import SelectKBest

7. #Import chi2 for performing chi square test from sklearn.feature_selection import chi2

8. #URL for loading the dataset

9. url =”https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data”

10. #Define the attribute names

11. names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]

12. #Create pandas data frame by loading the data from URL

13. dataframe = pandas.read_csv(url, names=names

14. #Create array from data values

15. array = dataframe.values

16. #Split the data into input and target

17. X = array[:, 0:8]

18. Y = array[:,8]

19. #We will select the features using chi square

20. test = SelectKBest(score_func=chi2, k=4)

21. #Fit the function for ranking the features by score

22. fit = test.fit(X, Y)

23. #Summarize scores numpy.set_printoptions(precision=3) print(fit.scores_)

24. #Apply the transformation on to dataset

25. features = fit.transform(X)

26. #Summarize selected features print(features[0:5,:])

你可以看到每個引數的得分，以及所選擇的四個引數(得分最高的):plas、test、mass和age。

每個特徵的分數為：

1. [111.52 1411.887 17.605 53.108 2175.565 127.669 5.393

2. 181.304]

被選出的特徵是：

1. [[148. 0. 33.6 50. ]

2. [85. 0. 26.6 31. ]

3. [183. 0. 23.3 32. ]

4. [89. 94. 28.1 21. ]

5. [137. 168. 43.1 33. ]]

遞迴特徵消除(RFE)

RFE的工作方式是遞迴地刪除引數併在保留的引數上構建模型。它使用模型精度來判斷哪些屬性(以及屬性的組合)對預測標的引數貢獻最大。你可以在scikit-learn的檔案中瞭解更多關於RFE類的資訊。

下麵的示例使用RFE和logistic回歸演演算法來選出前三個特徵。演演算法的選擇並不重要，只需要熟練並且一致:

1. #Import the required packages

2. #Import pandas to read csv import pandas

3. #Import numpy for array related operations import numpy

4. #Import sklearn’s feature selection algorithm from sklearn.feature_selection import RFE

5. #Import LogisticRegression for performing chi square test from sklearn.linear_model import LogisticRegression

6. #URL for loading the dataset

7. url =

8. “https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-dia betes/pima-indians-diabetes.data”

9. #Define the attribute names

10. names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]

11. #Create pandas data frame by loading the data from URL

12. dataframe = pandas.read_csv(url, names=names)

13.

14. #Create array from data values

15. array = dataframe.values

16.

17. #Split the data into input and target

18. X = array[:,:8]

19. Y = array[:,8]

20. #Feature extraction

21. model = LogisticRegression() rfe = RFE(model, 3)

22. fit = rfe.fit(X, Y)

23. print(“Num Features: %d”% fit.n_features_) print(“Selected Features: %s”% fit.support_)

24. print(“Feature Ranking: %s”% fit.ranking_)

執行完上述程式碼後，我們可以得到:

1. Num Features: 3

2. Selected Features: [ True False False False False True True False]

3. Feature Ranking: [1 2 3 5 6 1 1 4]

你可以看到RFE選擇了前三個特性，即preg、mass和pedi。這些在support_陣列中被標記為True，在ranking_陣列中被標記為首選（標記為1）。

主成分分析

PCA使用線性代數將資料集轉換為壓縮格式。通常，它被認為是一種資料約簡技術。PCA的一個屬性是，你可以選擇轉換結果中的維數或主成分的數量。

在接下來的例子中，我們使用PCA並選擇了三個主成分:

1. #Import the required packages

2. #Import pandas to read csv import pandas

3. #Import numpy for array related operations import numpy

4. #Import sklearn’s PCA algorithm

5. from sklearn.decomposition import PCA

6. #URL for loading the dataset

7. url =

8. “https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians diabetes/pima-indians-diabetes.data”

9. #Define the attribute names

10. names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]

11. dataframe = pandas.read_csv(url, names=names)

12. #Create array from data values

13. array = dataframe.values

14. #Split the data into input and target

15. X = array[:,0:8]

16. Y = array[:,8]

17. #Feature extraction

18. pca = PCA(n_components=3) fit = pca.fit(X)

19. #Summarize components

20. print(“Explained Variance: %s”) % fit.explained_variance_ratio_

21. print(fit.components_)

你可以看到，轉換後的資料集(三個主成分)與源資料幾乎沒有相似之處:

1. Explained Variance: [ 0.88854663 0.06159078 0.02579012]

2. [[ -2.02176587e-03 9.78115765e-02 1.60930503e-02 6.07566861e-02

3. 9.93110844e-01 1.40108085e-02 5.37167919e-04 -3.56474430e-03]

4. [ -2.26488861e-02 -9.72210040e-01 -1.41909330e-01 5.78614699e-02 9.46266913e-02 -4.69729766e-02 -8.16804621e-04 -1.40168181e-01

5. [ -2.24649003e-02 1.43428710e-01 -9.22467192e-01 -3.07013055e-01 2.09773019e-02 -1.32444542e-01 -6.39983017e-04 -1.25454310e-01]]

選擇重要特徵(特性重要度)

特徵重要度是一種利用訓練好的有監督分類器來選擇特徵的技術。當我們訓練分類器(如決策樹)時，我們計算每個引數以建立分割;我們可以使用這個度量作為特徵選擇器。讓我們來詳細瞭解一下。

隨機森林由於其相對較好的準確性、魯棒性和易用性而成為最受歡迎的機器學習方法之一。它們還提供了兩種簡單易行的特徵選擇方法——均值降低雜質和均值降低準確度。

隨機森林由許多決策樹組成。決策樹中的每個節點都是一個基於單個特徵的條件，其設計目的是將資料集分割成兩個，以便相似的響應值最終出現在相同的集合中。選擇(區域性)最優條件的度量叫做雜質。對於分類問題，它通常是基尼雜質或資訊增益/熵，而對於回歸樹，它是方差。因此，當訓練一棵樹時，可以透過每個特徵減少的樹中加權雜質的多少來計算。對於森林，可以對每個特徵的雜質減少量進行平均，並根據該方法對特徵進行排序。

讓我們看一下如何使用隨機森林分類器來進行特徵選擇，並評估特徵選擇前後分類器的準確性。我們將使用Otto資料集。該資料集可從kaggle免費獲得（你需要註冊kaggle才能下載該資料集)。你可以從https://www.kaggle.com/c/otto-group-product- classifics-challenge/data下載訓練集train.csv.zip，然後將解壓縮的train.csv檔案放在你的工作目錄中。

這個資料集描述了超過61,000個產品的93個模糊細節，這些產品被分成10個產品類別(例如，時尚類、電子產品類等)。輸入引數是某種型別的不同事件的計數。

訓練標的是對新產品作為10個類別中每一個類別的機率陣列做出預測，並使用多級對數損失（也稱為交叉熵）對模型進行評估。

我們將從匯入所有庫開始:

1. #Import the supporting libraries

2. #Import pandas to load the dataset from csv file

3. from pandas import read_csv

4. #Import numpy for array based operations and calculations

5. import numpy as np

6. #Import Random Forest classifier class from sklearn

7. from sklearn.ensemble import RandomForestClassifier

8. #Import feature selector class select model of sklearn

9. from sklearn.feature_selection

10. import SelectFromModel

11. np.random.seed(1)

定義一個方法用於將我們的資料集分為訓練資料和測試資料；我們將在訓練資料部分對資料集進行訓練，測試資料部分將用於訓練模型的評估:

1. #Function to create Train and Test set from the original dataset

2. def getTrainTestData(dataset,split):

3. np.random.seed(0)

4. training = []

5. testing = []

6. np.random.shuffle(dataset) shape = np.shape(dataset)

7. trainlength = np.uint16(np.floor(split*shape[0]))

8. for i in range(trainlength):

9. training.append(dataset[i])

10. for i in range(trainlength,shape[0]):

11. testing.append(dataset[i])

12. training = np.array(training) testing = np.array(testing)

13. return training,testing

還需要新增一個函式來評估模型的準確性；以預測輸出和實際輸出為輸入，計算準確率百分比：

1. #Function to evaluate model performance

2. def getAccuracy(pre,ytest):

3. count = 0

4. for i in range(len(ytest)):

5. if ytest[i]==pre[i]:

6. count+=1

7. acc = float(count)/len(ytest)

8. return acc

現在要匯入資料集。我們將匯入train.csv檔案；該檔案包含61,000多個訓練實體。我們的示例將使用50000個實體，其中使用35,000個實體來訓練分類器，並使用15,000個實體來測試分類器的效能:

1. #Load dataset as pandas data frame

2. data = read_csv(‘train.csv’)

3. #Extract attribute names from the data frame

4. feat = data.keys()

5. feat_labels = feat.get_values()

6. #Extract data values from the data frame

7. dataset = data.values

8. #Shuffle the dataset

9. np.random.shuffle(dataset)

10. #We will select 50000 instances to train the classifier

11. inst = 50000

12.

13. #Extract 50000 instances from the dataset

14. dataset = dataset[0:inst,:]

15.

16. #Create Training and Testing data for performance evaluation

17. train,test = getTrainTestData(dataset, 0.7)

18.

19. #Split data into input and output variable with selected features

20. Xtrain = train[:,0:94] ytrain = train[:,94] shape = np.shape(Xtrain)

21.

22. print(“Shape of the dataset “,shape)

23.

24. #Print the size of Data in MBs

25. print(“Size of Data set before featureselection: %.2f MB”%(Xtrain.nbytes/1e6))

26.

註意下這裡的資料大小；由於我們的資料集包含約35000個訓練實體，帶有94個引數；我們的資料集非常大。讓我們來看一下：

1. Shape of the dataset (35000, 94)

2. Size of Data set before feature selection: 26.32 MB

如你所見，我們的資料集中有35000行和94列，資料大小超過26MB。

在下一個程式碼塊中，我們將配置我們的隨機森林分類器；我們會使用250棵樹，最大深度為30，隨機特徵的數量為7。其他超引數將是sklearn的預設值:

1. #Lets select the test data for model evaluation purpose

2. Xtest = test[:,0:94] ytest = test[:,94]

4. #Create a random forest classifier with the following Parameters

5. trees = 250

6. max_feat= 7

7. max_depth = 30

8. min_sample = 2

9. clf = RandomForestClassifier(n_estimators=trees, max_features=max_feat, max_depth=max_depth, min_samples_split= min_sample, random_state=0, n_jobs=-1)

10.

11. #Train the classifier and calculate the training time

12. import time

13. start = time.time()

14. clf.fit(Xtrain, ytrain)

15. end = time.time()

16.

17. #Lets Note down the model training time

18. print(“Execution time for building the Tree is: %f”%(float(end)- float(start)))

19. pre = clf.predict(Xtest)

20.

21. #Let’s see how much time is required to train the model on the training dataset:

22. Execution time for building the Tree is: 2.913641

23.

24. #Evaluate the model performance for the test data

25. acc = getAccuracy(pre, ytest)

26.

27. print(“Accuracy of model before feature selection is %.2f”%(100*acc))

模型的精確度是：

1. Accuracy of model before feature selection is 98.82

正如所看到的，我們獲得了非常好的精確度，因為我們將幾乎99%的測試資料分類為正確的類別。這意味著我們在15,000個實體中對大概14,823個實體進行了正確的分類。

所以，現在問題是：我們應該進一步改進嗎？好吧，為什麼不呢？如果可能的話，我們一定需要進行更多的改進；在這裡，我們將使用特徵重要度來選擇特徵。如你所知，在樹的建造過程中，我們使用雜質度量來選擇節點。選擇雜質最少的引數值作為樹中的節點。我們可以使用類似的標準來選擇特徵。我們可以給雜質更少的特徵更多的重要度，這可以使用sklearn庫的feature_importances_函式來實現。讓我們來看一下每個特徵的重要度:

1. #Once we have trained the model we will rank all the features for feature in zip(feat_labels, clf.feature_importances_):

2. print(feature)

3. (‘id’, 0.33346650420175183)

4. (‘feat_1’, 0.0036186958628801214)

5. (‘feat_2’, 0.0037243050888530957)

6. (‘feat_3’, 0.011579217472062748)

7. (‘feat_4’, 0.010297382675187445)

8. (‘feat_5’, 0.0010359139416194116)

9. (‘feat_6’, 0.00038171336038056165)

10. (‘feat_7’, 0.0024867672489765021)

11. (‘feat_8’, 0.0096689721610546085)

12. (‘feat_9’, 0.007906150362995093)

13. (‘feat_10’, 0.0022342480802130366)

14.

正如你看到的，每個特徵都有不同的重要度，這取決於它對最終預測的貢獻值。

我們將使用這些重要度評分來對我們的特徵進行排序;在接下來的部分中，我們將選取特徵重要度大於0.01的特徵進行模型訓練：

1. #Select features which have higher contribution in the final prediction

2. sfm = SelectFromModel(clf, threshold=0.01)

3. sfm.fit(Xtrain,ytrain)

這裡，我們將根據所選的特徵引數轉換輸入的資料集。在下一個程式碼塊中，我們會轉換資料集。然後，我們將檢查新資料集的大小和形狀:

1. #Transform input dataset

2. Xtrain_1 = sfm.transform(Xtrain)

3. Xtest_1 = sfm.transform(Xtest)

5. #Let’s see the size and shape of new dataset

6. print(“Size of Data set before feature selection: %.2f MB”%(Xtrain_1.nbytes/1e6))

7. shape = np.shape(Xtrain_1)

8. print(“Shape of the dataset “,shape)

10. Size of Data set before feature selection: 5.60 MB

11. Shape of the dataset (35000, 20)

12.

看到資料集的形狀了嗎？經過特徵選擇後，我們只剩下20個特徵，這使得資料庫的大小從26MB減少到了5.60 MB，比原來的資料集減少了80%左右。

在下一個程式碼塊中，我們將使用與前面相同的超引數訓練一個新的隨機森林分類器，併在測試集上進行了測試。我們來看看修改訓練集後得到的精確度是多少：

1. #Model training time

2. start = time.time() clf.fit(Xtrain_1, ytrain) end = time.time()

3. print(“Execution time for building the Tree is: %f”%(float(end)- float(start)))

5. #Let’s evaluate the model on test data

6. pre = clf.predict(Xtest_1) count = 0

7. acc2 = getAccuracy(pre, ytest)

8. print(“Accuracy after feature selection %.2f”%(100*acc2))

10. Execution time for building the Tree is: 1.711518

11. Accuracy after feature selection 99.97

12.

看到了嗎！使用修改後的資料集，我們獲得了99.97%的準確率，這意味著我們把14,996個實體分到了正確的類別，而之前我們只正確地分類了14,823個實體。

這是我們在特徵選擇過程中取得的巨大進步；我們可以將所有的結果總結如下表：

評估標準	特徵選擇前	特徵選擇後
特徵數量	94	20
資料集大小	26.32MB	5.60MB
訓練時間	2.91 s	1.71 s
精確度	98.82%	99.97%

上表顯示了特徵選擇的實際優勢。可以看到我們顯著地減少了特徵的數量，這減少了模型的複雜性和資料集的維度。在減小維度後，我們需要更少的訓練時間，最終我們剋服了過擬合的問題，獲得了比以前更高的精確度。

本文我們共探討了機器學習中特徵選擇的4種方法。

如果你發現這篇文章很有用，請閱讀《整合機器學習》一書，瞭解關於疊加泛化和其他技術的更多資訊。

原文標題：

4 ways to implement feature selection in Python for machine learning

原文連結：

https://hub.packtpub.com/4-ways-implement-feature-selection-python-machine-learning/

譯者簡介：李潔，香港科技大學電信學碩士畢業生，現任北京師範大學香港浸會大學聯合學院資料科學系助教。喜歡資料科學，喜歡閱讀，喜歡研究程式碼和做手工。

釋出到看一看

</div><br />
<p><span class=”like_comment_msg” id=”js_b_like_comment_msg” style=”visibility: hidden;”>最多200字，當前共<span id=”js_b_like_current_cnt”/>字</span><br />
</div><br />
</div><br />
<div class=”like_comment_primary_mask” id=”js_mask_2″/><br />
</div><br />
<div id=”js_loading” style=” display: none;”><br />
<div class=”weui-mask_transparent”/><br />
<div class=”weui-toast”><br />
<i class=”weui-loading weui-icon_toast”/></p><br />
<p class=”weui-toast__content”>傳送中</p><br />
</div><br />
</div><br />
<div id=”js_fail” style=”display:none”><br />
<div class=”weui-mask”/><br />
<div class=”weui-dialog”><br />
<div class=”weui-dialog__bd”><br />
        網路異常，請稍後重試    </div><br />
<div class=”weui-dialog__ft”><br />
<a class=”weui-dialog__btn weui-dialog__btn_primary” href=”javascript:;” id=”js_fail_inform”>知道了</a><br />
</div><br />
</div><br />
</div><br />
<div class=”weui-desktop-popover weui-desktop-popover_pos-up-center weui-desktop-popover_img-text” id=”js_pc_weapp_code” style=”display: none;”><br />
<div class=”weui-desktop-popover__content”><br />
<div class=”weui-desktop-popover__desc”><br />
<img id=”js_pc_weapp_code_img”/><br /><br />
            微信掃一掃<br/>使用小程式<span id=”js_pc_weapp_code_des”/> </div><br />
</div><br />
</div><br />
<div id=”js_minipro_dialog” style=”display:none;”><br />
<div class=”weui-mask”/><br />
<div class=”weui-dialog”><br />
<div class=”weui-dialog__bd”>即將開啟”<span id=”js_minipro_dialog_name”/>”小程式</div><br />
<div class=”weui-dialog__ft”><br />
<a class=”weui-dialog__btn weui-dialog__btn_default” href=”javascript:void(0);” id=”js_minipro_dialog_cancel”>取消</a><br /><br />
<a class=”weui-dialog__btn weui-dialog__btn_primary” href=”javascript:void(0);” id=”js_minipro_dialog_ok”>開啟</a><br />
</div><br />
</div><br />
</div><br />
</div><br />

推薦：使用Python實現機器學習特徵選擇的4種方法（附程式碼）

朋友會在“發現-看一看”看到你“在看”的內容

朋友將在看一看看到

釋出到看一看

相關推薦

熱門標籤

熱門文章

分享創造快樂