Scikit-learn新版本發(fā)布,一行代碼秒升級(jí)
十三 發(fā)自 凹非寺
量子位 報(bào)道 | 公眾號(hào) QbitAI
Scikit-learn,這個(gè)強(qiáng)大的Python包,一直深受機(jī)器學(xué)習(xí)玩家青睞。
而近日,scikit-learn 官方發(fā)布了?0.22 最終版本。
此次的更新修復(fù)了許多舊版本的bug,同時(shí)發(fā)布了一些新功能。
安裝最新版本 scikit-learn 也很簡(jiǎn)單。
使用 pip :
pip install --upgrade scikit-learn
使用 conda :
conda install scikit-learn
接下來,就是此次更新的十大亮點(diǎn)。
全新 plotting API
對(duì)于創(chuàng)建可視化任務(wù),scikit-learn 推出了一個(gè)全新 plotting API。
這個(gè)新API可以快速調(diào)整圖形的視覺效果,不再需要進(jìn)行重新計(jì)算。
也可以在同一個(gè)圖形中添加不同的圖表。
例如:
from sklearn.model_selection import train_test_split from sklearn.svm import SVC from sklearn.metrics import plot_roc_curve from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification import matplotlib.pyplot as plt X, y = make_classification(random_state=0) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) svc = SVC(random_state=42) svc.fit(X_train, y_train) rfc = RandomForestClassifier(random_state=42) rfc.fit(X_train, y_train) svc_disp = plot_roc_curve(svc, X_test, y_test) rfc_disp = plot_roc_curve(rfc, X_test, y_test, ax=svc_disp.ax_) rfc_disp.figure_.suptitle("ROC curve comparison") plt.show()
StackingClassifier和StackingRegressor
StackingClassifier 和 StackingRegressor 允許用戶擁有一個(gè)具有最終分類器/回歸器的估計(jì)器堆棧(estimator of stack)。
堆棧泛化(stacked generalization)是將各個(gè)估計(jì)器的輸出疊加起來,然后使用分類器來計(jì)算最終的預(yù)測(cè)。
基礎(chǔ)估計(jì)器擬合在完整的X( full X )上,而最終估計(jì)器則使用基于cross_val_predict的基礎(chǔ)估計(jì)器的交叉驗(yàn)證預(yù)測(cè)進(jìn)行訓(xùn)練。
例如:
from sklearn.datasets import load_iris from sklearn.svm import LinearSVC from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline from sklearn.ensemble import StackingClassifier from sklearn.model_selection import train_test_split X, y = load_iris(return_X_y=True) estimators = [ ('rf', RandomForestClassifier(n_estimators=10, random_state=42)), ('svr', make_pipeline(StandardScaler(), LinearSVC(random_state=42))) ] clf = StackingClassifier( estimators=estimators, final_estimator=LogisticRegression() ) X_train, X_test, y_train, y_test = train_test_split( X, y, stratify=y, random_state=42 ) clf.fit(X_train, y_train).score(X_test, y_test)
輸出:0.9473684210526315。
基于排列(permutation)的特征重要性
inspection.permutation_importance可以用來估計(jì)每個(gè)特征的重要性,對(duì)于任何擬合的估算器:
from sklearn.ensemble import RandomForestClassifier from sklearn.inspection import permutation_importance X, y = make_classification(random_state=0, n_features=5, n_informative=3) rf = RandomForestClassifier(random_state=0).fit(X, y) result = permutation_importance(rf, X, y, n_repeats=10, random_state=0, n_jobs=-1) fig, ax = plt.subplots() sorted_idx = result.importances_mean.argsort() ax.boxplot(result.importances[sorted_idx].T, vert=False, labels=range(X.shape[1])) ax.set_title("Permutation Importance of each feature") ax.set_ylabel("Features") fig.tight_layout() plt.show()
對(duì)梯度提升提供缺失值的本地支持
ensemble.HistGradientBoostingClassifier 和 ensemble.HistGradientBoostingRegressor 現(xiàn)在對(duì)缺失值(NaNs)具有本機(jī)支持。這意味著在訓(xùn)練或預(yù)測(cè)時(shí)無需插補(bǔ)數(shù)據(jù)。
from sklearn.experimental import enable_hist_gradient_boosting # noqa from sklearn.ensemble import HistGradientBoostingClassifier import numpy as np X = np.array([0, 1, 2, np.nan]).reshape(-1, 1) y = [0, 0, 1, 1] gbdt = HistGradientBoostingClassifier(min_samples_leaf=1).fit(X, y) print(gbdt.predict(X))
輸出:[0 0 1 1]。
預(yù)計(jì)算的稀疏近鄰圖
現(xiàn)在,大多數(shù)基于最近鄰圖的估算都接受預(yù)先計(jì)算的稀疏圖作為輸入,以將同一圖重用于多個(gè)估算量擬合。
要在pipeline中使用這個(gè)特性,可以使用 memory 參數(shù),以及neighbors.KNeighborsTransformer和neighbors.RadiusNeighborsTransformer中的一個(gè)。
預(yù)計(jì)算還可以由自定義的估算器來執(zhí)行。
from tempfile import TemporaryDirectory from sklearn.neighbors import KNeighborsTransformer from sklearn.manifold import Isomap from sklearn.pipeline import make_pipeline X, y = make_classification(random_state=0) with TemporaryDirectory(prefix="sklearn_cache_") as tmpdir: estimator = make_pipeline( KNeighborsTransformer(n_neighbors=10, mode='distance'), Isomap(n_neighbors=10, metric='precomputed'), memory=tmpdir) estimator.fit(X) # We can decrease the number of neighbors and the graph will not be # recomputed. estimator.set_params(isomap__n_neighbors=5) estimator.fit(X)
基于Imputation的KNN
現(xiàn)在,scikit_learn 支持使用k近鄰來填充缺失值。
from sklearn.impute import KNNImputer X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]] imputer = KNNImputer(n_neighbors=2) print(imputer.fit_transform(X))
輸出:
[[1. 2. 4. ]
[3. 4. 3. ]
[5.5 6. 5. ]
[8. 8. 7. ]]
樹剪枝
現(xiàn)在,在建立一個(gè)樹之后,可以剪枝大部分基于樹的估算器。
X, y = make_classification(random_state=0) rf = RandomForestClassifier(random_state=0, ccp_alpha=0).fit(X, y) print("Average number of nodes without pruning {:.1f}".format( np.mean([e.tree_.node_count for e in rf.estimators_]))) rf = RandomForestClassifier(random_state=0, ccp_alpha=0.05).fit(X, y) print("Average number of nodes with pruning {:.1f}".format( np.mean([e.tree_.node_count for e in rf.estimators_])))
輸出:
Average number of nodes without pruning 22.3
Average number of nodes with pruning 6.4
從OpenML檢索dataframe
datasets.fetch_openml現(xiàn)在可以返回pandas dataframe,從而正確處理具有異構(gòu)數(shù)據(jù)的數(shù)據(jù)集:
from sklearn.datasets import fetch_openml titanic = fetch_openml('titanic', version=1, as_frame=True) print(titanic.data.head()[['pclass', 'embarked']])
輸出:
pclass embarked
0 1.0 S
1 1.0 S
2 1.0 S
3 1.0 S
4 1.0 S
檢查一個(gè)估算器的scikit-learn兼容性
開發(fā)人員可以使用check_estimator檢查其scikit-learn兼容估算器的兼容性。
現(xiàn)在,scikit-learn 提供了pytest特定的裝飾器(decorator),該裝飾器允許pytest獨(dú)立運(yùn)行所有檢查并報(bào)告失敗的檢查。
from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeRegressor from sklearn.utils.estimator_checks import parametrize_with_checks @parametrize_with_checks([LogisticRegression, DecisionTreeRegressor]) def test_sklearn_compatible_estimator(estimator, check): check(estimator)
ROC AUC現(xiàn)在支持多類別分類
roc_auc_score 函數(shù)也可用于多類別分類。
目前支持兩種平均策略:
one-vs-one算法計(jì)算兩兩配對(duì)的ROC AUC分?jǐn)?shù)的平均值;
one-vs-rest算法計(jì)算每個(gè)類別相對(duì)于所有其他類別的ROC AUC分?jǐn)?shù)的平均值。
在這兩種情況下,模型都是根據(jù)樣本屬于特定類別的概率估計(jì)來計(jì)算多類別ROC AUC分?jǐn)?shù)。
from sklearn.datasets import make_classification from sklearn.svm import SVC from sklearn.metrics import roc_auc_score X, y = make_classification(n_classes=4, n_informative=16) clf = SVC(decision_function_shape='ovo', probability=True).fit(X, y) print(roc_auc_score(y, clf.predict_proba(X), multi_class='ovo'))
輸出:0.9957333333333332
傳送門
Twitter:
https://twitter.com/scikit_learn/status/1201847227561529346
博客:
https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_0_22_0.html#new-plotting-api
使用指南:
https://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics
- 商湯林達(dá)華萬字長文回答AGI:4層破壁,3大挑戰(zhàn)2025-08-12
- 商湯多模態(tài)大模型賦能鐵路勘察設(shè)計(jì),讓70年經(jīng)驗(yàn)“活”起來2025-08-13
- 以“具身智能基座”為核,睿爾曼攜全產(chǎn)品矩陣及新品亮相2025 WRC2025-08-11
- 哇塞,今天北京被機(jī)器人人人人人塞滿了!2025-08-08