[python] Python에서 ROC 곡선을 그리는 방법

Question 1

로지스틱 회귀 패키지를 사용하여 Python에서 개발 한 예측 모델의 정확성을 평가하기 위해 ROC 곡선을 그리려고합니다. 나는 참 양성율과 거짓 양성율을 계산했습니다. 그러나 matplotlibAUC 값을 사용하여 올바르게 플롯 하고 계산하는 방법을 알아낼 수 없습니다 . 어떻게 할 수 있습니까?

Question 2

다음은 modelsklearn 예측 변수 라고 가정하여 시도 할 수있는 두 가지 방법입니다 .

import sklearn.metrics as metrics
# calculate the fpr and tpr for all thresholds of the classification
probs = model.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)

# method I: plt
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

# method II: ggplot
from ggplot import *
df = pd.DataFrame(dict(fpr = fpr, tpr = tpr))
ggplot(df, aes(x = 'fpr', y = 'tpr')) + geom_line() + geom_abline(linetype = 'dashed')

또는 시도

ggplot(df, aes(x = 'fpr', ymin = 0, ymax = 'tpr')) + geom_line(aes(y = 'tpr')) + geom_area(alpha = 0.2) + ggtitle("ROC Curve w/ AUC = %s" % str(roc_auc))

Question 3

이것은 일련의 Ground Truth 레이블과 예측 된 확률을 고려할 때 ROC 곡선을 그리는 가장 간단한 방법입니다. 가장 중요한 부분은 모든 클래스에 대한 ROC 곡선을 플로팅하므로 여러 개의 깔끔한 곡선도 얻을 수 있습니다.

import scikitplot as skplt
import matplotlib.pyplot as plt

y_true = # ground truth labels
y_probas = # predicted probabilities generated by sklearn classifier
skplt.metrics.plot_roc_curve(y_true, y_probas)
plt.show()

다음은 plot_roc_curve에 의해 생성 된 샘플 곡선입니다. scikit-learn의 샘플 숫자 데이터 세트를 사용 했으므로 10 개의 클래스가 있습니다. 각 클래스에 대해 하나의 ROC 곡선이 그려져 있습니다.

면책 조항 : 이것은 내가 만든 scikit-plot 라이브러리를 사용합니다 .

Question 4

여기서 문제가 무엇인지는 전혀 명확하지 않지만 배열 true_positive_rate과 배열이있는 경우 false_positive_rateROC 곡선을 플로팅하고 AUC를 얻는 것은 다음과 같이 간단합니다.

import matplotlib.pyplot as plt
import numpy as np

x = # false_positive_rate
y = # true_positive_rate 

# This is the ROC curve
plt.plot(x,y)
plt.show()

# This is the AUC
auc = np.trapz(y,x)

Question 5

matplotlib를 사용한 이진 분류에 대한 AUC 곡선

from sklearn import svm, datasets
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt

유방암 데이터 세트 불러 오기

breast_cancer = load_breast_cancer()

X = breast_cancer.data
y = breast_cancer.target

데이터 세트 분할

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33, random_state=44)

모델

clf = LogisticRegression(penalty='l2', C=0.1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

정확성

print("Accuracy", metrics.accuracy_score(y_test, y_pred))

AUC 곡선

y_pred_proba = clf.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

Question 6

다음은 ROC 곡선을 계산하기위한 Python 코드입니다 (산점도).

import matplotlib.pyplot as plt
import numpy as np

score = np.array([0.9, 0.8, 0.7, 0.6, 0.55, 0.54, 0.53, 0.52, 0.51, 0.505, 0.4, 0.39, 0.38, 0.37, 0.36, 0.35, 0.34, 0.33, 0.30, 0.1])
y = np.array([1,1,0, 1, 1, 1, 0, 0, 1, 0, 1,0, 1, 0, 0, 0, 1 , 0, 1, 0])

# false positive rate
fpr = []
# true positive rate
tpr = []
# Iterate thresholds from 0.0, 0.01, ... 1.0
thresholds = np.arange(0.0, 1.01, .01)

# get number of positive and negative examples in the dataset
P = sum(y)
N = len(y) - P

# iterate through all thresholds and determine fraction of true positives
# and false positives found at this threshold
for thresh in thresholds:
    FP=0
    TP=0
    for i in range(len(score)):
        if (score[i] > thresh):
            if y[i] == 1:
                TP = TP + 1
            if y[i] == 0:
                FP = FP + 1
    fpr.append(FP/float(N))
    tpr.append(TP/float(P))

plt.scatter(fpr, tpr)
plt.show()

Question 7

from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt

y_true = # true labels
y_probas = # predicted results
fpr, tpr, thresholds = metrics.roc_curve(y_true, y_probas, pos_label=0)

# Print ROC curve
plt.plot(fpr,tpr)
plt.show()

# Print AUC
auc = np.trapz(tpr,fpr)
print('AUC:', auc)

Question 8

이전 답변은 실제로 TP / Sens를 직접 계산했다고 가정합니다. 이 작업을 수동으로 수행하는 것은 나쁜 생각입니다. 계산에 실수를하기 쉽기 때문에이 모든 작업에 라이브러리 함수를 사용하십시오.

scikit_lean의 plot_roc 함수는 필요한 작업을 정확히 수행합니다.
http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

코드의 필수 부분은 다음과 같습니다.

  for i in range(n_classes):
      fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
      roc_auc[i] = auc(fpr[i], tpr[i])