[python] Pandas의 큰 상관 행렬에서 가장 높은 상관 쌍을 나열 하시겠습니까?

Question 1

Pandas와의 상관 관계 행렬에서 상위 상관 관계를 어떻게 찾습니까? R이 작업을 수행하는 방법에 대한 많은 답변 (있다 정렬 된 목록으로, 아니 큰 행렬로 표시 상관 관계 또는 효율적인 방법으로 높은 상관 관계를 얻을 수 파이썬 또는 R에서 대규모 데이터 세트에서 쌍 ,하지만 내가 그것을 수행하는 방법 궁금는) 판다 랑? 제 경우에는 행렬이 4460×4460이므로 시각적으로 할 수 없습니다.

Question 2

을 사용 DataFrame.values하여 데이터의 numpy 배열 argsort()을 얻은 다음 가장 상관 관계가있는 쌍을 가져 오는 것과 같은 NumPy 함수를 사용할 수 있습니다.

하지만 pandas에서이 작업을 수행 unstack하려면 DataFrame을 정렬 할 수 있습니다 .

import pandas as pd
import numpy as np

shape = (50, 4460)

data = np.random.normal(size=shape)

data[:, 1000] += data[:, 2000]

df = pd.DataFrame(data)

c = df.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")

print so[-4470:-4460]

출력은 다음과 같습니다.

2192  1522    0.636198
1522  2192    0.636198
3677  2027    0.641817
2027  3677    0.641817
242   130     0.646760
130   242     0.646760
1171  2733    0.670048
2733  1171    0.670048
1000  2000    0.742340
2000  1000    0.742340
dtype: float64

Question 3

@HYRY의 대답은 완벽합니다. 중복 및 자체 상관 관계와 적절한 정렬을 피하기 위해 논리를 조금 더 추가하여 그 대답을 구축하십시오.

import pandas as pd
d = {'x1': [1, 4, 4, 5, 6],
     'x2': [0, 0, 8, 2, 4],
     'x3': [2, 8, 8, 10, 12],
     'x4': [-1, -4, -4, -4, -5]}
df = pd.DataFrame(data = d)
print("Data Frame")
print(df)
print()

print("Correlation Matrix")
print(df.corr())
print()

def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n=5):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

print("Top Absolute Correlations")
print(get_top_abs_correlations(df, 3))

그러면 다음과 같은 출력이 제공됩니다.

Data Frame
   x1  x2  x3  x4
0   1   0   2  -1
1   4   0   8  -4
2   4   8   8  -4
3   5   2  10  -4
4   6   4  12  -5

Correlation Matrix
          x1        x2        x3        x4
x1  1.000000  0.399298  1.000000 -0.969248
x2  0.399298  1.000000  0.399298 -0.472866
x3  1.000000  0.399298  1.000000 -0.969248
x4 -0.969248 -0.472866 -0.969248  1.000000

Top Absolute Correlations
x1  x3    1.000000
x3  x4    0.969248
x1  x4    0.969248
dtype: float64

Question 4

중복 변수 쌍이없는 몇 줄 솔루션 :

corr_matrix = df.corr().abs()

#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)

sol = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
                  .stack()
                  .sort_values(ascending=False))

#first element of sol series is the pair with the biggest correlation

그런 다음 변수 쌍 (pandas.Series 다중 인덱스)의 이름과 해당 값을 다음과 같이 반복 할 수 있습니다.

for index, value in sol.items():
  # do some staff

Question 5

@HYRY와 @arun의 답변의 일부 기능을 결합하면 df다음을 사용하여 데이터 프레임 에 대한 상위 상관 관계를 한 줄로 인쇄 할 수 있습니다 .

df.corr().unstack().sort_values().drop_duplicates()

참고 : 한 가지 단점은 하나의 변수 가 아닌 1.0 상관 관계가있는 경우 drop_duplicates()추가하면 제거 된다는 것입니다.

Question 6

내림차순으로 상관 관계를 보려면 아래 코드를 사용하십시오.

# See the correlations in descending order

corr = df.corr() # df is the pandas dataframe
c1 = corr.abs().unstack()
c1.sort_values(ascending = False)

Question 7

데이터를 대체하여이 간단한 코드에 따라 그래픽으로 수행 할 수 있습니다.

corr = df.corr()

kot = corr[corr>=.9]
plt.figure(figsize=(12,8))
sns.heatmap(kot, cmap="Greens")

Question 8

나는 Addison Klinke의 게시물이 가장 단순하다는 점을 가장 좋아했지만 필터링 및 차트 작성에 Wojciech Moszczyńsk의 제안을 사용했지만 절대 값을 피하기 위해 필터를 확장 했으므로 큰 상관 행렬이 주어지면 필터링하고 차트 화 한 다음 평면화합니다.

생성, 필터링 및 차트 작성

dfCorr = df.corr()
filteredDf = dfCorr[((dfCorr >= .5) | (dfCorr <= -.5)) & (dfCorr !=1.000)]
plt.figure(figsize=(30,10))
sn.heatmap(filteredDf, annot=True, cmap="Reds")
plt.show()

함수

결국, 상관 행렬을 만들고 필터링 한 다음 평탄화하는 작은 함수를 만들었습니다. 아이디어로 쉽게 확장 할 수 있습니다 (예 : 비대칭 상한 및 하한 등).

def corrFilter(x: pd.DataFrame, bound: float):
    xCorr = x.corr()
    xFiltered = xCorr[((xCorr >= bound) | (xCorr <= -bound)) & (xCorr !=1.000)]
    xFlattened = xFiltered.unstack().sort_values().drop_duplicates()
    return xFlattened

corrFilter(df, .7)