[python] Python에서 Pearson 상관 관계 및 의미 계산

두 목록을 입력으로 받아 Pearson correlation 과 상관 관계 의 중요성을 반환하는 함수를 찾고 있습니다.

답변

당신은 볼 수 있습니다 scipy.stats:

from pydoc import help
from scipy.stats.stats import pearsonr
help(pearsonr)

>>>
Help on function pearsonr in module scipy.stats.stats:

pearsonr(x, y)
 Calculates a Pearson correlation coefficient and the p-value for testing
 non-correlation.

 The Pearson correlation coefficient measures the linear relationship
 between two datasets. Strictly speaking, Pearson's correlation requires
 that each dataset be normally distributed. Like other correlation
 coefficients, this one varies between -1 and +1 with 0 implying no
 correlation. Correlations of -1 or +1 imply an exact linear
 relationship. Positive correlations imply that as x increases, so does
 y. Negative correlations imply that as x increases, y decreases.

 The p-value roughly indicates the probability of an uncorrelated system
 producing datasets that have a Pearson correlation at least as extreme
 as the one computed from these datasets. The p-values are not entirely
 reliable but are probably reasonable for datasets larger than 500 or so.

 Parameters
 ----------
 x : 1D array
 y : 1D array the same length as x

 Returns
 -------
 (Pearson's correlation coefficient,
  2-tailed p-value)

 References
 ----------
 http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation

답변

Pearson 상관 관계는 numpy ‘s로 계산할 수 있습니다 corrcoef.

import numpy
numpy.corrcoef(list1, list2)[0, 1]

답변

대안은 다음 을 계산하는 linregress 의 기본 scipy 함수일 수 있습니다 .

기울기 : 회귀선의 기울기

가로 채기 : 회귀선 가로 채기

r- 값 : 상관 계수

p-value : 귀무 가설이 기울기가 0이라는 가설 검정의 양면 p- 값

stderr : 추정치의 표준 오차

그리고 여기 예가 있습니다 :

a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3]
b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15]
from scipy.stats import linregress
linregress(a, b)

당신을 반환합니다 :

LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)

답변

scipy를 설치하고 싶지 않다면 Programming Collective Intelligence 에서 약간 수정 된이 빠른 해킹을 사용했습니다 .

(정확성을 위해 편집되었습니다.)

from itertools import imap

def pearsonr(x, y):
  # Assume len(x) == len(y)
  n = len(x)
  sum_x = float(sum(x))
  sum_y = float(sum(y))
  sum_x_sq = sum(map(lambda x: pow(x, 2), x))
  sum_y_sq = sum(map(lambda x: pow(x, 2), y))
  psum = sum(imap(lambda x, y: x * y, x, y))
  num = psum - (sum_x * sum_y/n)
  den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
  if den == 0: return 0
  return num / den

답변

다음 코드는 정의에 대한 간단한 해석입니다 .

import math

def average(x):
    assert len(x) > 0
    return float(sum(x)) / len(x)

def pearson_def(x, y):
    assert len(x) == len(y)
    n = len(x)
    assert n > 0
    avg_x = average(x)
    avg_y = average(y)
    diffprod = 0
    xdiff2 = 0
    ydiff2 = 0
    for idx in range(n):
        xdiff = x[idx] - avg_x
        ydiff = y[idx] - avg_y
        diffprod += xdiff * ydiff
        xdiff2 += xdiff * xdiff
        ydiff2 += ydiff * ydiff

    return diffprod / math.sqrt(xdiff2 * ydiff2)

테스트:

print pearson_def([1,2,3], [1,5,7])

보고

0.981980506062

이것은 엑셀, 동의 이 계산기 , SciPy (도 NumPy와 각각 0.981980506 및 0.9819805060619657 및 0.98198050606196574을 반환).

R :

> cor( c(1,2,3), c(1,5,7))
[1] 0.9819805

편집 : 댓글 작성자가 지적한 버그를 수정했습니다.

답변

당신도 이것을 할 수 있습니다 pandas.DataFrame.corr:

import pandas as pd
a = [[1, 2, 3],
     [5, 6, 9],
     [5, 6, 11],
     [5, 6, 13],
     [5, 3, 13]]
df = pd.DataFrame(data=a)
df.corr()

이것은 준다

          0         1         2
0  1.000000  0.745601  0.916579
1  0.745601  1.000000  0.544248
2  0.916579  0.544248  1.000000

답변

numpy / scipy에 의존하기보다는 Pearson Correlation Coefficient (PCC) 계산 단계 를 이해하고 코딩하는 것이 가장 쉽다고 생각합니다 .

import math

# calculates the mean
def mean(x):
    sum = 0.0
    for i in x:
         sum += i
    return sum / len(x)

# calculates the sample standard deviation
def sampleStandardDeviation(x):
    sumv = 0.0
    for i in x:
         sumv += (i - mean(x))**2
    return math.sqrt(sumv/(len(x)-1))

# calculates the PCC using both the 2 functions above
def pearson(x,y):
    scorex = []
    scorey = []

    for i in x:
        scorex.append((i - mean(x))/sampleStandardDeviation(x))

    for j in y:
        scorey.append((j - mean(y))/sampleStandardDeviation(y))

# multiplies both lists together into 1 list (hence zip) and sums the whole list   
    return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)

PCC 의 중요성 은 기본적으로 두 변수 / 목록이 얼마나 강한 상관 관계 가 있는지 보여줍니다 . PCC 값의 범위 는 -1 ~ 1 입니다. 0에서 1 사이의 값은 양의 상관 관계를 나타냅니다. 0의 값은 가장 높은 변동 (상관 관계 없음)입니다. -1에서 0 사이의 값은 음의 상관 관계를 나타냅니다.