[python] Pandas 데이터 프레임에서 특이 값 탐지 및 제외

열이 적은 팬더 데이터 프레임이 있습니다.

이제 특정 행이 특정 열 값을 기반으로하는 특이 치라는 것을 알고 있습니다.

예를 들어

열 ‘Vol’에는 모든 값이 12xx있으며 하나의 값은 4000(이상 값)입니다.

이제 이와 같은 Vol열이 있는 행을 제외하고 싶습니다.

따라서 본질적으로 특정 열의 값이 평균과 3 표준 편차 내에있는 모든 행을 선택하도록 데이터 프레임에 필터를 배치해야합니다.

이것을 달성하는 우아한 방법은 무엇입니까?

답변

데이터 프레임에 여러 열이 있고 하나 이상의 열에 특이 치가있는 모든 행을 제거하려면 다음식이 한 번에 수행됩니다.

df = pd.DataFrame(np.random.randn(100, 3))

from scipy import stats
df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

기술:

각 열에 대해 먼저 열 평균 및 표준 편차를 기준으로 열에있는 각 값의 Z- 점수를 계산합니다.
그런 다음 방향이 중요하지 않기 때문에 Z- 점수의 절대 값을 취합니다. 임계 값 미만인 경우에만 가능합니다.
all (axis = 1)은 각 행에 대해 모든 열이 제한 조건을 충족하는지 확인합니다.
마지막으로이 조건의 결과는 데이터 프레임을 인덱싱하는 데 사용됩니다.

답변

boolean인덱싱 사용numpy.array

df = pd.DataFrame({'Data':np.random.normal(size=200)})
# example dataset of normally distributed data. 

df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]
# keep only the ones that are within +3 to -3 standard deviations in the column 'Data'.

df[~(np.abs(df.Data-df.Data.mean()) > (3*df.Data.std()))]
# or if you prefer the other way around

시리즈의 경우 비슷합니다.

S = pd.Series(np.random.normal(size=200))
S[~((S-S.mean()).abs() > 3*S.std())]

답변

각 데이터 프레임 열에 대해 다음을 통해 Quantile을 얻을 수 있습니다.

q = df["col"].quantile(0.99)

다음으로 필터링하십시오.

df[df["col"] < q]

하한값과 상한값을 제거해야하는 경우 조건을 AND 문과 결합하십시오.

q_low = df["col"].quantile(0.01)
q_hi  = df["col"].quantile(0.99)

df_filtered = df[(df["col"] < q_hi) & (df["col"] > q_low)]

답변

이 답변은 @tanemaki가 제공하는 답변과 비슷하지만 lambda대신 표현식을 사용합니다 scipy stats.

df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))

df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]

하나의 열 (예 : ‘B’) 만 세 표준 편차 내에있는 DataFrame을 필터링하려면 다음을 수행하십시오.

df[((df.B - df.B.mean()) / df.B.std()).abs() < 3]

롤링 기준으로이 z- 점수를 적용하는 방법은 여기를 참조하십시오. 팬더 데이터 프레임에 적용된 롤링 Z- 점수

답변

#------------------------------------------------------------------------------
# accept a dataframe, remove outliers, return cleaned data in a new dataframe
# see http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm
#------------------------------------------------------------------------------
def remove_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
    return df_out

답변

숫자 및 비 숫자 속성 을 다루는 답변을 보지 못 했으므로 여기에 보완 답변이 있습니다.

수치 속성에서만 특이 치를 삭제하려고 할 수 있습니다 (범주 변수는 특이 치일 수 없음).

기능 정의

숫자가 아닌 속성이 존재할 때 데이터를 처리하기 위해 @tanemaki의 제안을 확장했습니다.

from scipy import stats

def drop_numerical_outliers(df, z_thresh=3):
    # Constrains will contain `True` or `False` depending on if it is a value below the threshold.
    constrains = df.select_dtypes(include=[np.number]) \
        .apply(lambda x: np.abs(stats.zscore(x)) < z_thresh, reduce=False) \
        .all(axis=1)
    # Drop (inplace) values set to be rejected
    df.drop(df.index[~constrains], inplace=True)

용법

drop_numerical_outliers(df)

예

df골목, 토지 등고선, 판매 가격 등 주택에 대한 일부 가치가 있는 데이터 세트 를 상상해보십시오 . 예 : 데이터 문서

먼저, 산점도 (z-score Thresh = 3)로 데이터를 시각화하려고합니다.

# Plot data before dropping those greater than z-score 3. 
# The scatterAreaVsPrice function's definition has been removed for readability's sake.
scatterAreaVsPrice(df)

# Drop the outliers on every attributes
drop_numerical_outliers(train_df)

# Plot the result. All outliers were dropped. Note that the red points are not
# the same outliers from the first plot, but the new computed outliers based on the new data-frame.
scatterAreaVsPrice(train_df)

답변

데이터 프레임의 각 계열에 대해 이상 치를 사용 between하고 quantile제거 할 수 있습니다 .

x = pd.Series(np.random.normal(size=200)) # with outliers
x = x[x.between(x.quantile(.25), x.quantile(.75))] # without outliers