[python] FutureWarning : 요소 별 비교에 실패했습니다. 스칼라를 반환하지만 향후 요소 별 비교를 수행합니다.
0.19.1
Python 3에서 Pandas 를 사용 하고 있습니다.이 코드 줄에 대한 경고가 표시됩니다. string Peter
이 column에 있는 모든 행 번호를 포함하는 목록을 얻으려고합니다 Unnamed: 5
.
df = pd.read_excel(xls_path)
myRows = df[df['Unnamed: 5'] == 'Peter'].index.tolist()
경고를 생성합니다.
"\Python36\lib\site-packages\pandas\core\ops.py:792: FutureWarning: elementwise
comparison failed; returning scalar, but in the future will perform
elementwise comparison
result = getattr(x, name)(y)"
이 FutureWarning은 무엇이며 작동하는 것처럼 보이기 때문에 무시해야합니다.
답변
이 FutureWarning은 Pandas가 아니라 numpy에서 왔으며 버그는 matplotlib 및 기타에도 영향을 미칩니다. 문제의 원인에 더 가까운 경고를 재현하는 방법은 다음과 같습니다.
import numpy as np
print(np.__version__) # Numpy version '1.12.0'
'x' in np.arange(5) #Future warning thrown here
FutureWarning: elementwise comparison failed; returning scalar instead, but in the
future will perform elementwise comparison
False
double equals 연산자를 사용하여이 버그를 재현하는 또 다른 방법 :
import numpy as np
np.arange(5) == np.arange(5).astype(str) #FutureWarning thrown here
An example of Matplotlib affected by this FutureWarning under their quiver plot implementation: https://matplotlib.org/examples/pylab_examples/quiver_demo.html
What’s going on here?
There is a disagreement between Numpy and native python on what should happen when you compare a strings to numpy’s numeric types. Notice the left operand is python’s turf, a primitive string, and the middle operation is python’s turf, but the right operand is numpy’s turf. Should you return a Python style Scalar or a Numpy style ndarray of Boolean? Numpy says ndarray of bool, Pythonic developers disagree. Classic standoff.
Should it be elementwise comparison or Scalar if item exists in the array?
If your code or library is using the in
or ==
operators to compare python string to numpy ndarrays, they aren’t compatible, so when if you try it, it returns a scalar, but only for now. The Warning indicates that in the future this behavior might change so your code pukes all over the carpet if python/numpy decide to do adopt Numpy style.
Submitted Bug reports:
Numpy and Python are in a standoff, for now the operation returns a scalar, but in the future it may change.
https://github.com/numpy/numpy/issues/6784
https://github.com/pandas-dev/pandas/issues/7830
Two workaround solutions:
Either lockdown your version of python and numpy, ignore the warnings and expect the behavior to not change, or convert both left and right operands of ==
and in
to be from a numpy type or primitive python numeric type.
Suppress the warning globally:
import warnings
import numpy as np
warnings.simplefilter(action='ignore', category=FutureWarning)
print('x' in np.arange(5)) #returns False, without Warning
Suppress the warning on a line by line basis.
import warnings
import numpy as np
with warnings.catch_warnings():
warnings.simplefilter(action='ignore', category=FutureWarning)
print('x' in np.arange(2)) #returns False, warning is suppressed
print('x' in np.arange(10)) #returns False, Throws FutureWarning
Just suppress the warning by name, then put a loud comment next to it mentioning the current version of python and numpy, saying this code is brittle and requires these versions and put a link to here. Kick the can down the road.
TLDR: pandas
are Jedi; numpy
are the hutts; and python
is the galactic empire. https://youtu.be/OZczsiCfQQk?t=3
답변
I get the same error when I try to set the index_col
reading a file into a Panda
‘s data-frame:
df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=['0']) ## or same with the following
df = pd.read_csv('my_file.tsv', sep='\t', header=0, index_col=[0])
I have never encountered such an error previously. I still am trying to figure out the reason behind this (using @Eric Leschinski explanation and others).
Anyhow, the following approach solves the problem for now until I figure the reason out:
df = pd.read_csv('my_file.tsv', sep='\t', header=0) ## not setting the index_col
df.set_index(['0'], inplace=True)
I will update this as soon as I figure out the reason for such behavior.
답변
My experience to the same warning message was caused by TypeError.
TypeError: invalid type comparison
So, you may want to check the data type of the Unnamed: 5
for x in df['Unnamed: 5']:
print(type(x)) # are they 'str' ?
Here is how I can replicate the warning message:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 2), columns=['num1', 'num2'])
df['num3'] = 3
df.loc[df['num3'] == '3', 'num3'] = 4 # TypeError and the Warning
df.loc[df['num3'] == 3, 'num3'] = 4 # No Error
Hope it helps.
답변
Can’t beat Eric Leschinski’s awesomely detailed answer, but here’s a quick workaround to the original question that I don’t think has been mentioned yet – put the string in a list and use .isin
instead of ==
For example:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Name": ["Peter", "Joe"], "Number": [1, 2]})
# Raises warning using == to compare different types:
df.loc[df["Number"] == "2", "Number"]
# No warning using .isin:
df.loc[df["Number"].isin(["2"]), "Number"]
답변
A quick workaround for this is to use numpy.core.defchararray
. I also faced the same warning message and was able to resolve it using above module.
import numpy.core.defchararray as npd
resultdataset = npd.equal(dataset1, dataset2)
답변
Eric’s answer helpfully explains that the trouble comes from comparing a Pandas Series (containing a NumPy array) to a Python string. Unfortunately, his two workarounds both just suppress the warning.
To write code that doesn’t cause the warning in the first place, explicitly compare your string to each element of the Series and get a separate bool for each. For example, you could use map
and an anonymous function.
myRows = df[df['Unnamed: 5'].map( lambda x: x == 'Peter' )].index.tolist()
답변
If your arrays aren’t too big or you don’t have too many of them, you might be able to get away with forcing the left hand side of ==
to be a string:
myRows = df[str(df['Unnamed: 5']) == 'Peter'].index.tolist()
But this is ~1.5 times slower if df['Unnamed: 5']
is a string, 25-30 times slower if df['Unnamed: 5']
is a small numpy array (length = 10), and 150-160 times slower if it’s a numpy array with length 100 (times averaged over 500 trials).
a = linspace(0, 5, 10)
b = linspace(0, 50, 100)
n = 500
string1 = 'Peter'
string2 = 'blargh'
times_a = zeros(n)
times_str_a = zeros(n)
times_s = zeros(n)
times_str_s = zeros(n)
times_b = zeros(n)
times_str_b = zeros(n)
for i in range(n):
t0 = time.time()
tmp1 = a == string1
t1 = time.time()
tmp2 = str(a) == string1
t2 = time.time()
tmp3 = string2 == string1
t3 = time.time()
tmp4 = str(string2) == string1
t4 = time.time()
tmp5 = b == string1
t5 = time.time()
tmp6 = str(b) == string1
t6 = time.time()
times_a[i] = t1 - t0
times_str_a[i] = t2 - t1
times_s[i] = t3 - t2
times_str_s[i] = t4 - t3
times_b[i] = t5 - t4
times_str_b[i] = t6 - t5
print('Small array:')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_a), mean(times_str_a)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_a)/mean(times_a)))
print('\nBig array')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_b), mean(times_str_b)))
print(mean(times_str_b)/mean(times_b))
print('\nString')
print('Time to compare without str conversion: {} s. With str conversion: {} s'.format(mean(times_s), mean(times_str_s)))
print('Ratio of time with/without string conversion: {}'.format(mean(times_str_s)/mean(times_s)))
Result:
Small array:
Time to compare without str conversion: 6.58464431763e-06 s. With str conversion: 0.000173756599426 s
Ratio of time with/without string conversion: 26.3881526541
Big array
Time to compare without str conversion: 5.44309616089e-06 s. With str conversion: 0.000870866775513 s
159.99474375821288
String
Time to compare without str conversion: 5.89370727539e-07 s. With str conversion: 8.30173492432e-07 s
Ratio of time with/without string conversion: 1.40857605178