[python] 비 ASCII 문자를 단일 공백으로 교체

모든 비 ASCII (\ x00- \ x7F) 문자를 공백으로 바꿔야합니다. 내가 뭔가 빠진 것이 아니라면 파이썬에서 이것이 쉽지 않은 것에 놀랐습니다. 다음 함수는 단순히 비 ASCII 문자를 모두 제거합니다.

def remove_non_ascii_1(text):

    return ''.join(i for i in text if ord(i)<128)

그리고 이것은 ASCII가 아닌 문자를 문자 코드 포인트의 바이트 양에 따라 공백으로 바꿉니다 (즉, –문자가 3 개의 공백으로 바뀝니다).

def remove_non_ascii_2(text):

    return re.sub(r'[^\x00-\x7F]',' ', text)

ASCII가 아닌 모든 문자를 단일 공백으로 바꾸려면 어떻게해야합니까?

의 무수한 의 유사한 SO의 질문에 , 없음 주소 문자 교체 로 반대 에 제거 , 그리고 추가로 모든 비 ASCII 문자가 아닌 특정 문자 해결합니다.

답변

귀하의 ''.join()표현이 필터링되어 비 ASCII를 제거합니다. 대신 조건식을 사용할 수 있습니다.

return ''.join([i if ord(i) < 128 else ' ' for i in text])

이것은 문자를 하나씩 처리하며 대체되는 문자 당 하나의 공백을 사용합니다.

정규식은 ASCII가 아닌 연속 문자를 공백으로 바꿔야 합니다.

re.sub(r'[^\x00-\x7F]+',' ', text)

+거기에 주목하십시오 .

답변

당신에게 원래 문자열의 가장 유사한 표현을 얻으려면 unidecode 모듈을 권장 합니다 .

from unidecode import unidecode
def remove_non_ascii(text):
    return unidecode(unicode(text, encoding = "utf-8"))

그런 다음 문자열로 사용할 수 있습니다.

remove_non_ascii("Ceñía")
Cenia

답변

들어 문자 처리, 유니 코드 문자열을 사용합니다 :

PythonWin 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32.
>>> s='ABC马克def'
>>> import re
>>> re.sub(r'[^\x00-\x7f]',r' ',s)   # Each char is a Unicode codepoint.
'ABC  def'
>>> b = s.encode('utf8')
>>> re.sub(rb'[^\x00-\x7f]',rb' ',b) # Each char is a 3-byte UTF-8 sequence.
b'ABC      def'

그러나 문자열에 분해 된 유니 코드 문자 (예 : 별도의 문자와 악센트 부호 결합)가 포함 된 경우 여전히 문제가 있습니다.

>>> s = 'mañana'
>>> len(s)
6
>>> import unicodedata as ud
>>> n=ud.normalize('NFD',s)
>>> n
'mañana'
>>> len(n)
7
>>> re.sub(r'[^\x00-\x7f]',r' ',s) # single codepoint
'ma ana'
>>> re.sub(r'[^\x00-\x7f]',r' ',n) # only combining mark replaced
'man ana'

답변

대체 문자가 ‘?’일 수있는 경우 공백 대신 다음과 같이 제안합니다 result = text.encode('ascii', 'replace').decode().

"""Test the performance of different non-ASCII replacement methods."""


import re
from timeit import timeit


# 10_000 is typical in the project that I'm working on and most of the text
# is going to be non-ASCII.
text = 'Æ' * 10_000


print(timeit(
    """
result = ''.join([c if ord(c) < 128 else '?' for c in text])
    """,
    number=1000,
    globals=globals(),
))

print(timeit(
    """
result = text.encode('ascii', 'replace').decode()
    """,
    number=1000,
    globals=globals(),
))

결과 :

0.7208260721400134
0.009975979187503592

답변

이건 어때?

def replace_trash(unicode_string):
     for i in range(0, len(unicode_string)):
         try:
             unicode_string[i].encode("ascii")
         except:
              #means it's non-ASCII
              unicode_string=unicode_string[i].replace(" ") #replacing it with a single space
     return unicode_string

답변

기본적이고 효율적인 접근 방식으로 ord문자를 반복하거나 사용할 필요가 없습니다 . ascii오류로 인코딩 하고 무시하십시오.

다음은 ASCII가 아닌 문자를 제거합니다.

new_string = old_string.encode('ascii',errors='ignore')

삭제 된 문자를 바꾸려면 다음을 수행하십시오.

final_string = new_string + b' ' * (len(old_string) - len(new_string))

답변

잠재적으로 다른 질문이 있지만 @Alvero의 답변 버전 (Unidecode 사용)을 제공하고 있습니다. 문자열에 “일반”스트립을 만들고 싶습니다. 즉, 공백 문자의 경우 문자열의 시작과 끝, 다른 공백 문자 만 “일반”공백으로 바꿉니다.

"Ceñíaㅤmañanaㅤㅤㅤㅤ"

에

"Ceñía mañana"

def safely_stripped(s: str):
    return ' '.join(
        stripped for stripped in
        (bit.strip() for bit in
         ''.join((c if unidecode(c) else ' ') for c in s).strip().split())
        if stripped)

먼저 모든 비 유니 코드 공간을 일반 공간으로 바꾸고 다시 결합하십시오.

''.join((c if unidecode(c) else ' ') for c in s)

그런 다음 파이썬의 일반 분할로 다시 분할하고 각 “비트”를 제거합니다.

(bit.strip() for bit in s.split())

마지막으로 다시 연결하지만 문자열이 if테스트를 통과 한 경우에만

' '.join(stripped for stripped in s if stripped)

그리고 그것으로 safely_stripped('ㅤㅤㅤㅤCeñíaㅤmañanaㅤㅤㅤㅤ')올바르게 반환합니다 'Ceñía mañana'.