[python] 파이썬에서 문자열을 공백으로 나눕니다 (따옴표 붙은 부분 문자열 유지)

다음과 같은 문자열이 있습니다.

this is "a test"

따옴표 내 공백을 무시하면서 공백으로 나누기 위해 파이썬으로 무언가를 작성하려고합니다. 내가 찾고있는 결과는 다음과 같습니다.

['this','is','a test']

추신. “응용 프로그램에 따옴표 안에 따옴표가 있으면 어떻게됩니까?”

답변

split내장 shlex모듈 에서 원합니다 .

>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']

이것은 당신이 원하는 것을 정확하게해야합니다.

답변

shlex특히 모듈을 살펴보십시오 shlex.split.

>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']

답변

여기에 정규식 접근 방식이 복잡하고 잘못 보입니다. 정규식 구문은 “공백 또는 따옴표로 묶은 것”을 쉽게 설명 할 수 있고 대부분의 정규식 엔진 (파이썬 포함)이 정규식으로 분할 될 수 있기 때문에 놀랍습니다. 따라서 정규 표현식을 사용하려면 정확히 무엇을 의미하는지 말하지 않겠습니까? :

test = 'this is "a test"'  # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]

설명:

[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators

shlex는 아마도 더 많은 기능을 제공 할 것입니다.

답변

사용 사례에 따라 csv모듈 을 체크 아웃 할 수도 있습니다 .

import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
    print(row)

산출:

['this', 'is', 'a string']
['and', 'more', 'stuff']

답변

shlex.split을 사용하여 70,000,000 라인의 오징어 로그를 처리합니다. 너무 느립니다. 그래서 나는 다시 전환했다.

shlex에 성능 문제가있는 경우이 기능을 사용해보십시오.

import re

def line_split(line):
    return re.findall(r'[^"\s]\S*|".+?"', line)

답변

이 질문에 정규식 태그가 지정되었으므로 정규식 접근법을 사용하기로 결정했습니다. 따옴표 부분의 모든 공백을 \ x00으로 바꾼 다음 공백으로 나눈 다음 \ x00을 각 부분의 공백으로 바꿉니다.

두 버전 모두 동일한 작업을 수행하지만 splitter는 splitter2보다 약간 읽기 쉽습니다.

import re

s = 'this is "a test" some text "another test"'

def splitter(s):
    def replacer(m):
        return m.group(0).replace(" ", "\x00")
    parts = re.sub('".+?"', replacer, s).split()
    parts = [p.replace("\x00", " ") for p in parts]
    return parts

def splitter2(s):
    return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]

print splitter2(s)

답변

성능상의 이유로 re더 빠릅니다. 외부 인용 부호를 유지하는 가장 욕심 많은 연산자를 사용하는 솔루션은 다음과 같습니다.

re.findall("(?:\".*?\"|\S)+", s)

결과:

['this', 'is', '"a test"']

aaa"bla blub"bbb이러한 토큰은 공백으로 분리되지 않으므로 구성을 같이 남겨 둡니다 . 문자열에 이스케이프 문자가 포함되어 있으면 다음과 같이 일치시킬 수 있습니다.

>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""

패턴 ""의 \S일부를 사용하여 빈 문자열과도 일치합니다 .