[java] camelCase 또는 TitleCase를 분할하는 RegEx (고급)

Question 1

camelCase 또는 TitleCase 표현의 일부를 추출 하는 훌륭한 RegEx 를 찾았습니다 .

 (?<!^)(?=[A-Z])

예상대로 작동합니다.

가치-> 가치
camelValue-> 낙타 / 값
TitleValue-> 제목 / 값

예를 들어 Java의 경우 :

String s = "loremIpsum";
words = s.split("(?<!^)(?=[A-Z])");
//words equals words = new String[]{"lorem","Ipsum"}

내 문제는 어떤 경우에는 작동하지 않는다는 것입니다.

사례 1 : VALUE-> V / A / L / U / E
사례 2 : eclipseRCPExt-> eclipse / R / C / P / Ext

내 생각에 결과는 다음과 같다.

사례 1 : VALUE
사례 2 : eclipse / RCP / Ext

즉, n 개의 대문자가 주어지면 :

n 문자 다음에 소문자가 오면 그룹은 다음과 같아야합니다. (n-1 문자) / (n 번째 문자 + 낮은 문자)
n 개의 문자가 끝에 있으면 그룹은 (n 개의 문자) 여야합니다.

이 정규식을 개선하는 방법에 대한 아이디어가 있습니까?

Question 2

다음 정규식은 위의 모든 예에서 작동합니다.

public static void main(String[] args)
{
    for (String w : "camelValue".split("(?<!(^|[A-Z]))(?=[A-Z])|(?<!^)(?=[A-Z][a-z])")) {
        System.out.println(w);
    }
}

부정적인 lookbehind가 문자열 시작 부분의 일치 항목을 무시할뿐만 아니라 대문자 앞에 다른 대문자가있는 일치 항목도 무시하도록 강제하여 작동합니다. “VALUE”와 같은 경우를 처리합니다.

정규식의 첫 번째 부분은 “RPC”와 “Ext”사이를 분할하지 못하여 “eclipseRCPExt”에서 실패합니다. 이것이 두 번째 절의 목적입니다 : (?<!^)(?=[A-Z][a-z]. 이 절은 문자열의 시작 부분을 제외하고 소문자가 뒤 따르는 모든 대문자 앞에서 분할을 허용합니다.

Question 3

필요한 것보다 더 복잡하게 만드는 것 같습니다. 들어 낙타 표기법 , 분할 위치는 단순히 어디 대문자 즉시 소문자 문자를 다음입니다 :

(?<=[a-z])(?=[A-Z])

이 정규식이 예제 데이터를 분할하는 방법은 다음과 같습니다.

value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
VALUE -> VALUE
eclipseRCPExt -> eclipse / RCPExt

원하는 출력과의 유일한 차이점은 eclipseRCPExt여기에서 올바르게 분할되었다고 주장하는입니다.

부록-개선 된 버전

참고 :이 답변은 최근에 찬성 투표를 받았으며 더 나은 방법이 있다는 것을 깨달았습니다.

위의 정규식에 두 번째 대안을 추가하면 모든 OP의 테스트 케이스가 올바르게 분할됩니다.

(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])

개선 된 정규식이 예제 데이터를 분할하는 방법은 다음과 같습니다.

value -> value
camelValue -> camel / Value
TitleValue -> Title / Value
VALUE -> VALUE
eclipseRCPExt -> eclipse / RCP / Ext

편집 : 20130824RCPExt -> RCP / Ext 케이스 를 처리하기 위해 개선 된 버전을 추가했습니다 .

Question 4

또 다른 해결책은 commons-lang 에서 전용 메소드를 사용하는 것입니다 . StringUtils # splitByCharacterTypeCamelCase

Question 5

나는 aix의 솔루션을 작동시킬 수 없었고 (RegExr에서도 작동하지 않습니다) 그래서 나는 내가 테스트하고 당신이 찾고있는 것을 정확히하는 것처럼 보이는 내 자신을 생각해 냈습니다.

((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))

다음은 사용 예입니다.

; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms.
;   (^[a-z]+)                       Match against any lower-case letters at the start of the string.
;   ([A-Z]{1}[a-z]+)                Match against Title case words (one upper case followed by lower case letters).
;   ([A-Z]+(?=([A-Z][a-z])|($)))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($))))", "$1 ")
newString := Trim(newString)

여기에서는 각 단어를 공백으로 구분하므로 문자열이 어떻게 변형되는지에 대한 몇 가지 예가 있습니다.

ThisIsATitleCASEString => 이것은 제목 CASE 문자열입니다
andThisOneIsCamelCASE => 그리고 이것은 Camel CASE입니다

위의이 솔루션은 원래 게시물에서 요구하는 작업을 수행하지만 숫자를 포함하는 낙타 및 파스칼 문자열을 찾기 위해 정규식이 필요했기 때문에 숫자를 포함하는이 변형도 생각해 냈습니다.

((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))

그리고 그것을 사용하는 예 :

; Regex Breakdown:  This will match against each word in Camel and Pascal case strings, while properly handling acrynoms and including numbers.
;   (^[a-z]+)                               Match against any lower-case letters at the start of the command.
;   ([0-9]+)                                Match against one or more consecutive numbers (anywhere in the string, including at the start).
;   ([A-Z]{1}[a-z]+)                        Match against Title case words (one upper case followed by lower case letters).
;   ([A-Z]+(?=([A-Z][a-z])|($)|([0-9])))    Match against multiple consecutive upper-case letters, leaving the last upper case letter out the match if it is followed by lower case letters, and including it if it's followed by the end of the string or a number.
newString := RegExReplace(oldCamelOrPascalString, "((^[a-z]+)|([0-9]+)|([A-Z]{1}[a-z]+)|([A-Z]+(?=([A-Z][a-z])|($)|([0-9]))))", "$1 ")
newString := Trim(newString)

다음은 숫자가있는 문자열이이 정규식으로 변환되는 방법에 대한 몇 가지 예입니다.

myVariable123 => 내 변수 123
my2Variables => 내 2 개의 변수
The3rdVariableIsHere => 세 번째 rdVariable이 여기에 있습니다.
12345NumsAtTheStartIncludedToo => 시작시 12345 숫자도 포함됨

Question 6

단순한 것보다 더 많은 문자를 처리하려면 `A-Z`:

s.split("(?<=\\p{Ll})(?=\\p{Lu})|(?<=\\p{L})(?=\\p{Lu}\\p{Ll})");

어느 한 쪽:

소문자 다음에 대문자로 나눕니다.

예 parseXML-> parse, XML.

또는

임의의 문자 다음에 대문자와 소문자가 뒤 따릅니다.

예 XMLParser-> XML, Parser.

더 읽기 쉬운 형식으로 :

public class SplitCamelCaseTest {

    static String BETWEEN_LOWER_AND_UPPER = "(?<=\\p{Ll})(?=\\p{Lu})";
    static String BEFORE_UPPER_AND_LOWER = "(?<=\\p{L})(?=\\p{Lu}\\p{Ll})";

    static Pattern SPLIT_CAMEL_CASE = Pattern.compile(
        BETWEEN_LOWER_AND_UPPER +"|"+ BEFORE_UPPER_AND_LOWER
    );

    public static String splitCamelCase(String s) {
        return SPLIT_CAMEL_CASE.splitAsStream(s)
                        .collect(joining(" "));
    }

    @Test
    public void testSplitCamelCase() {
        assertEquals("Camel Case", splitCamelCase("CamelCase"));
        assertEquals("lorem Ipsum", splitCamelCase("loremIpsum"));
        assertEquals("XML Parser", splitCamelCase("XMLParser"));
        assertEquals("eclipse RCP Ext", splitCamelCase("eclipseRCPExt"));
        assertEquals("VALUE", splitCamelCase("VALUE"));
    }
}

Question 7

간결한

여기의 두 가지 최고 답변은 모든 정규식 유형에서 지원되지 않는 긍정적 인 lookbehinds를 사용하는 코드를 제공합니다. 아래 정규식은 PascalCase및을 모두 캡처 camelCase하고 여러 언어로 사용할 수 있습니다.

노트 : 이 질문이 Java에 관한 것임을 알고 있지만 다른 언어로 태그가 지정된 다른 질문에서이 게시물에 대한 여러 언급과 동일한 질문에 대한 일부 의견도 볼 수 있습니다.

암호

여기에서 사용중인 정규식 참조

([A-Z]+|[A-Z]?[a-z]+)(?=[A-Z]|\b)

결과

샘플 입력

eclipseRCPExt

SomethingIsWrittenHere

TEXTIsWrittenHERE

VALUE

loremIpsum

샘플 출력

eclipse
RCP
Ext

Something
Is
Written
Here

TEXT
Is
Written
HERE

VALUE

lorem
Ipsum

설명

하나 이상의 대문자 영문자 일치 [A-Z]+
또는 0 개 또는 1 개의 대문자 알파벳 문자 [A-Z]?다음에 하나 이상의 소문자 알파 문자가 오는 것과 일치합니다.[a-z]+
뒤에 오는 것이 대문자 영문자 [A-Z]또는 단어 경계 문자 인지 확인하십시오.\b

Question 8

StringUtils를 사용할 수 있습니다. Apache Commons Lang의 splitByCharacterTypeCamelCase ( “loremIpsum”)

답변

답변

부록-개선 된 버전

답변

답변

답변

단순한 것보다 더 많은 문자를 처리하려면 A-Z:

더 읽기 쉬운 형식으로 :

답변

간결한

암호

결과

샘플 입력

샘플 출력

설명

답변

단순한 것보다 더 많은 문자를 처리하려면 `A-Z`: