[.net] 문자열 중간에서 문화에 민감한 "시작"작업을 수행하려면 어떻게해야합니까?

나는 상대적으로 불분명 요구 사항을 가지고 있지만, 그것은 것처럼 느낀다 해야 BCL을 사용 가능하다.

컨텍스트를 위해 Noda Time 에서 날짜 / 시간 문자열을 구문 분석하고 있습니다 . 입력 문자열 내 위치에 대한 논리적 커서를 유지합니다. 따라서 전체 문자열은 “2013 년 1 월 3 일”일 수 있지만 논리적 커서는 ‘J’에있을 수 있습니다.

이제 문화에 대해 알려진 모든 월 이름과 비교하여 월 이름을 구문 분석해야합니다.

문화에 민감한
대소 문자를 구분하지 않음
커서 지점에서 (나중이 아님, 커서가 후보 월 이름을 “보고”있는지 확인하고 싶습니다)
빨리
… 그리고 나중에 사용 된 문자 수를 알아야합니다.

이 작업을 수행 하는 현재 코드 는 일반적으로 CompareInfo.Compare. 이는 사실상 다음과 같습니다 (일치하는 부분에 대해서만-실제에 더 많은 코드가 있지만 일치와 관련이 없음).

internal bool MatchCaseInsensitive(string candidate, CompareInfo compareInfo)
{
    return compareInfo.Compare(text, position, candidate.Length,
                               candidate, 0, candidate.Length,
                               CompareOptions.IgnoreCase) == 0;
}

그러나 그것은 후보와 우리가 비교하는 지역이 같은 길이에 달려 있습니다. 대부분 괜찮지 만 일부 특별한 경우에는 좋지 않습니다 . 다음과 같은 것이 있다고 가정합니다.

// U+00E9 is a single code point for e-acute
var text = "x b\u00e9d y";
int position = 2;
// e followed by U+0301 still means e-acute, but from two code points
var candidate = "be\u0301d";

이제 내 비교는 실패 할 것입니다. 나는 사용할 수 있습니다 IsPrefix:

if (compareInfo.IsPrefix(text.Substring(position), candidate,
                         CompareOptions.IgnoreCase))

그러나:

그것은 내가 정말로 피하고 싶은 부분 문자열을 생성 할 것을 요구한다. (나는 Noda Time을 효과적으로 시스템 라이브러리로보고 있으며, 일부 클라이언트에게는 구문 분석 성능이 중요 할 수 있습니다.)
나중에 커서를 얼마나 멀리 이동할지 알려주지 않습니다.

실제로, 나는 강하게이 매우 자주 오지 않습니다 의심 …하지만 난 정말 것 처럼 여기 옳은 일을 할 수 있습니다. 유니 코드 전문가가 아니거나 직접 구현하지 않고도 할 수 있기를 바랍니다. 🙂

( 누군가가 궁극적 인 결론을 따르기를 원할 경우 Noda Time에서 버그 210으로 발생 했습니다.)

나는 정규화라는 아이디어를 좋아합니다. a) 정확성 및 b) 성능에 대해 자세히 확인해야합니다. 제대로 작동 할 수 있다고 가정 할 때 , 모든 것을 바꿀 가치가 있는지는 아직 확실하지 않습니다 . 실제 생활에서 실제로는 나오지 않을 것이지만 모든 사용자의 성능을 저하시킬 수 있습니다. (

나는 또한 BCL을 확인했는데,이 역시 제대로 처리되지 않는 것 같습니다. 샘플 코드 :

using System;
using System.Globalization;

class Test
{
    static void Main()
    {
        var culture = (CultureInfo) CultureInfo.InvariantCulture.Clone();
        var months = culture.DateTimeFormat.AbbreviatedMonthNames;
        months[10] = "be\u0301d";
        culture.DateTimeFormat.AbbreviatedMonthNames = months;

        var text = "25 b\u00e9d 2013";
        var pattern = "dd MMM yyyy";
        DateTime result;
        if (DateTime.TryParseExact(text, pattern, culture,
                                   DateTimeStyles.None, out result))
        {
            Console.WriteLine("Parsed! Result={0}", result);
        }
        else
        {
            Console.WriteLine("Didn't parse");
        }
    }
}

사용자 정의 월 이름을 “bEd”텍스트 값으로 “bed”로 변경하면 잘 구문 분석됩니다.

좋습니다. 몇 가지 데이터 포인트가 더 있습니다.

사용하는 비용 Substring과 IsPrefix큰하지만 끔찍한하지 않습니다. 내 개발 노트북의 “Friday April 12 2013 20:28:42″샘플에서 1 초에 실행할 수있는 구문 분석 작업 수를 약 460K에서 약 400K로 변경합니다. 가능하면 속도 저하를 피하고 싶지만 그렇게 나쁘지는 않습니다.
정규화는 이식 가능한 클래스 라이브러리에서 사용할 수 없기 때문에 생각보다 덜 실현 가능합니다. 나는 잠재적 으로 비 PCL 빌드 에만 사용할 수 있으므로 PCL 빌드가 약간 덜 정확할 수 있습니다. 정규화 테스트 ( string.IsNormalized)의 성능 저하로 인해 성능이 초당 약 44 만 5 천 호출로 저하되며,이를 감안할 수 있습니다. 예를 들어, “ß”를 포함하는 월 이름은 많은 문화권에서 “ss”와 일치해야한다고 생각합니다. 정규화는 그렇게하지 않습니다.

답변

여러 정규화 형식을 처리하는 것과는 별도로 여러 <-> 하나 / 다 케이스 매핑 문제를 먼저 고려하겠습니다.

예를 들면 :

x heiße y
  ^--- cursor

일치 heisse하지만 커서 1을 너무 많이 이동합니다. 과:

x heisse y
  ^--- cursor

일치 heiße하지만 커서 1을 너무 적게 이동합니다.

이것은 단순한 일대일 매핑이없는 모든 캐릭터에 적용됩니다.

실제로 일치 된 부분 문자열의 길이를 알아야합니다. 하지만 Compare, IndexOf거리 정보를 던지 .. 등. 그것은 정규 표현식 가능할 수도 있지만 구현은 전체 사례 폴딩을하지 않는 등 일치하지 않는 경우
ß에 ss/SS비록 대소 문자를 구분 모드 .Compare와 .IndexOf않습니다. 그리고 어쨌든 모든 후보자에 대해 새로운 정규식을 만드는 데는 비용이 많이 듭니다.

이에 대한 가장 간단한 해결책은 문자열을 케이스 폴딩 형식으로 내부적으로 저장하고 케이스 폴딩 후보와 이진 비교를 수행하는 것입니다. 그러면 .Length커서가 내부 표현 용이므로 커서를 올바르게 이동할 수 있습니다 . 또한을 사용하지 않아도 대부분의 손실 된 성능을 되 찾을 수 있습니다 CompareOptions.IgnoreCase.

더 – 불행하게도 내장에는 전체 케이스 매핑이 없기 때문에 가난한 사람의 경우 접이식 중 하나가 작동하지 않는 어떠한 경우 배 기능이없는 ToUpper방법이 켜지지 않습니다 ß으로는 SS.

예를 들어, 이것은 일반 형식 C의 문자열이 주어지면 Java (및 Javascript에서도)에서 작동합니다.

//Poor man's case folding.
//There are some edge cases where this doesn't work
public static String toCaseFold( String input, Locale cultureInfo ) {
    return input.toUpperCase(cultureInfo).toLowerCase(cultureInfo);
}

Java의 대소 문자 무시 비교는 C #의 CompareOptions.IgnoreCase. 따라서이 점에서 반대입니다. Java는 전체 케이스 매핑을 수행하지만 간단한 케이스 접기-C #은 단순한 케이스 매핑을 수행하지만 전체 케이스 접기는 수행합니다.

따라서 문자열을 사용하기 전에 케이스를 접을 수있는 타사 라이브러리가 필요할 수 있습니다.

어떤 작업을하기 전에 문자열이 일반 형식 C인지 확인해야합니다. 라틴 스크립트에 최적화 된이 예비 빠른 검사를 사용할 수 있습니다.

public static bool MaybeRequiresNormalizationToFormC(string input)
{
    if( input == null ) throw new ArgumentNullException("input");

    int len = input.Length;
    for (int i = 0; i < len; ++i)
    {
        if (input[i] > 0x2FF)
        {
            return true;
        }
    }

    return false;
}

이것은 거짓 긍정을 제공하지만 거짓 부정은 제공하지 않습니다. 모든 문자열에 대해 수행해야하지만 라틴 스크립트 문자를 사용할 때 460k 구문 분석 / 초가 전혀 느려지지 않을 것으로 예상합니다. 거짓 양성을 사용 IsNormalized하면 참 음성 / 양성을 얻고 필요한 경우 정규화 한 후에 만 사용할 수 있습니다.

결론적으로, 처리는 먼저 정규형 C를 확인한 다음 케이스 폴드를 보장하는 것입니다. 처리 된 문자열과 이진 비교를 수행하고 현재 이동하면서 커서를 이동합니다.

답변

이것이 요구 사항을 충족하는지 확인하십시오 .. :

public static partial class GlobalizationExtensions {
    public static int IsPrefix(
        this CompareInfo compareInfo,
        String source, String prefix, int startIndex, CompareOptions options
        ) {
        if(compareInfo.IndexOf(source, prefix, startIndex, options)!=startIndex)
            return ~0;
        else
            // source is started with prefix
            // therefore the loop must exit
            for(int length2=0, length1=prefix.Length; ; )
                if(0==compareInfo.Compare(
                        prefix, 0, length1,
                        source, startIndex, ++length2, options))
                    return length2;
    }
}

compareInfo.Compare로 source시작 하면 한 번만 수행합니다 prefix. 그렇지 않은 경우 다음을 IsPrefix반환합니다 -1. 그렇지 않으면에서 사용되는 문자의 길이입니다 source.

그러나 다음과 같은 경우 증가 length2하는 것 외에는 전혀 모릅니다 1.

var candidate="ßssß\u00E9\u0302";
var text="abcd ssßss\u0065\u0301\u0302sss";

var count=
    culture.CompareInfo.IsPrefix(text, candidate, 5, CompareOptions.IgnoreCase);

업데이트 :

나는 약간의 성능을 개선하려고 노력했지만 다음 코드에 버그가 있는지 여부는 입증되지 않았습니다.

public static partial class GlobalizationExtensions {
    public static int Compare(
        this CompareInfo compareInfo,
        String source, String prefix, int startIndex, ref int length2,
        CompareOptions options) {
        int length1=prefix.Length, v2, v1;

        if(0==(v1=compareInfo.Compare(
            prefix, 0, length1, source, startIndex, length2, options))
            ) {
            return 0;
        }
        else {
            if(0==(v2=compareInfo.Compare(
                prefix, 0, length1, source, startIndex, 1+length2, options))
                ) {
                ++length2;
                return 0;
            }
            else {
                if(v1<0||v2<0) {
                    length2-=2;
                    return -1;
                }
                else {
                    length2+=2;
                    return 1;
                }
            }
        }
    }

    public static int IsPrefix(
        this CompareInfo compareInfo,
        String source, String prefix, int startIndex, CompareOptions options
        ) {
        if(compareInfo.IndexOf(source, prefix, startIndex, options)
                !=startIndex)
            return ~0;
        else
            for(int length2=
                    Math.Min(prefix.Length, source.Length-(1+startIndex)); ; )
                if(0==compareInfo.Compare(
                        source, prefix, startIndex, ref length2, options))
                    return length2;
    }
}

나는 특정 케이스로 테스트했고 비교는 약 3으로 줄였습니다.

답변

이것은 실제로 정규화없이 IsPrefix.

동일한 수의 문자 가 아닌 동일한 수의 텍스트 요소 를 비교해야 하지만 여전히 일치하는 문자 수를 반환합니다.

Noda Time의 ValueCursor.cs에서MatchCaseInsensitive 메서드 복사본을 만들고 정적 컨텍스트에서 사용할 수 있도록 약간 수정했습니다.

// Noda time code from MatchCaseInsensitive in ValueCursor.cs
static int IsMatch_Original(string source, int index, string match, CompareInfo compareInfo)
{
    unchecked
    {
        if (match.Length > source.Length - index)
        {
            return 0;
        }

        // TODO(V1.2): This will fail if the length in the input string is different to the length in the
        // match string for culture-specific reasons. It's not clear how to handle that...
        if (compareInfo.Compare(source, index, match.Length, match, 0, match.Length, CompareOptions.IgnoreCase) == 0)
        {
            return match.Length;
        }

        return 0;
    }
}

(참조 용으로 만 포함되어 있습니다. 아시다시피 제대로 비교되지 않는 코드입니다.)

해당 메서드의 다음 변형은 프레임 워크에서 제공하는 StringInfo.GetNextTextElement 를 사용합니다 . 아이디어는 텍스트 요소별로 텍스트 요소를 비교하여 일치를 찾고 발견되면 소스 문자열에서 일치하는 실제 문자 수를 반환하는 것입니다.

// Using StringInfo.GetNextTextElement to match by text elements instead of characters
static int IsMatch_New(string source, int index, string match, CompareInfo compareInfo)
{
    int sourceIndex = index;
    int matchIndex = 0;

    // Loop until we reach the end of source or match
    while (sourceIndex < source.Length && matchIndex < match.Length)
    {
        // Get text elements at the current positions of source and match
        // Normally that will be just one character but may be more in case of Unicode combining characters
        string sourceElem = StringInfo.GetNextTextElement(source, sourceIndex);
        string matchElem = StringInfo.GetNextTextElement(match, matchIndex);

        // Compare the current elements.
        if (compareInfo.Compare(sourceElem, matchElem, CompareOptions.IgnoreCase) != 0)
        {
            return 0; // No match
        }

        // Advance in source and match (by number of characters)
        sourceIndex += sourceElem.Length;
        matchIndex += matchElem.Length;
    }

    // Check if we reached end of source and not end of match
    if (matchIndex != match.Length)
    {
        return 0; // No match
    }

    // Found match. Return number of matching characters from source.
    return sourceIndex - index;
}

그 방법은 (: 기본적으로 그냥 문자열의 변종 몇 당신이 제공 한 테스트 내 테스트 케이스에 따라 적어도 잘 작동 "b\u00e9d"과 "be\u0301d").

그러나 GetNextTextElement 메서드는 각 텍스트 요소에 대한 하위 문자열을 생성 하므로이 구현에는 많은 하위 문자열 비교가 필요하므로 성능에 영향을 미칩니다.

따라서 GetNextTextElement를 사용하지 않고 대신 유니 코드 조합 문자를 건너 뛰어 문자에서 실제 일치 길이 를 찾는 다른 변형을 만들었습니다 .

// This should be faster
static int IsMatch_Faster(string source, int index, string match, CompareInfo compareInfo)
{
    int sourceLength = source.Length;
    int matchLength = match.Length;
    int sourceIndex = index;
    int matchIndex = 0;

    // Loop until we reach the end of source or match
    while (sourceIndex < sourceLength && matchIndex < matchLength)
    {
        sourceIndex += GetTextElemLen(source, sourceIndex, sourceLength);
        matchIndex += GetTextElemLen(match, matchIndex, matchLength);
    }

    // Check if we reached end of source and not end of match
    if (matchIndex != matchLength)
    {
        return 0; // No match
    }

    // Check if we've found a match
    if (compareInfo.Compare(source, index, sourceIndex - index, match, 0, matchIndex, CompareOptions.IgnoreCase) != 0)
    {
        return 0; // No match
    }

    // Found match. Return number of matching characters from source.
    return sourceIndex - index;
}

이 방법은 다음 두 도우미를 사용합니다.

static int GetTextElemLen(string str, int index, int strLen)
{
    bool stop = false;
    int elemLen;

    for (elemLen = 0; index < strLen && !stop; ++elemLen, ++index)
    {
        stop = !IsCombiningCharacter(str, index);
    }

    return elemLen;
}

static bool IsCombiningCharacter(string str, int index)
{
    switch (CharUnicodeInfo.GetUnicodeCategory(str, index))
    {
        case UnicodeCategory.NonSpacingMark:
        case UnicodeCategory.SpacingCombiningMark:
        case UnicodeCategory.EnclosingMark:
            return true;

        default:
            return false;
    }
}

나는 벤치 마킹을 한 적이 없기 때문에 더 빠른 방법이 실제로 더 빠른지 실제로 알지 못합니다. 확장 된 테스트도하지 않았습니다.

그러나 이것은 유니 코드 결합 문자를 포함 할 수있는 문자열에 대해 문화적 민감한 부분 문자열 일치를 수행하는 방법에 대한 귀하의 질문에 답할 것입니다.

다음은 내가 사용한 테스트 사례입니다.

static Tuple<string, int, string, int>[] tests = new []
{
    Tuple.Create("x b\u00e9d y", 2, "be\u0301d", 3),
    Tuple.Create("x be\u0301d y", 2, "b\u00e9d", 4),

    Tuple.Create("x b\u00e9d", 2, "be\u0301d", 3),
    Tuple.Create("x be\u0301d", 2, "b\u00e9d", 4),

    Tuple.Create("b\u00e9d y", 0, "be\u0301d", 3),
    Tuple.Create("be\u0301d y", 0, "b\u00e9d", 4),

    Tuple.Create("b\u00e9d", 0, "be\u0301d", 3),
    Tuple.Create("be\u0301d", 0, "b\u00e9d", 4),

    Tuple.Create("b\u00e9", 0, "be\u0301d", 0),
    Tuple.Create("be\u0301", 0, "b\u00e9d", 0),
};

튜플 값은 다음과 같습니다.

소스 문자열 (haystack)
소스의 시작 위치입니다.
일치 문자열 (바늘)입니다.
예상되는 일치 길이입니다.

세 가지 방법으로 이러한 테스트를 실행하면 다음 결과가 생성됩니다.

Test #0: Orignal=BAD; New=OK; Faster=OK
Test #1: Orignal=BAD; New=OK; Faster=OK
Test #2: Orignal=BAD; New=OK; Faster=OK
Test #3: Orignal=BAD; New=OK; Faster=OK
Test #4: Orignal=BAD; New=OK; Faster=OK
Test #5: Orignal=BAD; New=OK; Faster=OK
Test #6: Orignal=BAD; New=OK; Faster=OK
Test #7: Orignal=BAD; New=OK; Faster=OK
Test #8: Orignal=OK; New=OK; Faster=OK
Test #9: Orignal=OK; New=OK; Faster=OK

마지막 두 테스트는 소스 문자열이 일치 문자열보다 짧은 경우를 테스트합니다. 이 경우 원래 (Noda 시간) 방법도 성공합니다.