[string-matching] 가변 길이 문자열에 대한 더 나은 유사성 순위 알고리즘

나는 일반적으로 제안되는 것 (levenshtein distance, soundex 등)보다 가변 길이 문자열에서 더 나은 결과를 얻는 문자열 유사성 알고리즘을 찾고 있습니다.

예를 들어

주어진 문자열 A : “Robert”,

그런 다음 문자열 B : “Amy Robertson”

보다 더 나은 일치

문자열 C : “리차드”

또한이 알고리즘은 언어에 구애받지 않아야합니다 (영어 이외의 언어에서도 작동).

답변

Catalysoft의 Simon White는 내 목적에 잘 맞는 인접한 문자 쌍을 비교하는 매우 영리한 알고리즘에 대한 기사를 작성했습니다.

http://www.catalysoft.com/articles/StrikeAMatch.html

Simon은 Java 버전의 알고리즘을 가지고 있으며 그 아래에 PL / Ruby 버전을 작성했습니다 (Mark Wong-VanHaren의 관련 포럼 항목 주석에서 수행 된 일반 루비 버전에서 가져옴). 내 PostgreSQL 쿼리에서 사용할 수 있습니다.

CREATE FUNCTION string_similarity(str1 varchar, str2 varchar)
RETURNS float8 AS '

str1.downcase!
pairs1 = (0..str1.length-2).collect {|i| str1[i,2]}.reject {
  |pair| pair.include? " "}
str2.downcase!
pairs2 = (0..str2.length-2).collect {|i| str2[i,2]}.reject {
  |pair| pair.include? " "}
union = pairs1.size + pairs2.size
intersection = 0
pairs1.each do |p1|
  0.upto(pairs2.size-1) do |i|
    if p1 == pairs2[i]
      intersection += 1
      pairs2.slice!(i)
      break
    end
  end
end
(2.0 * intersection) / union

' LANGUAGE 'plruby';

매력처럼 작동합니다!

답변

marzagao의 답변 은 훌륭합니다. 나는 그것을 C #으로 변환 했으므로 여기에 게시 할 것이라고 생각했습니다.

Pastebin Link

/// <summary>
/// This class implements string comparison algorithm
/// based on character pair similarity
/// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
/// </summary>
public class SimilarityTool
{
    /// <summary>
    /// Compares the two strings based on letter pair matches
    /// </summary>
    /// <param name="str1"></param>
    /// <param name="str2"></param>
    /// <returns>The percentage match from 0.0 to 1.0 where 1.0 is 100%</returns>
    public double CompareStrings(string str1, string str2)
    {
        List<string> pairs1 = WordLetterPairs(str1.ToUpper());
        List<string> pairs2 = WordLetterPairs(str2.ToUpper());

        int intersection = 0;
        int union = pairs1.Count + pairs2.Count;

        for (int i = 0; i < pairs1.Count; i++)
        {
            for (int j = 0; j < pairs2.Count; j++)
            {
                if (pairs1[i] == pairs2[j])
                {
                    intersection++;
                    pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success

                    break;
                }
            }
        }

        return (2.0 * intersection) / union;
    }

    /// <summary>
    /// Gets all letter pairs for each
    /// individual word in the string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private List<string> WordLetterPairs(string str)
    {
        List<string> AllPairs = new List<string>();

        // Tokenize the string and put the tokens/words into an array
        string[] Words = Regex.Split(str, @"\s");

        // For each word
        for (int w = 0; w < Words.Length; w++)
        {
            if (!string.IsNullOrEmpty(Words[w]))
            {
                // Find the pairs of characters
                String[] PairsInWord = LetterPairs(Words[w]);

                for (int p = 0; p < PairsInWord.Length; p++)
                {
                    AllPairs.Add(PairsInWord[p]);
                }
            }
        }

        return AllPairs;
    }

    /// <summary>
    /// Generates an array containing every 
    /// two consecutive letters in the input string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private string[] LetterPairs(string str)
    {
        int numPairs = str.Length - 1;

        string[] pairs = new string[numPairs];

        for (int i = 0; i < numPairs; i++)
        {
            pairs[i] = str.Substring(i, 2);
        }

        return pairs;
    }
}

답변

다음은 marzagao의 답변 의 다른 버전입니다.이 버전은 Python으로 작성되었습니다.

def get_bigrams(string):
    """
    Take a string and return a list of bigrams.
    """
    s = string.lower()
    return [s[i:i+2] for i in list(range(len(s) - 1))]

def string_similarity(str1, str2):
    """
    Perform bigram comparison between two strings
    and return a percentage match in decimal form.
    """
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    union  = len(pairs1) + len(pairs2)
    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union

if __name__ == "__main__":
    """
    Run a test using the example taken from:
    http://www.catalysoft.com/articles/StrikeAMatch.html
    """
    w1 = 'Healed'
    words = ['Heard', 'Healthy', 'Help', 'Herded', 'Sealed', 'Sold']

    for w2 in words:
        print('Healed --- ' + w2)
        print(string_similarity(w1, w2))
        print()

답변

Simon White가 제안한 StrikeAMatch 알고리즘의 PHP 구현은 다음과 같습니다. 장점은 (링크에 나와있는 것처럼) 다음과 같습니다.

어휘 유사성의 진정한 반영 -차이가 작은 문자열은 비슷한 것으로 인식되어야합니다. 특히, 중요한 부분 문자열 겹침은 문자열 사이의 높은 수준의 유사성을 가리켜 야합니다.
단어 순서 변경에 대한 견고성 – 동일한 단어를 포함하지만 다른 순서로 된 두 개의 문자열은 유사한 것으로 인식되어야합니다. 반면에, 한 문자열이 다른 문자열에 포함 된 문자의 임의의 아나그램 인 경우 (보통) 다른 것으로 인식해야합니다.
언어 독립성 -알고리즘은 영어뿐만 아니라 여러 언어로 작동해야합니다.

<?php
/**
 * LetterPairSimilarity algorithm implementation in PHP
 * @author Igal Alkon
 * @link http://www.catalysoft.com/articles/StrikeAMatch.html
 */
class LetterPairSimilarity
{
    /**
     * @param $str
     * @return mixed
     */
    private function wordLetterPairs($str)
    {
        $allPairs = array();

        // Tokenize the string and put the tokens/words into an array

        $words = explode(' ', $str);

        // For each word
        for ($w = 0; $w < count($words); $w++)
        {
            // Find the pairs of characters
            $pairsInWord = $this->letterPairs($words[$w]);

            for ($p = 0; $p < count($pairsInWord); $p++)
            {
                $allPairs[] = $pairsInWord[$p];
            }
        }

        return $allPairs;
    }

    /**
     * @param $str
     * @return array
     */
    private function letterPairs($str)
    {
        $numPairs = mb_strlen($str)-1;
        $pairs = array();

        for ($i = 0; $i < $numPairs; $i++)
        {
            $pairs[$i] = mb_substr($str,$i,2);
        }

        return $pairs;
    }

    /**
     * @param $str1
     * @param $str2
     * @return float
     */
    public function compareStrings($str1, $str2)
    {
        $pairs1 = $this->wordLetterPairs(strtoupper($str1));
        $pairs2 = $this->wordLetterPairs(strtoupper($str2));

        $intersection = 0;

        $union = count($pairs1) + count($pairs2);

        for ($i=0; $i < count($pairs1); $i++)
        {
            $pair1 = $pairs1[$i];

            $pairs2 = array_values($pairs2);
            for($j = 0; $j < count($pairs2); $j++)
            {
                $pair2 = $pairs2[$j];
                if ($pair1 === $pair2)
                {
                    $intersection++;
                    unset($pairs2[$j]);
                    break;
                }
            }
        }

        return (2.0*$intersection)/$union;
    }
}

답변

John Rutledge의 대답 의 짧은 버전 :

def get_bigrams(string):
    '''
    Takes a string and returns a list of bigrams
    '''
    s = string.lower()
    return {s[i:i+2] for i in xrange(len(s) - 1)}

def string_similarity(str1, str2):
    '''
    Perform bigram comparison between two strings
    and return a percentage match in decimal form
    '''
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    return (2.0 * len(pairs1 & pairs2)) / (len(pairs1) + len(pairs2))

답변

이 토론은 정말 도움이되었습니다. Excel과 함께 사용하기 위해 알고리즘을 VBA로 변환하고 한 쌍의 문자열을 간단하게 비교하고 다른 하나를 문자열의 범위 / 배열과 비교하기위한 몇 가지 버전의 워크 시트 함수를 작성했습니다. strSimLookup 버전은 마지막으로 가장 일치하는 것을 문자열, 배열 인덱스 또는 유사성 메트릭으로 반환합니다.

이 구현은 Simon White 웹 사이트의 Amazon 예제에 나열된 것과 동일한 결과를 얻습니다. 차이가 어디에서 발생하는지 확실하지 않으면 VBA의 Split 기능이 될 수 있지만 내 목적에 잘 작동하는지 조사하지 않았습니다.

'Implements functions to rate how similar two strings are on
'a scale of 0.0 (completely dissimilar) to 1.0 (exactly similar)
'Source:   http://www.catalysoft.com/articles/StrikeAMatch.html
'Author: Bob Chatham, bob.chatham at gmail.com
'9/12/2010

Option Explicit

Public Function stringSimilarity(str1 As String, str2 As String) As Variant
'Simple version of the algorithm that computes the similiarity metric
'between two strings.
'NOTE: This verision is not efficient to use if you're comparing one string
'with a range of other values as it will needlessly calculate the pairs for the
'first string over an over again; use the array-optimized version for this case.

    Dim sPairs1 As Collection
    Dim sPairs2 As Collection

    Set sPairs1 = New Collection
    Set sPairs2 = New Collection

    WordLetterPairs str1, sPairs1
    WordLetterPairs str2, sPairs2

    stringSimilarity = SimilarityMetric(sPairs1, sPairs2)

    Set sPairs1 = Nothing
    Set sPairs2 = Nothing

End Function

Public Function strSimA(str1 As Variant, rRng As Range) As Variant
'Return an array of string similarity indexes for str1 vs every string in input range rRng
    Dim sPairs1 As Collection
    Dim sPairs2 As Collection
    Dim arrOut As Variant
    Dim l As Long, j As Long

    Set sPairs1 = New Collection

    WordLetterPairs CStr(str1), sPairs1

    l = rRng.Count
    ReDim arrOut(1 To l)
    For j = 1 To l
        Set sPairs2 = New Collection
        WordLetterPairs CStr(rRng(j)), sPairs2
        arrOut(j) = SimilarityMetric(sPairs1, sPairs2)
        Set sPairs2 = Nothing
    Next j

    strSimA = Application.Transpose(arrOut)

End Function

Public Function strSimLookup(str1 As Variant, rRng As Range, Optional returnType) As Variant
'Return either the best match or the index of the best match
'depending on returnTYype parameter) between str1 and strings in rRng)
' returnType = 0 or omitted: returns the best matching string
' returnType = 1           : returns the index of the best matching string
' returnType = 2           : returns the similarity metric

    Dim sPairs1 As Collection
    Dim sPairs2 As Collection
    Dim metric, bestMetric As Double
    Dim i, iBest As Long
    Const RETURN_STRING As Integer = 0
    Const RETURN_INDEX As Integer = 1
    Const RETURN_METRIC As Integer = 2

    If IsMissing(returnType) Then returnType = RETURN_STRING

    Set sPairs1 = New Collection

    WordLetterPairs CStr(str1), sPairs1

    bestMetric = -1
    iBest = -1

    For i = 1 To rRng.Count
        Set sPairs2 = New Collection
        WordLetterPairs CStr(rRng(i)), sPairs2
        metric = SimilarityMetric(sPairs1, sPairs2)
        If metric > bestMetric Then
            bestMetric = metric
            iBest = i
        End If
        Set sPairs2 = Nothing
    Next i

    If iBest = -1 Then
        strSimLookup = CVErr(xlErrValue)
        Exit Function
    End If

    Select Case returnType
    Case RETURN_STRING
        strSimLookup = CStr(rRng(iBest))
    Case RETURN_INDEX
        strSimLookup = iBest
    Case Else
        strSimLookup = bestMetric
    End Select

End Function

Public Function strSim(str1 As String, str2 As String) As Variant
    Dim ilen, iLen1, ilen2 As Integer

    iLen1 = Len(str1)
    ilen2 = Len(str2)

    If iLen1 >= ilen2 Then ilen = ilen2 Else ilen = iLen1

    strSim = stringSimilarity(Left(str1, ilen), Left(str2, ilen))

End Function

Sub WordLetterPairs(str As String, pairColl As Collection)
'Tokenize str into words, then add all letter pairs to pairColl

    Dim Words() As String
    Dim word, nPairs, pair As Integer

    Words = Split(str)

    If UBound(Words) < 0 Then
        Set pairColl = Nothing
        Exit Sub
    End If

    For word = 0 To UBound(Words)
        nPairs = Len(Words(word)) - 1
        If nPairs > 0 Then
            For pair = 1 To nPairs
                pairColl.Add Mid(Words(word), pair, 2)
            Next pair
        End If
    Next word

End Sub

Private Function SimilarityMetric(sPairs1 As Collection, sPairs2 As Collection) As Variant
'Helper function to calculate similarity metric given two collections of letter pairs.
'This function is designed to allow the pair collections to be set up separately as needed.
'NOTE: sPairs2 collection will be altered as pairs are removed; copy the collection
'if this is not the desired behavior.
'Also assumes that collections will be deallocated somewhere else

    Dim Intersect As Double
    Dim Union As Double
    Dim i, j As Long

    If sPairs1.Count = 0 Or sPairs2.Count = 0 Then
        SimilarityMetric = CVErr(xlErrNA)
        Exit Function
    End If

    Union = sPairs1.Count + sPairs2.Count
    Intersect = 0

    For i = 1 To sPairs1.Count
        For j = 1 To sPairs2.Count
            If StrComp(sPairs1(i), sPairs2(j)) = 0 Then
                Intersect = Intersect + 1
                sPairs2.Remove j
                Exit For
            End If
        Next j
    Next i

    SimilarityMetric = (2 * Intersect) / Union

End Function

답변

죄송합니다. 저자가 답을 만들지 않았습니다. 이것은 Digital Equipment Corporation에 의해 처음으로 알려진 알고리즘이며 종종 shingling이라고합니다.

http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-TN-1997-015.pdf