[nlp] 단어에서 음절 감지

단어로 음절을 감지하는 상당히 효율적인 방법을 찾아야합니다. 예 :

보이지 않는-> in-vi-sib-le

사용할 수있는 일부 음절 규칙이 있습니다.

V CV VC CVC CCV CCCV CVCC

* 여기서 V는 모음이고 C는 자음입니다. 예 :

발음 (5 개 발음); CV-CVC-CV-V-CVC)

나는 정규식 (음절을 세고 싶을 때만 도움이 됨) 또는 하드 코딩 된 규칙 정의 (매우 비효율적 인 것으로 판명 된 무력 접근법)를 사용하고 마침내 유한 상태 오토 마타를 사용하는 몇 가지 방법을 시도했습니다. 유용한 결과는 없습니다).

내 응용 프로그램의 목적은 주어진 언어로 모든 음절의 사전을 만드는 것입니다. 이 사전은 나중에 맞춤법 검사 응용 프로그램 (베이지 분류기를 사용) 및 텍스트-음성 합성에 사용됩니다.

이전 접근법 외에도이 문제를 해결할 수있는 다른 방법에 대한 팁을 줄 수 있다면 감사하겠습니다.

Java로 작업하지만 C / C ++, C #, Python, Perl의 팁은 저에게 효과적입니다.

답변

하이픈 넣기 목적으로이 문제에 대한 TeX 접근 방식에 대해 읽으십시오. 특히 Frank Liang의 논문 논문 Hy-phen-a-tion by Comp-put-er 참조 . 그의 알고리즘은 매우 정확하며 알고리즘이 작동하지 않는 경우에 대한 작은 예외 사전을 포함합니다.

답변

나는이 페이지를 우연히 찾아서 같은 것을 찾고, Liang 논문의 몇 가지 구현을 여기에서 발견했다 :
https://github.com/mnater/hyphenator 또는 후속 : https://github.com/mnater/Hyphenopoly

고유하지 않은 문제에 대해 자유롭게 사용할 수있는 코드를 적용하는 대신 60 페이지 논문을 읽는 것을 좋아하지 않는 한 그렇지 않습니다. 🙂

답변

NLTK를 사용하는 솔루션은 다음과 같습니다 .

from nltk.corpus import cmudict
d = cmudict.dict()
def nsyl(word):
  return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]

답변

텍스트 블록의 flesch-kincaid 및 flesch reading score를 계산하는 프로그램 에서이 문제를 해결하려고합니다. 내 알고리즘은이 웹 사이트에서 찾은 것 ( http://www.howmanysyllables.com/howtocountsyllables.html)을 사용 하며 합리적으로 가깝습니다. 보이지 않는 하이픈과 같은 복잡한 단어에는 여전히 문제가 있지만 내 목표를 위해 야구장에 도착한다는 것을 알았습니다.

구현하기 쉽다는 단점이 있습니다. 나는 “es”가 음절인지 아닌지를 발견했다. 도박이지만 알고리즘에서 es를 제거하기로 결정했습니다.

private int CountSyllables(string word)
    {
        char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
        string currentWord = word;
        int numVowels = 0;
        bool lastWasVowel = false;
        foreach (char wc in currentWord)
        {
            bool foundVowel = false;
            foreach (char v in vowels)
            {
                //don't count diphthongs
                if (v == wc && lastWasVowel)
                {
                    foundVowel = true;
                    lastWasVowel = true;
                    break;
                }
                else if (v == wc && !lastWasVowel)
                {
                    numVowels++;
                    foundVowel = true;
                    lastWasVowel = true;
                    break;
                }
            }

            //if full cycle and no vowel found, set lastWasVowel to false;
            if (!foundVowel)
                lastWasVowel = false;
        }
        //remove es, it's _usually? silent
        if (currentWord.Length > 2 &&
            currentWord.Substring(currentWord.Length - 2) == "es")
            numVowels--;
        // remove silent e
        else if (currentWord.Length > 1 &&
            currentWord.Substring(currentWord.Length - 1) == "e")
            numVowels--;

        return numVowels;
    }

답변

LaTeX 하이픈 넣기 알고리즘으로 완전히 해결되지 않는 특히 어려운 문제입니다. 사용 가능한 몇 가지 방법과 문제에 대한 요약은 영어 자동 실 라벨 알고리즘 평가 (Marchand, Adsett, Damper 2007)에서 확인할 수 있습니다.

답변

왜 계산합니까? 모든 온라인 사전에는이 정보가 있습니다. http://dictionary.reference.com/browse/invisible
in · vis · i · ble

답변

C #에서 빠르고 더러운 구현을 공유해 주신 Joe Basirico에게 감사드립니다. 나는 큰 라이브러리를 사용했지만 작동하지만 일반적으로 약간 느리고 빠른 프로젝트의 경우 방법이 잘 작동합니다.

다음은 테스트 사례와 함께 Java 코드입니다.

public static int countSyllables(String word)
{
    char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
    char[] currentWord = word.toCharArray();
    int numVowels = 0;
    boolean lastWasVowel = false;
    for (char wc : currentWord) {
        boolean foundVowel = false;
        for (char v : vowels)
        {
            //don't count diphthongs
            if ((v == wc) && lastWasVowel)
            {
                foundVowel = true;
                lastWasVowel = true;
                break;
            }
            else if (v == wc && !lastWasVowel)
            {
                numVowels++;
                foundVowel = true;
                lastWasVowel = true;
                break;
            }
        }
        // If full cycle and no vowel found, set lastWasVowel to false;
        if (!foundVowel)
            lastWasVowel = false;
    }
    // Remove es, it's _usually? silent
    if (word.length() > 2 &&
            word.substring(word.length() - 2) == "es")
        numVowels--;
    // remove silent e
    else if (word.length() > 1 &&
            word.substring(word.length() - 1) == "e")
        numVowels--;
    return numVowels;
}

public static void main(String[] args) {
    String txt = "what";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "super";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "Maryland";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "American";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "disenfranchized";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "Sophia";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
}

결과는 예상대로였습니다 (Flesch-Kincaid에게는 충분하게 작동합니다).

txt=what countSyllables=1
txt=super countSyllables=2
txt=Maryland countSyllables=3
txt=American countSyllables=3
txt=disenfranchized countSyllables=5
txt=Sophia countSyllables=2