[nlp] 단어 형태소 또는 기본형은 어떻게합니까?

PorterStemmer와 Snowball을 사용해 보았지만 둘 다 모든 단어에서 작동하지 않으며 매우 일반적인 단어가 누락되었습니다.

내 테스트 단어는 : ” cats running ran cactus cactuses cacti community community “그리고 둘 다 절반 미만의 권리를 얻습니다.

또한보십시오:

답변

Python을 아는 경우 NLTK (Natural Language Toolkit) 에는 WordNet 을 사용하는 매우 강력한 lemmatizer가 있습니다 .

이 lemmatizer를 처음 사용하는 경우 사용하기 전에 말뭉치를 다운로드해야합니다. 이것은 다음을 통해 수행 할 수 있습니다.

>>> import nltk
>>> nltk.download('wordnet')

이 작업은 한 번만 수행하면됩니다. 이제 코퍼스를 다운로드했다고 가정하면 다음과 같이 작동합니다.

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('cars')
'car'
>>> lmtzr.lemmatize('feet')
'foot'
>>> lmtzr.lemmatize('people')
'people'
>>> lmtzr.lemmatize('fantasized','v')
'fantasize'

nltk.stem 모듈 에 다른 lemmatizer가 있지만 직접 시도하지는 않았습니다.

답변

나는 lemmatization 을 수행하기 위해 stanford nlp 를 사용합니다. 지난 며칠 동안 비슷한 문제가 발생했습니다. 이 문제를 해결하는 데 도움이되는 stackoverflow 덕분입니다.

import java.util.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.ling.CoreAnnotations.*;

public class example
{
    public static void main(String[] args)
    {
        Properties props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma");
        pipeline = new StanfordCoreNLP(props, false);
        String text = /* the string you want */;
        Annotation document = pipeline.process(text);

        for(CoreMap sentence: document.get(SentencesAnnotation.class))
        {
            for(CoreLabel token: sentence.get(TokensAnnotation.class))
            {
                String word = token.get(TextAnnotation.class);
                String lemma = token.get(LemmaAnnotation.class);
                System.out.println("lemmatized version :" + lemma);
            }
        }
    }
}

나중에 분류 자에서 사용되는 경우 불용어를 사용하여 출력 기본형을 최소화하는 것도 좋은 생각 일 수 있습니다. John Conwell이 작성한 coreNlp 확장을 살펴보십시오 .

답변

이 눈덩이 데모 사이트 에서 용어 목록을 시도했는데 결과가 괜찮아 보입니다 ….

고양이-> 고양이
달리기-> 달리기
실행-> 실행
선인장-> 선인장
선인장-> 선인장
커뮤니티-> 커뮤니티
커뮤니티-> 커뮤니티

형태소 분석기는 변형 된 형태의 단어를 일반적인 어근으로 바꿔야합니다. 그 어근을 ‘적절한’사전 단어로 만드는 것은 실제로 형태소 분석기의 일이 아닙니다. 이를 위해서는 형태 / 직교 분석기 를 살펴볼 필요가 있습니다 .

나는 이 질문 이 거의 똑같은 것에 관한 것이라고 생각하며, 그 질문 에 대한 Kaarel의 대답은 내가 두 번째 링크를 가져온 곳입니다.

답변

형태소 분석기 대 lemmatizer 논쟁은 계속됩니다. 효율성보다 정밀도를 선호하는 문제입니다. 언어 적으로 의미있는 단위를 얻기 위해 lemmatize하고 최소한의 컴퓨팅 주스를 사용하면서 동일한 키 아래에서 단어와 그 변형을 색인화해야합니다.

참조 Lemmatizers 대 형태소 분석기를

다음은 Python NLTK를 사용한 예입니다.

>>> sent = "cats running ran cactus cactuses cacti community communities"
>>> from nltk.stem import PorterStemmer, WordNetLemmatizer
>>>
>>> port = PorterStemmer()
>>> " ".join([port.stem(i) for i in sent.split()])
'cat run ran cactu cactus cacti commun commun'
>>>
>>> wnl = WordNetLemmatizer()
>>> " ".join([wnl.lemmatize(i) for i in sent.split()])
'cat running ran cactus cactus cactus community community'

답변

Martin Porter의 공식 페이지에는 PHP 와 다른 언어 로 된 Porter Stemmer가 포함되어 있습니다 .

Porter Algorithm과 같은 것으로 시작해야하지만 좋은 형태소 분석에 대해 정말로 진지한 경우, 규칙을 추가하여 데이터 세트에 공통된 잘못된 사례를 수정 한 다음 마지막으로 규칙에 많은 예외를 추가합니다. . 키가 조회 할 단어이고 값이 원본을 대체 할 어간 단어 인 키 / 값 쌍 (dbm / hash / dictionaries)으로 쉽게 구현할 수 있습니다. 내가 작업 한 상업용 검색 엔진은 수정 된 Porter 알고리즘에 대한 몇 가지 예외로 끝났습니다.

답변

http://wordnet.princeton.edu/man/morph.3WN

많은 프로젝트에서 더 공격적인 포터 형태소 분석보다 어휘 기반 WordNet lemmatizer를 선호합니다.

http://wordnet.princeton.edu/links#PHP 에는 WN API에 대한 PHP 인터페이스에 대한 링크가 있습니다.

답변

Stack Overflow에 대한 다양한 답변과 제가 만난 블로그에 따르면 이것이 제가 사용하는 방법이며 실제 단어를 꽤 잘 반환하는 것 같습니다. 아이디어는 들어오는 텍스트를 단어 배열로 분할 한 다음 (원하는 방법을 사용하여) 해당 단어의 품사 (POS)를 찾아이를 사용하여 단어의 어간과 lemmatize를 돕는 것입니다.

위의 샘플은 POS를 결정할 수 없기 때문에 너무 잘 작동하지 않습니다. 그러나 실제 문장을 사용하면 상황이 훨씬 더 잘 작동합니다.

import nltk
from nltk.corpus import wordnet

lmtzr = nltk.WordNetLemmatizer().lemmatize


def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN


def normalize_text(text):
    word_pos = nltk.pos_tag(nltk.word_tokenize(text))
    lemm_words = [lmtzr(sw[0], get_wordnet_pos(sw[1])) for sw in word_pos]

    return [x.lower() for x in lemm_words]

print(normalize_text('cats running ran cactus cactuses cacti community communities'))
# ['cat', 'run', 'ran', 'cactus', 'cactuses', 'cacti', 'community', 'community']

print(normalize_text('The cactus ran to the community to see the cats running around cacti between communities.'))
# ['the', 'cactus', 'run', 'to', 'the', 'community', 'to', 'see', 'the', 'cat', 'run', 'around', 'cactus', 'between', 'community', '.']