[python] nltk.data.load로 english.pickle을 (를) 불러 오지 못했습니다.

punkt토크 나이저 를로드하려고 할 때 …

import nltk.data
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

… LookupError가 제기되었습니다.

> LookupError:
>     *********************************************************************
> Resource 'tokenizers/punkt/english.pickle' not found.  Please use the NLTK Downloader to obtain the resource: nltk.download().   Searched in:
>         - 'C:\\Users\\Martinos/nltk_data'
>         - 'C:\\nltk_data'
>         - 'D:\\nltk_data'
>         - 'E:\\nltk_data'
>         - 'E:\\Python26\\nltk_data'
>         - 'E:\\Python26\\lib\\nltk_data'
>         - 'C:\\Users\\Martinos\\AppData\\Roaming\\nltk_data'
>     **********************************************************************

답변

나는이 같은 문제가 있었다. 파이썬 쉘로 이동하여 다음을 입력하십시오.

>>> import nltk
>>> nltk.download()

그런 다음 설치 창이 나타납니다. ‘모델’탭으로 이동하여 ‘식별자’열에서 ‘펑크’를 선택하십시오. 그런 다음 다운로드를 클릭하면 필요한 파일이 설치됩니다. 그런 다음 작동합니다!

답변

이렇게 할 수 있습니다.

import nltk
nltk.download('punkt')

from nltk import word_tokenize,sent_tokenize

함수에 punkt인수로 전달하여 토크 나이저를 다운로드 할 수 있습니다 download. 그런 다음 단어 및 문장 토큰 화 도구를에서 사용할 수 있습니다 nltk.

당신이 다운로드 모든 즉, 원하는 경우 chunkers, grammars, misc, sentiment, taggers, corpora, help, models, stemmers, tokenizers,이 같은 인수를 전달하지 않습니다.

nltk.download()

더 많은 통찰력을 위해 이것을 참조하십시오. https://www.nltk.org/data.html

답변

이것이 바로 지금 저에게 효과적이었습니다.

# Do this in a separate python interpreter session, since you only have to do it once
import nltk
nltk.download('punkt')

# Do this in your ipython notebook or analysis script
from nltk.tokenize import word_tokenize

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

sentences_tokenized = []
for s in sentences:
    sentences_tokenized.append(word_tokenize(s))

votes_tokenized는 토큰 목록의 목록입니다.

[['Mr.', 'Green', 'killed', 'Colonel', 'Mustard', 'in', 'the', 'study', 'with', 'the', 'candlestick', '.', 'Mr.', 'Green', 'is', 'not', 'a', 'very', 'nice', 'fellow', '.'],
['Professor', 'Plum', 'has', 'a', 'green', 'plant', 'in', 'his', 'study', '.'],
['Miss', 'Scarlett', 'watered', 'Professor', 'Plum', "'s", 'green', 'plant', 'while', 'he', 'was', 'away', 'from', 'his', 'office', 'last', 'week', '.']]

“Mining the Social Web, 2nd Edition”책과 함께 제공되는 예제 ipython 노트북 에서 문장을 가져 왔습니다.

답변

bash 명령 행에서 다음을 실행하십시오.

$ python -c "import nltk; nltk.download('punkt')"

답변

이것은 나를 위해 작동합니다 :

>>> import nltk
>>> nltk.download()

Windows에서는 NLTK 다운로더도 얻을 수 있습니다

답변

nltk.download()이 문제는 단순 하지 않습니다. 나는 아래를 시도했고 그것은 나를 위해 일했다 :

nltk폴더 에서 폴더를 만들고 폴더를 폴더에 tokenizers복사하십시오 .punkttokenizers

작동합니다.! 폴더 구조는 그림과 같이되어야합니다! 1

답변

nltk에는 사전 훈련 된 토크 나이저 모델이 있습니다. 모델은 내부적으로 사전 정의 된 웹 소스에서 다운로드하여 다음 가능한 함수 호출을 실행하면서 설치된 nltk 패키지의 경로에 저장됩니다.

예 : 1 토크 나이저 = nltk.data.load ( ‘nltk : 토큰 나이저 /punkt/english.pickle’)

예 : 2 nltk.download ( ‘punkt’)

코드에서 위의 문장을 호출하는 경우 방화벽 보호없이 인터넷에 연결되어 있는지 확인하십시오.

더 나은 깊이있는 이해로 위의 문제를 해결하는 더 나은 대체 방법을 공유하고 싶습니다.

다음 단계를 따르고 nltk를 사용하여 영어 단어 토큰 화를 즐기십시오.

1 단계 : 먼저 웹 경로에 따라 “english.pickle”모델을 다운로드하십시오.

” http://www.nltk.org/nltk_data/ ” 링크로 이동 하여 옵션 “107. Punkt Tokenizer Models”에서 “download”를 클릭하십시오.

2 단계 : 다운로드 한 “punkt.zip”파일을 추출하고 “english.pickle”파일을 찾아 C 드라이브에 넣습니다.

3 단계 : 붙여 넣기 다음 코드를 복사하여 실행합니다.

from nltk.data import load
from nltk.tokenize.treebank import TreebankWordTokenizer

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

tokenizer = load('file:C:/english.pickle')
treebank_word_tokenize = TreebankWordTokenizer().tokenize

wordToken = []
for sent in sentences:
    subSentToken = []
    for subSent in tokenizer.tokenize(sent):
        subSentToken.extend([token for token in treebank_word_tokenize(subSent)])

    wordToken.append(subSentToken)

for token in wordToken:
    print token

문제가 발생하면 알려주세요