[python] 디스크에 쓰지 않고 .zip 파일 다운로드 및 압축 해제

Question 1

URL에서 .ZIP 파일 목록을 다운로드 한 다음 ZIP 파일을 추출하여 디스크에 쓰는 첫 번째 Python 스크립트가 작동하도록 관리했습니다.

나는 이제 다음 단계를 달성하기 위해 헤매고 있습니다.

내 주요 목표는 zip 파일을 다운로드 및 추출하고 TCP 스트림을 통해 콘텐츠 (CSV 데이터)를 전달하는 것입니다. zip 파일이나 압축을 푼 파일을 디스크에 기록하지 않는 것이 좋습니다.

다음은 작동하지만 불행히도 파일을 디스크에 써야하는 현재 스크립트입니다.

import urllib, urllister
import zipfile
import urllib2
import os
import time
import pickle

# check for extraction directories existence
if not os.path.isdir('downloaded'):
    os.makedirs('downloaded')

if not os.path.isdir('extracted'):
    os.makedirs('extracted')

# open logfile for downloaded data and save to local variable
if os.path.isfile('downloaded.pickle'):
    downloadedLog = pickle.load(open('downloaded.pickle'))
else:
    downloadedLog = {'key':'value'}

# remove entries older than 5 days (to maintain speed)

# path of zip files
zipFileURL = "http://www.thewebserver.com/that/contains/a/directory/of/zip/files"

# retrieve list of URLs from the webservers
usock = urllib.urlopen(zipFileURL)
parser = urllister.URLLister()
parser.feed(usock.read())
usock.close()
parser.close()

# only parse urls
for url in parser.urls:
    if "PUBLIC_P5MIN" in url:

        # download the file
        downloadURL = zipFileURL + url
        outputFilename = "downloaded/" + url

        # check if file already exists on disk
        if url in downloadedLog or os.path.isfile(outputFilename):
            print "Skipping " + downloadURL
            continue

        print "Downloading ",downloadURL
        response = urllib2.urlopen(downloadURL)
        zippedData = response.read()

        # save data to disk
        print "Saving to ",outputFilename
        output = open(outputFilename,'wb')
        output.write(zippedData)
        output.close()

        # extract the data
        zfobj = zipfile.ZipFile(outputFilename)
        for name in zfobj.namelist():
            uncompressed = zfobj.read(name)

            # save uncompressed data to disk
            outputFilename = "extracted/" + name
            print "Saving extracted file to ",outputFilename
            output = open(outputFilename,'wb')
            output.write(uncompressed)
            output.close()

            # send data via tcp stream

            # file successfully downloaded and extracted store into local log and filesystem log
            downloadedLog[url] = time.time();
            pickle.dump(downloadedLog, open('downloaded.pickle', "wb" ))

Question 2

내 제안은 StringIO객체 를 사용하는 것입니다. 파일을 에뮬레이트하지만 메모리에 상주합니다. 따라서 다음과 같이 할 수 있습니다.

# get_zip_data() gets a zip archive containing 'foo.txt', reading 'hey, foo'

import zipfile
from StringIO import StringIO

zipdata = StringIO()
zipdata.write(get_zip_data())
myzipfile = zipfile.ZipFile(zipdata)
foofile = myzipfile.open('foo.txt')
print foofile.read()

# output: "hey, foo"

또는 더 간단하게 (Vishal에게 사과) :

myzipfile = zipfile.ZipFile(StringIO(get_zip_data()))
for name in myzipfile.namelist():
    [ ... ]

Python 3에서는 StringIO 대신 BytesIO를 사용합니다.

import zipfile
from io import BytesIO

filebytes = BytesIO(get_zip_data())
myzipfile = zipfile.ZipFile(filebytes)
for name in myzipfile.namelist():
    [ ... ]

Question 3

다음은 압축 된 csv 파일을 가져 오는 데 사용한 코드 스 니펫입니다. 한 번 살펴보세요.

파이썬 2 :

from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen

resp = urlopen("http://www.test.com/file.zip")
zipfile = ZipFile(StringIO(resp.read()))
for line in zipfile.open(file).readlines():
    print line

파이썬 3 :

from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
# or: requests.get(url).content

resp = urlopen("http://www.test.com/file.zip")
zipfile = ZipFile(BytesIO(resp.read()))
for line in zipfile.open(file).readlines():
    print(line.decode('utf-8'))

여기 file에 문자열이 있습니다. 전달하려는 실제 문자열을 얻으려면 zipfile.namelist(). 예를 들어

resp = urlopen('http://mlg.ucd.ie/files/datasets/bbc.zip')
zipfile = ZipFile(BytesIO(resp.read()))
zipfile.namelist()
# ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']

Question 4

이미 언급되었을 수있는 적응 / 변경 사항에 대한 설명과 함께 Python 2를 사용하는 Vishal의 탁월한 답변의 업데이트 된 Python 3 버전을 제공하고 싶습니다.

from io import BytesIO
from zipfile import ZipFile
import urllib.request

url = urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/loc162txt.zip")

with ZipFile(BytesIO(url.read())) as my_zip_file:
    for contained_file in my_zip_file.namelist():
        # with open(("unzipped_and_read_" + contained_file + ".file"), "wb") as output:
        for line in my_zip_file.open(contained_file).readlines():
            print(line)
            # output.write(line)

필요한 변경 :

StringIOPython 3 에는 모듈 이 없습니다 (로 이동되었습니다 io.StringIO). 대신에 io.BytesIO] 2를 사용합니다 . 왜냐하면 우리는 바이트 스트림 ( 문서 , 이 스레드)을 처리 할 것이기 때문 입니다.
urlopen :
- ” urllib.urlopenPython 2.6 및 이전 버전 의 레거시 함수는 중단되었습니다. 이전.에 urllib.request.urlopen()해당합니다 urllib2.urlopen.”, Docs 및 이 스레드 .

노트 :

Python 3에서 인쇄 된 출력 행은 다음과 같습니다 b'some text'.. 이것은 문자열이 아니기 때문에 예상됩니다. 우리는 바이트 스트림을 읽고 있다는 것을 기억하십시오. Dan04의 탁월한 답변을 살펴보십시오 .

몇 가지 사소한 변경 사항 :

with ... as대신 문서 에 zipfile = ...따라 사용 합니다 .
이제 스크립트 .namelist()는 zip에있는 모든 파일을 순환하고 내용을 인쇄하는 데 사용합니다.
나는 그것이 더 나은지 확실하지 않지만 ZipFile객체 의 생성을 with진술 로 옮겼습니다 .
NumenorForLife의 의견에 대한 응답으로 파일 (zip의 파일 당)에 바이트 스트림을 쓰는 옵션을 추가 (및 주석 처리)했습니다. "unzipped_and_read_"파일 이름과 ".file"확장자 의 시작 부분에 추가 됩니다 ( ".txt"바이트 문자열이있는 파일에는 사용하지 않는 것을 선호합니다 ). 물론 코드를 사용하려면 들여 쓰기를 조정해야합니다.
- 여기서주의해야합니다. 바이트 문자열이 있기 때문에 바이너리 모드를 사용하므로 "wb"; 바이너리를 작성하면 어쨌든 웜 캔이 열린다는 느낌이 있습니다.
예제 파일 인 UN / LOCODE 텍스트 아카이브를 사용하고 있습니다 .

내가하지 않은 것 :

NumenorForLife는 zip을 디스크에 저장하는 것에 대해 물었습니다. 그가 무슨 뜻인지 잘 모르겠습니다. zip 파일을 다운로드하나요? 그것은 다른 작업입니다. Oleh Prypin의 훌륭한 답변을 참조하십시오 .

방법은 다음과 같습니다.

import urllib.request
import shutil

with urllib.request.urlopen("http://www.unece.org/fileadmin/DAM/cefact/locode/2015-2_UNLOCODE_SecretariatNotes.pdf") as response, open("downloaded_file.pdf", 'w') as out_file:
    shutil.copyfileobj(response, out_file)

Question 5

RAM에있는 임시 파일에 쓰기

그것은 밝혀 tempfile모듈 ( http://docs.python.org/library/tempfile.html ) 바로 일을 가지고 :

tempfile.SpooledTemporaryFile ([max_size = 0 [, mode = ‘w + b'[, bufsize = -1 [, suffix = ”[, prefix = ‘tmp'[, dir = None]]]]]])

이 함수는 파일 크기가 max_size를 초과하거나 파일의 fileno () 메서드가 호출 될 때까지 데이터가 메모리에 스풀링된다는 점을 제외하면 TemporaryFile ()과 똑같이 작동합니다. ().

결과 파일에는 하나의 추가 메서드 인 rollover ()가 있습니다.이 메서드는 파일 크기에 관계없이 파일이 디스크상의 파일로 롤오버됩니다.

반환 된 객체는 rollover ()가 호출되었는지 여부에 따라 _file 속성이 StringIO 객체 또는 실제 파일 객체 인 파일 류 객체입니다. 이 파일 류 객체는 일반 파일처럼 with 문에서 사용할 수 있습니다.

버전 2.6의 새로운 기능.

또는 게으르고 /tmpLinux에 tmpfs가 마운트 된 경우 파일을 만들 수 있지만 직접 삭제하고 이름을 지정해야합니다.

Question 6

완전성을 위해 Python3 답변을 추가하고 싶습니다.

from io import BytesIO
from zipfile import ZipFile
import requests

def get_zip(file_url):
    url = requests.get(file_url)
    zipfile = ZipFile(BytesIO(url.content))
    zip_names = zipfile.namelist()
    if len(zip_names) == 1:
        file_name = zip_names.pop()
        extracted_file = zipfile.open(file_name)
        return extracted_file
    return [zipfile.open(file_name) for file_name in zip_names]

Question 7

요청을 사용하여 다른 답변에 추가 :

 # download from web

 import requests
 url = 'http://mlg.ucd.ie/files/datasets/bbc.zip'
 content = requests.get(url)

 # unzip the content
 from io import BytesIO
 from zipfile import ZipFile
 f = ZipFile(BytesIO(content.content))
 print(f.namelist())

 # outputs ['bbc.classes', 'bbc.docs', 'bbc.mtx', 'bbc.terms']

사용의 도움 (F)는 예를 들어, 더 많은 기능 정보 얻을 extractall () 이상에서 사용 가능 zip 파일의 내용을 추출하는 연결 프로그램을 .

Question 8

Vishal의 예는 훌륭하지만 파일 이름과 관련하여 혼란스럽고 ‘zipfile’을 다시 정의하는 장점이 없습니다.

다음은 일부 파일이 포함 된 zip을 다운로드하는 예제입니다. 그 중 하나는 나중에 pandas DataFrame으로 읽어들이는 csv 파일입니다.

from StringIO import StringIO
from zipfile import ZipFile
from urllib import urlopen
import pandas

url = urlopen("https://www.federalreserve.gov/apps/mdrm/pdf/MDRM.zip")
zf = ZipFile(StringIO(url.read()))
for item in zf.namelist():
    print("File in zip: "+  item)
# find the first matching csv file in the zip:
match = [s for s in zf.namelist() if ".csv" in s][0]
# the first line of the file contains a string - that line shall de ignored, hence skiprows
df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0])

(참고, 저는 Python 2.7.13을 사용합니다.)

이것은 나를 위해 일한 정확한 솔루션입니다. StringIO를 제거하고 IO 라이브러리를 추가하여 Python 3 버전에 대해 약간 조정했습니다.

Python 3 버전

from io import BytesIO
from zipfile import ZipFile
import pandas
import requests

url = "https://www.nseindia.com/content/indices/mcwb_jun19.zip"
content = requests.get(url)
zf = ZipFile(BytesIO(content.content))

for item in zf.namelist():
    print("File in zip: "+  item)

# find the first matching csv file in the zip:
match = [s for s in zf.namelist() if ".csv" in s][0]
# the first line of the file contains a string - that line shall de     ignored, hence skiprows
df = pandas.read_csv(zf.open(match), low_memory=False, skiprows=[0])