[python] 팬더 : Excel 파일에서 시트 목록 조회

새 버전의 Pandas는 다음 인터페이스 를 사용 하여 Excel 파일을로드합니다.

read_excel('path_to_file.xls', 'Sheet1', index_col=None, na_values=['NA'])

사용 가능한 시트를 모르면 어떻게합니까?

예를 들어 다음 시트가있는 Excel 파일로 작업하고 있습니다.

데이터 1, 데이터 2 …, 데이터 N, foo, bar

그러나 나는 N선험을 모른다 .

Pandas의 Excel 문서에서 시트 목록을 얻는 방법이 있습니까?

답변

여전히 ExcelFile 클래스 및 sheet_names속성을 사용할 수 있습니다 .

xl = pd.ExcelFile('foo.xls')

xl.sheet_names  # see all sheet names

xl.parse(sheet_name)  # read a specific sheet to DataFrame

더 많은 옵션에 대해서는 파싱 문서를 참조하십시오 …

답변

두 번째 매개 변수 (시트 이름)를 명시 적으로 없음으로 지정해야합니다. 이처럼 :

 df = pandas.read_excel("/yourPath/FileName.xlsx", None);

“df”는 DataFrames의 사전으로 모든 시트이며 다음을 실행하여 확인할 수 있습니다.

df.keys()

다음과 같은 결과 :

[u'201610', u'201601', u'201701', u'201702', u'201703', u'201704', u'201705', u'201706', u'201612', u'fund', u'201603', u'201602', u'201605', u'201607', u'201606', u'201608', u'201512', u'201611', u'201604']

자세한 내용은 pandas 문서를 참조하십시오 : https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

답변

이것이 내가 찾은 가장 빠른 방법이며 @divingTobi의 답변에서 영감을 얻었습니다. xlrd, openpyxl 또는 pandas를 기반으로 한 답변은 전체 파일을 먼저로드하므로 느립니다.

from zipfile import ZipFile
from bs4 import BeautifulSoup  # you also need to install "lxml" for the XML parser

with ZipFile(file) as zipped_file:
    summary = zipped_file.open(r'xl/workbook.xml').read()
soup = BeautifulSoup(summary, "xml")
sheets = [sheet.get("name") for sheet in soup.find_all("sheet")]

답변

@ dhwanil_shah의 답변을 바탕으로 전체 파일을 추출 할 필요는 없습니다. 함께 zf.open그것을 직접 압축 파일에서 읽을 수 있습니다.

import xml.etree.ElementTree as ET
import zipfile

def xlsxSheets(f):
    zf = zipfile.ZipFile(f)

    f = zf.open(r'xl/workbook.xml')

    l = f.readline()
    l = f.readline()
    root = ET.fromstring(l)
    sheets=[]
    for c in root.findall('{http://schemas.openxmlformats.org/spreadsheetml/2006/main}sheets/*'):
        sheets.append(c.attrib['name'])
    return sheets

두 개의 연속적인 readlines는 추악하지만 내용은 텍스트의 두 번째 줄에만 있습니다. 전체 파일을 구문 분석 할 필요가 없습니다.

이 솔루션은 read_excel버전 보다 훨씬 빠르며 풀 추출 버전보다 빠릅니다.

답변

나는 xlrd, pandas, openpyxl 및 기타 라이브러리를 시도했으며 전체 파일을 읽을 때 파일 크기가 커짐에 따라 기하 급수적으로 시간이 걸리는 것 같습니다. 위에서 언급 한 ‘on_demand’를 사용한 다른 솔루션은 효과가 없었습니다. 시트 이름을 처음에 얻으려면 xlsx 파일에서 다음 함수가 작동합니다.

def get_sheet_details(file_path):
    sheets = []
    file_name = os.path.splitext(os.path.split(file_path)[-1])[0]
    # Make a temporary directory with the file name
    directory_to_extract_to = os.path.join(settings.MEDIA_ROOT, file_name)
    os.mkdir(directory_to_extract_to)

    # Extract the xlsx file as it is just a zip file
    zip_ref = zipfile.ZipFile(file_path, 'r')
    zip_ref.extractall(directory_to_extract_to)
    zip_ref.close()

    # Open the workbook.xml which is very light and only has meta data, get sheets from it
    path_to_workbook = os.path.join(directory_to_extract_to, 'xl', 'workbook.xml')
    with open(path_to_workbook, 'r') as f:
        xml = f.read()
        dictionary = xmltodict.parse(xml)
        for sheet in dictionary['workbook']['sheets']['sheet']:
            sheet_details = {
                'id': sheet['@sheetId'],
                'name': sheet['@name']
            }
            sheets.append(sheet_details)

    # Delete the extracted files directory
    shutil.rmtree(directory_to_extract_to)
    return sheets

모든 xlsx는 기본적으로 압축 파일이므로 기본 xml 데이터를 추출하고 통합 문서에서 직접 시트 이름을 읽습니다.이 함수는 라이브러리 함수와 비교하여 1 초의 시간이 걸립니다.

벤치마킹 : (4 매의 6mb xlsx 파일)

Pandas, xlrd : 12 초

openpyxl : 24 초

제안 된 방법 : 0.4 초

요구 사항이 시트 이름을 읽는 것이므로 전체 시간을 읽는 불필요한 오버 헤드로 인해 버그가 발생하여 대신이 경로를 사용했습니다.

답변

from openpyxl import load_workbook

sheets = load_workbook(excel_file, read_only=True).sheetnames

플래그가 load_workbook없는 5MB Excel 파일의 경우 read_only8.24 초가 걸렸습니다. 로 read_only플래그 만 39.6 밀리했다. 여전히 Excel 라이브러리를 사용하고 XML 솔루션을 사용하지 않으려면 전체 파일을 구문 분석하는 방법보다 훨씬 빠릅니다.