[python] 꼬리와 비슷한 파일의 마지막 n 줄 가져 오기

웹 응용 프로그램 용 로그 파일 뷰어를 작성 중이며 로그 파일 줄을 통해 페이지 매김하고 싶습니다. 파일의 항목은 맨 아래에있는 최신 항목을 기준으로합니다.

따라서 아래쪽에서 줄을 tail()읽을 수 n있고 오프셋을 지원하는 방법이 필요합니다 . 내가 생각해 낸 것은 다음과 같습니다.

def tail(f, n, offset=0):
    """Reads a n lines from f with an offset of offset lines."""
    avg_line_length = 74
    to_read = n + offset
    while 1:
        try:
            f.seek(-(avg_line_length * to_read), 2)
        except IOError:
            # woops.  apparently file is smaller than what we want
            # to step back, go to the beginning instead
            f.seek(0)
        pos = f.tell()
        lines = f.read().splitlines()
        if len(lines) >= to_read or pos == 0:
            return lines[-to_read:offset and -offset or None]
        avg_line_length *= 1.3

이것이 합리적인 접근입니까? 오프셋을 사용하여 로그 파일을 테일링하는 권장 방법은 무엇입니까?

답변

이것은 당신보다 빠를 수 있습니다. 줄 길이에 대한 가정을하지 않습니다. 올바른 수의 ‘\ n’문자를 찾을 때까지 파일을 한 번에 한 블록 씩 백업합니다.

def tail( f, lines=20 ):
    total_lines_wanted = lines

    BLOCK_SIZE = 1024
    f.seek(0, 2)
    block_end_byte = f.tell()
    lines_to_go = total_lines_wanted
    block_number = -1
    blocks = [] # blocks of size BLOCK_SIZE, in reverse order starting
                # from the end of the file
    while lines_to_go > 0 and block_end_byte > 0:
        if (block_end_byte - BLOCK_SIZE > 0):
            # read the last block we haven't yet read
            f.seek(block_number*BLOCK_SIZE, 2)
            blocks.append(f.read(BLOCK_SIZE))
        else:
            # file too small, start from begining
            f.seek(0,0)
            # only read what was not read
            blocks.append(f.read(block_end_byte))
        lines_found = blocks[-1].count('\n')
        lines_to_go -= lines_found
        block_end_byte -= BLOCK_SIZE
        block_number -= 1
    all_read_text = ''.join(reversed(blocks))
    return '\n'.join(all_read_text.splitlines()[-total_lines_wanted:])

나는 실용적으로-당신이 그런 것을 알 수 없을 때 줄 길이에 대한 까다로운 가정을 좋아하지 않습니다.

일반적으로 루프를 통과하는 첫 번째 또는 두 번째 패스에서 마지막 20 개 라인을 찾습니다. 74 자의 물건이 실제로 정확하면 블록 크기를 2048로 만들고 거의 즉시 20 줄을 꼬리에 붙입니다.

또한 물리적 OS 블록과의 정렬을 세밀하게하려고 많은 두뇌 칼로리를 태우지 않습니다. 이러한 높은 수준의 I / O 패키지를 사용하면 OS 블록 경계에 맞추려고 시도 할 때 성능이 저하 될 것입니다. 낮은 수준의 I / O를 사용하면 속도가 향상 될 수 있습니다.

최신 정보

Python 3.2 이상에서는 텍스트 파일에서 ( 모드 문자열에서 “b” 없이 열린 파일) 바이트의 프로세스를 따르십시오. 파일 의 시작 부분을 기준으로 한 탐색 만 허용됩니다 (파일 끝을 찾는 경우는 예외) seek (0, 2))로 :

예 : f = open('C:/.../../apache_logs.txt', 'rb')

 def tail(f, lines=20):
    total_lines_wanted = lines

    BLOCK_SIZE = 1024
    f.seek(0, 2)
    block_end_byte = f.tell()
    lines_to_go = total_lines_wanted
    block_number = -1
    blocks = []
    while lines_to_go > 0 and block_end_byte > 0:
        if (block_end_byte - BLOCK_SIZE > 0):
            f.seek(block_number*BLOCK_SIZE, 2)
            blocks.append(f.read(BLOCK_SIZE))
        else:
            f.seek(0,0)
            blocks.append(f.read(block_end_byte))
        lines_found = blocks[-1].count(b'\n')
        lines_to_go -= lines_found
        block_end_byte -= BLOCK_SIZE
        block_number -= 1
    all_read_text = b''.join(reversed(blocks))
    return b'\n'.join(all_read_text.splitlines()[-total_lines_wanted:])

답변

파이썬 2에서 유닉스 계열 시스템을 가정합니다.

import os
def tail(f, n, offset=0):
  stdin,stdout = os.popen2("tail -n "+n+offset+" "+f)
  stdin.close()
  lines = stdout.readlines(); stdout.close()
  return lines[:,-offset]

파이썬 3의 경우 다음을 수행 할 수 있습니다.

import subprocess
def tail(f, n, offset=0):
    proc = subprocess.Popen(['tail', '-n', n + offset, f], stdout=subprocess.PIPE)
    lines = proc.stdout.readlines()
    return lines[:, -offset]

답변

여기 내 대답이 있습니다. 순수한 파이썬. timeit을 사용하면 꽤 빠릅니다. 100,000 줄을 가진 로그 파일의 100 줄을 꼬리 :

>>> timeit.timeit('tail.tail(f, 100, 4098)', 'import tail; f = open("log.txt", "r");', number=10)
0.0014600753784179688
>>> timeit.timeit('tail.tail(f, 100, 4098)', 'import tail; f = open("log.txt", "r");', number=100)
0.00899195671081543
>>> timeit.timeit('tail.tail(f, 100, 4098)', 'import tail; f = open("log.txt", "r");', number=1000)
0.05842900276184082
>>> timeit.timeit('tail.tail(f, 100, 4098)', 'import tail; f = open("log.txt", "r");', number=10000)
0.5394978523254395
>>> timeit.timeit('tail.tail(f, 100, 4098)', 'import tail; f = open("log.txt", "r");', number=100000)
5.377126932144165

코드는 다음과 같습니다.

import os


def tail(f, lines=1, _buffer=4098):
    """Tail a file and get X lines from the end"""
    # place holder for the lines found
    lines_found = []

    # block counter will be multiplied by buffer
    # to get the block size from the end
    block_counter = -1

    # loop until we find X lines
    while len(lines_found) < lines:
        try:
            f.seek(block_counter * _buffer, os.SEEK_END)
        except IOError:  # either file is too small, or too many lines requested
            f.seek(0)
            lines_found = f.readlines()
            break

        lines_found = f.readlines()

        # we found enough lines, get out
        # Removed this line because it was redundant the while will catch
        # it, I left it for history
        # if len(lines_found) > lines:
        #    break

        # decrement the block counter to get the
        # next X bytes
        block_counter -= 1

    return lines_found[-lines:]

답변

전체 파일을 읽을 수 있으면 deque를 사용하십시오.

from collections import deque
deque(f, maxlen=n)

2.6 이전에는 deques에 maxlen 옵션이 없었지만 구현하기가 쉽습니다.

import itertools
def maxque(items, size):
    items = iter(items)
    q = deque(itertools.islice(items, size))
    for item in items:
        del q[0]
        q.append(item)
    return q

끝에서 파일을 읽어야하는 경우 갤럽 (일명 지수) 검색을 사용하십시오.

def tail(f, n):
    assert n >= 0
    pos, lines = n+1, []
    while len(lines) <= n:
        try:
            f.seek(-pos, 2)
        except IOError:
            f.seek(0)
            break
        finally:
            lines = list(f)
        pos *= 2
    return lines[-n:]

답변

위의 S.Lott의 대답은 거의 효과가 있지만 부분적으로 줄을 얻습니다. 데이터가 읽은 블록을 역순으로 보유하기 때문에 블록 경계의 데이터가 손상되는 것으로 나타났습니다. ”.join (data)을 호출하면 블록 순서가 잘못되었습니다. 이 문제를 해결합니다.

def tail(f, window=20):
    """
    Returns the last `window` lines of file `f` as a list.
    f - a byte file-like object
    """
    if window == 0:
        return []
    BUFSIZ = 1024
    f.seek(0, 2)
    bytes = f.tell()
    size = window + 1
    block = -1
    data = []
    while size > 0 and bytes > 0:
        if bytes - BUFSIZ > 0:
            # Seek back one whole BUFSIZ
            f.seek(block * BUFSIZ, 2)
            # read BUFFER
            data.insert(0, f.read(BUFSIZ))
        else:
            # file too small, start from begining
            f.seek(0,0)
            # only read what was not read
            data.insert(0, f.read(bytes))
        linesFound = data[0].count('\n')
        size -= linesFound
        bytes -= BUFSIZ
        block -= 1
    return ''.join(data).splitlines()[-window:]

답변

내가 사용한 코드. 나는 이것이 지금까지 최고라고 생각합니다.

def tail(f, n, offset=None):
    """Reads a n lines from f with an offset of offset lines.  The return
    value is a tuple in the form ``(lines, has_more)`` where `has_more` is
    an indicator that is `True` if there are more lines in the file.
    """
    avg_line_length = 74
    to_read = n + (offset or 0)

    while 1:
        try:
            f.seek(-(avg_line_length * to_read), 2)
        except IOError:
            # woops.  apparently file is smaller than what we want
            # to step back, go to the beginning instead
            f.seek(0)
        pos = f.tell()
        lines = f.read().splitlines()
        if len(lines) >= to_read or pos == 0:
            return lines[-to_read:offset and -offset or None], \
                   len(lines) > to_read or pos > 0
        avg_line_length *= 1.3

답변

mmap을 사용한 간단하고 빠른 솔루션 :

import mmap
import os

def tail(filename, n):
    """Returns last n lines from the filename. No exception handling"""
    size = os.path.getsize(filename)
    with open(filename, "rb") as f:
        # for Windows the mmap parameters are different
        fm = mmap.mmap(f.fileno(), 0, mmap.MAP_SHARED, mmap.PROT_READ)
        try:
            for i in xrange(size - 1, -1, -1):
                if fm[i] == '\n':
                    n -= 1
                    if n == -1:
                        break
            return fm[i + 1 if i else 0:].splitlines()
        finally:
            fm.close()