[python] 파이썬을 이용한 웹 스크랩 핑

웹 사이트에서 매일 일출 / 일몰 시간을 가져오고 싶습니다. 파이썬으로 웹 컨텐츠를 긁을 수 있습니까? 사용되는 모듈은 무엇입니까? 사용 가능한 자습서가 있습니까?

답변

urllib2를 화려한 BeautifulSoup 라이브러리 와 함께 사용하십시오 :

import urllib2
from BeautifulSoup import BeautifulSoup
# or if you're using BeautifulSoup4:
# from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())

for row in soup('table', {'class': 'spad'})[0].tbody('tr'):
    tds = row('td')
    print tds[0].string, tds[1].string
    # will print date and sunrise

답변

Scrapy를 추천합니다.

삭제 된 답변에서 인용 :

Scrapy 크롤링은 비동기 작업 (Twisted의 상단)을 사용하기 때문에 기계화보다 빠릅니다.

Scrapy는 libxml2를 기반으로 구문 분석 (x) html을보다 빠르고 효과적으로 지원합니다.

Scrapy는 완전한 유니 코드를 가진 성숙한 프레임 워크이며, 리디렉션, gzip 응답, 홀수 인코딩, 통합 http 캐시 등을 처리합니다.

Scrapy에 들어가면 5 분 이내에 스파이더를 작성하여 이미지를 다운로드하고 썸네일을 생성하며 추출 된 데이터를 csv 또는 json으로 직접 내보낼 수 있습니다.

답변

웹 스크래핑 작업에서이 비트 버킷 라이브러리 로 스크립트를 수집했습니다 .

귀하의 경우에 대한 예제 스크립트 :

from webscraping import download, xpath
D = download.Download()

html = D.get('http://example.com')
for row in xpath.search(html, '//table[@class="spad"]/tbody/tr'):
    cols = xpath.search(row, '/td')
    print 'Sunrise: %s, Sunset: %s' % (cols[1], cols[2])

산출:

Sunrise: 08:39, Sunset: 16:08
Sunrise: 08:39, Sunset: 16:09
Sunrise: 08:39, Sunset: 16:10
Sunrise: 08:40, Sunset: 16:10
Sunrise: 08:40, Sunset: 16:11
Sunrise: 08:40, Sunset: 16:12
Sunrise: 08:40, Sunset: 16:13

답변

pyquery 확인하는 것이 좋습니다 . jquery와 같은 (일명 CSS와 같은) 구문을 사용하여 해당 배경에서 오는 사람들이 실제로 쉽게 할 수 있습니다.

귀하의 경우 다음과 같습니다.

from pyquery import *

html = PyQuery(url='http://www.example.com/')
trs = html('table.spad tbody tr')

for tr in trs:
  tds = tr.getchildren()
  print tds[1].text, tds[2].text

산출:

5:16 AM 9:28 PM
5:15 AM 9:30 PM
5:13 AM 9:31 PM
5:12 AM 9:33 PM
5:11 AM 9:34 PM
5:10 AM 9:35 PM
5:09 AM 9:37 PM

답변

당신은 사용할 수 있습니다 urllib2가 HTTP 요청을하고, 당신은 웹 컨텐츠를해야합니다.

다음과 같이 얻을 수 있습니다.

import urllib2
response = urllib2.urlopen('http://example.com')
html = response.read()

Beautiful Soup 은 화면 스크래핑에 좋은 파이썬 HTML 파서입니다.

특히 다음 은 HTML 문서를 파싱하는 방법에 대한 튜토리얼입니다.

행운을 빕니다!

답변

Scrapemark (URL 찾기-py2)와 httlib2 (이미지 다운로드-py2 + 3) 의 조합을 사용합니다 . scrapemark.py에는 500 줄의 코드가 있지만 정규 표현식을 사용하므로 테스트가 빠르지 않을 수도 있습니다.

웹 사이트 스크랩 예 :

import sys
from pprint import pprint
from scrapemark import scrape

pprint(scrape("""
    <table class="spad">
        <tbody>
            {*
                <tr>
                    <td>{{[].day}}</td>
                    <td>{{[].sunrise}}</td>
                    <td>{{[].sunset}}</td>
                    {# ... #}
                </tr>
            *}
        </tbody>
    </table>
""", url=sys.argv[1] ))

용법:

python2 sunscraper.py http://www.example.com/

결과:

[{'day': u'1. Dez 2012', 'sunrise': u'08:18', 'sunset': u'16:10'},
 {'day': u'2. Dez 2012', 'sunrise': u'08:19', 'sunset': u'16:10'},
 {'day': u'3. Dez 2012', 'sunrise': u'08:21', 'sunset': u'16:09'},
 {'day': u'4. Dez 2012', 'sunrise': u'08:22', 'sunset': u'16:09'},
 {'day': u'5. Dez 2012', 'sunrise': u'08:23', 'sunset': u'16:08'},
 {'day': u'6. Dez 2012', 'sunrise': u'08:25', 'sunset': u'16:08'},
 {'day': u'7. Dez 2012', 'sunrise': u'08:26', 'sunset': u'16:07'}]

답변

사용하여 인생을 더 쉽게 CSS Selectors

파티에 늦었다는 걸 알고 있지만 좋은 제안이 있습니다.

사용은 BeautifulSoup이미 차라리 사용하여 선호 제안되어있다 CSS SelectorsHTML 내부에 스크랩 데이터에

import urllib2
from bs4 import BeautifulSoup

main_url = "http://www.example.com"

main_page_html  = tryAgain(main_url)
main_page_soup = BeautifulSoup(main_page_html)

# Scrape all TDs from TRs inside Table
for tr in main_page_soup.select("table.class_of_table"):
   for td in tr.select("td#id"):
       print(td.text)
       # For acnhors inside TD
       print(td.select("a")[0].text)
       # Value of Href attribute
       print(td.select("a")[0]["href"])

# This is method that scrape URL and if it doesnt get scraped, waits for 20 seconds and then tries again. (I use it because my internet connection sometimes get disconnects)
def tryAgain(passed_url):
    try:
        page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
        return page
    except Exception:
        while 1:
            print("Trying again the URL:")
            print(passed_url)
            try:
                page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
                print("-------------------------------------")
                print("---- URL was successfully scraped ---")
                print("-------------------------------------")
                return page
            except Exception:
                time.sleep(20)
                continue