[python] 파이썬과 BeautifulSoup을 사용하여 웹 페이지에서 링크를 검색

파이썬을 사용하여 웹 페이지의 링크를 검색하고 링크의 URL 주소를 복사하려면 어떻게해야합니까?

답변

BeautifulSoup의 SoupStrainer 클래스를 사용하는 짧은 스 니펫은 다음과 같습니다.

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print(link['href'])

BeautifulSoup 문서는 실제로 매우 훌륭하며 여러 가지 일반적인 시나리오를 다룹니다.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

편집 : 미리 구문 분석하는 것을 알고 있다면 SoupStrainer 클래스가 조금 더 효율적이기 때문에 (메모리 및 속도면에서) 효율적입니다.

답변

서버에서 제공하는 인코딩을 사용하여 BeautifulSoup 4 버전을 완성하기 위해 다음을 수행하십시오.

from bs4 import BeautifulSoup
import urllib.request

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print(link['href'])

또는 Python 2 버전 :

from bs4 import BeautifulSoup
import urllib2

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))

for link in soup.find_all('a', href=True):
    print link['href']

그리고 사용 버전 requests라이브러리 로 작성, 파이썬 2와 3 모두에서 작동합니다 :

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link['href'])

이 soup.find_all('a', href=True)호출은 속성 <a>이있는 모든 요소를 찾습니다 href. 속성이없는 요소는 건너 뜁니다.

BeautifulSoup 3는 2012 년 3 월 개발을 중단했습니다. 새로운 프로젝트는 항상 BeautifulSoup 4를 사용해야합니다.

바이트 에서 BeautifulSoup으로 HTML 디코딩을 남겨 두어야합니다 . 디코딩에 도움을주기 위해 HTTP 응답 헤더에있는 문자 세트를 BeautifulSoup에 알릴 수 있지만, 이는 잘못된 것으로 HTML 자체에 있는 헤더 정보 와 충돌 할 수 있으므로 <meta>위의 내용은 BeautifulSoup 내부 클래스 메소드 EncodingDetector.find_declared_encoding()를 사용하여 이러한 임베디드 인코딩 힌트는 잘못 구성된 서버보다 우선합니다.

로 requests1, response.encoding라틴어 1 속성 기본값 응답이있는 경우 text/*에는 characterset가 반환되지 않은 경우에도, MIME 형식을. 이는 HTTP RFC와 일치하지만 HTML 구문 분석과 함께 사용하면 고통스럽기 때문에 charsetContent-Type 헤더에 no 가 설정되어 있으면 해당 속성을 무시해야합니다 .

답변

다른 사람들은 BeautifulSoup을 추천했지만 lxml 을 사용하는 것이 훨씬 좋습니다 . 이름에도 불구하고 HTML 구문 분석 및 스크랩을위한 것입니다. BeautifulSoup보다 훨씬 빠르며, BeautifulSoup (명예를 주장하는 것)보다 “깨진”HTML을 더 잘 처리합니다. lxml API를 배우고 싶지 않은 경우 BeautifulSoup에 대한 호환성 API도 있습니다.

Ian Blicking이 동의합니다 .

Google App Engine을 사용하거나 순수하게 Python이 아닌 것을 허용하지 않는 한 BeautifulSoup을 더 이상 사용할 이유가 없습니다.

lxml.html은 CSS3 선택자를 지원하므로 이런 종류의 작업은 간단합니다.

lxml 및 xpath의 예는 다음과 같습니다.

import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')

dom =  lxml.html.fromstring(connection.read())

for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links)
    print link

답변

import urllib2
import BeautifulSoup

request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
  if 'national-park' in a['href']:
    print 'found a url with national-park in the link'

답변

다음 코드를 사용하여 웹 페이지에서 사용할 수있는 모든 링크를 검색하는 것입니다 urllib2및 BeautifulSoup4:

import urllib2
from bs4 import BeautifulSoup

url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)

for line in soup.find_all('a'):
    print(line.get('href'))

답변

후드 아래에서 BeautifulSoup은 이제 lxml을 사용합니다. 요청, lxml 및 목록 이해는 범인 콤보를 만듭니다.

import requests
import lxml.html

dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)

[x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]

목록 comp에서 “if ‘//’및 ‘url.com’x in x”는 사이트의 ‘내부’탐색 URL 등의 URL 목록을 제거하는 간단한 방법입니다.

답변

B.soup과 정규 표현식없이 링크를 얻기 위해 :

import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
    if "<a href" in item:
        try:
            ind = item.index(tag)
            item=item[ind+len(tag):]
            end=item.index(endtag)
        except: pass
        else:
            print item[:end]

보다 복잡한 작업을 위해서는 물론 BSoup이 선호됩니다.