[python] Python을 사용하여 웹 페이지의 페이지 제목을 검색하려면 어떻게해야합니까?

Question 1

Python을 사용하여 웹 페이지의 페이지 제목 (title html 태그)을 검색하려면 어떻게해야합니까?

Question 2

이러한 작업에는 항상 lxml 을 사용 합니다. Beautifulsoup 도 사용할 수 있습니다 .

import lxml.html
t = lxml.html.parse(url)
print t.find(".//title").text

의견에 따라 편집 :

from urllib2 import urlopen
from lxml.html import parse

url = "https://www.google.com"
page = urlopen(url)
p = parse(page)
print p.find(".//title").text

Question 3

@Vinko Vrsalovic의 답변의 단순화 된 버전은 다음과 같습니다 .

import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen("https://www.google.com"))
print soup.title.string

노트:

soup.title는 처음 발견 제목 요소 어디서나 HTML 문서의를
title.string는 단지이 가정 하나 개의 자식 노드를, 그 자식 노드는 것입니다 문자열

들어 BeautifulSoup로 4.x의 다른 가져 오기를 사용합니다 :

from bs4 import BeautifulSoup

Question 4

다른 라이브러리를 가져올 필요가 없습니다. 요청에는이 기능이 내장되어 있습니다.

>> hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
>>> n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders)
>>> al = n.text
>>> al[al.find('<title>') + 7 : al.find('</title>')]
u'Friends (TV Series 1994\u20132004) - IMDb'

Question 5

mechanize Browser 객체에는 title () 메서드가 있습니다. 따라서이 게시물 의 코드는 다음 과 같이 다시 작성할 수 있습니다.

from mechanize import Browser
br = Browser()
br.open("http://www.google.com/")
print br.title()

Question 6

이것은 아마도 그러한 간단한 작업에는 과잉 일 것입니다. 그러나 그 이상을 수행 할 계획이라면 이러한 도구 (mechanize, BeautifulSoup)에서 시작하는 것이 더 합리적입니다. 대체 도구 (내용 및 정규식을 얻기위한 urllib)보다 훨씬 사용하기 쉽기 때문입니다. 또는 html을 구문 분석하는 다른 파서)

링크 :
BeautifulSoup
기계화

#!/usr/bin/env python
#coding:utf-8

from BeautifulSoup import BeautifulSoup
from mechanize import Browser

#This retrieves the webpage content
br = Browser()
res = br.open("https://www.google.com/")
data = res.get_data()

#This parses the content
soup = BeautifulSoup(data)
title = soup.find('title')

#This outputs the content :)
print title.renderContents()

Question 7

soup.select_one을 사용하여 제목 태그를 타겟팅하십시오.

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('url')
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)

Question 8

HTMLParser 사용 :

from urllib.request import urlopen
from html.parser import HTMLParser


class TitleParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.match = False
        self.title = ''

    def handle_starttag(self, tag, attributes):
        self.match = tag == 'title'

    def handle_data(self, data):
        if self.match:
            self.title = data
            self.match = False

url = "http://example.com/"
html_string = str(urlopen(url).read())

parser = TitleParser()
parser.feed(html_string)
print(parser.title)  # prints: Example Domain