python爬虫十二种方法_Python爬虫的N种姿势

python爬虫十二种方法_Python爬虫的N种姿势

2024-12-29 20:23

问题的由来

前几天，在微信公众号(Python爬虫及算法)上有个人问了笔者一个问题，如何利用爬虫来实现如下的需求，需要爬取的网页如下(网址为：https://www.wikidata.org/w/in...:WhatlinksHere/Q5&limit=500&from=0)：

我们的需求为爬取红色框框内的名人(有500条记录，图片只展示了一部分)的名字以及其介绍，关于其介绍，点击该名人的名字即可，如下图：

这就意味着我们需要爬取500个这样的页面，即500个HTTP请求(暂且这么认为吧)，然后需要提取这些网页中的名字和描述，当然有些不是名人，也没有描述，我们可以跳过。最后，这些网页的网址在第一页中的名人后面可以找到，如George Washington的网页后缀为Q23.

爬虫的需求大概就是这样。

爬虫的4种姿势

首先，分析来爬虫的思路：先在第一个网页(https://www.wikidata.org/w/index.php?title=Special:WhatlinksHere/Q5&limit=500&from=0)中得到500个名人所在的网址，接下来就爬取这500个网页中的名人的名字及描述，如无描述，则跳过。

接下来，我们将介绍实现这个爬虫的4种方法，并分析它们各自的优缺点，希望能让读者对爬虫有更多的体会。实现爬虫的方法为：

一般方法(同步，requests+BeautifulSoup)

并发(使用concurrent.futures模块以及requests+BeautifulSoup)

异步(使用aiohttp+asyncio+requests+BeautifulSoup)

使用框架Scrapy

一般方法

一般方法即为同步方法，主要使用requests+BeautifulSoup，按顺序执行。完整的Python代码如下：

import requests

from bs4 import BeautifulSoup

import time

# 开始时间

t1 = time.time()

print('#' * 50)

url = "http://www.wikidata.org/w/index.php?title=Special:WhatlinksHere/Q5&limit=500&from=0"

# 请求头部

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}

# 发送HTTP请求

req = requests.get(url, headers=headers)

# 解析网页

soup = BeautifulSoup(req.text, "lxml")

# 找到name和Description所在的记录

human_list = soup.find(id='mw-whatlinkshere-list')('li')

urls = []

# 获取网址

for human in human_list:

url = human.find('a')['href']

urls.append('https://www.wikidata.org'+url)

# 获取每个网页的name和description

def parser(url):

req = requests.get(url)

# 利用BeautifulSoup将获取到的文本解析成HTML

soup = BeautifulSoup(req.text, "lxml")

# 获取name和description

name = soup.find('span', class_="wikibase-title-label")

desc = soup.find('span', class_="wikibase-descriptionview-text")

if name is not None and desc is not None:

print('%-40s, %s'%(name.text, desc.text))

for url in urls:

parser(url)

t2 = time.time() # 结束时间

print('一般方法，总共耗时：%s' % (t2 - t1))

print('#' * 50)

输出的结果如下(省略中间的输出，以......代替)：

##################################################

George Washington , first President of the United States

Douglas Adams , British author and humorist (1952–2001)

以上就是本篇文章【python爬虫十二种方法_Python爬虫的N种姿势】的全部内容了，欢迎阅览！文章地址：http://keair.bhha.com.cn/quote/5686.html
动态相关文章文章同类文章热门文章栏目首页网站地图返回首页康宝晨移动站 http://keair.bhha.com.cn/mobile/ , 查看更多