python_网络爬虫

发表于 2020-03-23 更新于 2023-12-18 分类于计算机应用， python 阅读次数：

所需要的模块：

1. requests          2.23.0 
2. urllib3           1.25.8 
3. urlopen           1.0.0
4. beautifulsoup4    4.8.2
5. lxml              4.5.0

举例讲解

用正则表达式：

解析网页的步骤：

指定要爬的网址 (url)
使用 python 登录上这个网址 (urlopen等)
读取网页信息 (read() 出来)

获取网页的Html代码：

1	html = urlopen(‘url’).read().decoad(‘utf-8’)

exp:

#读取网站的HTML代码。导入模块后用urlopen即可以读取指定网页的html代码
from urllib.request import urlopen
html = urlopen(
    "https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
print(html)

从已经提取的网页信息中匹配网页的title:

1	res = re.findall(r"<title>(.+?)</title>", html)

exp:

from urllib.request import urlopen
html = urlopen(
    "https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')

import re
res = re.findall(r"<title>(.+?)</title>", html)
print("\nPage title is: ", res[0])

从已经提取的网页信息中匹配网页的href

1	res = re.findall(r'href="(.*?)"', html)

exp:

1 2	res = re.findall(r'href="(.*?)"', html) print("\nAll links: ", res)

用BeautifulSoup 解析网页：

导入模块:

1
2
3

1. import requests
2. from urllib.request import urlopen
3. from bs4 import BeautifulSoup

解析网页的步骤：

指定要爬的网址 (url)
使用 python 登录上这个网址 (urlopen等)
读取网页信息 (read() 出来)
将读取的信息放入 BeautifulSoup
使用 BeautifulSoup 选取 tag 信息等 (代替正则表达式)

exp:

html = urlopen(
    "https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
soup = BeautifulSoup(html,features='lxml')
print('打印网页的一级标题',soup.h1)
print('打印网页段落内容',soup.p.contents)
print('打印网页链接',soup.a)

BeautifulSoup的工作原理：

读取这个网页信息, 我们将要加载进 BeautifulSoup, 以 lxml 的这种形式加载. 除了 lxml, 其实还有很多形式的解析器, 不过大家都推荐使用 lxml 的形式. 然后 soup 里面就有着这个 HTML 的所有信息. 如果你要输出
标题, 可以就直接 soup.h1.
如果网页中有多个同样的 tag, 比如链接 , 我们可以使用 find_all() 来找到所有的选项. 因为我们真正的 link 不是在中间 , 而是在里面, 也可以看做是的一个属性. 我们能用像 Python 字典的形式, 用 key 来读取 l[“href”].

上面是用html的标签进行解析，也可以通过网页的css解析

按 Class 匹配

按 Class 匹配很简单. 比如我要找所有 class=month 的信息. 并打印出它们的 tag 内文字.

或者找到 class=jan 的信息. 然后在

信息. 这样一层层嵌套的信息, 非常容易找到.

exp:

soup = BeautifulSoup(html, features='lxml')
month = soup.find_all('li', {"class": "month"})
for m in month:
    print(m.get_text())

jan = soup.find('ul', {"class": 'jan'})
d_jan = jan.find_all('li')            
for d in d_jan:
    print(d.get_text())

正则匹配：

我们先读取这个网页. 导入正则模块 re.

exp:

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

html = urlopen("https://morvanzhou.github.io/static/scraping/table.html").read().decode('utf-8')

比如图片, 它们都藏在这样一个标签中，但是每个标签的src（图片地址）可能不同。或者图片的格式也不同，有JPG,PNG,WEBP等等。如果只需要某一类的图片呢？可以用正则表达式。r’.*?\.webp找出所有webp格式的图片链接。把正则的 compile 形式放到 BeautifulSoup 的功能中, 就能选到符合要求的图片链接了.

exp:

soup = BeautifulSoup(html, features='lxml')
img_links = soup.find_all("img", {"src": re.compile('.*?\.webp')})
for link in img_links:
    print(link['src'])

又或者我们发现, 我想选一些课程的链接, 而这些链接都有统一的形式, 就是开头都会有 https://morvan., 那我就将这个定为一个正则的规则, 让 BeautifulSoup 帮我找到符合这个规则的链接.

exp:

1
2
3

course_links = soup.find_all('a', {'href': re.compile('https://morvan.*')})
for link in course_links:
    print(link['href'])

Requests

获取网页的方式 :其实在加载网页的时候, 有几种类型, 而这几种类型就是你打开网页的关键. 最重要的类型 (method) 就是 get 和 post (当然还有其他的, 比如 head, delete). 刚接触网页构架的朋友可能又会觉得有点懵逼了. 这些请求的方式到底有什么不同? 他们又有什么作用?

我们就来说两个重要的, get, post, 95% 的时间, 你都是在使用这两个来请求一个网页.

1. post
   1. 账号登录
   2. 搜索内容
   3. 上传图片
   4. 上传文件
   5. 往服务器传数据 等
2. get
   1. 正常打开网页
   2. 不往服务器传数据

这样看来, 很多网页使用 get 就可以了, 比如莫烦Python 里的所有页面, 都是只是 get 发送请求. 而 post, 我们则是给服务器发送个性化请求, 比如将你的账号密码传给服务器, 让它给你返回一个含有你个人信息的 HTML.

从主动和被动的角度来说, post 中文是发送, 比较主动, 你控制了服务器返回的内容. 而 get 中文是取得, 是被动的, 你没有发送给服务器个性化的信息, 它不会根据你个性化的信息返回不一样的 HTML.

requests get 请求

有了 requests, 我们可以发送个中 method 的请求. 比如 get. 我们想模拟一下百度的搜索. 首先我们需要观看一下百度搜索的规律. 在百度搜索框中写上 “莫烦python” 我们发现它弹出了一串这么长的网址.

但是仔细一看, 和 “莫烦Python” 有关的信息, 只有前面一小段 (“s?wd=莫烦python”), 其他的对我们来说都是无用的信息. 所以我们现在来尝试一下如果把后面的”无用” url 都去掉会怎样? Duang! 我们还是能搜到 “莫烦python”.

所以 “s?wd=莫烦python” 这就是我们搜索需要的关键信息. 我们就能用 get 来搭配一些自定义的搜索关键词来用 python 个性化搜索. 首先, 我们固定不动的网址部分是 “http://www.baidu.com/s”, ? 后面的东西都是一些参数 (parameters), 所以我们将这些 parameters 用 python 的字典代替, 然后传入 requests.get() 功能. 然后我们还能用 python (webbrowser模块) 打开一个你的默认浏览器, 观看你是否在百度的搜索页面.

exp:

import requests
import webbrowser
param = {"wd": "莫烦Python"}  # 搜索的信息
r = requests.get('http://www.baidu.com/s', params=param)
print(r.url)
webbrowser.open(r.url)

所需要的模块：

举例讲解

用正则表达式：

exp:

exp:

exp:

用BeautifulSoup 解析网页：

exp:

标题, 可以就直接 soup.h1.

按 Class 匹配

exp:

正则匹配：

exp:

exp:

exp:

Requests

requests get 请求

exp: