一、环境准备
1.1 安装requests
# 使用pip安装(推荐)
pip install requests
# 验证安装
python -c "import requests; print(requests.__version__)"
# 应输出类似:2.31.01.2 开发环境建议
使用Python 3.8+版本
推荐IDE (选择其中一个安装, 推荐使用VS code。):
VS Code(安装Python扩展)
PyCharm(社区版即可)
创建虚拟环境(可忽略):
python -m venv myenv
source myenv/bin/activate # Linux/Mac
myenv\Scripts\activate # Windows二、HTTP协议基础
2.1 关键概念
请求方法:GET(获取)、POST(提交)
状态码:
200:成功
404:未找到
500:服务器错误
请求头(Headers):
User-Agent:客户端标识
Content-Type:数据格式类型
Cookie:会话保持
2.2 请求流程
客户端构造请求
发送到服务器
服务器处理请求
返回响应
客户端处理响应
三、基础使用
3.1 GET请求
import requests
# 基本GET请求
response = requests.get('https://www.example.com')
# 查看响应内容
print(response.text) # 文本内容
print(response.status_code) # 状态码
print(response.headers) # 响应头3.2 带参数的GET请求
# 方式1:手动拼接URL
response = requests.get('https://httpbin.org/get?name=Alice&age=25')
# 方式2:使用params参数(推荐)
params = {
'page': 1,
'per_page': 20,
'search': 'python'
}
response = requests.get('https://httpbin.org/get', params=params)3.3 处理响应
if response.status_code == 200:
# 不同内容类型的处理方式
content_type = response.headers.get('Content-Type', '')
if 'application/json' in content_type:
data = response.json()
print(data['args']) # 访问JSON数据
elif 'text/html' in content_type:
print(response.text[:500]) # 打印前500个字符
else:
print(response.content) # 原始字节数据
else:
print(f"请求失败,状态码:{response.status_code}")四、进阶功能
4.1 请求头设置
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Referer': 'https://www.google.com/'
}
response = requests.get('https://httpbin.org/headers', headers=headers)
print(response.json()['headers'])4.2 POST请求
# 表单提交
data = {
'username': 'admin',
'password': 'secret'
}
response = requests.post('https://httpbin.org/post', data=data)
# JSON数据提交
json_data = {
'title': 'Hello World',
'content': 'This is a test post'
}
response = requests.post('https://httpbin.org/post', json=json_data)4.3 文件上传
files = {'file': open('report.xlsx', 'rb')}
response = requests.post('https://httpbin.org/post', files=files)
print(response.json()['files'])4.4 Cookies处理
# 获取Cookies
response = requests.get('https://www.example.com')
print(response.cookies.get_dict())
# 发送Cookies
cookies = {'session_id': 'abc123'}
response = requests.get('https://httpbin.org/cookies', cookies=cookies)4.5 超时设置
try:
# 连接超时3秒,读取超时5秒
response = requests.get('https://httpbin.org/delay/10', timeout=(3, 5))
except requests.exceptions.Timeout:
print("请求超时!")4.6 代理设置
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
response = requests.get('https://httpbin.org/ip', proxies=proxies)五、异常处理
try:
response = requests.get('https://invalid-url', timeout=5)
response.raise_for_status() # 自动检查4xx/5xx错误
except requests.exceptions.HTTPError as errh:
print(f"HTTP错误:{errh}")
except requests.exceptions.ConnectionError as errc:
print(f"连接错误:{errc}")
except requests.exceptions.Timeout as errt:
print(f"超时错误:{errt}")
except requests.exceptions.RequestException as err:
print(f"未知请求错误:{err}")六、实战案例
案例1:抓取网页内容
url = 'https://books.toscrape.com/'
response = requests.get(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
book_titles = [h3.a['title'] for h3 in soup.find_all('h3')]
print(f"找到 {len(book_titles)} 本书")案例2:API交互
# 获取GitHub用户信息
username = 'torvalds'
url = f'https://api.github.com/users/{username}'
response = requests.get(url)
data = response.json()
print(f"""
用户名:{data['login']}
姓名:{data.get('name', 'N/A')}
公司:{data.get('company', 'N/A')}
粉丝数:{data['followers']}
仓库数:{data['public_repos']}
""")案例3:下载文件
def download_file(url, save_path):
response = requests.get(url, stream=True)
with open(save_path, 'wb') as fd:
for chunk in response.iter_content(chunk_size=128):
fd.write(chunk)
print(f"文件已保存至:{save_path}")
# 示例:下载Python logo
download_file(
'https://www.python.org/static/community_logos/python-logo.png',
'python_logo.png'
)七、最佳实践与注意事项
遵守robots.txt规则
设置合理的请求间隔(建议≥2秒)
使用Session保持连接(适合多次请求)
with requests.Session() as s:
s.headers.update({'User-Agent': 'My Crawler 1.0'})
s.get('https://www.example.com/login', auth=('user','pass'))
# 后续请求会自动保持cookies处理重定向(默认自动处理,可禁用)
response = requests.get(url, allow_redirects=False)配置重试策略
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
session = requests.Session()
retries = Retry(
total=3,
backoff_factor=0.3,
status_forcelist=[500, 502, 503, 504]
)
session.mount('https://', HTTPAdapter(max_retries=retries))八、调试技巧
查看实际请求URL:
print(response.request.url)查看请求头:
print(response.request.headers)使用在线测试服务:
使用网络抓包工具:
Chrome开发者工具(F12)
Fiddler
Wireshark
九、推荐学习资源
HTTP协议标准:RFC 7230-7235
进阶爬虫框架:
Scrapy
Selenium(处理JavaScript)
数据解析库:
BeautifulSoup
lxml
parsel
十、常见问题解答
Q:遇到SSL证书错误怎么办?
A:可临时禁用验证(不太建议):
requests.get(url, verify=False)Q:如何处理中文乱码?
A:手动指定编码:
response.encoding = 'gbk' # 或 'utf-8'Q:如何保持登录状态?
A:使用Session对象:
session = requests.Session()
session.post(login_url, data=credentials)
session.get(protected_page_url)Q:ModuleNotFoundError:模块找不到???
A:检查一下几点
有没有为这个程序设置Python解释器?一般来说使用pycharm会需要配置这个。vscode直接使用系统变量里的。
你设置的python解释器是否安装了这个模块?
你安装的模块名称对吗?有一些模块名字和安装名字是不一样的。
模块名字是区分大写小写的,但是将模块下载在python的lib里的之后,会把所有大写字母改成小写,你得改回来。