Python 爬虫入门教程：从零搭建你的第一个网页爬虫

阅读量: 19

Python 是最流行的爬虫开发语言。本文手把手教你用 Python 写一个完整的网页爬虫——从安装环境到发送请求、解析 HTML、存储数据。

第一步：安装依赖库

首先你需要两个核心库：requests 用来发送 HTTP 请求获取页面，beautifulsoup4 用来解析 HTML 提取数据。

pip install requests beautifulsoup4 lxml

第二步：发送请求获取页面

import requests

url = "https://example.com"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
    html = response.text
else:
    print(f"请求失败: {response.status_code}")

第三步：解析 HTML 提取数据

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "lxml")

# 提取所有标题
titles = soup.find_all("h2")
for title in titles:
    print(title.text.strip())

# 用 CSS 选择器提取
items = soup.select(".product-item .price")
for item in items:
    print(item.text.strip())

第四步：翻页和递归

# 翻页采集
for page in range(1, 11):
    url = f"https://example.com/page/{page}"
    response = requests.get(url, headers=headers)
    # ... 解析逻辑 ...

第五步：加上代理

采集多个页面时建议使用代理：

proxies = {
    "http": "http://user:pass@proxy_ip:port",
    "https": "http://user:pass@proxy_ip:port"
}
response = requests.get(url, headers=headers, proxies=proxies)

进阶：Scrapy 框架

小型项目用 Requests+BeautifulSoup 足够。大规模采集建议用 Scrapy 框架，自带异步请求、中间件、Pipeline 等功能。

斜杠青年

Python 爬虫入门教程：从零搭建你的第一个网页爬虫

第一步：安装依赖库

第二步：发送请求获取页面

第三步：解析 HTML 提取数据

第四步：翻页和递归

第五步：加上代理

进阶：Scrapy 框架

相关教程

3 Responses

Leave a Reply Cancel reply

第一步：安装依赖库

第二步：发送请求获取页面

第三步：解析 HTML 提取数据

第四步：翻页和递归

第五步：加上代理

进阶：Scrapy 框架

相关教程

Share

Related Posts

如何使用代理爬取 LinkedIn 数据：完整指南

3 Responses

Leave a Reply Cancel reply