如何用python爬取数据

舞姬之光 2025-11-17 00:00:00 次阅读

爬取数据需先用requests获取网页内容，再用BeautifulSoup解析HTML提取信息，动态内容使用Selenium模拟浏览器，最后清洗并保存为CSV、JSON或数据库。

爬取数据是Python中常见的任务，主要通过发送HTTP请求获取网页内容，再解析出需要的信息。实现这一过程通常使用几个核心库：requests、BeautifulSoup、re（正则）、lxml，有时也会用到Selenium处理动态页面。

1. 发送请求获取网页内容

使用 requests 库可以轻松获取网页的HTML源码。

示例：

import requests
url = 'https://www./link/b05edd78c294dcf6d960190bf5bde635'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
html = response.text
else:
print("请求失败，状态码：", response.status_code)

注意添加 User-Agent 防止被反爬机制拦截。部分网站会验证请求头。

2. 解析HTML提取数据

常用 BeautifulSoup 解析HTML结构，结合CSS选择器或标签名提取内容。

示例：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
titles = soup.findall('h2', class='title')  # 查找所有class为title的h2标签
for title in titles:
print(title.get_text(strip=True))

也可以用 select() 方法使用CSS选择器：

soup.select('div.content p') 获取 div.content 下的所有 p 标签。

3. 处理动态加载内容（JavaScript渲染）

如果网页内容由JavaScript动态生成，requests 拿不到真实数据，需使用 Selenium 或 Playwright。

示例（Selenium）：

from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get('https://www./link/b05edd78c294dcf6d960190bf5bde635')
等待元素加载（可配合 WebDriverWait）
elements = driver.find_elements(By.CLASS_NAME, 'item')
for elem in elements:
print(elem.text)
driver.quit()

这种方式模拟真实浏览器操作，适合抓取SPA（单页应用）或需要登录、点击翻页的场景。