JavaScript 网页爬虫教程:用 JS 采集动态网页数据

Python 不是唯一能做爬虫的语言——JavaScript(Node.js)在爬虫领域同样强大。如果你已经是前端开发者,用 JS 做爬虫可以零语言切换成本。

基础工具链

npm install axios cheerio puppeteer

axios:HTTP 请求;cheerio:服务端 jQuery,解析 HTML;puppeteer:Headless Chrome 控制。

静态页面采集

const axios = require('axios');
const cheerio = require('cheerio');

async function scrape(url) {
  const { data } = await axios.get(url, {
    headers: {'User-Agent': 'Mozilla/5.0...'}
  });
  const $ = cheerio.load(data);
  $('h2').each((i, el) => console.log($(el).text()));
}

动态页面采集(Puppeteer)

const puppeteer = require('puppeteer');

async function scrapeDynamic(url) {
  const browser = await puppeteer.launch({headless: true});
  const page = await browser.newPage();
  await page.goto(url, {waitUntil: 'networkidle0'});
  const content = await page.content();
  await browser.close();
  return content;
}

JS vs Python 爬虫

  • JS 天然适合处理 Web 数据,JSON 处理最流畅
  • Puppeteer > Python Selenium(原生 Chrome 控制)
  • Python 生态更成熟,Scrapy 无可替代
  • 选你熟悉的语言——两者都能完成工作

相关教程

chcrazy

Share

1 Response

Leave a Reply

Your email address will not be published. Required fields are marked *

Post comment