Browser Agent | AI Full-Stack Architect

预计时间

2 周

前端背景的优势

你对 DOM、选择器、页面生命周期、异步操作的理解，是纯后端开发者做 Browser Agent 不具备的天然优势。

学习目标

用 Playwright 自动化网页操作
理解 LLM + 浏览器的协作模式
实现自动登录 → 搜索 → 填表 → 提交

一、Playwright 基础

1.1 安装

bash

npm init -y
npm install playwright
npx playwright install chromium  # 下载浏览器

1.2 启动浏览器

javascript

import { chromium } from 'playwright';

const browser = await chromium.launch({
  headless: false,  // false = 能看到浏览器窗口，调试用
  slowMo: 100,      // 每步慢 100ms，能看清操作
});

const context = await browser.newContext({
  viewport: { width: 1280, height: 720 },
  locale: 'zh-CN',
});

const page = await context.newPage();

二、定位元素

javascript

// ===== 最可靠的方式 =====

// 1. 按文本内容（推荐！不容易变）
await page.click('text=登录');
await page.click('button:has-text("提交")');

// 2. 按 role（最语义化）
await page.click('role=button[name="登录"]');
await page.fill('role=textbox[name="用户名"]', 'admin');

// 3. 按 placeholder
await page.fill('input[placeholder="请输入邮箱"]', 'test@test.com');

// 4. 按 test-id（最稳定，需要前端配合）
await page.click('[data-testid="submit-btn"]');

// ===== 不太可靠，慎用 =====

// 5. CSS Selector（页面改版就失效）
await page.click('#login-btn');

// 6. XPath（最脆弱）
await page.click('//button[contains(text(), "登录")]');

选择器优先级

text

role > text > placeholder > test-id > CSS > XPath

原则：挑前端改动时最不容易变化的方式。

三、等待策略

javascript

// ❌ 不要用固定等待
await page.waitForTimeout(3000);

// ✅ 等待特定元素出现
await page.waitForSelector('.result-list', { state: 'visible' });

// ✅ 等待网络请求完成
await page.waitForLoadState('networkidle');

// ✅ 等待特定文本出现
await page.waitForSelector('text=操作成功');

// ✅ 等待导航完成
await page.waitForURL('**/dashboard');

// ✅ 组合使用
await page.click('text=提交');
await page.waitForURL('**/success', { timeout: 10000 });
await page.waitForSelector('text=提交成功');

四、自动登录

javascript

async function autoLogin(page, url, username, password) {
  await page.goto(url);

  // 等待登录表单出现
  await page.waitForSelector('input[type="email"], input[name="email"], input[placeholder*="邮箱"]');

  // 填写表单
  await page.fill('input[type="email"], input[name="email"]', username);
  await page.fill('input[type="password"]', password);

  // 点击登录
  await page.click('button[type="submit"], button:has-text("登录"), button:has-text("Sign in")');

  // 等待登录成功（看有没有跳到首页）
  await page.waitForURL('**/dashboard', { timeout: 10000 }).catch(() => {
    console.log('登录可能失败了');
  });

  // 保存登录状态（下次免登录）
  await page.context().storageState({ path: 'auth.json' });
}

// 下次使用保存的状态
const context = await browser.newContext({
  storageState: 'auth.json',
});

五、完整自动填表单例

javascript

async function autoFillForm() {
  const browser = await chromium.launch({ headless: false });
  const page = await browser.newPage();

  // 1. 打开页面
  await page.goto('https://example.com/apply');

  // 2. 填写基本信息
  await page.fill('input[name="name"]', '张三');
  await page.fill('input[name="email"]', 'zhang@example.com');
  await page.fill('input[name="phone"]', '13800138000');

  // 3. 选择下拉框
  await page.selectOption('select[name="department"]', 'engineering');

  // 4. 勾选复选框
  await page.check('input[name="agree_terms"]');

  // 5. 上传文件
  await page.setInputFiles('input[type="file"]', '/path/to/resume.pdf');

  // 6. 填写文本域
  await page.fill('textarea[name="reason"]', '我对这个职位非常感兴趣...');

  // 7. 截图留证
  await page.screenshot({ path: 'form-filled.png', fullPage: true });

  // 8. 提交
  await page.click('button[type="submit"]');

  // 9. 等待成功
  await page.waitForSelector('text=提交成功', { timeout: 10000 });

  await browser.close();
}

六、LLM + Playwright 协作

核心理念

text

LLM 负责"决策"——看页面，决定下一步做什么
Playwright 负责"执行"——操作浏览器

循环：
  1. 截取页面截图
  2. 把截图 + 任务描述发给 LLM
  3. LLM 分析截图 → 输出下一步操作指令
  4. Playwright 执行指令
  5. 回到步骤 1

简化版实现

javascript

async function browserAgent(task) {
  const browser = await chromium.launch({ headless: false });
  const page = await browser.newPage();
  await page.goto('https://example.com');

  const messages = [
    { role: 'system', content: `你是一个网页操作助手。每次我给你当前页面的信息，
你将决定下一步操作。操作格式：
- CLICK <元素描述>
- TYPE <输入框描述> <内容>
- SCROLL_DOWN
- DONE <完成任务说明>` },
    { role: 'user', content: `任务：${task}` },
  ];

  let done = false;
  while (!done) {
    // 获取页面状态
    const title = await page.title();
    const url = page.url();
    const bodyText = await page.evaluate(() => document.body.innerText.substring(0, 3000));
    const screenshot = await page.screenshot({ type: 'png' });

    // 发给 LLM 决策
    const response = await openai.chat.completions.create({
      model: 'gpt-4o',
      messages: [
        ...messages,
        {
          role: 'user',
          content: [
            { type: 'text', text: `当前页面：${url}\n标题：${title}\n内容摘要：${bodyText}\n请决定下一步操作。` },
            { type: 'image_url', image_url: { url: `data:image/png;base64,${screenshot.toString('base64')}` } },
          ],
        },
      ],
    });

    const instruction = response.choices[0].message.content;
    console.log(`LLM: ${instruction}`);

    // 解析并执行指令
    if (instruction.startsWith('CLICK ')) {
      const target = instruction.slice(6);
      await page.click(`text=${target}`).catch(() =>
        page.click(`button:has-text("${target}")`).catch(() =>
          page.click(`a:has-text("${target}")`)
        )
      );
    } else if (instruction.startsWith('TYPE ')) {
      const [_, selector, ...contentParts] = instruction.split(' ');
      const content = contentParts.join(' ');
      await page.fill(`input[placeholder*="${selector}"], input[name="${selector}"]`, content);
    } else if (instruction.startsWith('DONE')) {
      done = true;
    }

    messages.push({ role: 'assistant', content: instruction });
  }

  await browser.close();
}

// 使用
browserAgent('在 GitHub 上搜索 "mcp server"，找到 star 最多的那个');

七、Browser Use

Browser Use 是一个封装好的 Browser Agent 库：

python

# Python 示例（概念了解即可）
from browser_use import Agent
from langchain_openai import ChatOpenAI

agent = Agent(
    task="在 GitHub 上搜索 playwright，打开 star 最多的仓库",
    llm=ChatOpenAI(model="gpt-4o"),
)
result = await agent.run()

如果你主要用 JS 生态，Playwright + LLM 手动写 Agent 循环足够灵活。

八、挑战与解决

挑战	解决
选择器失效	多用 text/role，降级逻辑（先试 text→再试 CSS→最后 XPath）
动态内容	`waitForSelector` + `networkidle`
弹窗/验证码	检测弹窗 → 关闭/跳过 → 验证码需人工介入
登录过期	检测到登录页 → 自动 use saved auth state
慢网络	增加 timeout、重试机制

实践

用 Playwright 写一个自动登录 GitHub 的脚本
写一个自动搜索 + 打开第一个结果的脚本
把 LLM 接进去，让 Agent 根据页面截图做决策

学习目标 ​

一、Playwright 基础 ​

1.1 安装 ​

1.2 启动浏览器 ​

二、定位元素 ​

选择器优先级 ​

三、等待策略 ​

四、自动登录 ​

五、完整自动填表单例 ​

六、LLM + Playwright 协作 ​

核心理念 ​

简化版实现 ​

七、Browser Use ​

八、挑战与解决 ​

实践 ​

学习目标

一、Playwright 基础

1.1 安装

1.2 启动浏览器

二、定位元素

选择器优先级

三、等待策略

四、自动登录

五、完整自动填表单例

六、LLM + Playwright 协作

核心理念

简化版实现

七、Browser Use

八、挑战与解决

实践