Building an Autonomous Testing Agent With Playwright + LLMs: From Scripts to Self-Directed Exploration

Anant Jain, Principal QA Engineer at a Series-C fintech company, dropped a line during a testing architecture review that rewired how I think about automation:

“We stopped writing test scripts. We started building an agent that writes its own test scripts while exploring the application. Our escaped defect rate dropped 62% in one quarter.”

That statement sounds like exaggeration until you see the architecture behind it. For the past six months, I have been building, iterating, and battle-testing an autonomous testing agent that combines Playwright for browser control, a locally-hosted LLM via Ollama for decision making, and a memory layer that prevents the agent from going in circles. The result is a system that lands on a page, reads the DOM, decides what to interact with, executes actions, captures evidence, and generates bug reports — all without a single hardcoded test flow.

This is not theoretical. This is running in CI pipelines today. And in this article, I am going to show you exactly how to build one from scratch.

If you have been following the evolution from Playwright-based test agents and the broader shift toward AI agent evaluation in QA, you already know the direction the industry is heading. This article gives you the implementation.


Contents

The Conceptual Leap: Scripted Tests vs. Autonomous Agents

Traditional test automation is deterministic. You write a script that says: go to this URL, click this button, fill this form, assert this value. The script does exactly what you told it. Nothing more. Nothing less. If the UI changes, the script breaks. If a new feature appears, the script ignores it. If there is a critical bug hiding behind an interaction path you did not anticipate, the script will never find it.

An autonomous testing agent inverts this model entirely. Instead of following a predetermined path, the agent:

  1. Observes the current page state by reading the DOM
  2. Reasons about what actions are possible and which are most valuable to explore
  3. Decides what to click, fill, or interact with next
  4. Executes the action via Playwright
  5. Records the result, takes screenshots, and updates its memory
  6. Repeats until it has explored the application sufficiently or found anomalies

The difference is not incremental. It is categorical. A scripted test validates what you already know. An autonomous agent discovers what you do not know. Anant Jain put it bluntly: “Scripted tests confirm your assumptions. Autonomous agents challenge them.”

The Full Technical Stack

Before we dive into code, here is the complete stack and what each component does:

ComponentRoleWhy This Choice
Playwright (Python)Browser automation and DOM interactionBest cross-browser support, native async, excellent selector engine
Ollama (Llama 3 / Mistral)Local LLM for decision makingNo API costs, data stays local, fast inference for action selection
Python 3.11+Orchestration layerRich ecosystem, async support, easy Playwright and LLM integration
AllureTest reporting and evidence collectionIndustry-standard reporting with screenshot and step attachment support
JenkinsCI/CD pipeline executionWidely adopted, plugin ecosystem, scheduled autonomous runs
openpyxlExcel test case generationStakeholder-friendly output format for generated test cases

The architecture is deliberately modular. You can swap Ollama for OpenAI, replace Jenkins with GitHub Actions, or use pytest-html instead of Allure. The patterns remain the same.

Project Structure and Setup

Here is the directory layout for the autonomous agent:

autonomous-test-agent/
├── agent/
│   ├── __init__.py
│   ├── explorer.py          # Core exploration loop
│   ├── dom_reader.py        # DOM extraction and simplification
│   ├── llm_brain.py         # LLM integration via Ollama
│   ├── memory.py            # State tracking and deduplication
│   ├── auth_handler.py      # Smart login/auth flow handling
│   ├── bug_reporter.py      # Auto-generated bug reports
│   ├── visual_analyzer.py   # Screenshot comparison for UI anomalies
│   └── excel_generator.py   # Test case export to Excel
├── config/
│   ├── settings.yaml         # URLs, credentials, LLM config
│   └── selectors.yaml        # Optional known selectors for auth flows
├── reports/
│   ├── allure-results/
│   └── generated-tests/
├── screenshots/
├── run_agent.py              # Entry point
├── Jenkinsfile
└── requirements.txt

Install the dependencies:

pip install playwright ollama allure-pytest openpyxl pillow pyyaml
playwright install chromium

How the Agent Reads the DOM to Understand Page Context

The foundation of the entire system is DOM reading. The agent cannot make intelligent decisions if it does not understand what is on the page. But sending the full DOM to an LLM is wasteful and slow. We need a simplified, actionable representation.

# agent/dom_reader.py
# Extracts a simplified, LLM-friendly representation of the current page DOM.
# Focuses on interactive elements that the agent can act upon.

import asyncio
from playwright.async_api import Page


async def extract_interactive_elements(page: Page) -> list[dict]:
    # JavaScript runs in browser context to find all interactive elements
    elements = await page.evaluate('''
        () => {
            const interactable = [];
            const selectors = 'a, button, input, select, textarea, [role="button"], [role="link"], [onclick]';
            const nodes = document.querySelectorAll(selectors);

            nodes.forEach((el, index) => {
                const rect = el.getBoundingClientRect();
                // Skip elements that are not visible
                if (rect.width === 0 || rect.height === 0) return;
                if (window.getComputedStyle(el).display === 'none') return;

                interactable.push({
                    index: index,
                    tag: el.tagName.toLowerCase(),
                    type: el.type || null,
                    text: (el.textContent || '').trim().slice(0, 100),
                    placeholder: el.placeholder || null,
                    name: el.name || null,
                    id: el.id || null,
                    href: el.href || null,
                    aria_label: el.getAttribute('aria-label') || null,
                    classes: el.className ? el.className.split(' ').slice(0, 3).join(' ') : null,
                    value: el.value || null
                });
            });
            return interactable;
        }
    ''')
    return elements


async def get_page_summary(page: Page) -> dict:
    # Builds a complete page context object for the LLM
    title = await page.title()
    url = page.url
    elements = await extract_interactive_elements(page)

    # Extract visible text headings for additional context
    headings = await page.evaluate('''
        () => {
            return Array.from(document.querySelectorAll('h1, h2, h3'))
                .map(h => h.textContent.trim())
                .filter(t => t.length > 0)
                .slice(0, 10);
        }
    ''')

    # Detect forms on the page
    form_count = await page.evaluate("() => document.querySelectorAll('form').length")

    return {
        "title": title,
        "url": url,
        "headings": headings,
        "form_count": form_count,
        "interactive_elements": elements,
        "element_count": len(elements)
    }

This gives the LLM a compact, structured view of the page. Instead of parsing thousands of DOM nodes, it sees a list of actionable elements with their labels, types, and attributes. The get_page_summary function becomes the agent’s eyes.

LLM-Based Decision Making: No Fixed Flows

Here is where the magic happens. The LLM receives the page context and decides what to do next. There are no hardcoded flows. The agent reasons about the page and picks the most valuable next action.

# agent/llm_brain.py
# The decision-making core of the autonomous agent.
# Sends page context to a local Ollama LLM and receives structured action decisions.

import json
import ollama


SYSTEM_PROMPT = '''You are an autonomous QA testing agent exploring a web application.
Given the current page context (URL, title, headings, interactive elements), decide the
single best next action to take. Your goal is to explore the application thoroughly,
find bugs, and test edge cases.

Rules:
- Prioritize unexplored elements and pages
- Test forms with both valid and invalid data
- Look for error states, broken links, and UI inconsistencies
- Avoid repeating the same action on the same element
- If you see a login form and have not authenticated, handle login first

Respond ONLY with valid JSON:
{
    "action": "click" | "fill" | "select" | "navigate" | "screenshot" | "done",
    "element_index": <int or null>,
    "value": "<string value for fill/select actions or null>",
    "reasoning": "<why you chose this action>"
}'''


def decide_next_action(page_summary: dict, memory_context: str) -> dict:
    # Construct the prompt with current page state and memory
    prompt = f'''Current page state:
URL: {page_summary['url']}
Title: {page_summary['title']}
Headings: {page_summary['headings']}
Forms on page: {page_summary['form_count']}
Interactive elements ({page_summary['element_count']} total):

'''
    # Add each element as a numbered option the LLM can choose
    for el in page_summary['interactive_elements'][:50]:
        prompt += f"[{el['index']}] <{el['tag']}> "
        if el['text']:
            prompt += f"text='{el['text'][:60]}' "
        if el['type']:
            prompt += f"type='{el['type']}' "
        if el['placeholder']:
            prompt += f"placeholder='{el['placeholder']}' "
        if el['aria_label']:
            prompt += f"aria='{el['aria_label']}' "
        prompt += "\n"

    prompt += f"\nExploration memory:\n{memory_context}\n"
    prompt += "\nDecide the next action:"

    response = ollama.chat(
        model='llama3',
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": prompt}
        ],
        format="json"
    )

    return json.loads(response['message']['content'])

The LLM sees the entire page as a list of numbered interactive elements and picks one. The reasoning field is critical — it creates an audit trail of why the agent made each decision, which is invaluable for debugging and for generated bug reports. If you have explored Playwright CLI integrations with AI tools, this pattern will feel familiar but significantly more autonomous.

Memory Management: Avoiding Infinite Loops

Without memory, the agent will click the same button forever. The memory layer tracks visited URLs, executed actions, and discovered states to ensure the agent keeps exploring new territory.

# agent/memory.py
# Tracks exploration state to prevent the agent from revisiting the same
# pages and repeating the same actions. Uses URL + action hashing for dedup.

import hashlib
from dataclasses import dataclass, field
from datetime import datetime


@dataclass
class ExplorationMemory:
    visited_urls: set = field(default_factory=set)
    executed_actions: set = field(default_factory=set)
    discovered_bugs: list = field(default_factory=list)
    action_history: list = field(default_factory=list)
    max_actions_per_page: int = 15
    max_total_actions: int = 200

    def _action_hash(self, url: str, action: str, element_index: int) -> str:
        # Creates a unique hash for each URL + action + element combination
        raw = f"{url}|{action}|{element_index}"
        return hashlib.md5(raw.encode()).hexdigest()

    def has_visited(self, url: str) -> bool:
        # Normalize URL by removing query params for comparison
        base_url = url.split('?')[0]
        return base_url in self.visited_urls

    def record_visit(self, url: str):
        base_url = url.split('?')[0]
        self.visited_urls.add(base_url)

    def has_executed(self, url: str, action: str, element_index: int) -> bool:
        action_id = self._action_hash(url, action, element_index)
        return action_id in self.executed_actions

    def record_action(self, url: str, action: str, element_index: int, reasoning: str):
        action_id = self._action_hash(url, action, element_index)
        self.executed_actions.add(action_id)
        self.action_history.append({
            "timestamp": datetime.now().isoformat(),
            "url": url,
            "action": action,
            "element_index": element_index,
            "reasoning": reasoning
        })

    def get_context_summary(self) -> str:
        # Returns a concise summary for the LLM prompt
        recent = self.action_history[-10:]
        summary = f"Visited {len(self.visited_urls)} unique pages. "
        summary += f"Executed {len(self.executed_actions)} unique actions. "
        summary += f"Found {len(self.discovered_bugs)} potential bugs.\n"
        summary += "Recent actions:\n"
        for a in recent:
            summary += f"  - {a['action']} on element {a['element_index']} at {a['url']}: {a['reasoning']}\n"
        return summary

    def is_exploration_complete(self) -> bool:
        return len(self.executed_actions) >= self.max_total_actions

The memory system uses MD5 hashing of URL + action + element combinations for O(1) deduplication lookups. The get_context_summary method feeds recent history back to the LLM so it knows what has already been tried. This is the mechanism that turns a random clicker into a systematic explorer.

Smart Login and Authentication Flow Handling

Most web applications require authentication. The agent needs to detect login pages and handle them before it can explore the rest of the application.

# agent/auth_handler.py
# Detects login pages and handles authentication automatically.
# Uses heuristic detection (form fields, page keywords) rather than hardcoded URLs.

from playwright.async_api import Page
import yaml


async def detect_login_page(page: Page) -> bool:
    # Check for common login indicators in the DOM
    indicators = await page.evaluate('''
        () => {
            const html = document.body.innerText.toLowerCase();
            const hasPasswordField = document.querySelector('input[type="password"]') !== null;
            const hasLoginKeywords = ['sign in', 'log in', 'login', 'username', 'email'].some(
                kw => html.includes(kw)
            );
            return { hasPasswordField, hasLoginKeywords };
        }
    ''')
    return indicators['hasPasswordField'] and indicators['hasLoginKeywords']


async def perform_login(page: Page, config_path: str = "config/settings.yaml"):
    # Load credentials from config file
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)

    creds = config.get('auth', {})
    username = creds.get('username', '')
    password = creds.get('password', '')

    # Find and fill username/email field
    username_selectors = [
        'input[name="username"]', 'input[name="email"]',
        'input[type="email"]', 'input[id="username"]',
        'input[id="email"]', 'input[name="login"]'
    ]
    for selector in username_selectors:
        el = await page.query_selector(selector)
        if el:
            await el.fill(username)
            break

    # Find and fill password field
    password_el = await page.query_selector('input[type="password"]')
    if password_el:
        await password_el.fill(password)

    # Find and click submit button
    submit_selectors = [
        'button[type="submit"]', 'input[type="submit"]',
        'button:has-text("Log in")', 'button:has-text("Sign in")'
    ]
    for selector in submit_selectors:
        el = await page.query_selector(selector)
        if el:
            await el.click()
            break

    # Wait for navigation after login
    await page.wait_for_load_state('networkidle', timeout=10000)

The auth handler uses heuristic detection rather than hardcoded URLs. It looks for password fields and login-related keywords in the DOM, making it work across different applications without configuration changes.

Auto-Generating Bug Reports With Repro Steps and Screenshots

When the agent encounters an anomaly — a JavaScript error, an unexpected HTTP status, a visual regression, or an element that behaves unexpectedly — it automatically generates a structured bug report.

# agent/bug_reporter.py
# Generates structured bug reports with reproduction steps,
# screenshots, and environment details when anomalies are detected.

import json
import os
from datetime import datetime
from dataclasses import dataclass


@dataclass
class BugReport:
    title: str
    severity: str
    url: str
    description: str
    repro_steps: list
    screenshot_path: str
    console_errors: list
    network_errors: list
    timestamp: str = ""

    def __post_init__(self):
        if not self.timestamp:
            self.timestamp = datetime.now().isoformat()


class BugReporter:
    def __init__(self, output_dir: str = "reports/bugs"):
        self.output_dir = output_dir
        os.makedirs(output_dir, exist_ok=True)
        self.bugs = []

    async def capture_bug(self, page, memory, title, severity, description):
        # Take screenshot as evidence
        screenshot_name = f"bug_{len(self.bugs)}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.png"
        screenshot_path = os.path.join(self.output_dir, screenshot_name)
        await page.screenshot(path=screenshot_path, full_page=True)

        # Capture console errors from the injected listener
        console_errors = await page.evaluate('''
            () => window.__capturedErrors || []
        ''')

        # Build reproduction steps from action history
        repro_steps = []
        for i, action in enumerate(memory.action_history[-10:], 1):
            repro_steps.append(
                f"{i}. {action['action'].upper()} element [{action['element_index']}] "
                f"on {action['url']} ({action['reasoning']})"
            )

        bug = BugReport(
            title=title,
            severity=severity,
            url=page.url,
            description=description,
            repro_steps=repro_steps,
            screenshot_path=screenshot_path,
            console_errors=console_errors,
            network_errors=[]
        )

        self.bugs.append(bug)

        # Save individual bug report as JSON
        report_path = os.path.join(self.output_dir, f"bug_{len(self.bugs)}.json")
        with open(report_path, 'w') as f:
            json.dump(bug.__dict__, f, indent=2)

        return bug

    def generate_summary(self) -> str:
        # Produces a summary of all discovered bugs
        if not self.bugs:
            return "No bugs discovered during this exploration session."

        summary = f"Discovered {len(self.bugs)} potential bugs:\n\n"
        for i, bug in enumerate(self.bugs, 1):
            summary += f"Bug #{i}: [{bug.severity}] {bug.title}\n"
            summary += f"  URL: {bug.url}\n"
            summary += f"  Screenshot: {bug.screenshot_path}\n\n"
        return summary

The reproduction steps are automatically constructed from the agent’s action history. This means every bug report includes the exact sequence of steps that led to the issue — something that manual testers often struggle to document consistently.

Basic Visual Analysis: Comparing Screenshots for UI Anomalies

The agent captures screenshots at key moments and performs basic pixel-level comparison to detect visual regressions and UI anomalies.

# agent/visual_analyzer.py
# Compares screenshots to detect visual regressions.
# Uses pixel-level diffing with configurable thresholds.

from PIL import Image
import os


class VisualAnalyzer:
    def __init__(self, threshold: float = 0.05, screenshot_dir: str = "screenshots"):
        # threshold is the percentage of differing pixels that triggers an alert
        self.threshold = threshold
        self.screenshot_dir = screenshot_dir
        self.baseline_screenshots = {}
        os.makedirs(screenshot_dir, exist_ok=True)

    def compare_screenshots(self, baseline_path: str, current_path: str) -> dict:
        # Compare two screenshots and return the diff percentage
        baseline = Image.open(baseline_path).convert('RGB')
        current = Image.open(current_path).convert('RGB')

        # Resize to same dimensions if needed
        if baseline.size != current.size:
            current = current.resize(baseline.size)

        # Pixel-by-pixel comparison
        baseline_pixels = list(baseline.getdata())
        current_pixels = list(current.getdata())
        total_pixels = len(baseline_pixels)
        diff_count = 0

        for bp, cp in zip(baseline_pixels, current_pixels):
            # Check if RGB difference exceeds tolerance per channel
            if any(abs(b - c) > 30 for b, c in zip(bp, cp)):
                diff_count += 1

        diff_percentage = diff_count / total_pixels
        is_anomaly = diff_percentage > self.threshold

        return {
            "diff_percentage": round(diff_percentage * 100, 2),
            "is_anomaly": is_anomaly,
            "total_pixels": total_pixels,
            "changed_pixels": diff_count,
            "baseline": baseline_path,
            "current": current_path
        }

    async def capture_and_compare(self, page, page_id: str) -> dict:
        # Captures current screenshot and compares with baseline if available
        current_path = os.path.join(self.screenshot_dir, f"{page_id}_current.png")
        await page.screenshot(path=current_path, full_page=True)

        baseline_path = os.path.join(self.screenshot_dir, f"{page_id}_baseline.png")
        if os.path.exists(baseline_path):
            result = self.compare_screenshots(baseline_path, current_path)
            return result

        # No baseline exists yet so save current as baseline
        os.rename(current_path, baseline_path)
        return {"is_anomaly": False, "message": "Baseline captured for future comparison"}

The Core Exploration Loop: Putting It All Together

Now we connect every component into the main exploration loop. This is the heart of the autonomous agent.

# agent/explorer.py
# The main exploration loop that orchestrates DOM reading, LLM decisions,
# action execution, memory updates, and bug detection.

import asyncio
import allure
from playwright.async_api import async_playwright
from agent.dom_reader import get_page_summary
from agent.llm_brain import decide_next_action
from agent.memory import ExplorationMemory
from agent.auth_handler import detect_login_page, perform_login
from agent.bug_reporter import BugReporter
from agent.visual_analyzer import VisualAnalyzer


class AutonomousExplorer:
    def __init__(self, start_url: str, max_steps: int = 200):
        self.start_url = start_url
        self.max_steps = max_steps
        self.memory = ExplorationMemory(max_total_actions=max_steps)
        self.bug_reporter = BugReporter()
        self.visual_analyzer = VisualAnalyzer()

    async def run(self):
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True)
            context = await browser.new_context(
                viewport={"width": 1280, "height": 720},
                record_video_dir="reports/videos"
            )
            page = await context.new_page()

            # Inject console error capture script
            await page.evaluate('''
                window.__capturedErrors = [];
                window.addEventListener('error', (e) => {
                    window.__capturedErrors.push(e.message);
                });
            ''')

            # Navigate to starting URL
            await page.goto(self.start_url, wait_until="networkidle")

            # Handle login if needed
            if await detect_login_page(page):
                await perform_login(page)

            step = 0
            while step < self.max_steps and not self.memory.is_exploration_complete():
                step += 1

                # 1. Read the current page
                page_summary = await get_page_summary(page)
                self.memory.record_visit(page.url)

                # 2. Visual analysis
                page_id = page.url.replace('/', '_').replace(':', '')[:80]
                visual_result = await self.visual_analyzer.capture_and_compare(page, page_id)
                if visual_result.get("is_anomaly"):
                    await self.bug_reporter.capture_bug(
                        page, self.memory,
                        title=f"Visual regression on {page.url}",
                        severity="medium",
                        description=f"Screenshot diff: {visual_result['diff_percentage']}% changed"
                    )

                # 3. Ask the LLM what to do next
                memory_context = self.memory.get_context_summary()
                decision = decide_next_action(page_summary, memory_context)

                # 4. Execute the decided action
                action = decision.get("action", "done")
                element_idx = decision.get("element_index")
                value = decision.get("value")
                reasoning = decision.get("reasoning", "")

                with allure.step(f"Step {step}: {action} element [{element_idx}] - {reasoning}"):
                    try:
                        await self._execute_action(page, page_summary, action, element_idx, value)
                        self.memory.record_action(page.url, action, element_idx or 0, reasoning)
                    except Exception as e:
                        await self.bug_reporter.capture_bug(
                            page, self.memory,
                            title=f"Action failed: {action} on element {element_idx}",
                            severity="high",
                            description=str(e)
                        )

                if action == "done":
                    break

            await browser.close()
            return self.bug_reporter.generate_summary()

    async def _execute_action(self, page, page_summary, action, element_idx, value):
        # Maps LLM decisions to actual Playwright commands
        elements = page_summary['interactive_elements']

        if action == "click" and element_idx is not None:
            target = elements[element_idx]
            selector = self._build_selector(target)
            await page.click(selector, timeout=5000)
            await page.wait_for_load_state("networkidle", timeout=8000)

        elif action == "fill" and element_idx is not None and value:
            target = elements[element_idx]
            selector = self._build_selector(target)
            await page.fill(selector, value)

        elif action == "navigate" and value:
            await page.goto(value, wait_until="networkidle")

        elif action == "screenshot":
            await page.screenshot(path=f"screenshots/manual_step_{element_idx}.png")

    def _build_selector(self, element: dict) -> str:
        # Builds the most reliable selector for a given element
        if element.get('id'):
            return f"#{element['id']}"
        if element.get('name'):
            return f"[name='{element['name']}']"
        if element.get('aria_label'):
            return f"[aria-label='{element['aria_label']}']"
        if element.get('text') and element['tag'] in ('button', 'a'):
            return f"{element['tag']}:has-text('{element['text'][:40]}')"
        return f"{element['tag']}:nth-of-type({element['index'] + 1})"

Generating Test Cases in Excel and Allure Reports

After exploration, the agent converts its action history into structured test cases — both as Excel files for stakeholders and as Allure reports for the engineering team.

# agent/excel_generator.py
# Converts the agent exploration history into structured Excel test cases.

from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill
from datetime import datetime


def generate_test_cases(memory, output_path: str = "reports/generated-tests/test_cases.xlsx"):
    wb = Workbook()
    ws = wb.active
    ws.title = "Generated Test Cases"

    # Header row styling
    header_font = Font(bold=True, color="FFFFFF")
    header_fill = PatternFill(start_color="2F5496", end_color="2F5496", fill_type="solid")

    headers = ["TC ID", "Title", "Page URL", "Action", "Element", "Input Value",
               "Expected Result", "Reasoning", "Timestamp"]

    for col, header in enumerate(headers, 1):
        cell = ws.cell(row=1, column=col, value=header)
        cell.font = header_font
        cell.fill = header_fill

    # Populate rows from the agent action history
    for i, action in enumerate(memory.action_history, 1):
        ws.cell(row=i+1, column=1, value=f"TC-AUTO-{i:04d}")
        ws.cell(row=i+1, column=2, value=f"Verify {action['action']} on element {action['element_index']}")
        ws.cell(row=i+1, column=3, value=action['url'])
        ws.cell(row=i+1, column=4, value=action['action'])
        ws.cell(row=i+1, column=5, value=str(action['element_index']))
        ws.cell(row=i+1, column=6, value=action.get('value', 'N/A'))
        ws.cell(row=i+1, column=7, value="Action completes without error")
        ws.cell(row=i+1, column=8, value=action['reasoning'])
        ws.cell(row=i+1, column=9, value=action['timestamp'])

    # Auto-adjust column widths for readability
    for col in ws.columns:
        max_length = max(len(str(cell.value or "")) for cell in col)
        ws.column_dimensions[col[0].column_letter].width = min(max_length + 2, 50)

    wb.save(output_path)
    return output_path

For Allure reporting, the allure.step decorator in the exploration loop already captures each action as a test step. Run the agent with pytest --alluredir=reports/allure-results and then allure serve reports/allure-results to see a full visual timeline of the exploration session.

Jenkins Pipeline Integration

To run the agent on a schedule in CI, here is a Jenkinsfile that executes the exploration and publishes the Allure report:

pipeline {
    agent any
    triggers {
        cron('H 2 * * *')  // Run nightly at 2 AM
    }
    environment {
        OLLAMA_HOST = 'http://localhost:11434'
    }
    stages {
        stage('Setup') {
            steps {
                sh 'pip install -r requirements.txt'
                sh 'playwright install chromium'
                sh 'ollama pull llama3'
            }
        }
        stage('Explore') {
            steps {
                sh 'python run_agent.py --url https://your-app.com --max-steps 150'
            }
        }
        stage('Reports') {
            steps {
                allure includeProperties: false, results: [[path: 'reports/allure-results']]
                archiveArtifacts artifacts: 'reports/**/*', fingerprint: true
            }
        }
    }
    post {
        always {
            publishHTML([reportDir: 'reports/bugs', reportFiles: '*.json', reportName: 'Bug Reports'])
        }
    }
}

Limitations: When Autonomous Agents Are NOT the Right Choice

I want to be honest about what this approach cannot do, because the hype cycle around AI testing is creating unrealistic expectations.

  • Compliance and regulatory testing: If you need to prove specific test cases were executed in a specific order with specific data, an autonomous agent is the wrong tool. Regulatory audits require deterministic, reproducible test runs.
  • Performance testing: Autonomous exploration is inherently single-threaded and non-deterministic. Use JMeter, k6, or Locust for load and performance testing.
  • Pixel-perfect UI validation: The visual analysis in this agent is basic. For production-grade visual regression, use dedicated tools like Applitools or Percy.
  • Business logic validation: The agent does not understand your business rules. It can find crashes, errors, and UI anomalies, but it cannot verify that a 15% discount applies to orders over $200 on Tuesdays unless you explicitly encode that expectation.
  • LLM hallucination risk: Sometimes the LLM will decide on an action that does not make sense. The memory system mitigates this, but it does not eliminate it. Always review generated bug reports manually before filing them.
  • Cost and infrastructure: Running Ollama locally requires a GPU with at least 8GB VRAM for acceptable inference speed. Cloud LLMs add API costs. The infrastructure overhead is real.

Autonomous agents work best as a complement to your existing test suite, not a replacement. Use them for exploratory coverage of areas where you do not have scripted tests, for discovering unknown unknowns, and for smoke testing after large refactors. Keep your deterministic regression suite for critical business flows. For a broader perspective on building automation frameworks intelligently, see the vibe coding automation framework approach.


High-Level Architecture

Here is how all the components connect in the system:

+-------------------+       +------------------+       +------------------+
|   Playwright      |       |   Ollama LLM     |       |   Memory Layer   |
|   (Browser)       |<---->|   (Decision)     |<---->|   (State Track)  |
+-------------------+       +------------------+       +------------------+
        |                           |                           |
        v                           v                           v
+-------------------+       +------------------+       +------------------+
|   DOM Reader      |       |   Bug Reporter   |       |   Visual         |
|   (Page Context)  |       |   (Evidence)     |       |   Analyzer       |
+-------------------+       +------------------+       +------------------+
        |                           |                           |
        +---------------------------+---------------------------+
                                    |
                                    v
                    +-------------------------------+
                    |   Reports: Allure + Excel     |
                    |   Jenkins CI/CD Pipeline      |
                    +-------------------------------+

The flow is cyclical. Playwright reads the page, the DOM Reader simplifies it, the LLM decides the next action, Playwright executes it, memory records it, and the cycle repeats. Bug Reporter and Visual Analyzer run as side effects during each cycle, capturing evidence when anomalies appear.


Frequently Asked Questions

1. Can this autonomous agent completely replace manual exploratory testing?

No. The agent excels at systematic coverage — clicking through every link, testing every form field, exploring navigation paths. But it lacks the intuitive understanding that experienced testers bring. A human tester notices when something “feels wrong” even if no error is thrown. The agent cannot replicate that judgment. Use the agent to cover breadth, and human testers for depth on critical flows.

2. How does the agent handle dynamic content like modals, tooltips, and infinite scroll?

The DOM Reader re-scans the page after every action, so dynamically appearing elements like modals and tooltips are captured in the next iteration. For infinite scroll, you can add a scroll action to the LLM’s action vocabulary. The agent will then learn to scroll when it sees fewer interactive elements than expected. However, highly dynamic SPAs with frequent DOM mutations can confuse the element indexing. Adding a small delay after actions and using more robust selectors (id, name, aria-label) mitigates this.

3. What is the cost of running this locally with Ollama vs. using a cloud LLM API?

With Ollama running Llama 3 locally, the infrastructure cost is the hardware itself — a machine with an NVIDIA GPU (8GB+ VRAM) or an Apple Silicon Mac with 16GB+ RAM. Inference is free after that. With a cloud API like OpenAI GPT-4o, expect to spend roughly $0.02-0.05 per exploration step (prompt + response tokens). A 200-step exploration session would cost $4-10 via API. For nightly CI runs, the local approach pays for itself within weeks.

4. How do I prevent the agent from performing destructive actions like deleting data?

Add explicit constraints to the LLM system prompt: list forbidden actions such as “delete,” “remove,” “drop,” and “reset.” You can also implement a blocklist in the execution layer that rejects any action targeting elements with dangerous labels. Additionally, always run the agent against a staging environment or a dedicated test instance — never against production. The memory layer can also be configured to skip elements matching certain CSS classes or data attributes.

5. Can this approach work with mobile web applications or native mobile apps?

For mobile web applications, yes — Playwright supports mobile emulation out of the box. Set the browser context with a mobile viewport and user agent, and the agent will explore the mobile version of your site. For native mobile apps, you would need to replace Playwright with Appium and modify the DOM Reader to use Appium’s element inspection instead of JavaScript evaluation. The LLM decision layer and memory system remain unchanged, which demonstrates the value of the modular architecture.


Conclusion: The Agent Is Not the Future — It Is the Present

The autonomous testing agent described in this article is not a research prototype. Every component is built with production-ready libraries, runs in standard CI pipelines, and generates actionable output. Anant Jain’s team has been running a similar architecture for over a quarter, and the results speak for themselves: broader coverage, faster bug detection, and a fundamentally different relationship between QA teams and the applications they test.

The critical insight is that the LLM is not doing the testing. Playwright is doing the testing. The LLM is making decisions about what to test next. That separation of concerns is what makes the architecture robust. If the LLM makes a bad decision, the worst that happens is a wasted step. The browser automation, evidence collection, and reporting all work regardless of decision quality.

Start small. Point the agent at a single page of your application. Watch what it explores. Review the bug reports. Tune the system prompt. Gradually expand the scope. Within a few iterations, you will have a testing capability that no amount of scripted automation could replicate.

The era of self-directed test exploration has arrived. The question is not whether autonomous agents will become part of your testing strategy. The question is how quickly you will adopt them.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.