What Happens When You Let AI Test Your App for a Week

Name: Drengr
Author: Drengr

March 4, 202610 min read

ResearchTesting

AI mobile testing is either the future of QA or an expensive way to generate false bug reports, depending on who you ask. I decided to find out for myself. I pointed Drengr's OODA-loop agent at three different apps — a calculator, a weather app, and a social media client — and let it run autonomously for a week. Here's what happened, including the parts that didn't work.

This wasn't a controlled experiment in any scientific sense. The sample size is tiny, the apps are specific, and the results may not generalize. I'm sharing this as a data point, not a proof. Autonomous mobile testing is genuinely new territory and I think honest reporting matters more than impressive claims.

The Setup

Each app got the same treatment:

10 autonomous exploration runs per day, each with a different high-level goal prompt
Each run capped at 50 actions to limit token costs
Goals ranged from specific ("calculate 15% tip on $47.50") to open-ended ("explore the app and report anything that seems broken")
Claude Sonnet 4 as the decision-making model, chosen for the balance of capability and cost
Android emulator, Pixel 7 image, API 34

Total over the week: 210 runs across the three apps, approximately 2.1 million tokens consumed, about $14 in API costs.

The Prompts

I learned quickly that prompt design matters enormously. "Test the calculator" produced aimless tapping. "Verify that the calculator handles edge cases in arithmetic operations, including negative numbers, decimal precision, division by zero, and very large numbers" produced useful, targeted exploration.

The sweet spot was specific enough to guide the agent but open enough to let it discover things I hadn't anticipated.

What It Found

App 1: Calculator — The Negative Number Bug

The calculator app was a personal project, something I'd built and considered "done" for months. The agent found a bug on the second day that I'd never noticed: entering a negative number, then pressing the percent button, then pressing equals produced NaN instead of a numeric result.

I'd never tested that sequence manually. Why would I? Negative percent of a number isn't a common operation. But the agent, exploring combinations I wouldn't think to try, stumbled into it. The underlying issue was a missing absolute value check in the percentage calculation path.

That alone made the experiment worthwhile for me. It's a trivial bug, but it had shipped. A real user could have hit it.

App 2: Weather App — The Broken Deep Link

The weather app supported deep links for sharing forecast URLs. The agent, when given the goal "navigate to the settings page using every available path," discovered that the deep link weather://settings/notifications crashed the app. The crash was caught by Drengr's logcat monitoring before the agent even had to report it — the situation engine flagged a fatal exception.

The root cause was a missing null check on a fragment argument. The deep link handler assumed a bundle parameter would always be present, but the notifications settings fragment expected it to be passed by the parent activity, not by a deep link.

App 3: Social Media Client — The Accessibility Issue

This was the most interesting finding. The social media client had several icon buttons — like, share, bookmark — that had no content descriptions. The agent reported them as "unlabeled interactive elements" because the UI hierarchy showed clickable views with no text and no accessibility labels.

The agent wasn't doing accessibility testing on purpose. It was trying to describe what it saw, and it couldn't identify those buttons. The same problem that confused the AI would confuse a screen reader. Inaccessible UI is ambiguous UI, and ambiguity hurts both automated agents and human users who rely on assistive technology.

What It Missed

Equally important is what the agent did not catch.

A timing-sensitive race condition. The weather app had a bug where rapidly switching between cities while forecasts were loading could display the wrong city's data. This required specific timing — switching during the 200-400ms window between the API response arriving and the UI updating. The agent's action cycle was too slow (3-5 seconds between actions) to ever trigger this window.

Visual alignment issues. The social media client had a layout bug where long usernames caused text to overlap with the timestamp on certain screen widths. The UI hierarchy reported correct element bounds — the overlap was a rendering issue, not a layout issue. The elements were "correctly positioned" according to the layout engine but visually overlapping. The agent, which relies on the UI tree more than pixel-level analysis, didn't notice.

Subtle UX problems. The calculator's history feature was confusing — it showed results in reverse chronological order with no clear timestamps, and old entries looked identical to new ones. A human tester would flag this as a usability issue. The agent, which has no concept of "confusing," saw a functioning list and moved on.

False Positives

The agent reported 23 "issues" across the week. After manual review, 14 were genuine findings and 9 were false positives. That's a 39% false positive rate — high enough to require human review of every report.

The most common false positive: interpreting slow loads as crashes. The agent would tap a button, wait for the screen to change, and if nothing happened within its patience window (about 8 seconds), report a failure. Several of these were just slow network responses on the emulator.

The second most common: misinterpreting intentional UI states as errors. A dismissed bottom sheet was reported as "content disappeared unexpectedly." An empty search results page was reported as "app failed to load content." These are correct observations — the content did disappear, the page is empty — but the agent's interpretation was wrong.

Cost Analysis

Across 210 runs:

Total tokens: ~2.1 million (input + output)
Total API cost: ~$14
Average per run: ~10,000 tokens, ~$0.07
Average run time: 3-4 minutes
Human review time: ~5 hours total to evaluate all reports

For comparison, manual QA testing of those three apps at a similar depth would have taken me roughly 15-20 hours. The AI testing took about 5 hours of my time (setup, prompt design, and report review) plus $14 in API costs.

That's a meaningful efficiency gain, but it's not zero-effort. The human is still in the loop, reviewing reports and separating signal from noise.

My Honest Take

AI QA testing is not a replacement for human QA. It finds different kinds of bugs through different kinds of exploration. A human tester applies domain knowledge, aesthetic judgment, and intuition about what "feels wrong." An AI agent applies exhaustive combinatorial exploration, patience for repetitive tasks, and zero assumptions about how the app "should" work.

The most valuable bugs the agent found were the ones I'd never have thought to test for. The most valuable bugs it missed were the ones that required human judgment to even recognize as bugs.

The two approaches are complementary. The agent explores the spaces I wouldn't think to explore. I evaluate the findings with context the agent doesn't have. Together, that coverage is better than either alone.

I plan to keep running these experiments with more apps and more sophisticated prompting strategies. The 39% false positive rate is the number I most want to bring down — that's where the agent goes from "interesting research tool" to "practical QA assistant."