Back to blog

Giving Claude a Phone: How I Built an MCP Server for Mobile Devices

9 min read
ResearchMCP

I was watching Claude write perfect Kotlin code — an elegant RecyclerView adapter with DiffUtil callbacks, proper coroutine scoping, the works — but it couldn't tap a button on the emulator running right next to it. The code was flawless. The app was sitting there, compiled and launched. And the AI that wrote it had absolutely no way to interact with it.

That disconnect felt like a problem worth solving. MCP mobile integration was the missing piece. If Claude could connect to Android and iOS devices through a standard protocol, it could finally close the loop between writing code and verifying it works. This is the story of how I built Drengr — an MCP server that gives AI agents eyes and hands on real mobile devices.

The Frustration That Started It

If you've used Claude or any capable LLM for mobile development, you've hit this wall. The AI helps you write code, debug layouts, even architect entire features. But the moment you need to verify something on an actual device, you're on your own. Copy the code, build, deploy, tap around, find the bug, go back to the AI, describe what you saw in words.

It's 2026, and the feedback loop between AI and mobile devices is still mediated entirely by human hands and human descriptions. That felt wrong to me. Not because automation is always better, but because the information loss in that loop is enormous. I can describe a broken layout to Claude, but Claude seeing the broken layout is fundamentally different.

The Insight: MCP as the Bridge

Anthropic's Model Context Protocol gave me the architecture I needed. MCP defines a standard way for AI models to discover and invoke tools — a JSON-RPC protocol over stdio or HTTP. Instead of building a bespoke integration, I could build an MCP server that exposes mobile device capabilities as tools that any MCP-compatible client can call.

The key insight was constraint. I didn't need to expose every possible device operation. I needed exactly three tools that would give an AI agent enough capability to understand and interact with any mobile app.

Three Tools, Three Verbs

Drengr exposes exactly three MCP tools:

  • drengr_look — Observes the current screen. Captures a screenshot, extracts the UI hierarchy, and returns an annotated view where every interactive element is numbered. The agent sees what a user would see, but with machine-readable structure.
  • drengr_do — Executes an action. Tap element 3, type "hello world", swipe up, press back. These are the hands.
  • drengr_query — Reads device state without side effects. Check if an element exists, read text content, get the current activity name. This is the quiet observer — it never changes anything.

That's it. Three tools. Every mobile interaction I've needed — from opening apps to navigating complex flows to filling forms — reduces to sequences of look, do, and query.

What Claude Actually Does With a Phone

Let me describe a real session. I asked Claude, through Drengr, to "open YouTube and find a video about the Model Context Protocol."

Claude called drengr_look first. It received back an annotated screenshot showing the home screen with numbered elements — the app drawer, status bar icons, and the YouTube icon labeled as element 14. Claude called drengr_do with {"action": "tap", "element": 14}.

YouTube opened. Claude called drengr_look again. Now it could see the YouTube home feed with a search icon at element 2. It tapped that, got a keyboard and search field, typed "Model Context Protocol MCP", and hit enter. Results appeared. Claude called drengr_look one more time, identified the first relevant result, and tapped it.

Total time: about 40 seconds. Total human intervention: zero. Claude navigated an app it had never been configured to use, adapting to whatever UI state it encountered.

Setting It Up

The MCP configuration is minimal. Here's what goes in your claude_desktop_config.json:

{
  "mcpServers": {
    "drengr": {
      "command": "drengr",
      "args": ["mcp"],
      "env": {
        "DRENGR_PLATFORM": "android"
      }
    }
  }
}

That's the entire integration. Drengr ships as a single binary — no Python virtualenv, no npm dependencies, no Docker container. You install it, point your MCP client at it, and Claude gains the ability to interact with whatever device is connected.

Honest Limitations

I want to be transparent about where this breaks down, because it does break down.

Vision isn't perfect. The UI hierarchy doesn't always capture everything visible on screen. Custom-drawn views, game canvases, and some Flutter widgets can appear as opaque rectangles. The agent can see the screenshot, but without structured element data, it's guessing at tap coordinates.

Some gestures are hard to express. A simple tap or swipe works reliably. But complex gestures — pinch to zoom, long-press-then-drag, multi-finger interactions — are difficult to represent in a tool call. I've implemented the common ones, but there's a long tail of interactions that don't map cleanly.

Latency adds up. Each look-do cycle involves capturing a screenshot, extracting the UI tree, sending it to the AI, waiting for a decision, and executing the action. On a fast local setup, each cycle takes 3-5 seconds. Over a network to a cloud device, it can be 8-12 seconds. For a 20-step flow, that's minutes of wall time.

Token costs are real. Screenshots and UI trees are not small. A single drengr_look response can be several thousand tokens. A complex navigation flow might consume 50,000-100,000 tokens. This isn't free, and it's something I think about when designing how much context to include in each response.

What This Changes

The immediate application is testing — give Claude a goal, let it explore the app, report what it finds. But I think the more interesting implication is broader. MCP mobile support means AI agents can participate in workflows that were previously human-only. Filing bug reports with actual screenshots. Verifying that a deployment worked on a real device. Walking through a user flow to understand it before writing code.

The gap between "AI that understands code" and "AI that understands the product" has always been the device. Drengr is my attempt to close that gap.

What's Next

I'm working on a dashboard for visualizing test runs, real-time network monitoring so the agent can correlate UI actions with API calls, and a steering system that lets you redirect the agent mid-run. The core — three tools, one binary, MCP-native — won't change. Everything else is about making that core more useful.

If you want to try it: curl -fsSL https://drengr.dev/install.sh | bash. It takes about 10 seconds. I'd genuinely appreciate feedback on what works, what doesn't, and what you'd want it to do that it can't yet.