I Shipped a macOS App Built Entirely by Claude Code

I recently shipped Context, a native macOS app for debugging MCP servers. The goal was to build a useful developer tool that feels at home on the platform, powered by Apple's SwiftUI framework. I've been building software for the Mac since 2008, but this time was different: Context was almost 100% built by Claude Code¹. There is still skill and iteration involved in helping Claude build software, but of the 20,000 lines of code in this project, I estimate that I wrote less than 1,000 lines by hand².

This is a long post explaining my journey, how I chose my tools, what those tools are good at and bad at (for now), and how you can leverage them to maximize the quality of your generated code output, especially if you're building a native app like I am.

Copilot to Claude Code
Starting Yet Another Side Project
Claude Code Is Good at Writing Code
Claude Code Is Okay at Swift and Good at SwiftUI
You Can Just Say "Make it More Beautiful"
Context Engineering Is Key
Priming the Agent
Agents Can’t Read Your Mind, They Need Specs
"Ultrathink and Make a Plan"
Set Up Feedback Loops
Claude Code Can Do More Than Write Code
Building High Quality Automation Is (Almost) Free Now
IDEs of the Future Will Look Very Different
I Can Ship Side Projects Again

1. Copilot to Claude Code

My first experience with AI coding tools was when I tried GitHub Copilot, built into VS Code. This was the first tool of its kind, and I was pretty amazed: at the time, it was just autocomplete, but it was surprisingly effective—instead of only autocompleting symbol names or function signatures like a typical editor, it could autocomplete entire function implementations based on the context around it. This was a great productivity boost but it still felt like you were doing most of the work.

Then things started to move fast: Cursor took off, they added Agent Mode, and new competitors like Windsurf entered the space. All of the products were leaning into the "agentic" mode of development, where instead of using one-shot LLM responses for autocomplete, an LLM calls various tools in a loop to accomplish more complex tasks: gathering context on your code base, reading web pages and documentation, compiling your program, running tests, iterating on build/test failures, etc.

I had not tried any of these new tools extensively because I wasn't actively working on a side project at the time, but in February 2025, an interesting contender emerged out of nowhere: Claude Code was not a VS Code fork like the others, but was an IDE that was designed to be used entirely in the terminal. It had no traditional code editing capabilities or an overwhelming UI with lots of features: it put the agentic loop front and center. A text box to enter a prompt and not much else. Instead of augmenting your IDE with AI, it replaced your IDE. I wasn't entirely convinced that this was the ideal UX, but the idea was refreshing enough compared to what already existed that I decided I had to give it a try.

2. Starting Yet Another Side Project

Like many engineers who have demanding day jobs, I have a large graveyard of side projects that never shipped. Building working prototypes is doable, but the last 20% takes so much time and effort that I had not been able to ship a side project for 6 years.

At this point, I was starting to play around with Claude Code and its support for MCP (Model Context Protocol) servers. Anthropic designed MCP as an open standard to allow agents to access tools and other context to accomplish specific tasks. For example, the Sentry MCP server exposes tools that allow an agent to fetch issues containing stack traces and other useful debugging context, and even invoke Sentry's own bug fixing agent.

However, the experience of building and testing MCP servers was cumbersome: MCP servers communicate with clients over standard input/output streams, or over HTTP with Server-Sent Events (SSE) to give servers the ability to stream responses to clients. It wasn't as simple as invoking a CLI or using curl to send requests to a service. There is a first-party tool called MCP Inspector that allows developers to test server functionality, but as a long-time macOS & iOS developer, I wanted to try building a native app to solve this problem. I figured it would be a great learning experience to push the boundaries of AI agents, and hoped to come out of it with a useful product.

3. Claude Code Is Good at Writing Code

Let me just start by saying that Claude Code (with the latest Sonnet 4 and Opus 4 models) is genuinely good at writing code. It's certainly not a top 1% programmer, but I would say that Claude's outputs are significantly better than those of the average developer. Given a description of the feature you're trying to implement, Claude can:

Locate and read existing source code in your project relevant to the feature
Understand code style and design patterns
Read additional documentation or specifications that you provide
Generate code to implement the feature
Generate tests to validate the behavior of the feature
Build your program and run the tests
Iterate on compiler failures and test failures until the build and tests pass
Look at screenshots or console logs, identify bugs, and fix them (more on this later)

The really incredible thing is that it does this in a fraction of the time that it would take a person to implement the whole thing. Imagine onboarding a new employee with zero context on your project and having them ship a complete feature a few minutes later.

4. Claude Code Is Okay at Swift and Good at SwiftUI

I decided to build my app using the latest Apple developer technologies: Swift 6.1 and SwiftUI on macOS 15.5. I was curious to see how Claude would perform at writing Swift since there is significantly less Swift code in the training data for the model compared to a more ubiquitous language like Python or JavaScript.

The good news is that Claude is competent at using most Swift language features up to Swift 5.5, when Swift Concurrency was introduced. Swift Concurrency was a drastic change to the language, and in my opinion, hard for even humans to use correctly. Claude also gets confused when trying to pick between the modern frameworks and the legacy equivalents. It will often try to use legacy Objective-C APIs when a more modern Swift replacement is available, or use AppKit/UIKit in place of SwiftUI.

The SwiftUI code that it produces works fairly well: it is typically an accurate (but somewhat ugly) representation of the UI, and further iteration can turn it into something that genuinely feels well designed and usable.

A problem that Claude constantly runs into when generating UI code is something that is fundamentally a problem with Swift itself: the type expressions for UI code often end up being so complex that the compiler fails with the dreaded The compiler is unable to type-check this expression in reasonable time error. The solution is to refactor view bodies into smaller expressions, which thankfully, Claude is excellent at doing without breaking the implementation—it sometimes even does this on its own when it sees that compiler error in the output.

You can get Claude to avoid common pitfalls by creating a CLAUDE.md file with basic instructions on using modern APIs. Here's a snippet from the CLAUDE.md file for my Context project:

* Aim to build all functionality using SwiftUI unless there is a feature that is only supported in AppKit.
* Design UI in a way that is idiomatic for the macOS platform and follows Apple Human Interface Guidelines.
* Use SF Symbols for iconography.
* Use the most modern macOS APIs. Since there is no backward compatibility constraint, this app can target the latest macOS version with the newest APIs.
* Use the most modern Swift language features and conventions. Target Swift 6 and use Swift concurrency (async/await, actors) and Swift macros where applicable.

Even this relatively low effort set of rules produces reasonable results, but you can go further: for example, Peter Steinberger's agent-rules repository contains rules that you can add to your agent for both general coding guidelines and to more specifically write better Swift code.

If you're interested in judging the code quality for yourself, see these examples from my project:

OAuthClient.swift: OAuth 2.1 implementation
JSONOutlineView.swift: SwiftUI view that renders JSON in a tree structure with support for expanding/collapsing nodes

5. You Can Just Say "Make it More Beautiful"

If Claude doesn't produce a well-designed UI the first time, you can just tell it to "make it more beautiful/elegant/usable". I've found that the results are surprisingly good for such little effort. You can also do this more methodically by asking it first to "come up with suggestions for how to make this UI more beautiful", which will generate a list of design tweaks that you can choose from.

If you find a UI bug or a UI element you want to tweak, you can take a screenshot and drag and drop it (or ⌘+V paste it) directly into Claude Code. There will likely be better automation for this at some point, but for now, this works well and is universal, no matter what frontend platform you're building for.

6. Context Engineering Is Key

With the advent of mainstream AI, the industry was quick to define a new discipline: prompt engineering. Prompt engineering was the idea that you had to carefully craft prompts to extract the best quality outputs from a model. This may have been true then, but in my experience, I've found that prompt engineering is the wrong thing to focus on when using more recent models.

Today's models are much better at taking imperfect inputs and understanding your intent, both because the models are better and because they incorporate chain of thought (CoT) prompting. You can prompt the model with vague descriptions, incomplete sentences, and poor spelling and grammar, and it still does a reasonably good job of understanding what you're asking for and breaking down the problem into a series of steps.

The limitation you're going to constantly fight against when using Claude Code or a similar tool is the context window. The two newest Anthropic models (Sonnet 4 and Opus 4) both have 200k context windows, meaning they can operate on 200k tokens worth of text at a time. Every prompt and response consumes more context, and the model tends to perform worse toward the end of the context window.

Claude even helpfully displays an indicator showing the amount of context you have left, after which it will proceed to "compact" the conversation. Compaction means that it will summarize the current conversation and use that summary to seed a fresh context window so that you can continue prompting. Compaction is not perfect—it may miss important details from the prior conversation or seed the new context with low-quality context from previous mistakes.

Producing the highest quality outputs using the limited number of context tokens you have, or in other words, context engineering, is the primary challenge in using coding agents effectively.

7. Priming the Agent

There's a process that I call "priming" the agent, where instead of having the agent jump straight to performing a task, I have it read additional context upfront to increase the chances that it will produce good outputs.

By default, it will read what's in both the user-scoped and project-scoped CLAUDE.md files, but you can pull in additional task-specific context by asking it to read specific documentation or source code. This is a prompt that I used recently to get it to read some existing source code and a spec from the web:

Read DXTTransport.swift, DXTManifest.swift, DXTManifestView.swift, DXTConfigurationView.swift, DXTUserConfiguration.swift, AddServerFeature.swift, and AddServerView.swift to learn how adding servers from DXT packages is implemented.

Then read the documentation for the manifest.json format here: https://raw.githubusercontent.com/anthropics/dxt/refs/heads/main/MANIFEST.md

After reading these sources, summarize what you've learned.

Claude will then use the Search and Read tools to find and read the source files, and use the Fetch tool to download the Markdown file from GitHub. Asking it to summarize forces it to think through what it understood from the sources, and having that summary in context improves performance on subsequent tasks.

Priming is especially important when your code uses third-party dependencies or new APIs that might have been introduced after the knowledge cutoff date for the model. Tools like Context7 and llm.codes exist to solve the problem of formatting documentation into a plain-text format that is consumable by the model.

8. Agents Can’t Read Your Mind, They Need Specs

When asking Claude to build a feature, having a detailed spec is essential in steering the model. Claude will not be able to build any non-trivial feature without you putting in the effort. It's customary for AI product demos to highlight 1 sentence prompts that create "entire apps", but if you want more than a prototype, you need a real spec.

The spec doesn't need to be well-written. You could even ramble over voice dictation (I still prefer typing, but anything works). Here's an example of a spec I gave Claude to build a new feature in my app:

This seems like a lot, but I was able to type this out much faster than I would've been able to implement the feature.

9. "Ultrathink and Make a Plan"

Claude tends to jump straight into implementation without sufficient background, which generates poor quality results. Another tactic for priming the agent is asking Claude to use its extended thinking mode and make a plan first. The extended thinking is activated by this set of magic keywords: "think" < "think hard" < "think harder" < "ultrathink." These are not just suggestions to the model—they are specific phrases that activate various levels of extended thinking. Ultrathink burns the most tokens but will yield the best results. If you want to iterate on the plan, it helps to explicitly include instructions in the prompt to not proceed with implementation until the plan has been accepted by the user.

In general, I would highly recommend reading Anthropic's Claude Code: Best practices for agentic coding article. Many of the techniques I've discussed here are covered in the article, and it should be considered essential reading for getting the most out of Claude Code or any coding agent.

10. Set Up Feedback Loops

Claude is most useful when it's capable of independently driving feedback loops that allow it to make a change, test the change, and gather context on what failed to try another iteration. The key loops are:

Build. Claude should know how to compile your app. Claude knows how to compile Swift packages via swift build, but for my macOS application target, it often failed to figure out the right xcodebuild invocation. XcodeBuildMCP solves the problem by giving the model a simplified set of tools for building and running apps.
Test. Claude should be able to build and run your tests and see the test output. Again, Claude was able to do this out of the box for Swift packages via swift test. I have not yet tested whether it can run application/UI tests, but I suspect XcodeBuildMCP may be necessary for that, too.
Fix Bugs. Claude already knows how to debug problems by adding debug logging. The problem is that it cannot interact with the app like a user would to get the app into a state where it emits the right logs. You will have to manually interact with the app and copy/paste logs from the console into Claude. This works fine, but it means you can't have it fix problems completely autonomously unless you write unit tests or UI tests upfront that encapsulate the behavior. There are automation solutions like playwright-mcp for browser apps, but I'm not aware of a well tested equivalent for native development.
Fix UX Issues. I mentioned earlier that you can paste screenshots into Claude to have it iterate on UI. You may be able to use tools like Peekaboo to automate taking screenshots, but you still run into the issue where you need to manually interact with the app to first get it into the right state.

11. Claude Code Can Do More Than Write Code

Because Claude Code is an agent wrapping a general-purpose model, you can still use it to help with non-coding tasks as you are iterating on the app itself. Things like editing copy or even planning out feature versions by asking the model for suggestions on how you could improve the functionality of the app.

One small thing I found useful was the ability to generate mock data before I had a way to get real data into the app. While building Context, I had partially built an implementation of a Swift MCP client library, but I wanted to switch gears and do some UI prototyping. Normally, the process of generating realistic mock data would've been so tedious that I never would've attempted it, but Claude generated great mock data in a matter of seconds. The first screenshots of the app that I shared with friends as I dialed in the UI were backed by mock data, but it looked real enough that you could get a good sense of how the app would look when rendering data from real MCP servers.

For MCP in particular, the mock data was even more important because most MCP servers at the time were not using most features of the spec outside of tools, but I still needed a way to validate the UI for those features.

12. Building High Quality Automation Is (Almost) Free Now

Part of the painful last 20% of shipping is automating the process of releasing the app, especially on macOS, where you have to navigate the complicated maze of code signing, notarization, and packaging. In earlier projects, this is the point at which I would fiddle around trying to get fastlane set up correctly and then build some bare bones Python automation around it. Not this time.

With a few hours of iteration, I had Claude write me a release script that does the following:

Check if the environment is set up correctly with the right tools installed
Generate change log entries from git commits, combine them with the handwritten change log entries, and generate HTML release notes
Build the app, codesign it, notarize it, and package it into a DMG
Generate a Sparkle appcast to deliver automatic updates to existing users
Tag the release and publish the release to GitHub
Upload debug symbols to Sentry for crash report symbolication

After the script was fully functional, I used a simple one-line prompt to beautify the CLI output, and I ended up with this:

This is 2,000 lines of Python code that, even if I had written it manually, I would've never bothered to automate more than the most critical steps, and certainly would not have put the effort into making the output look this nice. This script will save me tens of minutes of manual work on every release I publish, and all it took was a few paragraphs of natural language spec and having Claude debug and fix some issues I found while running the script.

13. IDEs of the Future Will Look Very Different

It occurred to me as I worked on this project that the only two tools I used throughout were Claude Code and GitHub Desktop for viewing diffs. The vast majority of the time, I didn't need any of the typical editor features: file tree, source code editor, extensions, etc. I occasionally used Xcode to make edits by hand, but this was rare, and I still didn't use most of the Xcode-specific features (SwiftUI Previews, View Debugger, etc.). Since this is the worst coding agents will ever be, I have to imagine that there is a future in which IDEs look nothing like they do today.

Cursor, Windsurf, and Copilot all started with VS Code and diverged in various ways, but they were all bolting AI onto an editor that was designed pre-AI. Fundamentally, VS Code does not look very different from a JetBrains IDE from 20 years ago. I also see projects like Warp that are attempting to pivot from being a modernized terminal emulator into an agentic development environment, but I don't believe a terminal is necessarily the ideal UX either, despite how much I enjoy Claude Code.

I believe that the IDEs of the future will focus on enabling developers to prime the agent's context and set up the feedback loops that are essential to helping an agent succeed with its task. The UX for this will look very different—I can't predict exactly how, but I don't think a source code editor will be the centerpiece.

14. I Can Ship Side Projects Again

The most exciting thing about this entire journey for me is not the app I built, but that I am now able to scratch my coding itch and ship polished side projects again. It's like I found an extra 5 hours every day, and all it cost me was $200 a month.