Harmony: LLM-Powered Merge Conflict Resolution for Faster, Cheaper Software Updates

TL;DR: Preview of Harmony — our homegrown model pipeline that automatically resolves merge conflicts, reducing the time and cost of software updates

The Merge Pain Behind Every Software Update

We’re sharing an early preview of Project Harmony — our research into using AI to automatically resolve merge conflicts, one of the most tedious, repetitive, and costly parts of maintaining and updating large device codebases.

Device software is never static. Upstream Android and Linux incorporate bug fixes and security patches daily, and regularly add new features. Device makers take these codebases and extensively modify them by adding support for hardware and unique features. However, with each new upstream release these changes should be rebased, inevitably leading to code conflicts.

At the scale of a codebase of hundreds of millions of lines of code, changing by 10–15% each year, the volume and complexity of merge conflicts is enormous, and each resolution carries the risk of new bugs.

Today, device makers face two choices: dedicate entire engineering teams to resolving conflicts continuously, or limit software updates to avoid the cost and resource drain. Upcoming regulatory changes make the second option no longer a viable choice. 

The first option isn’t great either. Aside from the labour cost, resolving merge conflicts is repetitive work that offers little reward. Imagine being dropped into the middle of two developers’ conflicting changes, with little context for what each was trying to achieve.

Harmony aims to change this by automatically resolving, validating, and explaining merge conflicts and freeing developers to focus on building great software. The challenge is doing this at a fraction of the cost while avoiding new or hard-to-detect bugs, which large language models (LLMs) can sometimes introduce.

Challenge accepted – and here’s how we do it.

Why Do We Use Smaller, Specialised Models?

LLMs exhibit strong generalisation and emerging reasoning capabilities, making them powerful tools for broad, open-ended tasks. However, for well-defined problems and specialised domains, small language models (SLMs) offer a more practical and efficient alternative. Not only can SLMs be 10-30x cheaper to run and deploy compared to LLMs, but recent studies also show that specialised SLMs (e.g., Phi-3.5-mini (3.8B), DeepSeek-Coder (6.7B)) can achieve comparable performance to models exceeding 70B parameters on coding benchmarks like HumanEval and MBPP.

In our work on automated code conflict resolution, we developed a family of domain-specialised SLMs, which we call Harmony models. These models are based on smaller models such as Llama-3.1-8B and Qwen3-4B, and fine-tuned on a high-quality dataset of code-merging examples derived from the Android Open Source Project (AOSP).

We found that fine-tuned, domain-specialised SLMs can match or even outperform leading general-purpose LLMs, despite being more than 20× smaller in size. As a result, our Harmony models deliver high precision and consistency without the computational overhead of broad, general-purpose reasoning, dramatically improving inference speed and reducing costs, which is key to resolving software updates at scale.

An additional advantage is that SLMs can be quickly retrained on new codebase versions, allowing them to learn and adapt to large-scale code migrations common in device development. These migrations often originate upstream, where the training data is sourced.

Combined with the rapid progress of open-source SLM foundations, this adaptability makes SLMs the ideal foundation for Harmony. It also enables operating them in isolated enterprise cloud environments, regularly fine-tuning for the specific software updates being performed.

Beyond Single Models

While a single fine-tuned model can already deliver strong performance, in practice, developers can be presented with multiple plausible merge options as part of the merging process. To mirror this workflow, we explored how ensembles of small, specialised models perform when tasked with proposing multiple candidate solutions.

To this end, we evaluated the top-3 accuracy (i.e., the likelihood that the correct resolution appears among the top three outputs), using a mixture of Harmony models built on different backbones (e.g., Llama-3.1-8B, Qwen3-4B) and trained on distinct subsets of AOSP-derived data.

Our ensemble of Harmony models achieved 27.11 accuracy improvement over the ensemble of general-purpose LLMs (Claude + Gemini + GPT). This result highlights another key advantage of small language models. Because SLMs are lightweight and cost-efficient, we can afford to train and deploy multiple specialised variants, each capturing slightly different coding patterns or merge heuristics, and then combine their strengths through ensembling.

Agentic Approach to Merge Conflict Resolution

Automated conflict resolution is not a single-step problem; it involves a sequence of interdependent tasks, each posing distinct challenges. For example, resolving a single conflict block may involve analysing just a few files, while verifying the correctness of the final resolution often requires compiling and testing the entire codebase.

Rather than relying on a single monolithic model to handle this entire pipeline, we believe the future of intelligent code automation lies in coordination among many specialised SLMs, each as an expert in a focused part of the workflow. To this end, we developed the Harmony Orchestrator, an agentic model equipped with robust tools that extend its capabilities in context retrieval, structured reasoning, and validation of proposed resolutions.

As part of our research into developing the Harmony Orchestrator, we focused on four key aspects that make agentic coordination both efficient and reliable:

  1. Efficient planning and tool-calling. Orchestrator is optimised for high-level reasoning and dynamic task decomposition. Built on top of GPT-OSS, an LLM designed for agentic workflows, it intelligently decides when and how to invoke auxiliary tools. Some tools (e.g, context retrieval or code validation) can be computationally expensive, so the Orchestrator is optimised to balance accuracy and resource efficiency, ensuring high-quality results at minimal cost.
  1. High-quality specialised tools. An agentic system is only as effective as the tools it employs. We invested heavily in building fast, reliable, and domain-optimised tools. For example, integrating with the SourceFS fast checkouts and builds, our high-performance virtual filesystem which accelerates code checkouts and builds 10×, enabled rapid validation cycles and significantly improves performance.
  1. Validating intermediate results. We use an LLM-as-a-judge validator to verify intermediate results at various stages of the workflow. This approach allows the system to detect and correct potential issues early, reducing error propagation. Flagged cases can be automatically re-run with a higher reasoning budget or escalated for human review.
  1. Explainability. We put a lot of effort into explainable reasoning. For instances that require human intervention, the Orchestrator provides structured reasoning explanations to support decision-making. We found that presenting clear, interpretable justifications for each proposed resolution substantially reduces the time engineers spend reviewing and finalising merges.

Next Steps

We’re rapidly productising the Harmony models on real-world codebases and expanding their capabilities even further. Early results have been outstanding and we are making great progress towards wider availability.

Our results also reinforce a growing consensus in the research community: smaller, specialised models, when trained on focused, high-quality data, can outperform much larger general-purpose models in targeted applications. 

For us, this translates to higher accuracy, lower latency, and dramatically reduced deployment costs — the key to keeping devices updated cost-effectively throughout their lifespan.