Embedded AI Does Not Fail Where You Think It Does
There is no shortage of enthusiasm around AI. Models perform well on workstations. Demos look convincing. Early results are promising. And then the model is moved onto the target platform.
That is where things begin to drift.
Not because the model is wrong. Not because the data science is flawed. But because running a model inside an embedded system is a different problem altogether.
At Embedded Expertise, this is where we operate.
The Gap No One Owns
In most organizations, roles are clearly defined. Data scientists build the model. Embedded engineers build the system.
But between the two, there is a gap. This gap often remains invisible until late in the project.
A trained model is not yet a deployable component. Especially not on a heterogeneous SoC where CPU, GPU, and NPU share memory, bandwidth, and constraints.
Here’s the typical flow: the model is designed and tested on a workstation. When it’s good engouh, it’s exported to TensorFlow Lite (or similar). The NPU is enabled. Expectations are high. Then the measurements come in:
- Performance is disappointing.
- CPU usage is higher than expected.
- Latency is unstable.
- The NPU is either underused, or fully used without delivering the expected gains.
At that point, the discussion becomes uncertain.
Is it the model? The runtime intepreter? The delegate? The platform?
The Illusion of Simplicity
Part of the confusion comes from how embedded AI is introduced.
Ready-to-use models such as MobileNet are excellent demonstrators. They show how a relatively modest platform can deliver impressive results. With a bit of fine-tuning, they can even be adapted to concrete use cases, such as gesture recognition or simple visual control systems.
They create a strong impression: this looks accessible, almost straightforward.
But these models are the visible end of a long process. Architecture design, dataset preparation, training strategies, quantization, optimization, benchmarking, iteration. What appears as “ready-to-use” is in fact the result of extensive engineering and repeated refinement.
When teams bring their own models to the same platform, the experience is usually different.
It is common to see models that technically run on the NPU, sometimes even fully delegated, yet fail to deliver the expected performance. Execution becomes dominated by data movement. CPU fallback appears in unexpected places. Latency behaves inconsistently.
These situations are not edge cases. They are typical first outcomes.
The difference is not in the intent or the competence of the team. This issue is not the model itself, not the NPU either. It is in the maturity of the model relative to the constraints of the target system.
Reference models hide that complexity. Your own model will expose it. This is where most projects begin.
Running on an NPU Is Not a Checkbox
“Use the NPU” suggests a simple switch. In practice, it is an architecture partitioning problem.
A model is split across execution domains. Some operators are delegated, others fall back to the CPU. Tensors are constantly moved, reshaped, and reinterpreted. Memory bandwidth becomes a limiting factor.
Even a fully delegated graph can perform poorly if it is dominated by data movement rather than computation.
This is why two models with similar accuracy can behave completely differently once deployed.
The behavior is not just defined by the model. It is defined by the interaction between the model and the system.
Architecture With an AI Twist
We are not data scientists. We do not design models. We do not run training pipelines.
But we are not external to that world either.
We understand how models are structured, how they are trained, how they are quantized, and how they are transformed into deployable artifacts such as TensorFlow Lite graphs.
This allows us to operate at the interface. What we bring is system architecture, applied to AI workloads.
The same discipline used to design reliable embedded platforms, extended to a new class of computation.
What We Actually Do
We take the model you already have, and we help you understand how far it can go on your target, and what it would take to go further.
That starts with a precise analysis of the TensorFlow Lite graph and its execution behavior on your platform.
We look at the system as it is, not as it is expected to behave.
We identify:
- what is truly delegated to the NPU, one op at a time
- where and why CPU fallback occurs
- whether the execution is compute-bound or memory-bound
- which operators dominate runtime
- how data movement impacts performance
To support this, we have developed internal analysis tools that inspect TFLite models and expose these characteristics clearly.
These tools are made available to our clients.
Because the objective is not to promise a result at any cost, but to make the situation understandable and actionable.
From Diagnosis to Direction
Initial deployments rarely meet expectations. That is not a failure. It is a starting point. From there, the work becomes actionable.
Sometimes the solution is to refine the model so that it aligns better with the NPU: reduce unnecessary tensor movements, reorder operations, enable better op fusion. Sometimes the right decision is to rebalance execution between CPU and NPU. Sometimes a part of the pipeline simply does not benefit from acceleration and should remain where it is. Sometimes the conclusion is simpler: the model, in its current form, is not a good fit for the target.
These are engineering decisions, grounded in measurable behavior. We help teams make them with clarity.
Transmission Matters Too
This type of work cannot remain implicit: teams need to understand what is happening, not just receive a result.
For that reason, we also provide targeted training on platforms such as i.MX8M Plus.
Not as a generic introduction to AI, but as a practical guide to how models behave on real hardware, how delegation works, and where performance is won or lost.
The objective is simple: after the engagement, your team should be able to reason about these systems independently.
When to Involve Us
If your model works on a workstation but not on your target, you are already in this situation. If your NPU is enabled but performance does not follow, you are already in this situation. If your team cannot clearly explain where time is spent during inference, you are already in this situation.
We step in at that point.
Embedded AI does not fail because the models are weak. It fails because the system is treated as a black box.
Running a model on an embedded platform is not just about inference. It is about data movement, memory pressure, operator compatibility, and execution boundaries. This is system engineering.
We bring a clear diagnosis, practical options, and the engineering discipline needed to take the most out of the system.
Key Takeaways
AI in embedded systems does work, but not by default. It works when the model and the platform are aligned through careful engineering.
That alignment is not accidental.
If you are facing this situation, we can help you make sense of it. Start with a model, a platform, and a measurement that does not match expectations. We will help you understand why.
Enjoyed this article?
Embedded Notes is an occasional, curated selection of similar content, delivered to you by email. No strings attached, no marketing noise.