On-Device vs Cloud AI Scribes: The Privacy Tradeoff

Most AI scribes look the same from the outside: you tap record, you talk to your patient, and a structured note appears a few seconds later. What that smooth surface hides is a decision the vendor made for you — whether your patient's voice is processed on the phone in your hand, or shipped across the internet to a server you'll never see. That single architectural choice is the most important privacy fact about any scribe you'll evaluate, and it almost never appears on the marketing page.

The note you sign at the end may be word-for-word identical either way. But the risk you've taken on to get there is not. To compare scribes honestly, you have to stop looking at the output and start asking where the audio lives.

What "on-device" actually means

"On-device" means the transcription — turning the spoken visit into text — happens on the phone itself, using the device's own processor. The audio is captured, converted to a note locally, and never uploaded anywhere. Modern phones are fast enough to run speech-to-text models without a server, so the recording of your patient's voice simply stays on the hardware you control.

The practical consequence is that there's no copy of the visit sitting on a vendor's server, no audio in transit to be intercepted, and no third party holding a recording of a conversation you're legally obligated to protect. If a tool processes audio on-device and discards it, the surface area for a breach shrinks to one thing: the phone in your pocket. That's privacy by architecture — it's true because of how the system is built, not because someone promised to behave.

The safest copy of a patient's voice is the one that was never made. On-device transcription is how you get a usable note without creating that copy in the first place.

What the cloud buys you — and costs you

Cloud scribes send the audio (or a real-time stream of it) to the vendor's servers, where larger, more powerful models do the transcription and note-drafting. There are real upsides: server-class models can be more accurate on messy audio, handle long or multi-speaker visits more gracefully, and improve over time without you updating an app. Convenience and raw capability genuinely favor the cloud.

The cost is that your patient's data leaves the building. Once audio reaches a third party's servers, that vendor is handling protected health information on your behalf — which means you need a signed BAA (a business associate agreement: the contract that legally binds a vendor to safeguard patient data and report breaches). Without one, using a cloud scribe on real patients isn't just risky, it's a compliance problem. We walk through exactly why in HIPAA & AI: A Practical Guide.

Stripped down, the tradeoff looks like this:

On-device: audio never leaves the phone · smaller breach surface · often works offline · no third party holding the recording · model capability bounded by the device
Cloud: bigger, often more accurate models · improves without app updates · easier multi-speaker handling · but audio leaves the building · requires a signed BAA · adds a vendor to trust

The honest framing

On-device isn't automatically "better" — it's a different risk posture. It trades some model power for the strongest possible privacy guarantee. Cloud trades that guarantee for capability and convenience, and asks you to backstop it with a contract. Neither is wrong; you just need to know which one you've chosen.

How to tell which one you're using

Vendors rarely say "we upload your audio" in plain language, so you have to ask directly. A scribe's architecture is knowable — push until you get a straight answer to these:

Does the transcription happen on the device, or on your servers? If they hedge, it's the cloud.
Is the audio ever uploaded — even temporarily? "Processed and deleted" still means it left the phone.
Does it work with no internet connection? If it does, the core transcription is almost certainly on-device.
If anything is sent to your servers, will you sign a BAA? A confident "yes" is the floor, not a bonus.
How long is any recording retained, and where? The best answer is "we never store the audio."

Once you know the architecture, the rest of your evaluation gets simpler. When you're ready to compare specific products against this bar, we rank them in The Best AI Medical Scribes for 2026.

Where to start

You don't have to pick a side in the abstract. Decide what your practice can tolerate: if the strongest privacy guarantee is worth a small ceiling on model power, start on-device. If you need maximum accuracy on hard audio and you're prepared to manage a vendor relationship with a signed BAA, the cloud is a defensible choice. Either way, the tools we'd actually deploy are on the shortlist.

Disclosure: Voti, Phiclaw and Ratatui are built by the team behind this publication. We recommend them because we'd run them ourselves; see our editorial standards.