Building memory for an always-on AI that listens to your kitchen

Tuesday, 17 February 2026 · Adam Juhasz

Last Tuesday, Juno heard my family say over 800 things. It remembered 12.

Juno is an ambient AI assistant in a smart display that sits on our kitchen counter (with everything happening on-device and no cloud dependencies). It listens to the room all day, no wake word, and tries to extract useful memories (appointments, shopping lists, events, etc) to build household context. The other 800 or so utterances from Tuesday? Jokes about burnt toast. A debate with my 11 month old on what food she was not going to throw on the floor. A full episode of some cooking show. The system heard all of that and decided, correctly, that none of it mattered. What mattered was my wife and I going back and forth scheduling who can take our daughter to her 1 year checkup (it remembered the time and date of the checkup).

We've been building the memory system over about 3 generations and the thing that surprised me most is where the actual engineering difficulty lives. It's not in storing memories or retrieving them (KISS and use markdown). It's in figuring out what to throw away.

The problem with kitchen audio

Every other AI memory system I'm aware of, ChatGPT's memory, Mem0, MemGPT, they all process typed text. And typed text has this nice property: if someone took the trouble to type a message to an AI, it's probably something worth remembering. The intent signal comes for free.

Ambient audio gives you no such luxury. Our kitchen on a Tuesday evening is a wall of noise. I'd estimate maybe 5% of transcribed speech contains actual household information. The rest is TV commentary, small talk, emotional venting, and my personal favorite category: vague half-plans that sound like commitments but aren't. "We should probably clean the garage one day." Sure we should.

So the extraction criteria starts with this, right at the top:

Precision is more important than recall. False positives (random conversations saved as memories) are worse than missing a borderline memory.

If the system isn't sure, it saves nothing. A ghost appointment created from a sports broadcast ("game at 7 on Sunday" looks a LOT like a real plan to an LLM) is worse than missing a borderline shopping item.

We also reject credentials outright. Passwords, PINs, credit card numbers, SSNs. Even if someone reads a card number aloud, the system drops it. An always-on listener that stored that kind of thing would be a nightmare.

In practice, the ignore/extract taxonomy looks like this. Ignored:

Ignored	Reason
How are you, Juno? Nice weather today.	Small talk
That game was insane, what a comeback.	Sports commentary
I hated that movie ending.	Opinion
Maybe we should go somewhere sometime.	Ambiguous
I am so stressed and tired of everything.	Venting
These tacos are amazing.	Reactive comment
The forecast says rain on Sunday.	Media
We should clean the garage one Saturday.	No commitment

Extracted:

Transcript	Extracted Memory
Add oat milk and dish soap to the list	`Shopping list: - oat milk - dish soap`
Dentist next Tuesday at 9am	`Dentist appointment at 2026-01-21T09:00:00-08:00`
The parent-teacher conference is Thursday at 6 PM	`Parent-teacher conference at 2026-01-22T18:00:00-08:00`
Mom is visiting next weekend	`Mom visiting 2026-01-18T00:00:00-08:00 to 2026-01-19T00:00:00-08:00`

The prompt also handles the case where noise and signal show up in the same sentence. "Did you see that touchdown? Also, remind us to call Grandma tomorrow at 5." Touchdown: dropped. Grandma: saved.

There's also room-aware filtering baked into the extraction. Kitchen gets stricter filtering for casual meal comments. Living room assumes most audio is TV. Bedroom is treated as private unless there's a concrete plan being discussed. Today, the room isn't detected automatically; you set it when you name the device during setup ("Kitchen Display"). It's crude. But even as a static hint, it shifts behavior in useful ways.We have ~5 live prototypes running in homes, each with a static room label.

How memories live and die

We landed on 2 memory "types" and then rely on a LLM's world understanding to interpret the rest, the permanent memory and a temporary memory. The only difference is that temporary memories get an expires_at timestamp and a cleanup job deletes them after they expire. Permanent memories stick around until they are explicitly archived.

We first experimented with sophisticated structured memories with explicit types and fields ({"type": "shopping_item", "item": "milk"}) but found it brittle architecture. We found much better consistency in practice when using simple markdown, such as "Shopping list:\n- milk\n- eggs". This is especially true with on-device LLM models that routinely had issues parsing complex JSON but understand equivalent markdown just fine.

We set appointments expire at event time. "Dentist on Friday at 10" gets an expires_at timestamp, and once Friday at 10 passes, it gets cleaned up. Events (broader things like "mom visiting next weekend") also expire. Household facts like the home address stick around forever.

Every relative date gets converted to an absolute ISO8601 timestamp at extraction time. "Tomorrow at 3pm" becomes 2026-01-16T15:00:00-08:00. We've been experimenting with using "human style" ("Friday, January 16, 2026, 3:00 PM PST") instead of ISO8601 but haven't found one to be more reliable for model understanding and this stuck to ISO8601 for easier deterministic parsing for non-LLM workflows.

My favorite detail in the whole system: the midnight grace period, from our user research and testing:

If the current time is between midnight and 3am, interpret "tomorrow" as the current calendar day (people still consider it "today" until ~3am). After 3am, "tomorrow" refers to the next calendar day.

We didn't design this upfront. We found it. At 12:30am someone said "the plumber is coming tomorrow morning" and they obviously meant later that same day, not the day after. The kind of thing you only discover by running the system in actual kitchens with actual families who stay up too late.We have ~5 live prototypes running in homes.

The schema:

CREATE TABLE memories (
  id UUID PRIMARY KEY,
  household_id UUID NOT NULL REFERENCES households(id) ON DELETE CASCADE,
  memory TEXT NOT NULL,
  transcript TEXT NOT NULL,
  device_id UUID REFERENCES devices(id) ON DELETE SET NULL,
  expires_at TIMESTAMPTZ,
  archived BOOLEAN NOT NULL DEFAULT FALSE,
  embedding vector(1024),
  embedding_model TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Notice the transcript column. We keep the original audio transcription next to every extracted memory. Due to people's history with voice assistants we found that people had very low trust in the system's ability to understand them. By showing the transcripts alongside the memory (we currently show a toast when a memory is saved), it gave our testers confidence that "oh, it heard me correctly, I can trust it to remember this." We've found people have actually scrolled the memory list just to see the transcripts of what the system heard, even for memories they didn't care about. It's a small thing but I think it makes a big difference in building trust in an always-on system.

Deduplication

Here's what happens in a real morning. 8am: "we need milk." Saved ("Shopping list:\n- milk"). 8:45am: "oh and grab eggs." Saved ("Shopping list:\n- eggs"). 9:15am: "oh man we're out of milk again." Saved ("Shopping list:\n- milk"). Now you have three shopping list fragments floating around. At 9:20 the dedup job runs and merges them into "Shopping list:\n- milk\n- eggs."

Why every 20 minutes? We tried a few intervals. Daily and hourly was too slow; the assistant would answer "what's on the shopping list?" correctly (it could merge the 3 fragments in its context window) but latency went up as we more tokens to process in the context. Twenty minutes was the interval where the cost felt reasonable and the UX wasn't noticeably stale.The nice side-effect of having on-device inference is that we don't have to worry about the cost of inference and can consider LLM calls to be "free".

The dedup job pages through memories in chunks:

const MEMORY_CHUNK_SIZE = 100;
const MEMORY_CHUNK_OVERLAP = 10;
const RECENT_MEMORY_WINDOW_MS = 24 * 60 * 60 * 1000; // 24 hours

For each household, we grab the last 24 hours of memories, then walk through the entire history in batches of 100 with a 10-memory overlap. The overlap is because related memories can land right on a chunk boundary and you'd miss the merge without it.

Each chunk goes to an LLM with structured output. The model gets recent and older memories and returns a JSON array saying which ones to merge and what the merged text should be. We never delete or mutate memories, we instead "supersede" a memory. This allows us to keep a full history of what Juno knows. We have some interesting ideas on how to provide UX around this context graph and that this graph could be used to build automated workflows based in the observed history of memories.

We also use superseding to power some of the UI. When a user marks a todo item as done, we create a new memory that supersedes the old one with the same text but wrapped in strikethrough formatting. "Pick up the dry cleaning" becomes "~~Pick up the dry cleaning~~".

Some of the merge types we see:

Appointment reschedule: "Doctor at 3pm" + "Doctor moved to 4pm" -- merged into "Doctor at 4pm"
List expansion: "Shopping: milk, eggs" + "Shopping: eggs, bread" -- merged into "Shopping: milk, eggs, bread"
Distinct memories (noop): "Trash pickup every Thursday" + "Susan has soccer at 7pm" -- no merge, return both

Search

Memories get embedded with Qwen3-Embedding-8B at 1024 dimensions. The model actually supports anywhere from 32 to 4096 via Matryoshka Representation Learning. We picked 1024 because pgvector's HNSW index gets slow above 2000 dimensions and 1024 still retrieves well.

The assistant talks to memory through MCP tools and a heuristic that attaches memories to the initial prompt as well (to save latency in Juno responding to a user query with actual information). We run an embedding on the user query and do a similarity search against the memory corpus. The top 5 most similar memories get attached to the prompt for the assistant to use as immediate context.

What we got wrong

The "TV problem" is still not solved and it bugs me. A sportscaster saying "game at 7 on Sunday" has exactly the same structure as a real household plan. The prompt engineering catches it most of the time but not always. We're starting the design and dataset collection for a custom speech-to-text model that can do ASR, speaker diarization (separate speakers fragments from each other, "Speaker A: foo. Speaker B: bar"), and speaker identification (identify the speaker given a set of voice patterns "Jane: foo. Speaker A: bar"). We also hoping to sound identification (glass breaking, dog barking, etc) but that's a stretch goal.

The shared household memory pool creates privacy situations we're still working through. The current design has everyone in the family shares the same memory corpus. Should a child be able to see a memory their parents created? Our current answer is to deliberately tune the memory extraction to be household-wide with no per-person scoping because a kitchen device hears everyone equally. But "deliberately chose" doesn't mean "solved." We're hoping our in-house STT will allow us to do per-person memory tagging and then we can experiment with scoping memories to certain people or groups of people in the household.

The 5-second debounce from transcribing to extracting was discovered through trial and error. Our initial spec was 500ms. In practice, 500ms was a disaster because the LLM would get half-thoughts as input (people pause for more than half a second many times). We kept bumping the debounce up until the extraction quality stopped improving, landing at 5 seconds. We've worked on some LLM-based classifiers and still believe that's better than a heuristic timer-based approach, but we're still working on that thread.

Forgetting as a feature

Every AI memory product we see is optimizing for the same direction: remember more stuff. Bigger context windows. Better retrieval. More comprehensive knowledge graphs.

Working on Juno taught me the opposite thing. When your AI sits in a room and hears everything, the valuable engineering is in what you throw away. Those 800 utterances Juno ignored last Tuesday weren't a bug. That was the whole point.

This thing runs on a real device in a real kitchen where someone walks by from six feet away and glances at it to check if there's anything they need to know. The plumber at 8. The dish soap. That's it. Everything else is noise, and Juno's job is to know the difference.

If you've worked on the TV-vs-conversation classification problem, or dealt with multi-person memory conflicts, I want to hear what you tried. Especially the things that didn't work. Email me at adam@juno-labs.com.