Mr. Bones

Implementation

Mr. Bones was implemented as a distributed system of four specialized processes running on a Raspberry Pi, communicating through POSIX message queues and pipes to enable real-time conversation and physical animation.

System Architecture

The system is composed of four independent processes that work in concert:

Jaw Controller

This process handles the physical actuation of the animatronic's jaw. It opens a message queue at /jaw-control and listens for a simple binary protocol:

0x01 → Start talking (jaw movement)
0x00 → Stop talking (jaw at rest)

Neck Controller

Responsible for managing the neck motion states, this process listens to a message queue at /neck-control and responds to the following commands:

0x00 → Idle: Look around to random angles with noise every 2-8 seconds
0x01 → Listening: Look straight forward with slight movement every 5 seconds
0x02 → Thinking: Look upward in thought with subtle head movements

Speech Process

This is the central coordination process that uses Vosk for speech-to-text recognition. It manages a state machine with two primary states:

Wake Word Detection: Listens continuously for "Hey Mr Bones"
Conversation Mode: Captures user input after wake word detection

When speech is detected, it sends the text to the LLM process through a pipe and coordinates the physical responses by sending appropriate commands to both the jaw and neck controllers. When the LLM responds, it uses Piper for text-to-speech synthesis and plays the audio through the speakers.

LLM Backend

Built using llama.cpp with calme-3.1-llamaloi-3b-Q6_K.gguf as the backing pretrained model, this process listens for conversation input through stdin (redirected from the speech process via pipes) and streams responses token-by-token through stdout. The model was specifically chosen for its balance of performance and resource requirements suitable for the Raspberry Pi platform.

Performance Optimization

To achieve real-time conversation with minimal latency, the system employs several optimization strategies:

Decoupled speech recognition and LLM generation into separate processes
Token-by-token streaming from the LLM to enable overlapping speech synthesis
Lightweight binary protocols for inter-process communication
Predefined personality constraints to reduce LLM computation complexity

The complete system successfully ran on Halloween, engaging in real-time conversations with my friends and family. It was pretty awesome :)

Motivation