Two-hand pose transport codec · zpe-xr v0.3.1 · github.com/Zer0pa/ZPE-XR
In a VR session your hands are always moving — picking up, pointing, reaching across a room. Today that motion is a stream of raw floats, expensive on the network and erased the moment the session ends.
ZPE-XR is a different answer: a sealed 25.9-byte packet for two complete hands per frame, decoded in 0.057 ms to byte-identical output on any machine, any year. The transport works on ContactPose. Unity and Meta runtime integration is still external; float16+zlib still wins raw fidelity by 0.2 mm.
Scope: ContactPose transport. Comparator 0/5 and runtime closure pending; byte-identical transport is not a fidelity win.
01 · THE GAPARRIVED WRONG
VR hands arrive late or too large — the experience breaks before the scene does.
02 · MARKETSADJACENT FORECASTS
Release postureBLOCKED
Hand tracking solutions$10.9B '33
Extended Reality market$59.2B '31
Spatial computing$280B '28
Ultraleap ref revenue~$30M
Hand tracking 19.7% CAGR through 2033; XR 41% CAGR to 2031. Transport is the wire all of it runs on.
03 · VALUE
23.9×
Smaller than a raw two-hand frame · 6.63× smaller than Ultraleap VectorHand
04 · INSIGHT
Hand motion data needs to travel.
05.1 · CURRENT TECHFLOAT STREAM AND ZLIB
XR developers ship hand motion as raw float streams or float16+zlib. Both move bytes. Neither is a transport: no sealed packet, no sequence numbering, no loss recovery, no byte-identical replay, no record after the session ends.
05.2 · OUR TECHSEALED PACKETS
ZPE-XR encodes two complete hands — 21 joints each — as a sealed, CRC32-checked packet at 25.9 bytes per frame, 23.9× smaller than raw. A backup sequence number recovers from drops without a keyframe stall. Encode plus decode runs in 0.057 ms, and every recorded ContactPose stream plays back the same hands on any machine, any year.
05.3 · BENCHMARKSCONTACTPOSE MEASURED
Compression23.9×vs raw
Enc+dec0.057ms
MPJPE0.479mm
Comparator0/5fidelity
Transport sizePASS
Round-trip speedPASS
FidelityMISS
Scope: ContactPose 5-sequence, 3,500 frames. Transport passes. Fidelity comparator 0/5.
06 · MEASUREMENTTRANSPORT VS FIDELITY
Transport evidence pairs speed with fidelity.
06.1 · COMPARATIVE PERFORMANCECONTACTPOSE BYTES PER FRAME
ZPE-XR25.9 bytes/frame
float16+zlib~110 bytes/frame
raw float32619.5 bytes/frame
comparator fidelity0/5
ContactPose five-sequence run, 3,500 frames. ZPE-XR ships 25.9 bytes per two-hand frame — 6.63× under Ultraleap, 1.47× under Photon Fusion. float16+zlib still wins raw fidelity: 0.277 mm vs 0.479 mm MPJPE — comparator 0/5.
07 · KEY METRICSMEASURED RESULTS
07.1 · VS RAW
23.9×
vs raw float32 · ContactPose two-hand comparator
07.2 · BYTES / FRAME
25.9B
two complete hands · 6.63× smaller than Ultraleap
07.3 · ENC + DEC
0.057ms
encode + decode mean · 3,500-frame ContactPose run
07.4 · MPJPE
0.479mm
ZPE-XR vs 0.277 mm float16+zlib · fidelity comparator 0/5
07.5 · LOSS @ 10%
0.399%
pose error at 10% loss · 9.5× more resilient than Ultraleap proxy
On the measured ContactPose surface — five sequences, 3,500 frames — every ZPE-XR packet carries a CRC32 tail and a backup sequence number. The recorded stream decodes byte-for-byte the same on any machine, any year. The checksum is a provenance anchor, not just an error detector.
The determinism claim is bounded to the encoded stream — not to the sensor estimating the hand or the engine smoothing the output. float16+zlib still wins raw fidelity: 0.277 mm versus 0.479 mm MPJPE. Comparator 0/5; closing that gap is active research.
08.2 · THE FIDELITY GAP
Honest Blocker ·
float16+zlib wins fidelity (0.277 mm vs 0.479 mm).Comparator 0/5. Unity and Meta runtime closure is externally dependent. Photon Fusion semantic parity remains an open secondary. Replay-error corpus evidence beyond ContactPose is unresolved. PyPI zpe-xr 0.3.1 stale; 0.3.2 pending.
09
Hands become persistent data packet by packet.
09.1 · THE AMBITION
Embodiment in XR has been disposable. ZPE-XR makes it the opposite: a sealed packet small enough to network at chat-app bandwidth, faithful enough to play back as the same hands every time, and structured enough to search across recordings. Headsets, robots, archives, and training corpora share one transport for motion.
09.2 · WHAT WORKS NOW
A ContactPose-bounded transport:25.9 bytes per two-hand frame, 0.057 ms round-trip, byte-identical replay under packet loss.
09.3 · WHAT'S STILL OPEN
Raw fidelity against float16+zlib, Unity/Meta runtime, Photon parity;broader corpora and the 0.3.2 release stay open.
09.4 · TELEPRESENCE · NEAR-TERM (12–24 MO)
Multiplayer hands at messaging-app bandwidth
A four-player social session at 90 fps fits inside 6.84 KB/s — the bandwidth budget of a chat app, not a video call. Social-VR studios stop paying a voice-call price just to render fingers, and continuous embodied presence becomes a default rather than a feature.
09.5 · ARCHIVES · NEAR-TERM (12–24 MO)
Embodied sessions become persistent records
A two-hour session compresses to roughly 49 MB with no fidelity drift on replay. Coaching reviews, surgical rehearsal, factory walkthroughs, and forensic playback stop ending when the headset comes off. Embodiment graduates from disposable runtime state into a scrubable, hash-addressable record.
09.6 · MOTION SEARCH · MID-TERM (24–48 MO)
Hand motion becomes a queryable corpus
Once every frame is hashed and every gesture fingerprinted, recorded sessions become a search surface. “Find every clip where two hands hand off a mug” turns into a tractable query. Coaches, ergonomists, and rehab clinicians get a search bar over embodied behavior.
09.7 · HUMAN-ROBOT REPLAY · MID-TERM (24–48 MO)
Headset hands and robot arms share a clock
When a human demonstration and a robot re-run share one packet format, imitation-learning pipelines and teleoperation review collapse into a single timeline with one parity hash. Human-in-the-loop robotics gets a common ground truth where today it has two stacks talking past each other.
09.8 · PHYSICAL AI · PARADIGM (48 MO+)
Embodiment becomes network infrastructure
The same 26-byte envelope that carries a human hand can carry a robot manipulator across headsets, simulators, training agents, and forensic archives without rewriting at each boundary. Spatial computing stops treating presence as a per-engine reconstruction problem; the network itself carries embodiment as a first-class signal.