I’m kind of preoccupied today by the current events in Ukraine, but I figured to write down my thoughts on last night’s (my time) WWDC announcements anyway. Starting with the “one more thing”, of course.
The first of first impressions posts have come out, and the people who have got to try seem impressed. Apple is uniquely positioned to be practically the only company in the world who can pull off both the industrial design to make a headset, and chip design to make it completely self-contained instead of requiring a tether to an expensive rendering station. But despite those advantages, I’m going to go out on a limb to say Vision Pro will disappoint, because the content for it will be strictly consumer, while the headset will fall short of being actually pro. Nevertheless, it will wind up with some kind of market, because the truly pro devices cost 2–5 times as much, even if “Starting at €3499” will be a very much “starting at” price. We shall see what the add-ons pricing will be, but at least the prescription lenses will cost a sweet penny (not because they have to, but because they can).
Foveated rendering. It was barely mentioned (well, it was in an architecture diagram, but I didn’t catch it being said out loud), but it was the key to everything. Eye tracking and saccade synchronization. There’s also a huge amount of tech specs and impressively minituarized computing embedded into the headset, but those three keywords are the ones you have to understand, in order to understand the whole. I’ve come to understand these by having read a whole bunch of stuff on them, and having talked to the people at Varjo (though of nothing secret, which is why I can write this).
A “4K display for each eye” statement may sound impressive, and I’m sure it looks impressive to someone whose exposure to VR/AR devices so far has been something in the class of Oculus. But it both falls short of what human vision needs, while also being way too much to compute. So a quick, and a simplified explanation is in order:
We have a very narrow, but very precise field of focal acuity. It’s around 5 degrees, but for the purposes of simplifying technical integration, lets say we can see details in a 10 degrees wide circle. Within this foveal circle, our acuity is about 50 arcseconds, or less than 0.014 degrees. To have that much resolution inside a 10 degree circle, you have to have 720x720 pixels cover that angle. Not much, huh?
We also have a wide peripheral field of vision of about 200 degrees horizontally and 130 degrees vertically. In this field, our acuity is far, far less, in both spatial and color resolution, but on the other hand, our vision is tuned to see rapid movement and flicker in this range. In order for an AR device to not suffer from “disappearing” objects (like the Hololens did), you have to have pixels covering this field (or most of it) as well, but the resolution there doesn’t have to be nearly as high. Outside the foveal circle, the resolution is just 13 arcminutes or 0.2 degrees — so to cover that entire circle, you need another 1000x650 pixels. Again, not much.
The actual vision is much more complicated, with the foveal circle being far smaller and surrounded by paracentral, macular, near peripheral, far peripheral and so on ranges, which the acuity dropping in each increment, but I promised a simplified explanation.
So, basically, our vision has one crappy-resolution wideangle view and another crappy-resolution telescopic view out of each eye. How come we can see in so much detail, then? All the rest of it is neural processing. Our eyes jump their focus point all the time, and our brains stitch all those images together into a field where we think we see a lot wider area in high detail. I’ll come back to that in a moment.
Computing for vision
For purposes of computing, 4K resolution displays for each eye is then both a radical overkill, and not at all sufficient. We didn’t hear mentioned once what the field of view of the Vision Pro is, but Oculus devices reach around 100 degrees while the leading consumer devices have reached 120 degrees. For our purposes, lets assume Vision Pro is also 120 degrees — that is, it can render virtual scenes into our mid peripheral vision, but not into the far peripheral — so you couldn’t simulate playing football with it, but can certainly watch football. Apple didn’t quote the real resolution, but did mention a total of 23 million pixels, which I’m guessing is roughly two 4k internal displays, and one, lower resolution display on the outside for rendering your face to other people (because the goggles certainly aren’t transparent). That would come around to approx 120°/4000 = 108 arcseconds of resolution across the entire field of view. Again, impressive — but half of 20/20 acuity, and simultaneously massive overkill for everything beyond the foveal center.
And here is where computing starts to come into what was so far an optical discussion. That peripheral vision requires at least 120 frames per second rendering, but rendering 23 million pixels 120 times per second is too much work even for the heaviest GPU rigs, let alone a half-a-kilo wearable device, never mind Apple’s custom silicon. What do you do?
You don’t. Remember, you need trivially little resolution to cover our foveal circle of acuity. The problem is figuring out where that foveal circle lays, because our eyes move independent of our head position. Which is where eye tracking comes into play. If you know where the wearer’s vision is centered, you render that area in full resolution (a 500x500 area on Apple’s displays cover, according to our assumptions so far, a 13-degree circle), and all the rest of the display in much less resolution (full HD 1920x1080 also being a massive overkill). This is foveated rendering.
But the eye moves constantly. And this is where eye tracking and saccade sync are critical. Not much was discussed about eye tracking, but it’s probably one of the most crucial parts of the whole Vision Pro rig, and a minefield of patents. The US defense industry and their pilot head mounted displays have contributed most of the early research, and Varjo (which I mentioned before) has done a lot to miniaturize the tech into something that can approach, if not quite reach, consumer deployments. We’re typically talking about IR projectors aimed at the eye and cameras looking at the reflection off the pupil, and that appears to be also what Apple is doing:
The other part, saccade synchronization, is what allows a fast computer to react to the eye moving. Saccade is the very rapid period during which the eye is in movement, ranging from 20 to 200 milliseconds, while the foveal center moves anywhere from a few degrees to several dozen degrees in one direction (as compared to our general head direction, or where the headset and its displays would be pointing at). And during this movement, we are blind. Literally so. If something was presented in front of you only during the period while your eye is moving its focal point to any direction (as opposed to its focal depth adjusting), but taken away before the eye stops, your brain would not register the object in your field of vision. At all.
Similarly, the headset won’t have to render. But it will have to predict where your eye will stop its movement, and be prepared with a new, high-resolution render in that position in time for the eye to stop there. So eye tracking on its own isn’t enough, you also have to have excellent predictive methods to tell where the saccade will stop at.
Fortunately we tend to focus on objects of interest, which the headset would have explicitly rendered, thus know precisely about. Except if you’re focusing something in the real world, in which case the device must try to interpret what’s in front of you (and its forward-facing cameras), and what there is interesting enough to focus on in the direction your eyes are moving towards (that’d probably be a face, a moving object, and so forth).
And that takes us to the next topic entirely, but before that, to kind of a conclusion for early understanding of the Apple Vision Pro: a 10/20 acuity headset, best consumer AR/VR device presented so far, paired with consumer-focused content and a really hefty price tag, a weird chimera of value proposition. Personally, I wouldn’t pay $3500-plus for an inflight entertainment and FaceTime device. Won’t be replacing the real pro devices, but will be extending the reach of pro applications to not-quite-as-pro audience. Auto and and air industry executives, instead of the auto and air engineers, and so forth.
But what about AI?
Several commentators have expressed disappointment that Apple didn’t talk about AI solutions at all. But they did. They just intentionally chose to not use that word, and I respect them for that choice.
That saccade prediction I just mentioned before? A machine learning, vision modeling problem. As is the interesting-objects vision analysis. The upgraded Autocomplete keyboard that turns from a word replacement system into a contextually aware sentence completion engine? That’s an LLM. The Vision Pro FaceTime avatar that must replace a regular video image of a person wearing the Vision Pro goggles, because normal video can’t be captured of them? Several models right there, interpreting expressions (from camera worn right at your face) and projecting them onto a virtual representation of your face. That’s all spatial computing, and it may be the harder engineering part of the entire keynote.
These applications were sprinkled throughout the event, not once called AI, several times called Machine Learning, and many other times just mentioned, without calling them either, leaving it for the audience to understand.
The most practical new announcement
Safari 17’s dockable web apps, which will no longer group with other Safari windows in Cmd-Tab navigation, but instead present themselves as the application itself. The demo showed Confluence and Pinterest, but I’m exited about several installable apps (like Slack and Figma) I can ditch for just using the web site in a Safari window with its own name, icon and task switching. We’ve had these on mobile devices for ages, and now they’re finally coming to the Mac desktop, too. About damn time.
The most superfluous and annoying bit
I don’t know what is it with Craig Federighi and widgets, but it seemed all of the operating systems will now be filled with widgets. I hope I can just ignore them all. Well, I only use MacOS anyway, and I’ve ignored widgets so far, too. Did I mention widgets?