How Microsoft’s Surface Pro X uses driver-based software and dedicated AI hardware to make video chat more people-friendly.
With so many of us working from home, we’ve shifted into a world where video conferencing has become the main way we connect with colleagues. We spend hours in one-on-ones and group meetings, looking at faces in little boxes on our screens. It is, to be blunt, hard. The cognitive load that comes with trying to parse faces on screens is high, leading to what’s become known as ‘Zoom fatigue’. It’s not limited to Zoom, of course — the same problems are there with whatever you use, be it Google Meet, WebEx, Skype or Microsoft’s Teams.
Microsoft has been working on ways to reduce this strain. One approach is Teams’ Together mode, which changes the way we view faces on a screen. Another relies on specialised machine-learning hardware built into the Arm-based Surface Pro X.
Introducing Eye Contact
Now available to everyone with a Pro X, Eye Contact is designed to work with any app that uses the tablet’s front camera. All you need to do is install Microsoft’s Surface app, switch to the Eye Contact tab and click enable. A preview option shows the subtle difference between a processed and unprocessed image, with a slight change in eye position between the two when you’re looking down at the preview image and switching the function on and off.
Eye Contact doesn’t make huge changes to your image — there’s no shift in head position or in room lighting. All it does is slightly change the position and look of your eyes, making them a little wider and slightly altering the position of your gaze, so it looks as though you’re looking into the camera even if you’re actually focused on the on-screen faces below you.
The resulting effect makes you appear more engaged in the conversation, as if you’re looking into the eyes of the other people in the video meeting. It’s quite subtle, but it does make conversations that little bit more comfortable, as the person you’re talking to is no longer subconsciously trying to make eye contact with you while you peer at your screen.
It’s an oddly altruistic piece of machine learning. You yourself won’t see any benefit from it (unless you’re talking to someone who’s also using a Surface Pro X), but they will see you as more engaged in the call and as a result will be more relaxed and less overloaded. Still, those secondary effects aren’t to be underestimated. The better a call is for some of the participants, the better it is for everyone else.
Using the device hardware
Eye Contact uses the custom AI engine in the Surface Pro X’s SQ1 SOC, so you shouldn’t see any performance degradation, as much of the complex real-time computational photography is handed off to it and to the integrated GPU. Everything is handled at a device driver level, so it works with any app that uses the front-facing camera — it doesn’t matter if you’re using Teams or Skype or Slack or Zoom, they all get the benefit.
There’s only one constraint: the Surface Pro X must be in landscape mode, as the machine learning model used in Eye Contact won’t work if you hold the tablet vertically. In practice that shouldn’t be much of an issue, as most video-conferencing apps assume that you’re using a standard desktop monitor rather than a tablet PC, and so are optimised for landscape layouts.
The question for the future is whether this machine-learning approach can be brought to other devices. Sadly it’s unlikely to be a general-purpose solution for some time; it needs to be built into the camera drivers and Microsoft here has the advantage of owning both the camera software and the processor architecture in the Surface Pro X. Microsoft has plenty of experience in design and developing the Deep Neural Network (DNN) hardware used in the custom silicon in both generations of HoloLens, and it’s reasonable to assume that some of that learning went into the design of the Surface Pro X silicon (especially as the same team appears to have been involved with the design of both chipsets).
For the rest of the Intel- and AMD-based Surface line, we’ll probably have to wait until a new generation of processors with improved machine-learning support or for Microsoft to unbundle its custom AI engine from its ARM-based SQ1 processor into a standalone AI accelerator like Google’s TPUs.
Real-time AI needs specialised silicon
The AI engine is a powerful piece of compute hardware in its own right, able to deliver 9 TFLOPs. It’s here that Microsoft runs the Eye Contact machine-learning model, calling it from a computational photography model in the Surface Pro X’s camera driver. Without dedicated hardware like this available across all Windows PCs it’s hard to imagine a generic Eye Contact service available to any internal or external camera — even with Windows 10’s support for portable ONNX machine-learning models.
Even though Intel’s latest Tiger Lake processors (due in November 2020) add DL Boost instructions to improve ML performance, they don’t offer the DNN capabilities functions like SQ1’s dedicated AI silicon. We’re probably two to three silicon generations away from these capabilities being available to general-purpose CPUs. There is the possibility that next-generation GPUs could support DNNs like Eye Contact’s, but you’re likely to be looking at expensive, high-end hardware designed for scientific workstations.
For now, it’s perhaps best to think of Eye Contact as an important proof-of-concept tool for future AI-based cameras, using either SOC AI engines like the SQ1’s, or general-purpose GPU with discrete graphics using Open CL or CUDA, or processor ML inferencing instruction sets. By building AI models into device drivers, we can provide advanced capabilities to users simply plugging in a new device. And if new machine-learning techniques deliver new features, they can be shipped with an updated device driver. Until then, we’ll have to take advantage of every little bit of power in the hardware we have to make video conferences better for as many people as possible.