This image is of me looking into one end of a system we built to explore the question of just how well conventional video conferencing can be made. We used a two-way mirror to allow us to mount a high-quality camera behind a glass surface on which you see projected the image from an LCD monitor (laying flat on the table). By connecting two of these boxes together, you are able to make normal eye contact while video conferencing: because the camera sits right ‘behind’ the image of the other person, you are able to look right at them.
Using Skype or Facetime is never quite right, because you can’t make eye contact – the camera is above the edge of the screen, but the image you are looking at is below it. We also used HDMI 1080p cameras and monitors to create a very high resolution, low latency connection.
Our findings upon using this system were that although real eye contact is very compelling, you still don’t enjoy using the system as much as you’d expect… in a manner similar to typical video conferencing, it still feels very far from ‘being there’.
Digging into this, a few problems stand out:
Video latency - As surprising as it may seem (and John Carmack has written at length on the details behind this), a camcorder with HDMI output connected directly to an LCD TV Monitor at 60Hz still exhibits a video delay of typically 100msecs! This is really frustrating, as the 60Hz camera/screen rate would lead you to expect an average of 16msec delay, and yet you see more than 5x that delay.
Lighting - Even when you aren’t lit by the ugly white glare of a laptop monitor, it is almost impossible to create a lighting setup that simultaneously makes you look good to the other person while enabling you to see them. The warm white ‘magic hour’ light that makes YOU look best is shining right in the corner of your eye while you are trying to make out the other person. It’s like stage lighting – you can’t look good on stage, AND see the audience.
Audio - We used high quality XLR-linked sets of high-exclusion headphones, but a big factor that is missing when using video conferencing as compared to face-to-face is the spatialization of audio. For example, if you turn your head while talking to me so that you are no longer pointing right at me, I can easily hear the effect – the frequencies and timing of the sound change. This information is missing when we are on headphones.
These observations support the idea that it could be better to optimize the technology that can turn us into highly realistic avatars, rather than improving on the state of the art in video conferencing. If we don’t try to capture the photons reflecting off your poorly-lit human face, and instead try to capture sensor data that tells us about your movement, gaze, and facial cues at high resolution, using that data to animate an avatar. Once we have you as an avatar, of course, we can do any lighting we’d like. Also we can put you on a beach or in a huge boardroom filled with monitors, but that’s a story for another day.