Camera in display

Machine learning and embedded cameras make possible a new class of more natural videoconferencing devices

The idea of embedding cameras in a display is not new. From the earliest days of videoconferencing it was recognized that the separation of the camera and the display meant the system could not convey gaze awareness accurately. Videoconferencing systems remain unable to recreate eye contact—a key element of effective communication.

A second challenge has emerged more recently. The desire to maximize screen size on small devices such as cell phones leaves little room outside the display to locate a camera.

Placing cameras behind the screen could solve these problems, but doing so tends to degrade the image. Diffraction from the screen’s pixel structure can blur the image, reduce contrast, reduce usable light levels, and even obstruct some image content entirely, in ways that are dependent on the device’s display-pixel design.

Furthermore, there are other ways in which videoconferencing is oblivious to spatial factors, distorting the perceived size and position of participants as compared with a conversation in real space.

In this project we investigate how machine learning can help overcome some of the image degradation problems associated with placing cameras behind the display, and can help frame remote conversations in a more natural spatial environment.

The Perspective Problem

Locating the camera above the display results in a vantage point that’s different from a face-to-face conversation, especially with large displays, which can create a sense of looking down on the person speaking.

Worse, the distance between the camera and the display mean that the participants will not experience a sense of eye contact. If I look directly into your eyes on the screen, you will see me apparently gazing below your face. Conversely, if I look directly into the camera to give you a sense that I am looking into your eyes, I’m no longer in fact able to see your eyes, and I may miss subtle non-verbal feedback cues.

Taken together, the result can look more like surveillance video than a conversation!

View from a camera located above a large display — Camera located above the display

View from a camera located in the large display — In-screen cameras provide a natural conversational perspective

These incongruities cause videoconferencing to fall short of its potential to create a sense of presence and faithfully reproduce the richness of face-to-face conversation.

Relocating the camera to the point on the screen where the remote participant’s face appears would achieve a natural perspective and a sense of eye contact.

The Diffraction Difficulty

With transparent OLED displays (T-OLED), we can position a camera behind the screen, potentially solving the perspective problem. But because the screen is not fully transparent, looking through it degrades image quality by introducing diffraction and noise.

Here we can see the effect of photographing a simple image through a T-OLED screen:

Detail from an image of a white dot on a black background shot through a camera directly — Enlarged detail of a photograph of a dot

Detail from an image of the same dot shot by the same camera through a T-OLED screen — The same subject shot by the same camera through a T-OLED screen

Pixel structure affects diffraction. In this example, the screen has a long, thin rectangle to view through:

Illustration of a camera aimed through a T-OLED screen with a pixel structure that has vertical transparent gaps — An imager (right) looks through vertical gaps in a T-OLED pixel structure

Not surprisingly, this causes significant degradation, but only in the horizontal direction. We can visualize this effect by plotting the modulation-transfer function (MTF):

Graph of the horizontal MTF for red, green, and blue light — The horizontal MTF shows major diffraction.

Graph of the vertical MTF for red, green, and blue light — The vertical MTF shows negligible diffraction.

Image Recovery with U-Net

To compensate for the image degradation inherent in photographing through a T-OLED screen, we used a U-Net neural-network structure that both improves the signal-to-noise ratio and de-blurs the image.

We were able to achieve a recovered image that is virtually indistinguishable from an image that was photographed directly.

The ability to position cameras in the display and still maintain good image quality provides an effective solution to the perennial problems of gaze awareness and perspective.

Framing the Conversation

Concentric circles representing significant ranges of distance between speakers — Interpersonal distances (from center):
intimate, personal, social, and public

Spatial factors also affect conversational dynamics, but are not taken into account by current videoconferencing systems.

Both the arrangement of participants relative to each other and the distance between them (proxemics) are meaningful aspects of non-verbal communication.

Such factors could be applied to the virtual environment of a remote conversation by adjusting the speaker’s position and size on the display.

Image Segmentation

We devised a convolutional neural network (CNN) structure to find the speaker within the image.

Illustration of a neural network progressing through layers convolution, maxpool, resent, and deconvolution layers — Neural network structure used to isolate speaker in source video feed

First, we performed semantic segmentation to recognize and locate human forms within the image.

Processed image showing participants with background masked out

Next, we performed depth segmentation to find the nearest individual, whom we pick out as the current speaker. (This simple technique works well when there is a single main speaker, but more sophisticated techniques could be applied that take multiple factors into account in order to handle more complex multi-speaker scenarios.)

Correcting Scale

Having identified the speaker in the remote view, we can scale the incoming video so that the remote participant appears in a lifelike size on the local display.

One way to accomplish this would be to zoom the whole image and re-center it on the speaker. However, for this project we went a step further. We extracted and scaled the individual independently of the background, as seen in this video:

Isolating the people from the background opens up additional options. You can screen out a background that is distracting or that contains sensitive information. You can also use the background region to display other information, such as presentation slides or video that the speaker is talking about.

Conclusion

Human interaction in videoconferences can be made more natural by correcting gaze, scale, and position by using convolutional neural network segmentation together with cameras embedded in a partially transparent display. The diffraction and noise resulting from placing the camera behind the screen can be effectively removed using U-net neural network. Segmentation of live video also makes it possible to combine the speaker with a choice of background content.

Neural networks combined with T-OLED displays create a new class of videoconference devices.