August 24, 2023
Traditionally, the way humans interface with computers is by a keyboard, mouse, trackpad or touchscreen. In recent years, voice-based interface has become more prevalent — especially for controlling smart home devices like Alexa and Siri. Looking toward the future, vision-based interfaces (VBI) are on the rise and will see a number of exciting applications. TV and other video viewing are naturally suited for technologies like VBI, especially as video becomes more interactive.
Computer vision, the underlying technology that enables vision-based interfaces, is based on what a camera connected to a computer “sees.” Using algorithms centered on that image-viewing process, a VBI can then be used to direct a device to perform various functions and interact with other devices. This can be achieved through various visual cues, such as a user’s gaze or specific hand gestures.
Similar to how we use a mouse to control a pointer to click and drag on various items in a computer interface, we can use computer vision to control various applications. A few examples presented below will illustrate some scenarios where this technology can work.
Most of us are familiar with two of the most common uses of computer vision: facial recognition and iris recognition. Using a camera, a computer can recognize a human face and identify unique features with a high degree of accuracy. Similarly, using a camera, computers can detect and identify the unique features of the iris in a person’s eyes. In both cases, the technology is primarily used for biometric identification, because, like fingerprints, our faces and irises are unique.
Many of us use this technology every day: Apple’s FaceID is a standard feature on iPhones and can quickly unlock our phones with no other interface required. Other smartphone operating systems have their own facial recognition features. Some secure buildings also use facial recognition or iris recognition for physical access. These are all applications of computer vision.
Unlocking your phone or a door is a straightforward function: the computer recognizes your face, compares it with the biometric data of those with approved access to the device or facility, and access is granted. This is a relatively passive use of the technology. However, things become a lot more interesting and complex when we consider the potential of dynamic interaction with computer vision.
Our eyes say a lot about us, and with computer vision, cameras can detect certain subtleties through them. For example, a camera can determine the direction you are looking, or, in some cases, the position of your eyelid. Determining the direction and focus of your gaze can be a powerful tool for user interaction.
Some systems are capable of detecting head motion and hand gestures as well. This could be used, for example, to control basic functions of a TV. Computer vision could detect you waving to turn the TV on, or perhaps pointing up to turn the volume up, pointing down to turn the volume down, or even placing your finger perpendicular to your lips to “shush” or mute the TV.
Gesture recognition has other applications. Some cars now feature a gesture recognition system that allows you to open the trunk just by waving your foot under the rear bumper. This is often marketed as an assistive feature for people who might have an armload of groceries and prefer the convenience of a simple gesture to fishing their keys out of a pocket, for example.
Apple recently unveiled its newest product, the Vision Pro, which is an XR headset with a vision-based interface that tracks the user’s eyes. It incorporates gesture recognition as well, to control portions of the user’s experience without the need for other peripherals or controllers.
Its $3500 price point at launch means this product is targeted squarely at enterprise users and other power users at first, but the company seems intent on figuring out how to bring this product category to the masses, and most people wouldn’t bet against Apple on category-defining technology.
At the other end of the consumer device price range, the Microsoft Kinect (now discontinued) was marketed as an accessory to the Xbox gaming console. It used the combination of an infrared light, camera sensor, AI/ML (artificial intelligence/machine learning) SoC (system-on-chip) and algorithms to translate gestures into commands to perform a range of gaming control functions. While this particular product is no longer for sale, it was offered at a relatively low price point (under $400) and demonstrated that a mass-market product was possible.
Looking ahead, computer vision, as a relatively low-cost consumer technology, has the potential to be applied to a variety of areas such as recognition of sign language or assistive devices for people with disabilities, for instance.
For those interested in further exploration of these topics, the recent white paper, Vision-Based Technology: Next-Gen Control, produced by Parks Associates and sponsored by Adeia, gives additional detail on some of these applications and their many benefits.
One of the most exciting applications for VBI is interactive video. In this use case, an interactive video can contain several opportunities for vision-based control.
One example includes billboards that show user-targeted video ads, depending on who is in front of it. This application is slightly more passive; it may detect the gender, age and other demographics of the viewer in order to serve relevant ads.
Another example, this one more of an active use case for interactive video, is Universal Studios’ new ride, “Mario Kart: Bowser’s Challenge”, which uses augmented reality (AR) and head and eye tracking for aiming. In this ride-based game, the video is changed based on where you direct your gaze.
In a third example think of watching a video on YouTube, for instance. Usually at the end, there are clickable areas of the screen (called Cards) that you can click on to view a related video. In the not-too-distant future, this type of interaction could be controlled by your eyes: you could direct your gaze at one of the cards and blink your eyelids to launch the next video.
One other application in interactive video is the emerging technique of “branching”. Branching means that a film or TV show is constructed and filmed in such a way that there are decision points for the viewer. At a certain point in the story, you are presented with a choice like, “Does the main character get into the car, or doesn’t she?” Depending on your choice, the story evolves in that direction. Netflix, for instance, has been offering branching narratives for several years in some of their original content. This technology could be adapted to allow branching to be controlled by shifts in user gaze.
For educational and training purposes, interactive video could combine VBI with AR and virtual reality to enable the user to learn and practice real-world skills such as painting, welding and even surgery. In these cases, a camera would look at the user’s action and the computer would generate corresponding video and render it in real time on the headset.
While VBI is not likely to entirely replace all other forms of touch-based interface any time soon, there are many situations where it can provide a helpful augmentation to the experience. Vision-based interfaces can offer valuable assistive capability for some applications while providing exciting immersivity for others.
Dr. Ning Xu currently serves as Fellow, Advanced R&D at Adeia Inc, pioneering innovations that enhance the way we live, work, and play. Before joining Adeia, Dr. Xu was the Chief Scientist of Video Algorithms at Kuaishou Technology, and before that, he held various positions at Amazon, Snap Research, Dolby Laboratories, and Samsung Research America. He earned his Ph.D. in Electrical Engineering from the University of Illinois at Urbana-Champaign (UIUC) in 2005, and his Master’s and Bachelor’s degrees from the University of Science and Technology of China (USTC). Dr. Xu has co-authored over 200 journal articles, conference papers, patents, and patent applications. His research interests encompass machine learning, computer vision, video technology, and other related areas. He is a Senior Member of IEEE.