Vision Models for Emerging Technologies and Their Impact on Computer Vision
June 19th 2020, 1PM-4:30PM Pacific Time
There will be a live Q&A after the talks, at 4PM (PDT), through CVPR's streaming platform.
To enhance the overall viewing experience (for cinema, TV, games, AR/VR) the media industry is continuously striving to improve image quality. Currently the emphasis is on High Dynamic Range (HDR) and Wide Colour Gamut (WCG) technologies, which yield images with greater contrast and more vivid colours. The uptake of these technologies, however, has been hampered by the significant challenge of understanding the science behind visual perception. This course provides university researchers and graduate students in computer science, computer engineering, vision science, as well as industry R&D engineers, an insight into the science and methods for HDR and WCG. It presents the underlying principles and latest practical methods in a detailed and accessible way, highlighting how the use of vision models is a key element of all state-of-the-art methods for these emerging technologies. It discusses their impact on computer vision research and applications, as well as open challenges and future directions of research.
The tutorial will consist of three hour-long talks and it will closely follow the book "Vision models for HDR and WCG imaging", published by Elsevier in November 2019 in their Computer Vision and Pattern Recognition series.
We start with an overview of the biology of vision. In Session 1.1 we will describe the impact of the optics of the eye in the formation of the retinal image, the layered structure of the retina and its different types of cell, and the neural interactions and transmission channels by which information is represented and conveyed. Whenever possible we try to stress two ideas: first, that many characteristics of the retina and its processes can be explained in terms of maximizing efficiency, because they simply are the optimal choices; this concept of efficient representation is key for our applications and is the underlying theme in the course. And second, that much is still unknown about how the retina works.
In Session 1.2 we will describe the layered structure, cell types and neural connections in the lateral geniculate nucleus and the visual cortex, with an emphasis on color representation. We will also introduce linear+nonlinear (L+NL) models, which are arguably the most popular form of model not just for cell activity in the visual system but for visual perception as well. An important take-away message is that despite the enormous advances in the field, the most relevant questions about color vision and its cortical representation remain open: which neurons encode color, how does the cortex transform the cone signals, how shape and form are perceptually bound, and how do these neural signals correspond to color perception. Another important message is that the parameters of L+NL models change with the image stimulus, and the effectiveness of these models decays considerably when they are tested on natural images. This has grave implications for our purposes, since in computer vision many essential methodologies assume a L+NL form.
In Session 1.3 we discuss adaptation and efficient representation. Adaptation is an essential feature of the neural systems of all species, a change in the input-output relation of the system that is driven by the stimuli and that is intimately linked with the concept of efficient representation. Through adaptation the sensitivity of the visual system is constantly adjusted taking into accout multiple aspects of the input stimulus, matching the gain to the local image statistics through processes that aren't fully understood and contribute to make human vision so hard to emulate with devices. Adaptation happens at all stages of the visual system, from the retina to the cortex, with its effects cascading downstream; it's a key strategy that allows the visual system to deal with the enormous dynamic range of the world around us while the dynamic range of neurons is really limited.
In Session 2.1 we discuss brightness perception, the relationship between the intensity of the light (a physical magnitude) and how bright it appears to us (a psychological magnitude). It has been known for a long time that this relationship is not linear, that brightness isn't simply proportional to light intensity. But we'll see that determining the brightness perception function is a challenging and controversial problem: results depend on how the experiment is conducted, what type of image stimulus are used and what tasks are the observers asked to perform. Furthermore, brightness perception depends on the viewing conditions, including image background, surround, peak luminance and dynamic range of the display, and, to make things even harder, it also depends on the distribution of values of the image itself. This is a very important topic for imaging technologies, which require a good brightness perception model in order to encode image information efficiently and without introducing visible artifacts.
Session 2.2 deals with color. There are models that for simple stimuli in controled environments can predict very accurately the color appearance of objects, as well as the magnitude of their color differences. These models were developed and validated for standard dynamic range (SDR) images, and their extension to the HDR case is not straightforward. For the general case of natural images in arbitrary viewing conditions, there are many perceptual phenomena that come into play and no comprehensive vision model that is capable of handling them all in an effective way. As a result, the color appearance problem remains very much open, and this affects all aspects of color representation and processing.
In Session 3.1 we show how an image processing method for color and contrast enhancement, that performs local histogram equalization (LHE), is linked to neural activity models and the Retinex theory of color vision. The common thread behind all these subjects is efficient representation. The LHE method can be extended to reproduce an important visual perception phenomenon which is assimilation, a type of visual induction effect. The traditional view has been that assimilation must be a cortical process, but we show that it can start already in the retina. The LHE method is based on minimizing an energy functional through an iterative process. If the functional is regularized, the minimization can be achieved in a single step by convolving the input with a kernel, which has important implications for the performance of algorithms based on LHE.
In Session 3.2 we show how the local histogram equalization approach presented earlier can be used to develop gamut mapping algorithms that are of low computational complexity, produce results that are free from artifacts and outperform state-of-the-art methods according to psychophysical tests. Another contribution of our research is to highlight the limitations of existing image quality metrics when applied to the gamut mapping problem, as none of them, including two state-of-the-art deep learning metrics for image perception, are able to predict the preferences of the observers.
Based on recent findings and models from vision science, we present in Session 3.3 effective tone mapping and inverse tone mapping algorithms for production, post-production and exhibition. These methods are automatic and real-time, and they have been both fine-tuned and validated by cinema professionals, with psychophysical tests demonstrating that the proposed algorithms outperform both the academic and industrial state-of-the-art. As in the case of gamut mapping, we also show that state-of-the-art deep learning metrics are not capable of predicting observers' preferences for tone mapping and inverse tone mapping results.
In Session 3.4 we discuss the impact on computer vision research of the limitations of current vision models. In particular, given that:
- imaging techniques based on vision models are the ones that perform best for HDR and WCG imaging and a number of other applications;
- the performance of these methods is still far below what cinema professionals can achieve;
- and vision models are lacking, most key problems in visual perception remain open,
we propose that rather than be improved or revisited, a change of paradigm seems to be needed for vision models.
About the speaker
Marcelo Bertalmío (Montevideo, 1972) is a full professor at Universitat Pompeu Fabra, Spain, in the Information and Communication Technologies Department. He received the B.Sc. and M.Sc. degrees in electrical engineering from the Universidad de la República, Uruguay, and the Ph.D. degree in electrical and computer engineering from the University of Minnesota in 2001.
His publications total more than 11,000 citations. He was awarded the 2012 SIAG/IS Prize of the Society for Industrial and Applied Mathematics (SIAM) for co-authoring the most relevant image processing work published in the period 2008-2012. Has received the Femlab Prize, the Siemens Best Paper Award, the Ramón y Cajal Fellowship, and the ICREA Academia Award, among other honors. He was Associate Editor for SIAM-SIIMS and elected secretary of SIAM's activity group on imaging.
Has obtained an ERC Starting Grant for his project “Image processing for enhanced cinematography” (IP4EC) and two ERC Proof of Concept Grants to bring to market tone mapping and gamut mapping technologies. He's co-PI of two H2020 projects, HDR4EU and SAUCE, involving world-leading companies in the film industry.His new book "Vision models for HDR and WCG imaging" has been published by Elsevier in November 2019. Previously, he's written a book titled “Image Processing for Cinema”, published by CRC Press in 2014, and edited the book "Denoising of Photographics Images and Video" published by Springer in 2018.
His current research interests are in developing image processing algorithms for cinema that mimic neural and perceptual processes in the human visual system.
Here's the video of an April 2019 talk at UCLA's IPAM, with an overview of the research done at his lab.