Original article: Saliency map.
Saliency map
editIn computer vision, a saliency map is an image that highlights the region on which people's eyes focus first. The goal of a saliency map is to reflect the degree of importance of a pixel to the human visual system. For example, in this image, a person first looks at the fort and light clouds, so they should be highlighted on the saliency map.
Application
editSaliency maps have applications in a variety of different problems. Some general applications:
- Image and video compression: The human eye focuses only on a small region of interest in the frame. Therefore, it is not necessary to compress the entire frame with uniform quality. According to the authors, using a salience map reduces the final size of the video with the same visual perception.[1]
- Image and video quality assessment: The main task for an image or video quality metric is a high correlation with user opinions. Differences in salient regions are given more importance and thus contribute more to the quality score.[2]
- Image retargeting: It aims at resizing an image by expanding or shrinking the noninformative regions. Therefore, retargeting algorithms rely on the availability of saliency maps that accurately estimate all the salient image details.[3]
- Object detection and recognition: Instead of applying a computationally complex algorithm to the whole image, we can use it to the most salient regions of an image most likely to contain an object.[4]
Algorithms
editThere are three forms of classic saliency estimation algorithms implemented in OpenCV:
- Static saliency: Relies on image features and statistics to localize the regions of interest of an image.
- Motion saliency: Relies on motion in a video, detected by optical flow. Objects that move are considered salient.
- Objectness: Objectness reflects how likely an image window covers an object. These algorithms generate a set of bounding boxes of where an object may lie in an image.
In addition to classic approaches, neural-network-based are also popular. There are examples of neural networks for motion saliency estimation:
- TASED-Net: It consists of two building blocks. First, the encoder network extracts low-resolution spatiotemporal features, and then the following prediction network decodes the spatially encoded features while aggregating all the temporal information.
- STRA-Net: It emphasizes two essential issues. First, spatiotemporal features integrated via appearance and optical flow coupling, and then multi-scale saliency learned via attention mechanism.
- STAViS: It combines spatiotemporal visual and auditory information. This approach employs a single network that learns to localize sound sources and to fuse the two saliencies to obtain a final saliency map.
Datasets
editThe saliency dataset usually contains human eye movements on some image sequences. It is valuable for new saliency algorithm creation or benchmarking the existing one. The most valuable dataset parameters are spatial resolution, size, and eye-tracking equipment. Here is part of the large datasets table from MIT/Tübingen Saliency Benchmark datasets, for example.
Dataset | Resolution | Size | Observers | Durations | Eyetracker |
---|---|---|---|---|---|
CAT2000 | 1920×1080px | 4000 images | 24 | 5 sec | EyeLink 1000 (1000Hz) |
EyeTrackUAV2 | 1280×720px | 43 videos | 30 | 33 sec | EyeLink 1000 Plus (1000 Hz, binocular) |
CrowdFix | 1280×720px | 434 videos | 26 | 1-3 sec | The Eyetribe Eyetracker (60 Hz) |
SAVAM | 1920×1080px | 43 videos | 50 | 20 sec | SMI iViewXTM Hi-Speed 1250 (500Hz) |
To collect a saliency dataset, image or video sequences and eye-tracking equipment must be prepared, and observers must be invited. Observers must have normal or corrected to normal vision and must be at the same distance from the screen. At the beginning of each recording session, the eye-tracker recalibrates. To do this, the observer fixates his gaze on the screen center. Then the session started, and saliency data are collected by showing sequences and recording eye gazes.
The eye-tracking device is a high-speed camera, capable of recording eye movements at least 250 frames per second. Images from the camera are processed by the software, running on a dedicated computer returning gaze data.
References
edit- ^ Guo, Chenlei; Zhang, Liming (Jan 2010). "A Novel Multiresolution Spatiotemporal Saliency Detection Model and Its Applications in Image and Video Compression". IEEE Transactions on Image Processing. 19 (1): 185–198. doi:10.1109/TIP.2009.2030969. ISSN 1057-7149.
- ^ Tong, Yubing; Konik, Hubert; Cheikh, Faouzi; Tremeau, Alain (2010-05-01). "Full Reference Image Quality Assessment Based on Saliency Map Analysis". Journal of Imaging Science and Technology. 54 (3): 30503–1–30503-14. doi:10.2352/J.ImagingSci.Technol.2010.54.3.030503.
- ^ Goferman, Stas; Zelnik-Manor, Lihi; Tal, Ayellet (Oct 2012). "Context-Aware Saliency Detection". IEEE Transactions on Pattern Analysis and Machine Intelligence. 34 (10): 1915–1926. doi:10.1109/TPAMI.2011.272. ISSN 1939-3539.
- ^ Jiang, Huaizu; Wang, Jingdong; Yuan, Zejian; Wu, Yang; Zheng, Nanning; Li, Shipeng (June 2013). "Salient Object Detection: A Discriminative Regional Feature Integration Approach". 2013 IEEE Conference on Computer Vision and Pattern Recognition. IEEE. doi:10.1109/cvpr.2013.271.