Ziqi Gao1,2 Wendi Yang1,2 Yujia Li3 Lei Xing4 S. Kevin Zhou1,2,3,5
1School of Biomedical Engineering, Division of Life Sciences and Medicine,
University of Science and Technology of China (USTC), Hefei Anhui, 230026, China
2Center for Medical Imaging, Robotics, Analytic Computing & Learning (MIRACLE),
Suzhou Institute for Advance Research, USTC, Suzhou Jiangsu, 215123, China
3Key Laboratory of Intelligent Information Processing of Chinese Academy of
Sciences (CAS), Institute of Computing Technology, China
4 Department of Radiation Oncology, Stanford University, Stanford, CA, USA
5Key Laboratory of Precision and Intelligent Chemistry, USTC, Hefei Anhui, 230026, CAS
{gaoziqi,yangwendi}@mail.ustc.edu.cn yujia.li@miracle.ict.ac.cn
lei@stanford.edu s.kevin.zhou@gmail.comCorresponding author.
Abstract
Non-semantic context information is crucial for visual recognition, as the human visual perception system first uses global statistics to process scenes rapidly before identifying specific objects. However, while semantic information is increasingly incorporated into computer vision tasks such as image reconstruction, non-semantic information, such as global spatial structures, is often overlooked. To bridge the gap, we propose a biologically informed non-semantic context descriptor, MS-Glance, along with the Glance Index Measure for comparing two images. A Global Glance vector is formulated by randomly retrieving pixels based on a perception-driven rule from an image to form a vector representing non-semantic global context, while a local Glance vector is a flattened local image window, mimicking a zoom-in observation. The Glance Index is defined as the inner product of two standardized sets of Glance vectors. We evaluate the effectiveness of incorporating Glance supervision in two reconstruction tasks: image fitting with implicit neural representation (INR) and undersampled MRI reconstruction. Extensive experimental results show that MS-Glance outperforms existing image restoration losses across both natural and medical images. The code is available at https://github.com/Z7Gao/MSGlance.
1 Introduction
Computer vision (CV) has increasingly incorporated global semantic information, inspired by the human visual perception process[25] where understanding is driven by high-level semantic cues rather than focusing on individual pixels.For example, the Vision Transformer (ViT)[11], a prominent neural network architecture, captures long-range dependencies within an image and generates powerful feature representations through vision-language pretraining[62]. In image reconstruction, advanced models have also successfully incorporated semantic information. For instance, in super-resolution, architectures such as U-Net utilize multiple layers of convolutional and downsampling operations to progressively extract high-level semantic information[63, 27, 82]. Besides, in undersampled MRI reconstruction, some researchers utilize semantic segmentation networks to assist in the reconstruction process[30, 14, 57].However, in human vision, semantic conception[68] is not the only way to portray the recognition; it can also be interpreted as a reconstruction procedure of the scene that doesn’t involve semantic understanding[50, 56, 65, 51], demonstrated in Fig.1. A pioneering study[56] shows that human vision gleans a large amount of meaningful information from one single glance. Experimental studies followed[65, 51], suggesting that recognition of the real world may start from encoding global configuration since no perception of individual objects or detailed features can be made in such a short time. A representative computational model, Spatial Envolope[50], models the structure of real-world scenes by a set of perceptual properties including naturalness, openness, roughness, ruggedness, and expansion. Further, Michelle Greene et al. [21] experimentally shows that statistical and structural cues extracted from very brief (e.g. 19 ms and masked) exposures allow for above-chance categorization of scenes, verifying the Spatial Envelope model and the existence of a reconstruction-based recognition.
Current image reconstruction algorithms primarily focus on pixel-wise similarity or high-level semantic information, often overlooking non-semantic, statistical, and structural information. Although Structural Similarity Index Measure (SSIM) [74] is widely used for evaluating image quality, it only captures local information from neighboring pixels without accounting for structural information from distant pixels.On the other hand, S3IM[76] improves SSIM in the application of novel view synthesis with Neural Radiance Field (NeRF)[47] by applying SSIM to random patches. However, the Gaussian-kernel-based computation of SSIM assumes weighting the central pixels while suppressing the edge pixels, limiting its ability to represent global context, as global context does not inherently focus on the center.
To bridge this gap, we propose Multi-Scale Glance (MS-Glance), a novel non-semantic descriptor of image context, inspired by the human recognition process that bypasses the semantic concept. MS-Glance includes local and global Glance vectors. A global Glance vector is formulated by randomly retrieving pixels from an image with an explicit rule and a local Glance vector is formed by flattening a local window of an image. Given two Glance vectors, their Glance Index is the inner product between two sets of standardized Glance vectors. Since MS-Glance is a descriptor of image context, it can be seamlessly integrated into existing image reconstruction models as a plug-and-play component, enhancing the quality of the reconstructed images.
We show the applicability of MS-Glance loss in two scenarios: image fitting with implicit neural representation (INR) and supervised undersampled MRI reconstruction with DRDN[87]. For image fitting, we training SIREN[66], a neural representation capable of modeling signals with fine details, with MS-Glance loss. Using MS-Glance leads to the best SIREN representation ability on datasets of common objects, human faces, and MRI brain scans. For undersampled MRI reconstruction, we propose a novel air prior that allows rule-based pixel selection for MS-Glance. Experiments are conducted on two public datasets, IXI and FastMRI, encompassing various MRI acquisition scenarios and different organs. Extensive experimental results demonstrate that incorporating Glance not only enhances the performance of existing models but also improves image reconstruction quality for both natural and medical images.
The contributions of this work are as follows. First, we propose a biologically inspired non-semantic context descriptor, MS-Glance, along with the Glance Index Measure and Glance loss specifically designed for image comparison. Second, we demonstrate its ability to improve learned image representation through INR image fitting. Third, we apply it to training undersampled MRI reconstruction networks, showcasing its utility in image restoration tasks. Additionally, we introduce a novel perception prior, MRI air prior, and incorporate it with the construction of MS-Glance vectors for MRI reconstruction. Finally, extensive experiments on a wide range of datasets show that the MS-Glance loss outperforms existing loss functions used in image restoration, such as L1+SSIM[70] and LPIPS[18].
2 Related work
2.1 Image non-semantic information
While semantic information has been extensively utilized in various CV tasks, research on non-semantic information of images, such as structural and layout features, remains limited. Cao et al.[1] introduced the concept of Non-Semantic Facial Parts (NSFP), which identifies the most discriminative patches for face recognition and retrieval. However, their method relies on predefined features like SIFT, limiting the performance. Murrugarra et al.[48] proposed non-semantic transfer from attributes that may belong to different domains, but focuses solely on texture, which is only a small subset of non-semantic information. In autonomous driving, Anas et al.[3] represented point clouds with non-semantic features for environment interpretation, localization, and mapping; yet their work is specific to point clouds rather than images. In NeRF, Xie et al.[76] applied SSIM on reorganized random image patches. Still, their work is also application-specific and relies on the original pixel-based SSIM computation with a local focus center.
2.2 Image reconstruction
Image reconstruction[80, 31, 70] in CV refers to recovering or restoring an image from incomplete, corrupted, or undersampled data. Here, we introduce one reconstruction method and one reconstruction task: Implicit Neural Representation (INR) and undersampled MRI reconstruction.
2.2.1 Implicit Neural Representation (INR)
INR is a neural network-based continuous image representation. It parameterizes a field in a coordinate-based manner. Various research on INR explored better image fitting methods, for example, Sinusoidal Representation Networks (SIREN)[66] utilizes periodic activation functions for INR to capture intricate natural signals and their derivatives. INR has also been used for image compression[69, 12, 19], segmentation[37], and super-resoultion[4, 49]. Other major application of implicit neural representation is a series of works[47, 72, 83, 15] on 3D reconstruction. Instead of using explicit representation(e.g., voxels), NeRf[47] encodes the representation of 3D object/scene into MLPs. In this work, we focus on improving INR’s image-fitting performance, potentially applicable to various related tasks.
2.2.2 Undersampled MRI reconstruction
MRI is a non-invasive in-vivo imaging modality that benefits radiological diagnosis. However, its long acquisition leads to patient discomfort and motion artifacts. Undersampling in the K-space domain is often employed to accelerate the acquisition, but this leads to a loss of image fidelity. Parallel Imaging (PI) [22, 58] and Compressed Sensing (CS)[46, 26, 40, 54] is limited to an acceleration rate of 2-3[46]. Supervised deep learning-based undersampled MRI reconstruction networks are developed to learn the mapping between undersampled images and high-quality MR images and achieve improved performance[13, 60, 10, 59, 73, 79, 36, 64, 9, 28, 67, 85, 61, 5]. Compared with diffusion and GAN-based unsupervised models[23, 38, 29, 6, 34, 2, 39, 52, 7, 53, 24, 55, 16, 44, 8], supervised models typically show strong performance in in-domain scenarios. Most works have adopted the commonly used L1, L2, adversarial loss, SSIM loss, or frequency domain loss[35, 87, 71]. A small amount of work compares the effect of loss functions used for training[18] and explores new loss function[71] that is more effective than existing loss for phase reconstruction. In our work, we focus on the reconstruction of image magnitude, the final part that is displayed in diagnostics, and compare ours with those losses that are proved superior in magnitude reconstruction: SSIM[71] and perceptual loss [18].
2.3 Loss functions for comparing images
In tasks related to image comparing, various network architectures have been extensively explored while loss functions remain relatively under-developed.There are primarily two categories of loss functions commonly used: (1) pixel-wise loss functions such as L1[33] and L2 loss[77].These types of loss functions operate at the level of individual pixels or elements in the output and target images.(2) global loss functions such as Gram loss[17], adversarial loss (GAN)[20], perceptual loss[84], and SSIM loss[74, 76].In contrast to pixel-wise loss functions, global loss functions consider the relationships and structures across the entire image.There has been research[86] that indicates the quality of the results improves significantly with better loss functions, empirically a combination of pixel-wise loss and global loss.However, there is currently no effective loss function that can capture information about non-semantic image context.
3 Method
In this section, we formulate a novel non-semantic image descriptor of image context, Glance Vector, and a novel Glance Similarity Index Measure, GlanceIM, that benefits the training image reconstruction network.
3.1 Glance Vector
Given an image , we mimic the human recognition process that bypasses the semantic concept with a set of global and local Glance Vectors.
3.1.1 Global Glance Vector
Firstly, we retrieve the global image context from . A set of pixels, where , is randomly selected from , denoted by :
where is the set of coordinates corresponding to the selected pixels. Secondly, a Glance vector is formulated by randomly retrieving pixels from and forming a vector of shape .
We extract Glance vectors in a window-based computation: reshape into a 2D matrix, , of shape and apply a 2D window of shape , resulting in submatrices, . Different from SSIM[74] and S3IM [76], the kernel is uniform, instead of circular-symmetric Gaussian weighting, since the stochastic global context is a group-based term and should not have a focus center. Moreover, we apply a unit stride to form a dense representation of the image context. The corresponding Glance vectors are obtained by flattening each submatrix into a one-dimensional vector . The dense set of global Glance Vectors is represented as:
provides a more compact and computationally efficient representation of the global context embedded in .
The construction of can leverage prior knowledge of human perception by translating it into a pixel selection rule that emphasizes perceptually important structure. For instance, humans perceive bright and high-contrast areas more effectively than dim and low-contrast ones[75]. This also extends to well-trained physicians in clinical settings, where imaging modalities are physically designed to capture critical regions with high contrast, as higher-intensity areas often correlate with diagnostically significant regions. Below, we present an example of integrating Glance with perceptual priors to enhance undersampled MRI reconstruction.
MRI Air Prior.The air region appears as a black area in the MRI image and the physicians don’t consider them for diagnosis. This is because MRI works by detecting the magnetic signals produced by hydrogen nuclei in water molecules. Since air typically does not contain water molecules, the MRI scanner can not detect any meaningful magnetic signal produced by hydrogen nuclei in the air. We can simulate this perception process by not sampling from air pixels.
As shown in Fig. 2, we can define a small intensity threshold , say 0.01, and sample the pixels whose intensity is higher. Formally, this process can be defined as
3.1.2 Local Glance Vector
Dedicated image reconstruction requires a closer look at the local image context. As such, we additionally extract a set of local Glance vectors from the original image . We apply the 2D uniform window of shape to and produce a set of local Glance Vector by flattening. The dense set of local Glance Vectors is represented as:
Multi-scale Glance (MS-Glance) Vectors are defined as the union of Global and Local Glance Vectors:
3.2 Glance Index
Given two Glance vectors and from , the similarity between them is defined as the dot product of their standardized versions. The standardization of each vector involves subtracting the mean and dividing by the standard deviation of its elements. Specifically, the normalized Glance Vectors are denoted as and , where
Here, and represent the mean and standard deviation of the elements in vector , respectively. The similarity between them is given by:
This Glance similarity measure reflects the correlation between the two vectors after accounting for their respective means and variances.
The normalized Glance Vectors and take values in , where each element follows a standard normal distribution . The inner product of the normalized Glance Vectors lies within the range:
The maximum value is achieved when the vectors are perfectly aligned, and the minimum value is achieved when they are perfectly anti-aligned.
From an algebraic perspective, The similarity of and is equal to a normalized covariance term:From an algebraic perspective, the similarity of and is equal to a Pearson correlation coefficient:
(1) | ||||
(2) | ||||
(3) |
We add a small constant to both the numerator and denominator[74] of the Glance Index to avoid numerical instability.
3.3 Glance Index Measure
Given two images and , their Glance Index Measure (GlanceIM) is defined as the average of the Glance Index over two sets of MS-Glance Vectors, and , which are extracted from two images in the same way:
where and are the sets of Glance Vectors randomly sampled in the same manner from images and , respectively. lies within the range , where a value of 1 indicates perfect similarity (the images are identical in terms of global structure), and -1 indicates complete dissimilarity (the images are maximally different).
GlanceIM can be used as a loss for supervising image restoration networks by changing its range into :
4 Applications
This section demonstrates how MS-Glance improves supervised image reconstruction. We first choose a simple regression task, fitting an image with INR then demonstrate the applicability of MS-Glance and the novel air prior in undersampled MRI reconstruction, spanning various acquisition scenarios and organs.
Glance Loss Implementation.We validate MS-Glance and decompose it MS-Glance into Local Glance and Global Glance, which represent the Glance Vector sets used for computing GlanceIM. For 3-channel RGB images, the Glance vectors are defined as flattened 1D vectors of shape to leverage the correlation information among channels. For 2-channel complex-valued MRIs, the Glance vectors are extracted from the image magnitude, which is the root-sum-of-square of the two channels that represent the real and imaginary parts. This operation allows us to directly incorporate MRI air prior when constructing . We set , , and .
Comparsion and Evaluation. For comparison with existing losses, we add several common losses for training image restoration networks, including a feature-based loss, Perceptual Loss (LPIPS)[84], a local-structure-emphasized loss, SSIM Loss[74], and Stochastic SSIM (S3IM) [76], a recent loss used in NeRF for multi-view synthesis. We use the official implementation of LPIPS111https://github.com/richzhang/PerceptualSimilarity and S3IM222https://github.com/Madaoer/S3IM-Neural-Fields and follow their default settings. For SSIM, we keep the default setting in Pytorch333https://github.com/Po-Hsun-Su/pytorch-ssim and set its local window size to 16, the same as our and , and stride to 1. This gives us a direct comparison of our Local Glance and SSIM - two metrics that both focus on local regions. The image reconstruction performance is evaluated with peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM).
All code is written in PyTorch and experiments are carried on an NVIDIA 3090 GPU with 24 GB memory. During training, gradient clipping is deployed for all.
4.1 Image Fitting with INR
Dataset | Coco - Objects | CelebA - Human Face | IXI - Structural MRI | |||
---|---|---|---|---|---|---|
PSNR(dB) | SSIM | PSNR(dB) | SSIM | PSNR(dB) | SSIM | |
SIREN[66] | 34.469 | 0.9396 | 39.007 | 0.9696 | 37.430 | 0.9743 |
+LPIPS[84] | 34.803 | 0.9417 | 39.346 | 0.9714 | 33.506 | 0.9288 |
+SSIM[74] | 34.939 | 0.9468 | 39.232 | 0.9711 | 31.370 | 0.9298 |
+S3IM[76] | 34.717 | 0.9430 | 39.293 | 0.9714 | 37.412 | 0.9604 |
+Local Glance | 35.004 | 0.9463 | 39.422 | 0.9723 | 39.088 | 0.9788 |
+Global Glance | 34.843 | 0.9439 | 39.329 | 0.9712 | 37.613 | 0.9601 |
+MS-Glance | 35.249 | 0.9493 | 39.571 | 0.9733 | 39.141 | 0.9788 |
In this section, we show how MS-Glance improves network reconstruction ability over existing losses based on an implicit neural network, SIREN[66].
Model Implementation. SIREN’s input is a Fourier-encoded 2D coordinate, and the output is the pixel intensity: 3 channels for RGB images and 1 channel for gray-scale images. Its architecture is a Multi-Layer Perceptron (MLP) network with 256 hidden features. It is originally trained with an L2 loss. We use the official implementation444https://github.com/vsitzmann/siren/blob/master/explore_siren.ipynb of SIREN and follow its default configuration for training: Adam optimizer combined with a learning rate of 0.0001.
Dataset. We use multiple categories of images from three public datasets: COCO[41], CelebA[45] and IXI555https://brain-development.org/ixi-dataset/. They contain common objects in context, human faces in the wild, and T1-w MRI brains, respectively. Images are center-cropped into the shape of and normalized to [0,1]. We randomly choose 100, 100, and 1000 images from the three datasets, fit each image three times, and evaluate the PSNR and SSIM of the reconstructed image.
Loss implementation. All compared losses are scaled with a coefficient of 0.01 to match the order of magnitudes of the L2 losses before combining with L2. For our MS-Glance and Global Glance, we don’t assume any prior when selecting .
Result. The qualitative result is shown in Fig. 4. LPIPS enhances natural image reconstruction but fails in the case of MRI. Both SSIM and Local Glance improve local structures, highlighted by the red boxes around the room and the woman’s double-fold eyelids. However, SSIM introduces color distortion in the MRI reconstruction. S3IM and Global Glance improve overall image reconstruction, with Global Glance showing fewer losses, particularly evident in the error map of the last row. MSGlance combines the strengths of Local and Global Glance, achieving better structural reconstruction. The quantitative performance is shown in Table 1, further demonstrating the effectiveness of MS-Glance over others in all three scenarios. Specifically, local and global Glance show improvement separately and compensate for each other in MS-Glance.
We further visualize the SIREN regression process supervised by MS-Glance and other methods. Using a classic RGB image of shape , Astronaut666https://scikit-image.org/docs/stable/api/skimage.data.html, we plot PSNR and SSIM at each step of the SIREN regression, as shown in Fig. 3. MS-Glance consistently outperforms existing loss functions throughout the entire training process. Additional qualitative results using Astronaut are provided in the supplementary materials.
4.2 Undersampled MRI reconstruction
In this section, we show the effectiveness of Glance in a real-world application, undersampled MRI reconstruction. We use a popular recurrent learning backbone, Dilated Residual Dense Network (DRDN)[87] for training and use the standard loss for training image-domain MRI reconstruction networks, L1[81], as the baseline.
Dataset IXI - Brain FastMRI - Knee Acceleration Rate 5x 7x 5x 7x PSNR (dB) SSIM PSNR (dB) SSIM PSNR (dB) SSIM PSNR (dB) SSIM L1 30.973 0.9512 30.651 0.9481 33.196 0.9068 31.117 0.8762 +LPIPS[84] 30.831 0.9496 30.262 0.9440 32.671 0.8974 30.704 0.8577 +SSIM[74] 30.704 0.8577 30.592 0.9485 33.191 0.9078 31.363 0.8835 +S3IM[76] 31.017 0.9513 30.681 0.9484 33.236 0.9071 31.344 0.8794 +Global Glance 31.122 0.9524 30.711 0.9485 33.253 0.9073 31.454 0.8804 +Local Glance 31.346 0.9537 30.813 0.9483 33.187 0.9038 31.352 0.8776 +MS-Glance 31.434 0.9535 30.865 0.9485 33.299 0.9072 31.476 0.8783
Model Implementation.DRDN’s input is a 2-channel undersampled MRI image and the output is a 2-channel reconstructed image. It stacks densely connected atrous layers and thus has a large receptive field while preserving image details. We use the official implementation777https://github.com/bbbbbbzhou/DuDoRNet and change its recurrent number into 3 due to the GPU memory limit. DRDN is trained with an Adam optimizer and a learning rate of 0.0001 for 40 epochs for IXI and 20 epochs for FastMRI, and the model checkpoints used for evaluation are the one that produces the best PSNR on the validation set.
Dataset and simulation.We use Proton Density weighted (PD-w) MRI from one simulation dataset, the IXI brain dataset, and one raw dataset, the FastMRI[81] knee dataset.For IXI, we take 576 Volumes and uniformly sample slice-wise and then split volume-wise, resulting in 2455 slices for training and 700 slices for testing (matrix size = 256256). The FastMRI dataset includes 1,172 volumes of single-coil complex-valued PD-w k-space. We use the official dataset split and drop the first and last five noisy slices from each volume, resulting in 25012 training slices and 5145 testing slices (matrix size = 320320). Image normalization is done by normalizing image magnitude into [0,1].
We simulate the Uniform 1D undersampling using the random mask generation function from the official implementation of [81]. The undersampled images are obtained by applying the Fourier Transform to the corresponding ground truth images, masking with a randomly generated uniform 1D mask, and applying the inverse Fourier Transform. We train DRDN under multiple acceleration rates, including and , and evaluate under the corresponding rates. The Auto-Calibration region ratio is set to . We show a 7x undersampled image (Zero Filling) in Fig. 5. It is reconstructed directly from undersampled k-space by inverse Fourier Transform without any learning. It contains strong visual aliasing artifacts and is blurred.
Loss implementation.All compared losses are scaled with a coefficient of 0.1 to match the order of magnitudes of the L2 losses before combined with L2. For our MS-Glance and Global Glance, we use air prior when selecting and set as 0.01.
Result. Fig. 5 presents the qualitative results. The red box highlights a portion of the brain cortex, indicated by the red arrow. Among all, only MS-Glance and LPIPS successfully reconstruct the structure from the blurry input. However, LPIPS tends to introduce hallucinations, such as the vertical line on the right of the ‘+LPIPS’ text which is not in the Ground Truth. The quantitative results are provided in Table 2 and show similar behaviors to INR image fitting. MS-Glance achieves the best PSNR among all.
5 Ablation studies
PSNR (dB) SSIM Time (s) 34.744 0.9430 13.37 (ours) 35.249 0.9493 20.71 35.119 0.9470 29.91 (whole) 35.029 0.9464 76.78
PSNR (dB) SSIM Time (s) 35.096 0.9474 9.94 35.247 0.9491 10.19 (ours) 35.249 0.9493 20.71 35.188 0.9485 54.77
Acc. rates 5x 7x PSNR (dB) SSIM PSNR (dB) SSIM Ours 31.434 0.9535 30.865 0.9485 w/o air prior 31.308 0.9519 30.693 0.9478
5.1 The selection of and
MS-Glance extracts the global context of an image by randomly retrieving a subset of pixels. is the shape of Glance vectors. We show the effect of using different values when fitting the INR on the CoCo dataset in Table 3 and Table 4 for and , respectively. Table 3 shows that using the whole set of pixels is slow and does not lead to performance gain and our selection leads to the best performance. Table 4 shows that our selection of achieves the best performance while achieves the best speed-performance trade-off.
5.2 Effectiveness of the rule-based pixel selection
The construction of can integrate perceptual prior, and we have shown an example of using MRI air prior. To validate the effectiveness of MRI air prior, we evaluate the performance of training undersampled MRI networks with and without it. The experiments are carried out on the FastMRI dataset under two acceleration rates. Table 5 presents the performance of MS-Glance with and without the MRI air prior. This validates its beneficial effect.
6 Discussion
6.1 MS-Glance and SSIM
SSIM[74] computes image similarity window-wise and emphasizes the center pixel with a Gaussian kernel. Given two windows and , their similarity is the product of three statistics: luminance , contrast , and structure :
where are the mean of and , are the variance of and is the covariance of . resembles the formulation of the Glance Index. However, the construction of Global Glance vectors gathers pixels across the image, rather than in local windows like SSIM. Additionally, when we combine the and of SSIM with our Glance Index, we witness a performance drop. We argue that the non-semantic image context we try to capture is highly correlated with image structure and thus is better measured by our proposed Glance Index unitarily. Moreover, MS-Glance does not assume locality or center for each window, since the image context is statistics-based and does not have a focus center. Ablation studies about the kernel selection can be found in the supplementary.
6.2 Future Work
MS-Glance incorporates multi-scale, non-semantic image context into supervised image reconstruction. It can be deployed in other low-level vision tasks such as super-resolution, deblurring, and compression. Also, the stochasticity property of Global Glance allows for integrating more perception priors. Future work includes combining MS-Glance with other explicit appearance priors[43, 42] and learned implicit priors such as pixel correlation[32, 78].
7 Conclusion
We have shown how to leverage the non-semantic image context, which can be captured by human vision, for supervising image reconstruction networks. We propose MS-Glance, a novel bio-inspired multi-scale descriptor of non-semantic image context and its loss form. We demonstrate its effectiveness in improving image reconstruction through two settings: INR fitting and undersampled MRI reconstruction. Finally, extensive experiments on both natural and medical datasets show that the MS-Glance loss outperforms existing loss functions used in image restoration.
8 Acknowledgement
This work is supported by Natural Science Foundation of China under Grant 62271465, Suzhou Basic Research Program under Grant SYG202338, and Open Fund Project of Guangdong Academy of Medical Sciences, China (No. YKY-KF202206).
References
- [1]Chong Cao and Haizhou Ai.Non-semantic facial parts for face verification.In 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), pages 725–729. IEEE, 2015.
- [2]Chentao Cao etal.High-frequency space diffusion model for accelerated mri.IEEE Transactions on Medical Imaging, 43(5):1853–1865, 2024.
- [3]Anas Charroud, Ali Yahyaouy, Karim ElMoutaouakil, and Uche Onyekpe.Localisation and mapping of self-driving vehicles based on fuzzy k-means clustering: A non-semantic approach.In 2022 International Conference on Intelligent Systems and Computer Vision (ISCV), pages 1–8. IEEE, 2022.
- [4]Zeyuan Chen, Yinbo Chen, Jingwen Liu, Xingqian Xu, Vidit Goel, Zhangyang Wang, Humphrey Shi, and Xiaolong Wang.Videoinr: Learning video implicit neural representation for continuous space-time super-resolution.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2047–2057, 2022.
- [5]Jing Cheng, Haifeng Wang, Leslie Ying, and Dong Liang.Model learning: Primal dual networks for fast mr imaging.In MICCAI, 2019.
- [6]Hyungjin Chung, Byeongsu Sim, and JongChul Ye.Come-closer-diffuse-faster: Accelerating conditional diffusion models for inverse problems through stochastic contraction.In CVPR, 2022.
- [7]Hyungjin Chung and JongChul Ye.Score-based diffusion models for accelerated mri.Medical Image Analysis, page 102479, 2022.
- [8]ElizabethK Cole, Frank Ong, ShreyasS Vasanawala, and JohnM Pauly.Fast unsupervised mri reconstruction without fully-sampled ground truth data using generative adversarial networks.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3988–3997, 2021.
- [9]Pak LunKevin Ding, Zhiqiang Li, Yuxiang Zhou, and Baoxin Li.Deep residual dense u-net for resolution enhancement in accelerated mri acquisition.In Medical Imaging 2019: Image Processing, volume 10949, pages 110–117. SPIE, 2019.
- [10]Qiaoqiao Ding and Xiaoqun Zhang.Mri reconstruction by completing under-sampled k-space data with learnable fourier interpolation.In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 676–685. Springer, 2022.
- [11]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby.An image is worth 16x16 words: Transformers for image recognition at scale.In International Conference on Learning Representations, 2021.
- [12]Emilien Dupont, Adam Goliński, Milad Alizadeh, YeeWhye Teh, and Arnaud Doucet.Coin: Compression with implicit neural representations.arXiv preprint arXiv:2103.03123, 2021.
- [13]Taejoon Eo etal.Kiki-net: cross-domain convolutional neural networks for reconstructing undersampled magnetic resonance images.Magnetic resonance in medicine, 80(5):2188–2201, 2018.
- [14]Zhiwen Fan, Liyan Sun, Xinghao Ding, Yue Huang, Congbo Cai, and John Paisley.A segmentation-aware deep fusion network for compressed sensing mri.In Proceedings of the European Conference on Computer Vision (ECCV), pages 55–70, 2018.
- [15]Chen Gao, Yipeng Wang, Changil Kim, Jia-Bin Huang, and Johannes Kopf.Planar reflection-aware neural radiance fields.arXiv preprint arXiv:2411.04984, 2024.
- [16]Ziqi Gao and S.Kevin Zhou.Mrpd: Undersampled mri reconstruction by prompting a large latent diffusion model, 2024.
- [17]LeonA Gatys, AlexanderS Ecker, and Matthias Bethge.Image style transfer using convolutional neural networks.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
- [18]Vahid Ghodrati, Jiaxin Shao, Mark Bydder, Ziwu Zhou, Wotao Yin, Kim-Lien Nguyen, Yingli Yang, and Peng Hu.Mr image reconstruction using deep learning: Evaluation of network structure and loss functions.Quantitative Imaging in Medicine and Surgery, 9:1516–1527, 09 2019.
- [19]Sharath Girish, Abhinav Shrivastava, and Kamal Gupta.Shacira: Scalable hash-grid compression for implicit neural representations.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17513–17524, October 2023.
- [20]Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.Generative adversarial networks.Communications of the ACM, 63(11):139–144, 2020.
- [21]Michelle Greene and Aude Oliva.The briefest of glances the time course of natural scene understanding.Psychological science, 20:464–72, 05 2009.
- [22]MarkA Griswold, , PeterM Jakob, RobinM Heidemann, Mathias Nittka, Vladimir Jellus, Jianmin Wang, Berthold Kiefer, and Axel Haase.Generalized autocalibrating partially parallel acquisitions (grappa).Magnetic Resonance in Medicine, 47(6):1202–1210, 2002.
- [23]Yu Guan, Chuanming Yu, Zhuoxu Cui, Huilin Zhou, and Qiegen Liu.Correlated and multi-frequency diffusion modeling for highly under-sampled mri reconstruction.IEEE Transactions on Medical Imaging, 2024.
- [24]Alper Güngör etal.Adaptive diffusion priors for accelerated mri reconstruction.Medical Image Analysis, 88:102872, 2023.
- [25]RalphN Haber.Visual perception.Annual review of psychology, 1978.
- [26]JustinP. Haldar, Diego Hernando, and Zhi-Pei Liang.Compressed-sensing mri with random encoding.IEEE Transactions on Medical Imaging, 30(4):893–903, 2011.
- [27]Xiaodan Hu, MohamedA Naiel, Alexander Wong, Mark Lamm, and Paul Fieguth.Runet: A robust unet architecture for image super-resolution.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
- [28]Jiahao Huang etal.Swin transformer for fast mri.Neurocomputing, 493:281–304, 2022.
- [29]Jiahao Huang etal.Cdiffmr: Can we replace the gaussian noise with k-space undersampling for fast mri?In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 3–12. Springer, 2023.
- [30]Qiaoying Huang, Dong Yang, Jingru Yi, Leon Axel, and Dimitris Metaxas.Fr-net: Joint reconstruction and segmentation in compressed sensing cardiac mri.In Functional Imaging and Modeling of the Heart: 10th International Conference, FIMH 2019, Bordeaux, France, June 6–8, 2019, Proceedings 10, pages 352–360. Springer, 2019.
- [31]ChangMin Hyun, HwaPyung Kim, SungMin Lee, Sungchul Lee, and JinKeun Seo.Deep learning for undersampled mri reconstruction.Physics in Medicine & Biology, 63(13):135007, 2018.
- [32]MdTauhidul Islam and Lei Xing.Deciphering the feature representation of deep neural networks for high-performance ai.IEEE transactions on pattern analysis and machine intelligence, PP, 02 2024.
- [33]Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and AlexeiA Efros.Image-to-image translation with conditional adversarial networks.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
- [34]Ajil Jalal etal.Robust compressed sensing mri with deep generative priors.In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J.Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume34, pages 14938–14954. Curran Associates, Inc., 2021.
- [35]Liming Jiang, Bo Dai, Wayne Wu, and ChenChange Loy.Focal frequency loss for image reconstruction and synthesis.In Proceedings of the IEEE/CVF international conference on computer vision, pages 13919–13929, 2021.
- [36]KyongHwan Jin etal.Deep convolutional neural network for inverse problems in imaging.IEEE Transactions on Image Processing, 26(9):4509–4522, 2017.
- [37]MuhammadOsama Khan and Yi Fang.Implicit neural representations for medical imaging segmentation.In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 433–443. Springer, 2022.
- [38]Yilmaz Korkmaz, Tolga Cukur, and VishalM Patel.Self-supervised mri reconstruction with unrolled diffusion models.In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 491–501. Springer, 2023.
- [39]Brett Levac, Ajil Jalal, and JonathanI. Tamir.Accelerated motion correction for mri using score-based generative models.In 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), pages 1–5, 2023.
- [40]Dong Liang, Bo Liu, Jiunjie Wang, and Leslie Ying.Accelerating sense using compressed sensing.Magnetic Resonance in Medicine, 62(6):1574–1584, 2009.
- [41]Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C.Lawrence Zitnick, and Piotr Dollár.Microsoft coco: Common objects in context, 2015.
- [42]Peirong Liu, Oula Puonti, Xiaoling Hu, DanielC. Alexander, and JuanE. Iglesias.Brain-id: Learning contrast-agnostic anatomical representations for brain imaging.In European Conference on Computer Vision (ECCV), 2024.
- [43]Peirong Liu, Oula Puonti, Annabel Sorby-Adams, W.Taylor Kimberly, and JuanE. Iglesias.PEPSI: Pathology-enhanced pulse-sequence-invariant representations for brain MRI.In International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2024.
- [44]Xianzhe Liu, Hongwei Du, Jinzhang Xu, and Bensheng Qiu.Dbgan: A dual-branch generative adversarial network for undersampled mri reconstruction.Magnetic Resonance Imaging, 89:77–91, 2022.
- [45]Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.Deep learning face attributes in the wild.In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
- [46]Michael Lustig, DavidL Donoho, JuanM Santos, and JohnM Pauly.Compressed sensing mri.IEEE signal processing magazine, 25(2):72–82, 2008.
- [47]Ben Mildenhall, PratulP Srinivasan, Matthew Tancik, JonathanT Barron, Ravi Ramamoorthi, and Ren Ng.Nerf: Representing scenes as neural radiance fields for view synthesis.Communications of the ACM, 65(1):99–106, 2021.
- [48]Nils Murrugarra-Llerena and Adriana Kovashka.Asking friendly strangers: non-semantic attribute transfer.In Proceedings of the AAAI Conference on Artificial Intelligence, volume32, 2018.
- [49]QuanH Nguyen and WilliamJ Beksi.Single image super-resolution via a dual interactive implicit neural network.In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4936–4945, 2023.
- [50]Aude Oliva and Antonio Torralba.Modeling the shape of the scene: A holistic representation of the spatial envelope.International Journal of Computer Vision, 42:145–175, 2001.
- [51]Aude Oliva and Antonio Torralba.The role of context in object recognition.Trends in cognitive sciences, 11(12):520–527, 2007.
- [52]Batu Ozturkler etal.Regularization by denoising diffusion process for MRI reconstruction.In NeurIPS 2023 Workshop on Deep Learning and Inverse Problems, 2023.
- [53]Batu Ozturkler etal.Smrd: Sure-based robust mri reconstruction with diffusion models.In MICCAI. Springer, 2023.
- [54]VishalM Patel, Ray Maleh, AnnaC Gilbert, and Rama Chellappa.Gradient-based image recovery methods from incomplete fourier measurements.IEEE Transactions on Image Processing, 21(1):94–105, 2011.
- [55]Cheng Peng etal.Towards performant and reliable undersampled mr reconstruction via diffusion model sampling.MICCAI, 2022.
- [56]MaryC Potter.Meaning in visual search.Science, 187(4180):965–966, 1975.
- [57]Aniket Pramanik and Mathews Jacob.Joint calibrationless reconstruction and segmentation of parallel mri.In European Conference on Computer Vision, pages 437–453. Springer, 2022.
- [58]KlaasP Pruessmann etal.Sense: sensitivity encoding for fast mri.Magnetic Resonance in Medicine, 42(5):952–962, 1999.
- [59]Patrick Putzky etal.i-rim applied to the fastmri challenge.arXiv preprint arXiv:1910.08952, 2019.
- [60]Chen Qin etal.Convolutional recurrent neural networks for dynamic mr image reconstruction.IEEE transactions on medical imaging, 38(1):280–290, 2018.
- [61]TranMinh Quan, Thanh Nguyen-Duc, and Won-Ki Jeong.Compressed sensing mri reconstruction using a generative adversarial network with a cyclic loss.IEEE transactions on medical imaging, 37(6):1488–1497, 2018.
- [62]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever.Learning transferable visual models from natural language supervision, 2021.
- [63]Olaf Ronneberger, Philipp Fischer, and Thomas Brox.U-net: Convolutional networks for biomedical image segmentation.In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
- [64]Jo Schlemper etal.A deep cascade of convolutional neural networks for mr image reconstruction.In IPMI, 2017.
- [65]PhilippeG. Schyns and Aude Oliva.From blobs to boundary edges: Evidence for time- and spatial-scale-dependent scene recognition.Psychological Science, 5(4):195–200, 1994.
- [66]Vincent Sitzmann, JulienN.P. Martel, AlexanderW. Bergman, DavidB. Lindell, and Gordon Wetzstein.Implicit neural representations with periodic activation functions.In Proc. NeurIPS, 2020.
- [67]Anuroop Sriram etal.End-to-end variational networks for accelerated mri reconstruction.In Medical Image Computing and Computer Assisted Intervention 2020, pages 64–73. Springer, 2020.
- [68]Kent Stevens.The vision of david marr.Perception, 41:1061–72, 09 2012.
- [69]Yannick Strümpler, Janis Postels, Ren Yang, LucVan Gool, and Federico Tombari.Implicit neural representations for image compression.In European Conference on Computer Vision, pages 74–91. Springer, 2022.
- [70]Jingwen Su, Boyan Xu, and Hujun Yin.A survey of deep learning approaches to image restoration.Neurocomputing, 487:46–65, 2022.
- [71]MaartenL Terpstra, Matteo Maspero, Alessandro Sbrizzi, and CornelisAT vanden Berg.-loss: A symmetric loss function for magnetic resonance imaging reconstruction and image registration with deep learning.Medical Image Analysis, 80:102509, 2022.
- [72]Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang.Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction.arXiv preprint arXiv:2106.10689, 2021.
- [73]Shanshan Wang etal.Accelerating magnetic resonance imaging via deep learning.In 2016 IEEE 13th international symposium on biomedical imaging (ISBI), pages 514–517. IEEE, 2016.
- [74]Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli.Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004.
- [75]Andrew Watson and Albert Ahumada.A standard model for foveal detection of spatial contrast.Journal of vision, 5:717–40, 02 2005.
- [76]Zeke Xie, Xindi Yang, Yujie Yang, Qi Sun, Yixiang Jiang, Haoran Wang, Yunfeng Cai, and Mingming Sun.S3im: Stochastic structural similarity and its unreasonable effectiveness for neural fields.In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 17978–17988, 2023.
- [77]Li Xu, JimmyS Ren, Ce Liu, and Jiaya Jia.Deep convolutional neural network for image deconvolution.Advances in neural information processing systems, 27, 2014.
- [78]Rui Yan, MdTauhidual Islam, and Lei Xing.Interpretable discovery of patterns in tabular data via spatially semantic topographic maps.Nature Biomedical Engineering, pages 1–12, 2024.
- [79]Guang Yang etal.Dagan: deep de-aliasing generative adversarial networks for fast compressed sensing mri reconstruction.IEEE transactions on medical imaging, 37(6):1310–1321, 2017.
- [80]Wenming Yang, Xuechen Zhang, Yapeng Tian, Wei Wang, Jing-Hao Xue, and Qingmin Liao.Deep learning for single image super-resolution: A brief review.IEEE Transactions on Multimedia, 21(12):3106–3121, 2019.
- [81]Jure Zbontar, Florian Knoll, Anuroop Sriram, Tullie Murrell, Zhengnan Huang, MatthewJ. Muckley, Aaron Defazio, Ruben Stern, Patricia Johnson, Mary Bruno, Marc Parente, KrzysztofJ. Geras, Joe Katsnelson, Hersh Chandarana, Zizhao Zhang, Michal Drozdzal, Adriana Romero, Michael Rabbat, Pascal Vincent, Nafissa Yakubova, James Pinkerton, Duo Wang, Erich Owens, C.Lawrence Zitnick, MichaelP. Recht, DanielK. Sodickson, and YvonneW. Lui.fastMRI: An open dataset and benchmarks for accelerated MRI, 2018.
- [82]Kai Zhang, LucVan Gool, and Radu Timofte.Deep unfolding network for image super-resolution.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3217–3226, 2020.
- [83]Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun.Nerf++: Analyzing and improving neural radiance fields.arXiv preprint arXiv:2010.07492, 2020.
- [84]Richard Zhang, Phillip Isola, AlexeiA. Efros, Eli Shechtman, and Oliver Wang.The unreasonable effectiveness of deep features as a perceptual metric, 2018.
- [85]Zizhao Zhang, Adriana Romero, MatthewJ Muckley, etal.Reducing uncertainty in undersampled mri reconstruction with active acquisition.In CVPR, 2019.
- [86]Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz.Loss functions for image restoration with neural networks.IEEE Transactions on Computational Imaging, 3(1):47–57, 2017.
- [87]Bo Zhou and SKevin Zhou.Dudornet: learning a dual-domain recurrent network for fast mri reconstruction with deep t1 prior.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4273–4282, 2020.