previous day
next day
all days

View: session overviewtalk overview

08:30-10:30 Session LNCS1-IP1

Zoom Link:        Meeting ID: 850 7050 5056, Password: cgi2023

Image Analysis and Processing 1

Lei Li (Copenhagen University, Denmark)
Ruizhong Du (Hebei University, China)
Jing Cui (Hebei University, China)
Mingyue Li (Hebei University, China)
A Multi-label Privacy-preserving Image Retrieval Scheme based on Object Detection for Efficient and Secure Cloud Retrieval

ABSTRACT. With the development of self-media, the burden of client-side computation and storage of massive data has become increasingly heavy. Additionally, considering the presence of sensitive information in images, image owners commonly adopt the practice of encrypting images before storing them in the cloud. However, encrypted image retrieval faces a challenge of striking a balance between security and efficiency. To address this issue, a Multi-label Privacy-preserving Image Retrieval scheme based on Object Detection (MPIR-OD) is proposed. Firstly, image labels are extracted using object detection techniques. Then, frequent itemsets of labels are discovered through mining label association rules, and they are matched and classified with the previously extracted image labels to construct an index. Lastly, the Asymmetric Scalar-product Preserving Encryption (ASPE) is employed to encrypt image feature vectors, ensuring the privacy of the images, and enabling secure K-Nearest Neighbor (KNN) operations using the ASPE algorithm. Compared to existing schemes, the MPIR-OD scheme achieves a reduction in retrieval time of approximately 6 times and an improvement in retrieval accuracy of around 15.

Feng Yu (School of Computer Science and Artificial Intelligence,Wuhan Textile University, Wuhan, China)
Zhuohan Xiao (School of Computer Science and Artificial Intelligence,Wuhan Textile University, Wuhan, China)
Zhaoxiang Chen (School of Computer Science and Artificial Intelligence,Wuhan Textile University, Wuhan, China)
Li Liu (School of Computer Science and Artificial Intelligence,Wuhan Textile University, Wuhan, China)
Minghua Jiang (School of Computer Science and Artificial Intelligence,Wuhan Textile University, Wuhan, China)
Xiaoxiao Liu (School of Computer Science and Artificial Intelligence,Wuhan Textile University, Wuhan, China)
Xinrong Hu (School of Computer Science and Artificial Intelligence,Wuhan Textile University, Wuhan, China)
Tao Peng (School of Computer Science and Artificial Intelligence,Wuhan Textile University, Wuhan, China)
AMCNet: Adaptive Matching Constraint for Unsupervised Point Cloud Registration
PRESENTER: Zhuohan Xiao

ABSTRACT. The registration of 3D point cloud is a crucial challenge in computer vision with numerous applications in robotics, medical imaging and other industries. However, due to the lack of accurate data annotation, the performance of unsupervised point cloud registration networks is often unsatisfactory. In this paper, we propose an unsupervised method based on generating corresponding points and utilizing structural constraints for rigid point cloud registration. The objective is to optimize the similarity matrix using the neighborhood score of matching point pairs, and the feature extractor is designed to capture better features by constraining the structural difference between the source neighborhoods and the predicted neighborhoods. The key components in our approach are similarity optimization module and structure variation checking module. In the similarity optimization module, we improve the similarity matrix by adaptively weighting the matching scores of neighbors. Through this method, the spatial information of matching point pairs can be fully utilized, resulting in high-quality corresponding estimations. We observe that the solution of the rigid transformation matrix is easily affected by incorrect matching point pairs, while the predicted point cloud is crucial for constructing accurate correspondences. Therefore, we developed a structure variation checking module to constrain the predicted point cloud and the source point cloud to have similar structural information. Based on the constraints, the extraction network is continuously optimized and adjusted to obtain even better features. The extensive experimental results show that our method achieves state-of-the-art performance when compared with other supervised and unsupervised tasks on the ModelNet40 data set, and significantly outperforms previous methods on the real-world indoor 7Scenes data set.

Minghua Jiang (Wuhan Textile University, China)
Shuqing Liu (Wuhan Textile University, China)
Yankang Shi (Wuhan Textile University, China)
Chenghu Du (Wuhan Textile University, China)
Guangyu Tang (Wuhan Textile University, China)
Li Liu (Wuhan Textile University, China)
Tao Peng (Wuhan Textile University, China)
Xinrong Hu (Wuhan Textile University, China)
Feng Yu (Wuhan Textile University, China)
COCCI: Context-Driven Clothing Classification Network
PRESENTER: Shuqing Liu

ABSTRACT. Clothing classification aims to obtain labels for any given clothing item and serves as a fundamental task for clothing retrieval, clothing recommendation, and other related applications. Its potential commercial value has attracted widespread attention from researchers. In this task, there are two inherent challenges: suppressing complex backgrounds outside the clothing region and disentangling the feature entanglement of shape-similar clothing samples. These challenges arise from insufficient attention to key distinctions of clothing, which hinders the accuracy of clothing classification. Also, the high computational resource requirement of some complex and large-scale models also decreases the inference efficiency. To tackle these challenges, we propose a new context-driven clothing classification network (COCCI), which improves inference accuracy while reducing model complexity. First, we design a self-adaptive attention fusion (SAAF) module to enhance category-exclusive clothing features and prevent misclassification by suppressing ineffective features that have confused image contexts. Second, we propose a novel multi-scale feature aggregation (MSFA) module to establish spatial context correlations by using multi-scale clothing features. This helps disentangle feature entanglement among shape-similar clothing samples. Finally, we introduce knowledge distillation to extract reliable teacher knowledge from complex datasets, which helps student models learn clothing features with rich representation information, thereby improving generalization and classification accuracy while reducing model complexity. In comparison to state-of-the-art networks trained with one single model, COCCI shows a significant improvement of 5.47% in top-1 accuracy on the Clothing 1M dataset for images with complex backgrounds. Moreover, COCCI achieves an improvement of up to 6.4% in top-1 accuracy on the Deepfashion dataset. Experimental results demonstrate that our method achieves SOTA performance on the widely-used clothing classification benchmark.

Lei Li (Computer Science department of Copenhagen University, Denmark)
Hierarchical Edge Aware Learning for 3D Point Cloud

ABSTRACT. This paper proposes an innovative approach to Hierarchical Edge Aware 3D Point Cloud Learning (HEA-Net) that seeks to address the challenges of noise in point cloud data, and improve object recognition and segmentation by focusing on edge features. In this study, we present an innovative edge-aware learning methodology, specifically designed to enhance point cloud classification and segmentation. Drawing inspiration from the human visual system, the concept of edge-awareness has been incorporated into this methodology, contributing to improved object recognition while simultaneously reducing computational time. Our research has led to the development of an advanced 3D point cloud learning framework that effectively manages object classification and segmentation tasks. A unique fusion of local and global network learning paradigms has been employed, enriched by edge-focused local and global embeddings, thereby significantly augmenting the model's interpretative prowess. Further, we have applied a hierarchical transformer architecture to boost point cloud processing efficiency, thus providing nuanced insights into structural understanding. Our approach demonstrates significant promise in managing noisy point cloud data and highlights the potential of edge-aware strategies in 3D point cloud learning. The proposed approach is shown to outperform existing techniques in object classification and segmentation tasks, as demonstrated by experiments on ModelNet40 and ShapeNet datasets.

Saeed Hadadan (University of Maryland, College Park, United States)
Matthias Zwicker (University of Maryland, College Park, United States)
Neural Differential Radiance Field: Learning the Differential Space Using a Neural Network
PRESENTER: Saeed Hadadan

ABSTRACT. We introduce an adjoint-based inverse rendering method using a Neural Differential Radiance Field, i.e. a neural network representation of the solution of the differential rendering equation. Inspired by neural radiosity techniques, we minimize the norm of the residual of the differential rendering equation to directly optimize our network. The network is capable of outputting continuous, view-independent gradients of the radiance field w.r.t scene parameters, taking into account differential global illumination effects while keeping memory and time complexity constant in path length. To solve inverse rendering problems, we simultaneously train networks to represent radiance and differential radiance, and optimize the unknown scene parameters.

Yukun Cao (Shanghai University of Electric Power, China)
Jialuo Yan (Shanghai University of Electric Power, China)
Yijia Tang (Nanjing University of Aeronautics and Astronautics, China)
Zhenyi He (Shanghai University of Electric Power, China)
Kangle Xu (Shanghai University of Electric Power, China)
Yu Cheng (Shanghai University of Electric Power, China)
Aware-Transformer: A Novel Pure Transformer-based Model for Remote Sensing Image Captioning

ABSTRACT. Remote sensing image captioning (RSIC) is the task of generating accurate and coherent descriptions of the visual content in remote sensing images. While recent progress has been made in developing CNN-Transformer based models for this task, given the significant scale differences in the visual objects within these images, many existing methods still have some deficiencies in effectively capturing the multiscale visual features of these images. Additionally, applying these visual features directly to a vanilla Transformer architecture may result in the loss of important visual information. To address these challenges, we propose a novel pure Transformer-based model that first utilizes a fine-tuned Swin-Transformer as the encoder to extract multiscale visual features from remote sensing images. Then it introduces an Aware-Transformer as the decoder, which enhances multiscale and multiobject visual information to help generate accurate and detailed captions. To assess the performance of our proposed method, we conducted ablation and comparison experiments on three publicly available RSIC datasets: Sydney-Captions, UCM-Captions, and NWPU-Captions. The results demonstrate that our method outperforms state-of-the-art RSIC models in captioning quality.

Haobing Tian (China Mobile (Suzhou) Software Technology Company Limited, China)
Jingyi Li (China Mobile (Suzhou) Software Technology Company Limited, China)
Qi Yan (China Mobile (Suzhou) Software Technology Company Limited, China)
Yang Zhong (China Mobile (Suzhou) Software Technology Company Limited, China)
Lang Zhang (China Mobile (Suzhou) Software Technology Company Limited, China)
Pengju Jiao (China Mobile (Suzhou) Software Technology Company Limited, China)
Blind image quality assessment method based on DeepSA-Net

ABSTRACT. Blind image quality assessment refers to the accurate prediction of the visual quality of any input image without a reference image. With the rapid growth of the number of images and increasing requirements for image quality, how to assess image quality has become an urgent problem. Complex images are difficult to consider professionally from a single perspective. A blind image quality assessment algorithm based on a deep semantic adaptation network (DeepSA-Net) is proposed. Based on the end-to-end deep learning model, the semantic pre-trained models and multi-resolution adaptive module are added, and the adaptive factor α is proposed to better capture global and local quality information and fuse multi-resolution features to improve the convergence ability and speed of the network. Finally, the quality assessment results of images are obtained by regression. The experiment used the Spearman Correlation Coefficient and Pearson Correlation Coefficient as assessment indicators. The results showed that DeepSA-Net outperformed most current methods in real distortion scene databases and had excellent assessment ability in synthetic distortion databases. In addition, ablation study and different distortion studies were designed to fully validate the effectiveness and feasibility of the DeepSA-Net algorithm.

Shuhuan Zhao (College of Electronic and Information Engineering, Hebei University, China)
Chunrong Wang (College of Electronic and Information Engineering, Hebei University, China)
Shuaiqi Liu (College of Electronic and Information Engineering, Hebei University, China)
Hongfang Cheng (College of Electronic and Information Engineering, Hebei University, China)
Deep Feature Learning for Image-based Kinship Verification

ABSTRACT. Facial image-based kinship verification is one of the challenging tasks in computer vision. This task has many potential applications, such as human trafficking, studying human genetics, generating family maps, family photo albums, etc. The main obstacle in practice is that there is always have great difference between the images of parents and children. Therefore, we propose a deep feature learning method (DFLKV) which can extract more discriminative features for kinship verification. For a pair of facial images, we firstly design a network with multi-scale channel attention for the features extraction; then, we select four methods for feature fusion; finally, we infer kinship based on the fused features. We construct the final loss by jointly adopting the contrastive loss and the binary cross-entropy loss to compute matching degree for paired samples. The experimental results on datasets KinFaceW-I, KinFaceW-II, Cornell KinFace and TS KinFace validate the effectiveness of our approach.

Yue Yu (Beijing Institute of Technology, China)
Ding Li (Beijing Institute of Technology, China)
Yulin Yang (Beijing Institute of Technology, China)
Efficient Semantic-Guidance High-resolution Video Matting

ABSTRACT. Video matting has made significant progress in the trimap-based field. However, researchers are increasingly interested in auxiliary-free matting because it is more useful in real-world applications. The semantic feature can play an important role in improving video matting results. However, the size and speed of the current semantic-guidance method suffers as a result of over-bloated network architecture. We propose a new efficient semantic-guidance high-resolution video matting network. This network maintains efficiency while improving the comprehension of semantic feature. We still apply the convolutional network as the backbone while also employing the transformer in the encoder. The transformer is used as a submodule to provide semantic feature to help the convolutional network while ensuring that the network is not overly bloated. Two cross-attention modules are used to implement semantic feature adjustment and guidance. In addition, a channel-wise attention mechanism is introduced in the decoder to improve the representation of semantic feature. In comparison to the current state-of-the-art methods, the method proposed in this paper achieves better results while maintaining the speed and efficiency of prediction. We can complete the real-time auxiliary-free matting for high-resolution video (4K or HD).

Lei Li (Computer Science department of Copenhagen University, Denmark)
Segment Any Building

ABSTRACT. The identification and segmentation of buildings in remote sensing imagery has consistently been a important point of academic research. This work highlights the effectiveness of using diverse datasets and advanced representation learning models for the purpose of building segmentation in remote sensing images. By fusing various datasets, we have broadened the scope of our learning resources and achieved exemplary performance across several datasets. Our innovative joint training process demonstrates the value of our methodology in various critical areas such as urban planning, disaster management, and environmental monitoring. Our approach, which involves combining dataset fusion techniques and prompts from pre-trained models, sets a new precedent for building segmentation tasks. The results of this study provide a foundation for future exploration and indicate promising potential for novel applications in building segmentation field.

08:30-10:30 Session LNCS2-IP2

Zoom Link:        Meeting ID: 854 6450 7535, Password: cgi2023

Image Analysis and Processing 2

Xiaochao Wang (School of Mathematical Sciences, Tiangong University, China)
Xiaochao Wang (Tiangong University, China)
Qianqian Du (Tiangong Universtiy, China)
Xiaodong Tan (Tiangong University, China)
Jianping Hu (Northeast Electric Power University, China)
Ling Du (Tiangong University, China)
Huayan Zhang (Tiangong University, China)
A Novel Zero-Watermarking Algorithm based on Texture Complexity Analysis
PRESENTER: Xiaochao Wang

ABSTRACT. To address the problems of most existing watermarking algorithms cannot effectively resist complex attacks, we propose a novel zero-watermarking algorithm based on texture complexity analysis. First, we calculate the standard deviation map of the host image by the spatially selective texture method and achieve the optimal target regions (OTRs) by clustering the binary standard deviation map. To improve the robustness of the proposed algorithm, we use singular value decomposition (SVD) to extract multiple feature sequences from the OTRs. Then, these robust feature sequences are binarized to generate multiple feature images. For the watermark image, we apply the chaotic mapping to encrypt it and ensure the security of the watermark image. Finally, we perform an exclusive-or (XOR) operation on each of the extracted multiple feature images with the encrypted watermark image to construct multiple zero watermarks, which will be saved at the Copyright Certification Center to protect the copyright of the image. A large number of experimental results show that the newly-proposed algorithm not only has good distinguishability, but also has high robustness to complex attacks. Compared with existing watermarking algorithms, our proposed algorithm has advantages in invisibility, robustness and security.

Qianlin Li (Shenzhen University, China)
Xiaoyan Zhang (Shenzhen University, China)
Video-Based Self-Supervised Human Depth Estimation
PRESENTER: Xiaoyan Zhang

ABSTRACT. In this paper, we propose a video-besed met-hod for self-supervised human depth estimation, aiming at the problem of joint point distortion in human depth and insufficient utilization of 3D information in video-based depth estimation. We use the relative ordinal relations between human joint point pairs to deal with the problem of joint point distortion. Meanwhile, a temporal correlation module is proposed to focus on the temporal correlation between past and present frames, taking into account the influence of temporal characteristics in the video sequence. A hierarchical structure is adopted to fuse adjacent features, thus fully mine the 3D information based on the video. The experimental results show that this model significantly improves the human depth estimation performance, especially at the joints.

Yin Wang (Tiangong University, China)
Wenjing Cao (Tiangong University, China)
Nan Sheng (Tiangong University, China)
Huiying Shi (Tiangong University, China)
Congwei Guo (Tiangong University, China)
Yongzhen Ke (Tiangong University, China)
TSC-Net: Theme-Style-Color guided Artistic Image Aesthetics Assessment Network

ABSTRACT. Image aesthetic assessment is a hot issue in current research, but less research has been done in the art image aesthetic assessment field, mainly due to the lack of large-scale artwork datasets. The recently proposed BAID dataset fills this gap and allows us to delve into the aesthetic assessment methods of artworks, and this research will contribute to the study of artworks and can also be applied to real-life scenarios, such as art exams, to assist in judging. In this paper, we propose a new method, TSC-Net (Theme-Style-Color guided Artistic Image Aesthetics Assessment Network), which extracts image theme information, image style information, and color information and fuses general aesthetic information to assess art images. Experiments show that our proposed method outperforms existing methods using the BAID dataset.

Jie Sun (Zhejiang Gongshang University, China)
Yan Tian (Zhejiang Gongshang University, China)
Jialei Wang (Shining 3D Tech Co., Ltd., China)
Zhaocheng Xu (Massey University, New Zealand)
Hao Wang (Zhejiang Gongshang University, China)
Zhaoyi Jiang (Zhejiang Gongshang University, China)
Xun Wang (Zhejiang Gongshang University, China)
Weakly Supervised Method for Domain Adaptation in Instance Segmentation

ABSTRACT. Instance segmentation is an active research area in the signal processing field. The domain adaptation of a segmentation model can be improved by introducing supervision signals from a target dataset. However, manual annotation is tedious and time-consuming, and self-training contains too much pseudolabel noise. Inspired by weakly supervised methods, we propose a method to handle these domain adaptation challenges by limited verification signals. Labels of relevant samples are updated by label propagation. First, we construct semantic trees to explore the relation between samples by using a clustering method. Then, we verify and propagate reliable pseudolabels to their corresponding unreliable labels, which improves our instance segmentation model by employing the updated samples. Experiments on public datasets demonstrate that the proposed approach is competitive with state-of-the-art approaches.

Sheng Yu (Beijing University Of Technology, China)
Fei Ye (Jilin Jianzhu University, China)
Op-PSA: an instance segmentation model for occlusion of garbage

ABSTRACT. With the increasing emphasis on green development, garbage classification has become one of the important elements of green development. However, in scenarios where garbage stacking occurs, the task of segmenting highly overlapping objects is difficult because the bottom garbage is in an obscured state and its contours and obscured boundaries are usually difficult to distinguish. In this paper, we propose an Op-PSA model, which uses the HTC model as the baseline model and improves the modeling method of backbone network and model interest region using attention model and occlusion perception model. The Op-PSA model constructs the image as two overlapping layers and uses the two-layer structure to explicitly model the occluded and occluded objects, so that the boundaries of the occluded and occluded objects are naturally decoupled, and their interactions are considered in the mask regression. It is experimentally verified that the model can effectively detect the masked garbage and improve the detection accuracy of the masked garbage.

Hanlin Liu (Ningbo University, China)
Huaying Hao (Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, China)
Yuhui Ma (Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, China)
Lijun Guo (Ningbo University, China)
Yitian Zhao (Cixi Institute of Biomedical Engineering, Ningbo Institute of Materials Technology and Engineering, China)
SPC-Net: structure-aware pixel-level contrastive learning network for OCTA A/V segmentation and differentiation
PRESENTER: Huaying Hao

ABSTRACT. Recent studies have indicated that morphological changes in retinal vessels are associated with many ophthalmic diseases, which have different impacts on arteries and veins (A/V) respectively. To this end, retinal vessel segmentation and further A/V classification are essential for quantitative analysis of related diseases. OCTA is a newly non-invasive vascular imaging technique that provides visualization of microvasculatures with higher resolution than traditional fundus imaging modality. Recently, the task of A/V classification has attracted a lot of attention in the field of OCTA imaging. However, there exist two main challenges in this task. On one hand, there is a lack of intensity information in OCTA images to differentiate between arteries and veins. On the other hand, signal fluctuations during OCTA imaging could also bring about vessel discontinuity. These challenges limit the performance of A/V classification in OCTA images. In this paper, we propose a novel Structure-aware Pixel-level Contrastive learning network (SPC-Net) for A/V classification. In the proposed SPC-Net, a latent alignment-based network is first utilized to produce a vessel segmentation map in the original OCTA images. The introduction of latent alignment could guide the model in learning more contextual information to obtain more continuous vessel segmentation results. Then a pixel-level contrast learning-based network is used to further differentiate between arteries and veins according to the topology of vessels. This network adopts a novel pixel-level contrast learning topology loss to accurately classify the vessel pixels into arteries and veins by taking full account of global semantic similarity. The experimental results demonstrate the superiority of our method compared with the existing state-of-the-art methods respectively on one public OCTA dataset and one in-house OCTA dataset.

Afifa Khaled (Huazhong University of Science and Technology, China)
Taher Ghaleb (University of Ottawa, Canada)
MRI-GAN: Generative Adversarial Network for Brain Segmentation
PRESENTER: Afifa Khaled

ABSTRACT. Segmentation is an important step in medical imaging. In particular, machine learning, especially deep learning, has been widely used to efficiently improve and speed up the segmentation process in clinical practices of MRI brain images. Despite the acceptable segmentation results of multi-stage models, little attention was paid to the use of deep learning algorithms for brain image segmentation, which could be due to the lack of training data. Therefore, in this paper, we propose MRI − GAN, a Generative Adversarial Network (GAN) model that performs segmentation MRI brain images. Our model enables the generation of more labeled brain images from existing labeled and unlabeled images. Our segmentation targets brain tissue images, including white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF). We evaluate the performance of the MRI − GAN model using a commonly used evaluation metric, which is the Dice Coefficient (DC). Our experimental results reveal that our proposed model significantly improves segmentation results compared to the standard GAN model while taking shorter training time

Jiamin Sun (Ocean University of China, China)
Zhongjie Zhu (Zhejiang Wanli University, China)
Yongqiang Bai (Zhejiang Wanli University, China)
Yuer Wang (Zhejiang Wanli University, China)
Rong Zhang (Zhejiang Wanli University, China)
Fast Prediction of Ternary Tree Partition for Efficient VVC Intra Coding

ABSTRACT. In versatile video coding (VVC) intra coding, the partition pattern depends on the rate-distortion optimization process, which is time-consuming and has a great impact on the overall coding efficiency. Hence, in this paper, a fast decision mechanism is proposed for ternary tree partition based on the  LightGBM model aiming to improve the decision-making efficiency by skipping the calculation process of rate-distortion cost. Firstly, five features of each coding unit (CU) are selected based on their importance to the optimal partition pattern. Secondly, the selected five features are employed to train the LightGBM models and optimize the parameters. Finally, the trained models are embedded into the VTM 4.0 platform to predict whether to use or skip the ternary tree partition pattern for each CU. Theoretically, the proposed mechanism can effectively reduce the VVC intra coding complexity. Experiments are conducted and the results show that the proposed scheme can save 46.46% encoding time with only 0.56% BDBR increase and 0.03% BD-PSNR decrease compared with VTM4.0, out forming most of the existing major methods.

Kai Liu (Shanghai Jiao Tong University, China)
Qingyang Wu (Shanghai Jiao Tong University, China)
Mengkun Xie (Shanghai Jiao Tong University, China)
Large GAN is all you need

ABSTRACT. Sketch-to-portrait conversion is an emerging research area that aims to transform rough facial line sketches into highly detailed and realistic portrait images. This paper presents a comprehensive study on the impact of different loss functions and data augmentation techniques in achieving superior results using the U-Net256 network architecture. The study explores the effects of Mean Squared Error (MSE) loss, L1 loss, Generative Adversarial Network (GAN) loss, and the number of parameters on the quality of the generated portrait images.

Experimental results demonstrate that the choice of loss function significantly influences the perceptual quality and accuracy of the converted portraits. While both MSE and L1 loss contribute to capturing the overall structure, GAN loss excels in generating fine-grained details. Moreover, a trade-off is observed between the number of parameters and image quality, with higher parameter counts resulting in more intricate outputs but increased computational complexity.

In conclusion, this paper offers valuable insights into the sketch-to-portrait conversion task, shedding light on the effects of different loss functions and data augmentation techniques. The findings contribute to the advancement of sketch-to-portrait conversion systems, pushing the boundaries of realism and detail in generated portrait images.

We finally reached FID value of 0.2184, the second in the CGI-PSG2023 leaderboard as of May 21st.

All code is open-source and can be found in

Song Liang (School of Information and Control Engineering, China University of Mining and Technology, China)
Ruihang Liu (Xuhai College, China University of Mining and Technology, China)
Jiansheng Qian (School of Information and Control Engineering, China University of Mining and Technology, China)
EAID: An Eye-tracking based Advertising Image Dataset with Personalized Affective Tags

ABSTRACT. Contrary to natural images with randomized content, advertisements contain abundant emotion-eliciting manufactured scenes and multi-modal visual elements with highly related semantics. However, little research has evaluated the interrelationships of advertising vision and affective perception. The absence of advertising data sets with affective labels and visual attention benchmarks is one of the most pressing issues that have to be addressed. Meanwhile, growing evidence indicates that eye movements can reveal the internal states of human minds. Inspired by these, we use a high-precision eye tracker to record the eye-moving data of 57 subjects when they observe 1000 advertising images. 7-score opinion ratings for the five advertising attributes (i.e., ad liking, emotional, aesthetic, functional, and brand liking) are then collected. We further make a preliminary analysis of the correlation among advertising attributes, subjects’ visual attention, eye movement characteristics, and personality traits, obtaining a series of enlightening conclusions. To our best knowledge, the proposed dataset is the largest advertising image dataset based on eye tracking and with multiple personalized affective tags. It provides a new exploration space and data foundation for multimedia visual analysis and affection computing community. The data are available at:

10:30-11:00Coffee Break
11:00-12:30 Session LNCS3-IR

Zoom Link:        Meeting ID: 850 7050 5056, Password: cgi2023

Image Restoration and Enhancement

Daisuke Iwai (Osaka University, Japan)
Mingfu Jiang (Faculty of Applied Sciences , Macao Polytechnic University, China, Macao)
Chenzhi You (College of aerospace engineering, Nanjing University of Aeronautics and Astronautics,China., China)
Mingwei Wang (Department of Dardiovascular Medicine, Affiliated Hospital of Hangzhou Normal University,China., China)
Heye Zhang (School of Biomedical Engineering, Sun Yat-sen University, Shenzhen, China, China)
Zhifan Gao (School of Biomedical Engineering, Sun Yat-sen University, Shenzhen, China, China)
Dawei Wu (College of aerospace engineering, Nanjing University of Aeronautics and Astronautics,China., China)
Tao Tan (Faculty of Applied Sciences , Macao Polytechnic University, China, Macao)
Controllable Deep Learning Denoising Model for Ultrasound Images Using Synthetic Noisy Image
PRESENTER: Mingfu Jiang

ABSTRACT. Medical ultrasound imaging has gained widespread prevalence in human muscle and internal organ diagnosis. However, defects in the circuitry during image acquisition, operating methods, defects in the image signal transmission process or other objective factors can lead to the occurrence of speckle noise and distortion in ultrasound images. These issues not only make it challenging for doctors to diagnose diseases but can also pose difficulties in image post-processing. While traditional denoising methods are time-consuming, they are also not effective in removing speckle noise while retaining image details, leading to potential misdiagnosis. Therefore, there is a significant need to accurately and quickly denoise medical ultrasound images to enhance image quality. In this paper, we propose a flexible and lightweight deep learning denoising method for ultrasound images. Initially, we utilize a considerable number of natural images to train the convolutional neural network for acquiring a pre-trained denoising model. Next, we employ the plane-wave imaging technique to generate simulated noisy ultrasound images for further transfer learning of the pre-trained model. As a result, we obtain a non-blind, lightweight, fast, and accurate denoiser. Experimental results demonstrate the superiority of our proposed method in terms of denoising speed, flexibility, and effectiveness compared to conventional convolutional neural network denoisers for ultrasound images.

Yuzhou Sun (Shanghai University, China)
Sen Wang (Shanghai University, China)
Hao Li (Shanghai University, China)
Zhifeng Xie (Shanghai University, China)
Mengtian Li (Shanghai University, China)
Youdong Ding (Shanghai University, China)
Degradation-aware Blind Face Restoration via High-quality VQ Codebook

ABSTRACT. Blind face restoration, as a kind of face restoration method dealing with complex degradation, has been a challenging research hotspot recently. However, due to the influence of a variety of degradation in low-quality images, artifacts commonly exist in the low fidelity results of existing methods, resulting in a lack of natural and realistic texture details. In this paper, we propose a degradation-aware blind face restoration method based on a high-quality vector quantization (VQ) codebook to improve the degradation-aware capability and texture quality. The overall framework consists of Degradation-aware Module (DAM), Texture Refinement Module (TRM) and Global Restoration Module (GRM). DAM adopts the channel attention mechanism to adjust the weight of feature components in different channels, so that it has the ability to perceive complex degradation from redundant information. In TRM, continuous vectors are quantized and replaced with high-quality discretized vectors in the VQ codebook to add texture details. GRM adopts the reverse diffusion process of the pre-trained diffusion model to restore the image globally. Experiments show that our method outperforms state-of-the-art methods on synthetic and real-world datasets.

Naohiko Ishikawa (University of Yamanashi, Japan)
Zhenyang Zhu (University of Yamanashi, Japan)
Jong-nam Kim (Pukyong National University, South Korea)
Wan-Young Chung (Pukyong National University, South Korea)
Kentaro Go (University of Yamanashi, Japan)
Xiaoyang Mao (University of Yamanashi, Japan)
Seamless Image Editing for Perceptual Size Restoration Based on Seam Carving
PRESENTER: Naohiko Ishikawa

ABSTRACT. Thing of interest (ToI) in a photograph may be perceived as smaller than being perceived from the real scene due to the discrepancy between the imaging principles in the camera and human perception. When using existing image resizing approaches to enlarge the ToI in the input image, the resulting image may have problems, such as loss of distance sense, composition collapse, failure to preserve salient object shapes, etc. In this study, we propose a ToI resizing method based on seam carving method. The proposed method adopts an energy function, which takes image composition preservation into consideration. Furthermore, to prevent salient objects from being edited, the state-of-the-art deep learning model for salient object detection (SOD) has been adopted in the proposed method. To confirm the performance of the proposed method, a subjective evaluation experiment was conducted in this study. The experimental result shows that the effectiveness of the proposed method in terms of the preservation of perceptual size and perceptual distance of the ToI.

Chao Li (School of Computer Science, Inner Mongolia University, China)
Bo Yang (School of Computer Science, Inner Mongolia University, China)
Underwater image enhancement based on the fusion of PUIENet and NAFNet

ABSTRACT. Due to light absorption and scattering in the ocean, underwater images suffer from blur and color bias, and the colors tend to be biased towards blue or green. To enhance underwater images, many underwater image enhancement (UIE) methods have been developed. Probabilistic Network for UIE (PUIENet) is a neural network model that produces good results in processing underwater images. However, it cannot handle underwater images with motion blur, which is caused by camera or object motion. Nonlinear Activation Free Network (NAFNet) is a network model designed to remove image blur by simplifying everything. Inspired by NAFNet, we simplified the convolution, activation function, and channel attention module of PUIENet, resulting in Probabilistic and Nonlinear Activation Hybrid for UIE (PNAH_UIE), which reduced training time by approximately 19% and also reduced loss. In this paper, we propose a deep learning-based method for underwater image enhancement, called Probabilistic and Nonlinear Activation Hybrid Network for UIE (PNAHNet_UIE), which integrates the two most advanced network structures, PNAH_UIE and NAFNet, to improve overall image clarity and remove motion blur. The URPC2022 dataset was used in the experiments, which comes from the "CHINA UNDERWATER ROBOT PROFESSIONAL CONTEST." PNAH_UIE was used to enhance the URPC2022 dataset, and the processed images were checked for motion blur. If the variance of an image was below a certain threshold, the NAFNet network was used to process the image, thus reducing computational pressure.

Wanchang Jiang (Northeast Electric Power University, China)
Dongdong Xue (Northeast Electric Power University, China)
Infrared image enhancement for photovoltaic panels based on improved homomorphic filtering and CLAHE
PRESENTER: Dongdong Xue

ABSTRACT. Photovoltaic panels are extremely vulnerable to thermal imaging camera performance and other external factors such as extreme weather during the imaging process. This will result in low contrast and low illumination of the acquired infrared image, which is not conducive to the subsequent detection of photovoltaic panels. Aiming at this problem, an infrared image enhancement algorithm for photovoltaic panels based on improved homomorphic filtering and CLAHE (Contrast Limited Adaptive Histogram Equalization) is proposed. Firstly, in order to improve the overall brightness and contrast of the infrared image, a homomorphic filtering algorithm based on the improved transfer function is designed. The algorithm constructs a transfer function with a similar structure to the homomorphic filtering profile. Then, the CLAHE algorithm fused with gamma correction is used to further process the image, which overcomes the defects of weak details and uneven brightness of the image after homomorphic filtering enhancement, and improves the clarity and anti-interference of the image. The experimental results show that the comprehensive evaluation index value of the infrared image enhanced by the proposed algorithm is 50% higher than that of the original image. Compared with other algorithms, it has better visual effect, which is helpful to reduce the background interference. In addition, when the enhanced dataset of this algorithm is used for detection, the mAP is up to 97.8%, and the F1-score is 6% higher than that of the original dataset. It indicates that the proposed algorithm can effectively improve the detection accuracy of photovoltaic panels in infrared images.

Xiaonan He (Technological University of Shannon, Ireland)
Yukun Xia (Jiangxi University of Finance and Economics, China)
Yuansong Qiao (Technological University of Shannon, Ireland)
Brian Lee (Technological University of Shannon, Ireland)
Yuhang Ye (Technological University of the Shannon, Ireland)
An Efficient and Lightweight Structure for Spatial-Temporal Feature Extraction in Video Super Resolution

ABSTRACT. Video Super Resolution (VSR) model based on deep convolutional neural network (CNN) uses multiple Low-Resolution (LR) frames as input and has a strong ability to recover High-Resolution (HR) frames and maintain video temporal information. However, to realize the above advantages, VSR must consider both spatial and temporal information to improve the perceived quality of the output video, leading to expensive operations such as cross-frame convolution. Therefore, how to balance the output video quality and computational cost is a worthy issue to be studied. To address the above problem, we propose an efficient and lightweight multi-scale 3D video super resolution scheme that arranges 3D convolution features extraction blocks using a U-Net structure to achieve multi-scale feature extraction in both spatial and temporal dimensions. Quantitative and qualitative evaluation results on public video datasets show that compared to other simple cascaded spatial-temporal feature extraction structure, an U-Net structure achieves comparable texture details and temporal consistency while with a significant reduction in computation costs and latency.

Jinyao Shen (School of Cyber Science and Engineering, Wuhan University, China)
Huanmei Guan (School of Computer Science, Wuhan University, China)
Shuohan Tao (Selwyn College, Cambridge, UK)
Fu Zhou (School of Computer Science, Wuhan University, China)
Fei Luo (School of Computer Science, Wuhan University, China)
Specular Highlight Detection and Removal Based on Dynamic Association Learning
PRESENTER: Jinyao Shen

ABSTRACT. Specular highlight widely exists in daily life. Its strong brightness influences the recognition of text and graphic patterns in images, especially for documents and cards. In this paper, we propose a coarse-to-fine dynamic association learning method for specular highlight detection and removal. Specifically, based on the dichromatic reflection model, we first use a sub-network to separate the specular highlight layer and locate the regions of the highlight. Instead of directly subtracting the estimated specular highlight component from the raw image to get the highlight removal result, we design an associated learning module (ALM) together with a second-stage sub-network to restore the color distortion of the specular highlight layer removal. Our ALM respectively extracts features from the specular highlight part and non-specular highlight part to improve the color restoration. We conducted extensive evaluation experiments and ablation study on the synthetic dataset and the real-world dataset. Our method achieved 36.09 PSNR and 97% SSIM on SHIQ dataset, along with 28.90 PSNR and 94% SSIM on SD1 dataset, which outperformed the SOTA methods.

Yongkang Ma (Wuhan Textile University, China)
Li Li (Wuhan Textile University, China)
Hao Chen (Wuhan Textile University, China)
Xian Li (Wuhan Textile University, China)
Junchao Chen (Wuhan Textile University, China)
Ping Zhu (Wuhan Textile University, China)
Tao Peng (Wuhan Textile University, China)
Xiong Pan (Wuhan Textile University, China)
Highlight removal from a single image based on an prior knowledge gudided unsupervised CycleGAN
PRESENTER: Yongkang Ma

ABSTRACT. Highlights widely exist in many objects, such as the optical images of high-gloss leather, glass, plastic, metal parts, and other mirror-reflective objects. It is difficult to directly apply optical measurement techniques, such as object detection, intrinsic image decomposition, and tracking which are suitable for objects with diffuse reflection characteristics. Although deep CNNs can be used to perform supervised learning of material rendering parameters to automatically remove highlights by applying a large number of paired specular–diffuse reflection datasets. It is hard to deal with unpaired datasets. In this paper, we proposed a specular-to-diffuse-reflection image conversion network based on improved CycleGAN to automatically remove image highlights. It does not require paired training data, and the experimental results verify the effectiveness of our method. There are two main contributions for this framework. On one hand, we proposed a confidence map based on independent average values as the initial value to solve the slow convergence problem of the network due to the lack of a strict mathematical definition for distinguishing specular reflection components from diffuse reflection components. On the other hand, we designed a logarithm-based transformation method generator which made the specular reflection and diffuse reflection components comparable. It could solve the anisotropy problem in the optimization process. This problem was caused by the fact that the peak specular reflection on the surface of a specular object was much larger than the value of the off-peak diffuse reflection. We also compared our method with the latest methods. It was found that the SSIM and PSNR values of our proposed algorithm were significantly improved, and the comparative experimental results showed that the proposed algorithm significantly improves the image conversion quality.

11:00-12:30 Session LNCS4-IAP: Image Attention and Perception

Zoom Link:        Meeting ID: 854 6450 7535, Password: cgi2023

Zhenbo Li (China Agricultural University, China)
Shukai Zheng (Guangzhou University, China)
Miao Liu (Guangzhou University, China)
Ligang Zheng (Guangzhou University, China)
Wenbin Chen (Guang Zhou Uinversity, China)
Facial expression recognition with global multiscale and local attention network
PRESENTER: Shukai Zheng

ABSTRACT. Due to problems such as occlusion and pose variation, facial expression recognition (FER) in the wild is a challenging classification task. This paper proposes a global multiscale and local attention network(GL-VGG) based on the VGG structure, which consists of four modules: a VGG base module, a dropblock module, a global multiscale module, and a local attention module. The base module pre-extracts features,the dropblock module prevents overfitting in the convolutional layers, the global multiscale module is used to learn different receptive field features in the global perception domain, which reduces the susceptibility of deeper convolution towards occlusion and variant pose, and the local attention module guides the network to focus on local rich features, which releases the interference of occlusion on FER in the wild. Experiments on two public wild FER datasets show that our GL-VGG approach outperforms the baseline and other state of the art methods with 88.33% on RAF-DB and 74.17% on FER2013.

Jia Chen (, China)
Xiyang Li (, China)
Yangjun Ou (, China)
Xinrong Hu (, China)
Tao Peng (, China)
MARANet: Multi-scale Adaptive Region Attention Network for Few-Shot Learning

ABSTRACT. Few-shot learning, which aims to classify unknown categories with fewer label samples, has become a research hotspot in computer vision because of its wide application. Objects will present different regional locations in nature, and the existing few-shot learning only focuses on the overall location information, while ignoring the impact of local key information on classification tasks. To solve this problem, (1) we propose a new multi-scale adaptive region attention network (MARANet), which makes use of the semantic similarity between images to make the model pay more attention to the areas that are beneficial to the classification task. (2) MARANet mainly includes two modules---the multi-scale feature generation module uses low-level features (LR) of different scales to solve the problem of different target scales in nature; the adaptive region metric module selects the LR of key regions by assigning masks to each classification task. We have conducted experiments on four common data sets (i.e. miniImageNet, CUB-200, Stanford Dog, and Stanford Cars). The experimental results show that the new category classification task of MARANet is 1.1%-4.9% higher than the existing methods.

Yan Gui (Changsha University of Science and Technology, China)
Yan Xie (Changsha University of Science and Technology, China)
Lidan Kuang (Changsha University of Science and Technology, China)
Zhihua Chen (East China University of Science and Technology, China)
Jin Zhang (Changsha University of Science and Technology, China)
Enhancing Image Rescaling Using High Frequency Guidance and Attentions in Downscaling and Upscaling Network

ABSTRACT. Recent image rescaling methods adopt invertible bijective transformations to model downscaling and upscaling simultaneously, where the high-frequency information learned in the downscaling process is used to recover the high-resolution image by inversely passing the model. However, less attention has been paid to exploiting the high-frequency information when upscaling. In this paper, an efficient end-to-end learning model for image rescaling, based on a new designed neural network, is developed. The network consists of a downscaling generation sub-network (DSNet) and a super-resolution sub-network (SRNet), and learns to recover high-frequency signals. Concretely, we introduce dense attention blocks to the DSNet to produce the visually-pleasing low resolution (LR) image and model the distribution of the high-frequency information using a latent variable following a specified distribution. For the SRNet, we adapt an enhanced deep residual network by using residual attention blocks and adding a long skip connection, which transforms the predicted LR image and the random samples of the latent variable back during upscaling. Finally, we define a joint loss and adopt a multi-stage training strategy to optimize the whole network. Experimental results demonstrate that the superior performance of our model over existing methods in terms of both quantitative metrics and visual quality.

Maciej Szymkowski (Łukasiewicz - Poznań Institute of Technology, Poland)
Maciej Niemir (Łukasiewicz - Poznań Institute of Technology, Poland)
Convolutional Neural Networks and Vision Transformers in product GS1 GPC brick code recognition

ABSTRACT. Online stores and auctions are commonly used nowadays. It means that we buy much more on the Internet than in traditional stores. It leads to the case that during looking for the products we need to have precise categories assigned to each of them (to find only records that can be of interest for a consumer). Sometimes it is hard, users make simple mistakes by assigning wrong categories to the product they sell. In this paper, we propose an approach to the analysis of product images and their real categories assignment. The proposed algorithm is based on Convolutional Neural Networks (CNNs). Vision Transformers were also tested and compared with CNNs. Products categories were represented by GS1 GPC brick codes. The maximum accuracy reached around 80%. Based on the discussions with e-commerce experts, it was claimed that such precision is acceptable, as the differences between real and assigned categories were effectively small (change in the class not segment or family).

Chenxin Qu (Beijing Jiaotong University, China)
Kexin Li (Beijing Jiaotong University, China)
Xiaoping Che (Beijing Jiaotong University, China)
Enyao Chang (Beijing Jiaotong University, China)
Zhongwei Zhang (Beijing Jiaotong University, China)
Multi-source Information Perception and Prediction for Panoramic Videos

ABSTRACT. With the popularization and development of virtual reality technology, panoramic video has gradually become one of the mainstream forms of VR technology in various fields. However, the research on the information perception of panoramic video in different media is insufficient. And shortcomings still exist in building information perception and prediction models owing to small samples. This work focuses on users' perception of multi-source information in panoramic videos with different media. We conducted the experiment ( N=40 ) to analyze the differences of users' perception level when viewing panoramic videos using different media (i.e. VR and traditional media). We also studied the correlation between user characteristics and information reception effectiveness. Finally, we use the few-shot-learning prediction model to predict the perception effect of multi-source information. The results show that users' perception of multi-source information in VR is better than in traditional media, except for sound information. Besides, there is a positive correlation between observational ability, memory, concentration, and spatial perception, whether playing computer games frequently and multi-source information perception. And the few-shot learning prediction achieves an accuracy of 90.0875% and can accurately predict the user's information perception effect based on their characteristics.

Yiming Li (College of Information and Electrical Engineering, China Agricultural University, China)
Fei Li (College of Information and Electrical Engineering, China Agricultural University, China)
Zhenbo Li (College of Information and Electrical Engineering, China Agricultural University, China)
Multi-Scale Attention Conditional GAN for Underwater Image Enhancement

ABSTRACT. Underwater image enhancement (UIE) has achieved impressive achievements in various marine tasks, such as aquaculture and biological monitoring. However, complex underwater scenarios impede current UIE method application development. Some UIE methods utilize CNN-based models to improve the quality of degradation images, but these methods fail to capture multi-scale high-level features, leading to sub-optimal results. To address these issues, we propose a multi-scale attention conditional generative adversarial network, dubbed Mac-GAN, to recover the degraded underwater images by utilizing an encoder-decoder structure. Concretely, a novel multi-scale conditional GAN architecture is utilized to aggregate the multi-scale features and reconstruct the high-quality underwater images with high perceptual information. Meanwhile, a novel attention module (AMU) is designed to integrate associated features among the channels for the UIE tasks, effectively suppressing non-significant features to improve the extraction effect of multi-scale features. Extensive experiments demonstrate that our proposed model achieves remarkable results in terms of qualitative and quantitative metrics, such as 0.7dB improvement in PSNR metrics and 0.8dB improvement in UIQM metrics. Moreover, Mac-GAN can generate a pleasing visual result without obvious over-enhancement and over-saturation over the comparison of UIE methods. A detailed set of ablation experiments analyzes the core components’ contribution to the proposed approach.

Chenhao Yao (South China University of Technology, China)
Guiqing Li (South China University of Technology, China)
Juncheng Zeng (, China)
Yongwei Nie (South China University of Technology, China)
Chuhua Xian (South China University of Technology, China)
MANet: Multi-level Attention Network for 3D Human Shape and Pose Estimation
PRESENTER: Chenhao Yao

ABSTRACT. Although there has been some progress in 3D human pose and shape estimation, accurately predicting complex human poses is still challenging. To tackle this issue and improve the accuracy of the human mesh reconstruction, we propose an end-to-end framework called Multi-level Attention Network (MANet) that improves reconstruction results. MANet consists of three modules: Intra Part Attention Network (IntraPA-Net), Inter Part Attention Network (InterPA-Net), and Hierarchical Pose Regressor (HPR), which model attention at various levels. IntraPA-Net utilizes pixel attention and aggregates pixel-level features for each body part, InterPA-Net establishes attention between different body parts, and HPR implicitly captures the attention of different joints in a hierarchical structure. Experimental results demonstrate that MANet achieves high accuracy in reconstructing the human mesh and aligning well with images that contain flexible human motion.

Wei Zhao (Qingdao City College, China)
Zhaoyang Xie (anhui university, China)
Lina Huang (Qingdao City College, China)
LIELFormer: Low-light Image Enhancement with a Lightweight Transformer

ABSTRACT. Images captured under low-light conditions often suffer from (partially) poor visibility. One of the challenges of low-light enhancement, in addition to inadequate lighting, is noise and color distortion due to the limited quality of the cameras. Previous researchers have typically used paired data (low-light and high-definition images) for training to solve single-image enhancement problems. However, those approaches have two disadvantages. One is the difficulty of collecting data in pairs, which wastes time and computational resources. Secondly, such models tend to be poorly generalizable and perform poorly on multiple datasets. This paper proposes a simple but accurate single image enhancement network to solve this problem. Our network consists of the light estimation module and the color correction module. The light estimation module is based on the Retinex principle and uses CNN to enhance illumination. The color correction module uses a global prediction module (transformer block) to obtain the actual color distribution. This module extracts the image's original colors to make it more realistic. Our network has a simple structure and does not any paired and unpaired datasets. It allows a single image enhancement task to be performed using only iterations of the image itself. Our approach outperforms current state-of-the-art methods in qualitative and quantitative experiments. We will release our code after publication.

12:30-13:30Lunch Break
13:30-15:30 Session LNCS5-Reconstruction: Reconstruction

Zoom Link:        Meeting ID: 850 7050 5056, Password: cgi2023

Zizhao Wu (Hangzhou Dianzi University, China, China)
Ali Fakih (Université de Haute Alsace, France)
Nicola Wilser (Université de Haute Alsace, France)
Yvan Maillot (Université de Haute Alsace, France)
Frederic Cordier (Université de Haute Alsace, France)
Single-view 3D reconstruction of curves

ABSTRACT. This paper describes a method to generate a 3D curve from a planar polygonal curve. One application of such method is the modeling of trajectories of moving objects in 3D using sketches. Given a planar polygonal curve C_2D , our algorithm computes a 3D curve C_3D such that its orthogonal projection matches the input curve. The algorithm aims at minimizing the variation of the curvature along the reconstructed curve. The driving idea is to fit a set of ellipses to the input curve; these ellipses enable us to determine the osculating circles and thus the tangent at every point of the curve to reconstruct in 3D. The reconstruction of 3D curve using these tangents is then straightforward. The method is demonstrated with several examples.

Le Ma (Institute of Automation, Chinese Academy of Sciences, China)
Zhihao Ma (Institute of Automation, Chinese Academy of Sciences, China)
Weiliang Meng (MAIS, CAS Institute of Automation, China)
Shibiao Xu (Beijing University of Post and Telecommunication, China)
Xiaopeng Zhang (Institute of Automation, Chinese Academy of Sciences, China)
Audio-Driven Lips and Expression on 3D Human Face

ABSTRACT. Extensive researches have been conducted on audio-driven 3D facial animation with many attempts to achieve human-like performance, while creating a truly realistic and expressive 3D facial animation remains a challenging task, and existing methods often fall short in capturing the nuances of anthropomorphic expressions. We propose the Audio-Driven Lips and Expression (ADLE), which is designed to generate highly expressive and lifelike conversations between individuals, complete with important social signals like laughter and excitement, based solely on audio cues. At the core of our approach is the groundbreaking audio-expression-consistency strategy, which disentangles person-specific lips from dependent expressions. This allows our method to robustly learn lips movements and generic expression parameters on a 3D human face from an audio sequence. As a result, our ADLE is a multimodal fusion approach that can automatically generate accurate lip movements accompanied by vivid facial expressions on a 3D face, all in real-time. Our experiments demonstrate that our ADLE outperforms other state-of-the-art works in this field, making it a highly promising approach for a wide range of applications.

Xiaoyu Chai (National Engineering Research Center for Multimedia Software, School of Computer, Wuhan University, Wuhan, China, China)
Jun Chen (National Engineering Research Center for Multimedia Software, School of Computer, Wuhan University, Wuhan, China, China)
Dongshu Xu (National Engineering Research Center for Multimedia Software, School of Computer, Wuhan University, Wuhan, China, China)
Hongdou Yao (National Engineering Research Center for Multimedia Software, School of Computer, Wuhan University, Wuhan, China, China)
Zheng Wang (National Engineering Research Center for Multimedia Software, School of Computer, Wuhan University, Wuhan, China, China)
Chia-Wen Lin (Department of Electrical Engineering, National Tsinghua University, Hsinchu, 30013, Taiwan, Taiwan)
Multi-Image 3D Face Reconstruction via An Adaptive Aggregation Network
PRESENTER: Xiaoyu Chai

ABSTRACT. Image-based 3D face reconstruction suffers from inherent drawbacks of incomplete visible regions and interference from occlusion or lighting. One solution is to utilize multiple face images for collecting sufficient knowledges. Nevertheless, most existing methods typically do not make full use of information among different images since they roughly fuse the results of individual reconstructed face for multi-image 3D face modeling, thus may ignore the intrinsic relations within various images. To tackle this problem, we propose a framework named Adaptive Aggregation Network (ADANet) to investigate the subtle correlations among multiple images for 3D face reconstruction. Specifically, we devise an Aggregation Module that can adaptively establish both the in-face and cross-face relationships by exploiting the local- and long-range dependencies among visible facial regions of multiple images, thus can effectively extract complementary aggregation face features in the multi-image scenario. Furthermore, we incorporate contour-aware information to promote the boundary consistency of 3D face model. The seamless combination of these novel designs forms a more accurate and robust multi-image 3D face reconstruction scheme. Extensive experiments have demonstrated the superiority of the proposed network over other state-of-the-art models.

Guiqing Li (South China University of Technology, China)
Chenhao Yao (South China University of Technology, China)
Huiqian Zhang (South China University of Technology, China)
Juncheng Zeng (South China University of Technology, China)
Yongwei Nie (South China University of Technology, China)
Chuhua Xian (South China University of Technology, China)
METRO-X: Combining Vertex and Parameter Regressions for Recovering 3D Human Meshes with Full Motions
PRESENTER: Chenhao Yao

ABSTRACT. It is well known that regressing the parametric representation of a human body from a single image suffers low accuracy due to sparse information use and error accumulation. Although being able to achieve higher accuracy by avoiding these issues, directly regressing vertices may result in vertex outliers and can only deal with the body mesh with very limited number of vertices. We present METRO-X, a novel method for reconstructing full-body human meshes with body pose, facial expression and hand gesture from a single image, which combines the advantages from the two disciplines so as to achieve higher accuracy than parameters regression while bear denser vertices and generate smoother shape than vertices regression. It first detects and extracts hands, head and the whole body parts from a given image, then regresses the vertices of three parts separately using METRO, and finally fits SMPL-X to the reconstructed meshes to obtain the complete parametric representation of the human body, facial expression and hand gesture. Experimental results show that METRO-X outperforms the ExPose method, with a significant 23% improvement in body accuracy and a 35% improvement in gesture accuracy. These results demonstrate the potential of our approach in enabling various applications.

Ritesh Sharma (University of California Merced, United States)
Eric Bier (Palo Alto Research Center, United States)
Lester Nelson (Palo Alto Research Center, United States)
Mahabir Bhandari (Oak Ridge National Laboratory,, United States)
Niraj Kunwar (Oak Ridge National Laboratory, United States)
Automatic Digitization and Orientation of Scanned Mesh Data for Floor Plan and 3D Model Generation
PRESENTER: Ritesh Sharma

ABSTRACT. This paper proposes a novel approach for automatically generating accurate floor plans and 3D models of building interiors using scanned mesh data. Unlike previous methods, which begin with a high resolution point cloud from a laser range-finder, our approach begins with triangle mesh data, such as from a Microsoft HoloLens headset. The approach includes generating two types of floor plans, a "pen-and-ink" style that preserves details and a drafting-style that reduces clutter, and processing the 3D model for use in applications by aligning it with coordinate axes, annotating important objects, dividing it into stories, and removing the ceiling. The performance of each step is analyzed on commercial and residential buildings, and experiments are conducted to evaluate the appearance of results when different amounts of transparency and numbers of mesh slices are used. Our approach has applications in navigation, interior design, furniture placement, facilities management, building construction, and heating, ventilation, and air conditioning (HVAC) design. In general, our approach appears to be promising for automatic digitization and orientation of scanned mesh data for floor plan and 3D model generation.

Xiaoyu Chai (National Engineering Research Center for Multimedia Software, School of Computer, Wuhan University, Wuhan, China, China)
Jun Chen (National Engineering Research Center for Multimedia Software, School of Computer, Wuhan University, Wuhan, China, China)
Dongshu Xu (National Engineering Research Center for Multimedia Software, School of Computer, Wuhan University, Wuhan, China, China)
Hongdou Yao (National Engineering Research Center for Multimedia Software, School of Computer, Wuhan University, Wuhan, China, China)
An Adaptive-Guidance GAN for Accurate Face Reenactment
PRESENTER: Xiaoyu Chai

ABSTRACT. Face reenactment has been widely used in face editing, augmentation and animation. However, it is still challenging to generate photo-realistic target face with accurate pose or expression as reference face, meanwhile retain the identity as the source face. To achieve this goal, we propose an Adaptive-Guidance Generative Adversarial Network (AD-GAN) for accurate face reenactment. Unlike previous methods that control GANs by either directly employing a simple set of vectors or sparse representations (e.g., facial landmarks or boundaries), which ignore the correspondence between reference and source faces, thus leading to inaccurate reenactment or artifacts on target faces. We devise a Correlation Module (CM) that can adaptively establish dense correspondence between a 3D face model as the conditions and the latent features from sources to formulate an indicator map for implementing explicit control of target faces. Besides, the Texture Module (TM) and Guiding Blocks (GB) in generator can restore the facial appearance distorted by expression or pose changes, and progressively guide the generation process. Extensive experiments demonstrate the superiority of our AD-GAN in generating photo-realistic and accurately controllable images.

Wiem Grina (ENIM, Tunisia)
Ali Douik (ENISO, Tunisia)
Reconstructing Neutral Face Expressions with Disentangled Variational Autoencoder

ABSTRACT. In this study, we address the challenge of unsupervised learning for disentangled representations in datasets including independent variation factors. We propose a new approach inspired from Factor-VAE and $\beta$VAE but integrating the ranger optimizer with dropout layers, which encourages the distribution of representations to be factorial, ensuring independence between dimensions and leading to faster convergence. Our method outperforms Factor-VAE by finding a better balance between disentanglement and reconstruction quality and better optimization of model parameters leading to improved convergence and generalization during learning by effectively adapting the learning rate.

Ciliang Sun (Ningbo University, China)
Yuqi Li (Ningbo University, China)
Jiabao Li (Ningbo University, China)
Chong Wang (Ningbo University, China)
Xinmiao Dai (Ningbo University, China)
CaSE-NeRF: Camera Settings Editing of Neural Radiance Fields
PRESENTER: Ciliang Sun

ABSTRACT. Neural Radiance Fields (NeRF) have shown excellent quality in three-dimensional (3D) reconstruction by synthesizing novel views from multi-view images. However, previous NeRF-based methods do not allow users to perform user-controlled camera setting editing in the scene. While existing works have proposed methods to modify the radiance field, these modifications are limited to camera settings within the training set. Hence, we present Camera Settings Editing of Neural Radiance Fields (CaSE-NeRF) to recover a radiance field from a set of views with different camera settings. In our approach, we allow users to perform controlled camera settings editing on the scene and synthesize the novel view images of the edited scene without re-training the network. The key to our method lies in modeling each camera parameter separately and rendering various 3D defocus effects based on thin lens imaging principles. By following the image processing of real cameras, we implicitly model it and learn gains that are continuous in the latent space and independent of the image. The control of color temperature and exposure is plug-and-play, and can be easily integrated into NeRF-based frameworks. As a result, our method allows for manual and free post-capture control of the viewpoint and camera settings of 3D scenes. Through our extensive experiments on two real-scene datasets, we have demonstrated the success of our approach in reconstructing a normal NeRF with consistent 3D geometry and appearance. Our related code and data is available at

Yongwei Miao (Zhejiang Sci-Tech University / Hangzhou Normal University, China)
Haipeng Wang (Zhejiang Sci-Tech University, China)
Ran Fan (Hangzhou Normal University, China)
Fuchang Liu (Hangzhou Normal University, China)
A Submodular-based Autonomous Exploration for Multi-Room Indoor Scenes Reconstruction
PRESENTER: Haipeng Wang

ABSTRACT. To autonomously explore and densely recover an unknown indoor scene is a nontrivial task in 3D scene reconstruction. Especially, it is difficult for scenes composed of compact and complicated interconnected rooms with no priors. To address this issue,we propose a novel approach to autonomous scan and reconstruct multi-room scenes without any prior knowledge. Specifically, the proposed method introduces a submodular-based planning to efficiently guide the active scanning by “Next-Best-View” until marginal gains diminish. The submodular-based planning gives an approximately optimal solution of “Next-Best-View” which is NP-hard in case of no prior knowledge. Experiments show that our method can improve scanning efficiency significantly for multi-room scenes while maintaining reconstruction errors.

Jin Chen (School of Computer Science, Wuhan University, Hubei Province, China, China)
Jun Chen (School of Computer Science, Wuhan University, Hubei Province, China, China)
Xiaofen Wang (School of Computer Science, Wuhan University, Hubei Province, China, China)
Dongshu Xu (School of Computer Science, Wuhan University, Hubei Province, China, China)
Chao Liang (School of Computer Science, Wuhan University, Hubei Province, China, China)
Zhen Han (School of Computer Science, Wuhan University, Hubei Province, China, China)
Learning Degradation for Real-World Face Super-Resolution

ABSTRACT. Acquiring degraded faces with corresponding high-resolution (HR) faces is critical for real-world face super-resolution (SR) applications. To generate realistic low-resolution (LR) faces with degradation similar to that in real-world scenarios, most approaches learn a deterministic mapping from HR faces to LR faces. However, these deterministic models fail to model the various degradation of real-world LR faces, which limits the performance of the following face SR models. In this work, we learn a degradation model based on conditional generative adversarial networks (cGANs). Specifically, we propose a simple and effective weight-aware content loss that adaptively assigns different content losses to LR faces generated from the same HR face under different noise vector inputs. It significantly improves the diversity of the generated LR faces while having similar degradation to real-world LR faces. Compared with previous degradation models, the proposed degradation model can generate HR-LR pairs, which can better cover various degradation cases of real-world LR faces and further improve the performance of face SR models in real-world applications. Experiments on four datasets demonstrate that the proposed degradation model can help the face SR model achieve better performance in both quantitative and qualitative results.

13:30-15:30 Session LNCS6-Rendering: Rendering and Animation

Zoom Link:        Meeting ID: 854 6450 7535, Password: cgi2023

Dawar Khan (Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China, China)
Hairong Gu (Chang’ an University, China)
Jiale Wang (Chang’ an University, China)
Yanhui Hu (Chang’ an University, China)
Jixiang Wang (Chang’ an University, China)
Lishun Sun (Chang’ an University, China)
Mostak Ahamed (Chang' an University, Bangladesh)
Visualization of Irregular Tree Canopy Centerline Data from a Depth Camera Based on An Optimized Spatial Straight-Line Fitting

ABSTRACT. We propose a novel method for visualizing the canopy centerline of a single tree based on a depth camera. Initially, the depth camera captures the image of the target tree to obtain 3D point cloud data. To improve the accuracy of the model reconstruction and enhance the smoothness of the model, the point cloud data is filtered and denoised. Next, we employ the Poisson surface reconstruction method to reconstruct the 3D space surface of the point cloud data, which can accurately restore the real scene. Additionally, we fit the 3D point cloud to a circle using Random Sample Consensus (RANSAC) and Least Square Circles (LSC) in MATLAB software and propose a new spatial straight-line fitting method to visualize the canopy centerline. This method has the advantage of no error in the Z coordinates of spatial scatter points, and the fitted straight line is perpendicular to the xoy plane. Furthermore, compared to the traditional spatial straight-line fitting method, the new method yields a smaller root mean square error (RMSE). This method can be effectively applied in practical applications such as tree canopy pruning, providing precise position information for the positioning of tools during the pruning process. It can ultimately reduce pruning time and improve the accuracy of the process.

Dan Mei (College of Computer and Information Science, Southwest University, China)
Xiaogang Wang (College of Computer and Information Science, Southwest University, China)
Finernet: A coarse-to-fine approach to learning high-quality implicit surface reconstruction

ABSTRACT. Recent studies have shown that implicit neural representation can be effectively applied to geometric surface reconstruction. Existing methods have achieved impressive results. However, they often struggle to recover geometric details, or require normal vectors as supervisory information for surface points, which is often unavailable in actual scanned data. In this paper, we propose a coarse-to-fine approach to enhance the geometric details of the reconstructed results without relying on normal vectors as supervision, and able to fill holes caused by missing scanned data. In the coarse stage, a local spatial normal consistent term is presented to estimate a stable but coarse implicit neural representation. In the fine stage, a local fitting penalty is proposed to locally modify the reconstruction results obtained in the previous stage to better fit the original input data and recover more geometric details. Experimental results on three widely used datasets (ShapeNet, SRB and ABC) indicate that our method is very competitive when compared with current state-of-the-art methods, especially for restoring the geometric details.

Anning Huang (Tongji University, China)
Zhicheng Liu (Tongji University, China)
Qian Zhang (Tongji University, China)
Feng Tian (Duke Kunshan University, China)
Jinyuan Jia (Tongji University, China)
Fine-grained Web3D Culling-Transmitting-Rendering Pipeline
PRESENTER: Anning Huang

ABSTRACT. Web3D has gradually become the mainstream online 3D technology to support Metaverse. However, massive multiplayer online Web3D still faces challenges such as slow culling of potentially visible set at servers, networking congestion and sluggish online rendering at web browsers. To address the challenges, in this paper we propose a novel Web3D pipeline that coordinates PVS culling, networking transmitting, and Web3D rendering in a fine-grained way. The pipeline integrates three key steps: establishment of a granularity-aware voxelization scene graph, fine-grained PVS culling and transmitting scheduling, and incremental & instanced rendering. Our experiments on a massive 3D plant have demonstrated that the proposed pipeline outperforms existing Web3D approaches in terms of transmitting and rendering.

Xu Lu (Huazhong Univeristy of Science and Technology, China)
Shuo Xiong (Huazhong Univeristy of Science and Technology, China)
Tao Wu (Huazhong Univeristy of Science and Technology, China)
Ke Zhang (Huazhong Univeristy of Science and Technology, China)
Yachang Wang (Tencent computer system Co.Ltd., China)
Qilong Kou (Tencent computer system Co.Ltd., China)
Yue Zhang (Huazhong University of Science and Technology, China)
The Chemical Engine Algorithm and Realization based on Unreal Engine-4

ABSTRACT. The Chemical Engine is a new concept introduced by Nintendo company as a counterpart to the traditional physics engine in game development. However, Nintendo has not released any details of the Chemical Engine, also Nintendo blurred the definition between ``chemical" and ``physical". Therefore, this paper clarifies the concept of physical engine and chemical engine in game development, then based on the definition, two chemical engine algorithms are proposed. One is called the ``elemental energy" algorithm, which is based on Nintendo's philosophy and optimized for future scalability, ``elemental energy" can be widely used in general game scenarios. The second one is called the ``factorization and properties" algorithm, which is more in line with the definition of chemistry in academics, this method can realistically render chemical reactions, but the realization is more difficult and too costly to use in game development. Therefore, this paper provides a specific means of implementation in the Unreal Engine 4 engine based on the elemental energy algorithm Through the analysis of the achievement and experiment, the cost and method of the ``elemental energy" algorithm are reasonable. Therefore, the scheme is more practical in this scenario, and it could be widely used in commercial game development.

Geonu Noh (Gwangju Institute of Science and Technology, South Korea)
Hajin Choi (Gwangju Institute of Science and Technology, South Korea)
Bochang Moon (Gwangju Institute of Science and Technology, South Korea)
Enhanced Direct Lighting Using Visibility-Aware Light Sampling

ABSTRACT. Next event estimation has been widely applied to Monte Carlo rendering methods such as path tracing since estimating direct and indirect lighting separately often enables finding light paths from the eye to the lights effectively. Its success heavily relies on light sampling for direct lighting when a scene contains multiple light sources since each light can contribute differently to the reflected radiance on a surface point. We present a light sampling technique that can guide such a light selection to improve direct lighting. We estimate a spatially-varying function that approximates the contribution of each light on surface points within a discretized local area (i.e., a voxel in an adaptive octree) while considering the visibility between lights and surface points. We then construct a probability distribution function for sampling lights per voxel, which is proportional to our estimated function. We demonstrate that our light sampling technique can significantly improve rendering quality thanks to improved direct lighting with our light sampling.

Dongmei Ma (School of Informatics, Xiamen University, China)
Juan Cao (School of Mathematical Sciences, Xiamen University, China)
Zhonggui Chen (School of Informatics, Xiamen University, China)
Point Cloud Rendering via Multi-plane NeRF

ABSTRACT. We propose a new neural point cloud rendering method by combining point cloud multi-plane projection and NeRF. Existing point-based rendering methods often rely on the high-quality geometry of point clouds. Meanwhile, NeRF and its extensions usually query the RGB and volume density of each point on the ray through neural networks, thus leading to a low inference efficiency. In this paper, we assign a feature vector to each point and project them to multiple random depth planes. The multi-plane feature maps are fed into a 3D convolutional neural network to predict the RGB and volume density map of each feature plane, then we synthesize a novel view through volume rendering. On the one hand, projecting point features to multiple planes reduces the impact of geometry noise, and on the other hand, directly using multiple planes for rendering avoids sampling points on rays, thereby improving the rendering efficiency. The introduction of volume rendering enables our approach to synthesize high-quality images even when point clouds are relatively sparse. Experimental results on the DTU dataset and ScanNet dataset show that our approach achieves state-of-the-art results.

Shuzhan Yang (none, China)
Han Su (none, China)
Fast Geometric Sampling for Phong-like Reflection
PRESENTER: Shuzhan Yang

ABSTRACT. Importance sampling is a critical technique for reducing the variance of Monte Carlo samples. However, the classical importance sampling based on the Bidirectional Reflectance Distribution Function (BRDF) is often complex and challenging to implement. In this work, we present a simple yet efficient sampling method inspired by Phong's reflectance model. Our method generates samples of rays using geometric vector operations, replacing the need for BRDF. We explain our implementation of this method on WebGL and demonstrate how we obtain per-pixel random numbers in GLSL. We also conduct experiments to compare our method's speed and patterns to the Phong distribution. The results show that our sampling process can simulate reflections similar to Phong, but is about three times faster than traditional Phong or other BRDF importance sampling methods. Our sampling method is applicable to both real-time and offline rendering, making it a useful tool for computer graphics applications.

Haitang Zhang (Shenzhen Technology University, China)
Junchao Ma (Shenzhen Technology University, China)
Zixia Qiu (Shenzhen Technology University, China)
Junmei Yao (Shenzhen University, China)
Mustafa A. Al Sibahee (Shenzhen University, Iraq)
Zaid Ameen Abduljabbar (University of Basrah, Iraq)
Vincent Omollo Nyangaresi (Tom Mboya University, Kenya)
Multi-GPU Parallel Pipeline Rendering with Splitting Frame
PRESENTER: Haitang Zhang

ABSTRACT. Ray tracing is a rendering technique that simulates real world lighting effects in computers, and it can provide excellent visual experience. Using ray tracing in real-time rendering requires extremely large graphics computing resources and the computing power of a single graphics processing unit (GPU) is often insufficient in complex scenes. In this paper, we propose a multi-GPU parallel pipeline rendering approach that makes full use the computing power of multiple GPUs to accelerate real-time ray tracing rendering effectively. This approach enables heterogeneous GPUs to render the same frame cooperatively through a dynamic splitting frame load balancing scheme, and ensures that each GPU is assigned with the suitable size of splitting frame based on its rendering ability. A fine-grained parallel pipeline method divides the process of rendering into more detailed steps that enable multiple frames to be rendered in parallel, which improves the utilization of each step and speeds up the output of frames. With the experiments on various dynamic scenes, the results show that the number of frames per second (FPS) of the multi-GPU system composed of two GPUs using the parallel pipeline rendering approach is 2.2 times higher than that of the single GPU system. And the multi-GPU system composed of three GPUs has increased to 3.3 times.

Dawar Khan (Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China, China)
Sheng Gui (LSEC, NCMIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, China)
Zhanglin Cheng (Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China, China)
Molecular Surface Mesh Smoothing with Subdivision

ABSTRACT. Smoothing with subdivision has several popular techniques. However, these techniques have several limitations including mesh deformation, no care of mesh quality, and the increasing mesh complexity. Molecular meshes having a distinct surface with abrupt changes from concave to convex and vice versa are further challenging for these techniques. In this paper, we formulated a smoothing algorithm for molecular surface meshes. This algorithm integrates the advantages of three well-known algorithms including Catmull-Clarck, Loop and Centroidal Voronoi tessellation (CVT) with an error control module. CVT is used for pre-processing, and the remaining two for smoothing. We find new vertices like Catmull-Clarck and connect them like Loop. Unlike Catmull-Clark, which is generating a quad mesh, we establish a new connection making it triangular. We control the geometric loss by backward translation of the vertices toward the input mesh. We compared the results with previous methods and tested the algorithm with different numerical analysis and modeling applications. We found our results with significant improvement and always robust for downstream applications.

Jianping Su (University of Electronic Science and Technology of China, China)
Ning Xie (University of Electronic Science and Technology of China, China)
Xin Lou (University of Electronic Science and Technology of China, China)
Photorealistic Aquatic Plants Rendering with Cellular Structure

ABSTRACT. This paper presents a realistic real-time rendering method for aquatic plants, which considers their unique optical characteristics. While many render models has been proposed in real-time rendering of plant leaves, the rendering of aquatic plants is often inaccurate due to reliance on the botanical parameters and optical statistical characteristics of terrestrial plants. To address this issue, we combine existing rendering methods for terrestrial plants with the optical properties of aquatic plants. Through a qualitative analysis of the differences in optical properties between the cell structures of aquatic and terrestrial plants, we propose a rendering method for aquatic plants. The experimental results show that our method is more effective in expressing the rendering appearance of aquatic plants compared to general-purpose physically-based shading models and advanced plant leaf rendering models. We also demonstrated our method in virtual reality, providing a solution for the construction of virtual reality underwater environments. Our contributions include a realistic real-time rendering model for aquatic plants, a real-time underwater roaming platform, and experimental evidence demonstrating our method's effectiveness in expressing the appearance of aquatic plants while balancing model efficiency and accuracy.

15:30-16:00Coffee Break
16:00-18:00 Session LNCS7-Colors: Colors, Painting and Layout

Zoom Link:        Meeting ID: 850 7050 5056, Password: cgi2023

Ping Li (The Hong Kong Polytechnic University, Hong Kong)
Bing Yu (Shanghai University, China)
Wangyidai Lv (Shanghai University, China)
Dongjin Huang (Shanghai University, China)
Youdong Ding (Shanghai University, China)
Staged Transformer Network with Color Harmonization for Image Outpainting
PRESENTER: Wangyidai Lv

ABSTRACT. Image outpainting aims at generating new looking-realistic content beyond the original boundaries for a given image patch. Existing image outpainting methods tend to generate images with erroneous structures and unnatural colors when extrapolating the sub-image all-side. To solve this problem, we propose a Transformer-based staged image outpainting network. Specifically, we restructure the encoder-decoder architecture by adding hierarchical cross attention to the connection in each layer. We propose a staged expanding module that splits the extrapolation into vertical and horizontal steps so that the generated images can have consistent contextual information and similar texture. A color harmonization module that adjusts both local and global color information is also presented to make color transitions more natural. Our experiments prove that the proposed method outperforms the advanced methods on multiple datasets.

Keyue Fan (Tianjin University, China)
Shiguang Liu (Tianjin University, China)
SemiRefiner: Learning to refine Semi-Realistic paintings

ABSTRACT. The previous image optimization methods cannot complete the automatic refinement of semi-realistic paintings. Aiming at improving the efficiency of refinement manually, we propose an automatic refinement method for semi-realistic figure paintings guided by the line art. In order to enable the framework to adjust the draft color in the refinement process like a real painter, we design a color correction module, which automatically fixes the inappropriate color in the draft. In order to reduce artifacts and generate high-quality results, we use the line art to guide the refinement. We further devise a line art optimization module in the framework to ensure generation of high quality results by improving the quality of the line art. The experimental results and user surveys demonstrate the effectiveness of the proposed method.

Zhanyi Huang (Wuhan Textile University, China)
Wenqing Zhao (Wuhan Textile University, China)
Tangsheng Guo (Wuhan Textile University, China)
Jin Huang (Wuhan Textile University, China)
Ping Li (Hong Kong Polytechnic University, Hong Kong)
Bin Sheng (Shanghai Jiao Tong University, China)
MagicMirror: A 3-D Real-time Virtual Try-On System through Cloth Simulation
PRESENTER: Zhanyi Huang

ABSTRACT. Nowadays, with the increasing development of online shopping, there exists huge latent benefit area in clothing e-commerce. It has been leading the application of emerging technologies to this field. However, online shopping can not intuitively feel the material of clothes fabric and the dynamic effect of trying on clothes. Methodologies based on cloth simulation and human-computer interaction can be used to solve this challenge. In this paper, we proposed a virtual try-on system based cloth simulation technique to tackle the realism of cloth, using physical law in garment to strengthen the realism of virtual try-on and integrated markless motion capture technique realized by common RGB-D camera to synchronize movement of models and people. We also adopt GPU acceleration solution to ensure real-time simulation. We realized the system based Unity3D using TaiChi Programming Language to control and stimulate the garment. And we verify the significance of GPU acceleration and conduct several experiments to prove the real-time performance of the simulation-based virtual try-on system. We compared the simulation time on CPU and GPU and validated the accuracy of motion capture satisfying virtual try-on task. In the end we conducted a user study to find out if the average consumer was satisfied with our proposed virtual try-on system.

Xingquan Cai (School of Information Science and Technology, North China University of Technology, China)
Qingtao Lu (School of Information Science and Technology, North China University of Technology, China)
Jiali Yao (School of Information Science and Technology, North China University of Technology, China)
Yao Liu (School of Information Science and Technology, North China University of Technology, China)
Yan Hu (School of Information Science and Technology, North China University of Technology, China)
An Ancient Murals Inpainting Method Based on Bidirectional Feature Adaptation and Adversarial Generative Networks

ABSTRACT. To address the issue of varying degrees of damage in ancient Chinese murals due to their age and human-induced destruction, we propose a mural image restoration method based on bidirectional feature adaptation and adversarial generative networks. The proposed method first preprocesses the mural images by resizing them and extracting masked feature maps and their corresponding reverse masked feature maps. Subsequently, an improved U-Net generator model is constructed, which captures bidirectional semantic information from the masked feature maps, enhancing the restoration of irregular regions in the mural images. Additionally, a spatial attention mechanism is introduced to adaptively enhance the features of known regions in the mural images. Furthermore, a discriminator model is constructed to discriminate between the restored mural images and real images, outputting a binary classification matrix. Finally, the network model is constrained by adversarial loss, pixel reconstruction loss, style loss, and perceptual loss to generate mural images with rich textures. Experimental results demonstrate that the proposed method effectively restores mural images with different levels of damage and produces mural images with finer texture information compared to conventional mural restoration methods. This method contributes to the preservation and inheritance of traditional Chinese culture by providing an effective means for mural image restoration.

Xingquan Cai (School of Information Science and Technology, North China University of Technology, China)
Sichen Jia (School of Information Science and Technology, North China University of Technology, China)
Jiali Yao (School of Information Science and Technology, North China University of Technology, China)
Yijie Wu (School of Information Science and Technology, North China University of Technology, China)
Haiyan Sun (School of Information Science and Technology, North China University of Technology, China)
An Image Extraction Method for Traditional Dress Pattern Line Drawings Based on Improved CycleGAN

ABSTRACT. To address the problem of missing details in the general dress pattern line extraction method, we propose a traditional dress pattern line extraction method based on the improved CycleGAN. First, we input the traditional dress pattern image and extract the outline edge image by using a bi-directional cascade network. Afterwards, we construct an improved CycleGAN network model, input the traditional dress pattern image and its outline edge image into the generator model for line drawing extraction, use the discriminator model to discriminate between the generated image and the real image, and output the binary classification matrix. Finally, we construct the adversarial loss, cycle consistency loss and contour consistency loss functions to constrain the network model, output a detail rich line drawing image. Experiments show that the proposed method achieves the extraction of traditional dress pattern line images with perfect details, and the generated traditional dress pattern line images have more realistic and natural lines compared with other dress pattern line extraction methods. The method can accurately extract traditional costume pattern line images and contribute to the preservation and transmission of Chinese traditional costume culture.

Alexis Benamira (University of Central Florida, United States)
Sachin Shah (University of Central Florida, United States)
Sumanta Pattanaik (University of Central Florida, United States)
Parametrization of Measured BRDF for Flexible Material Editing
PRESENTER: Alexis Benamira

ABSTRACT. Finding a low dimensional parametric representation of measured BRDF remains challenging. Currently available solutions are either not usable for editing, or rely on limited analytical solutions, or require expensive test subject based investigations. In this work, we strive to establish a parametrization space that affords the data-driven representation variance of measured BRDF models while still offering the artistic control of parametric analytical BRDFs. We present a machine learning approach that generates a parameter space relying on a compressed disentangled representation of the measured BRDF data. After training our network, we analyze the parametrization space and interpret the learned generative factors utilizing our visual perception. It should be noted that visual analysis is called upon downstream of the system for identification purposes contrary to most other existing methods where it is used upfront to elaborate the parametrization. Furthermore, we do not need a test subject investigation. A novel feature of our parametrization is the post-processing capability to incorporate new parameters along with the learned ones, thus expanding the richness of producible appearances. Furthermore, our solution allows more flexible and controllable material editing possibilities than current machine learning solutions. Finally, we provide a rendering interface, for interactive material editing and interpolation based on the presented new parametrization system.

Ruhan He (15871682480, China)
Xuelian Yang (15171426253, China)
Jin Huang (13476165997, China)
cGAN-based Garment Line Draft Colorization Using A Garment-Line Dataset

ABSTRACT. Garment line draft is the basis of clothing design. Automatic or semi-automatic colorization of garment line draft will improve the efficiency of fashion designers and reduce the drawing cost. In this paper, we present a garment line draft colorization method based on cGAN, which can support user interaction by adding scribbles to guide the colorization process. Due to the inadequacy of the garment line drafts, we construct a paired garment-line image dataset for training our colorization model. While existing methods for line art colorization are able to generate plausible colorized results, they tend to suffer from the color bleeding issue. We introduce a region segmentation fusion mechanism to aid colorization frameworks in avoiding color bleeding. Finally, we use a joint bilateral filter to smooth the output results and generate clearer and more vivid coloring images. The experimental results show that each module in the method can contribute to the final result. In addition, the comparison with the classical methods that our method can avoid large areas of leakage in the background and have cleaner garment details.

Xiaying Liu (Hangzhou Dianzi University, China, China)
Ping Yang (Hangzhou Dianzi University, China, China)
Alexandru C. Telea (Department of Information and Computing Science, Utrecht University, Netherlands, Netherlands)
Jiří Kosinka (University of Groningen, Netherlands)
Zizhao Wu (Hangzhou Dianzi University, China, China)
PCCNet: A Few-Shot Patch-wise Contrastive Colorization Network

ABSTRACT. Few-shot colorization aims to learn a model to colorize grayscale images with little training data. Yet, existing models often fail to keep color consistency due to ignored patch correlations of the images. In this paper, we propose PCCNet, a novel Patch-wise Contrastive Colorization Network to learn color synthesis by measuring the similarities and variations of image patches in two different aspects: inter-image and intra-image. Specifically, for inter-image, we investigate a patch-wise contrastive learning mechanism with positive and negative samples constraint to distinguish color features between patches across images. For intra-image, we explore a new intra-image correlation loss function to measure the similarity distribution which reveals structural relations between patches within an image. Furthermore, we augment our network with a color memory module to remember the correct color for specific kinds of structures and textures. Experiments show that our method allows the correct color to spread naturally over objects and also achieves higher scores in quantitative comparisons with related methods.

Jiaze He (Wuhan Textile University, China)
Wenqing Zhao (Wuhan Textile University, China)
Ziruo Li (Wuhan University of Technology, China)
Jin Huang (Wuhan Textile University, China)
Ping Li (Hong Kong Polytechnic University, Hong Kong)
Lei Zhu (The Hong Kong University of Science and Technology, Hong Kong)
Bin Sheng (Shanghai Jiao Tong University, China)
Subrota Kumar Mondal (Macau University of Science and Technology, Taipa, Macau, Macao)
Reference-Based Line Drawing Colorization through Diffusion Model

ABSTRACT. Line drawing colorization is an indispensable stage in the image painting process, however, traditional manual coloring requires a lot of time and energy from professional artists. With the development of deep learning techniques, attempts have been made to colorize line drawings by means of user prompts, text, etc., but these methods also seem to require some manual involvement. In this paper, we propose a reference-based colorization method for cartoon line drawings, which uses a more stable diffusion model to automatically colorize line drawings and introduces a skeleton map as an additional guide to reduce the bleeding problem encountered during colorization and improve the quality of the generated images. In addition, to further learn the color of the reference image and improve the quality of the colorized image, we also design a two-stage training strategy, which first trains a pre-trained model matching the cartoon features on a large dataset and then obtains the model by fine-tuning a small dataset. To ensure the generality of the model, in addition to the 17,769 benchmark datasets shared on the Kaggle website, we used the cartoon dataset provided by the competition in the fine-tuning phase and produced a garment dataset with cartoon features, which we hope will contribute to the field of garment design. Finally, we illustrate the effectiveness of the model in referencebased automatic coloring through a large number of qualitative and quantitative experiments.

Yan Wan (Donghua University, China)
Yue Wang (Donghua University, China)
Li Yao (Donghua University, China)
Research of Virtual Try-on Technology Based on Two-dimensional Image

ABSTRACT. The virtual try-on based on two-dimensional image is to use the given clothes to change the clothes of human image to generate try-on images. In order to solve the problems of blurred human images, body parts missing and clothes cannot be correctly warped according to the posture of human images after fitting, this paper improves the Flow-Style-VTON network and proposes the virtual try-on method A-VITON. In this paper, residual blocks and CBAM attention mechanism are added to the UNet network of the try-on module to improve the feature extraction ability of the network to the target object, so that the generated try-on images are more realistic. Secondly, this paper also proposes a layered virtual try-on method to provide consumers with more diverse try-on services. Finally, in order to reduce the interference of complex background on the try-on results, this paper proposes a virtual try-on method with background for the first time, which can generate high quality try-on images while preserving the background of original images.The try-on results on the VITON dataset show that the proposed method has great advantages in generating high quality try-on images.

16:00-18:00 Session LNCS8-Synthesis: Synthesis and Generation

Zoom Link:        Meeting ID: 854 6450 7535, Password: cgi2023

Issei Fujishiro (Keio University, Japan)
Marco Mameli (Università Politecnica delle Marche, Italy)
Emanuele Balloni (Università Politecnica delle Marche, Italy)
Adriano Mancini (Università Politecnica delle Marche, Italy)
Emanuele Frontoni (University of Macerata, Italy)
Primo Zingaretti (Università Politecnica delle Marche, Italy)
Investigation on the Encoder-Decoder application for Mesh generation
PRESENTER: Emanuele Balloni

ABSTRACT. In computer graphics, 3D modeling is a fundamental concept. It is the process of creating three-dimensional objects or scenes using specialized software that allows users to create, manipulate and modify geometric shapes to build complex models. This operation requires a huge amount of time to perform and specialised knowledge. Typically, it takes three to five hours of modelling to obtain a basic mesh from the blueprint. Several approaches have tried to automate this operation to reduce modelling time. The most interesting of these approaches are based on Deep Learning, and one of the most interesting is Pixel2Mesh. However, training this network requires at least 150 epochs to obtain usable results. Starting from these premises, this work investigates the possibility of training a modified version of the Pixel2Mesh in fewer epochs to obtain comparable or better results. A modification was applied to the convolutional block to achieve this, replacing the classification-based approach with an image reconstruction-based approach. This modification uses a configuration based on constructing an encoder-decoder architecture using state-of-the-art networks such as VGG, DenseNet, ResNet, and Inception. Using this approach, the convolutional block learns how to reconstruct the image correctly from the source image by learning the position of the object of interest within the image. With this approach, it was possible to train the complete network in 50 epochs, achieving results that outperform the state-of-the-art. The tests performed on the networks show an increase of 0.5 percentage points over the state-of-the-art average.

Sijia Yang (National University of Defense Technology, China)
Yun Zhou (National University of Defense Technology, China)
Arbitrary Style Transfer with Style Enhancement and Structure Retention

ABSTRACT. Arbitrary style transfer is to transfer the style of any reference image to another image through a trained neural network while retaining its content as much as possible. However, the early style transfer approaches perform poorly, while some later methods generate results that are over-adapted to the style image and struggle to preserve the image structure. To solve the above problems, we propose a new style transfer method based on a neural network structure with two modules: the style enhancement module (SEM), and the content retention module (SRM). SEM aligns stylistic images and stylized image statistics in the feature space. SRM uses fast Fourier transform and Gaussian high-pass filtering to align the high-frequency information of the content image and the transferred image simultaneously in the frequency domain and the spatial domain. This new approach works well in both style transfer and content retention. Both experimental results and the questionnaire survey show that our method can generate satisfactory stylized images without missing content information.

Bo Han (Zhejiang University, China)
Yitong Fu (Zhejiang university, China)
Yixuan Shen (National University of Singapore, Singapore)
Zero3D: Semantic-Driven 3D Shape Generation For Zero-shot Learning

ABSTRACT. Semantic-driven 3D shape generation aims to generate 3D shapes conditioned on textual input. However, previous approaches have faced challenges with the single-category generation, low-frequency details, and the requirement for large quantities of paired data. To address these issues, we propose a multi-category diffusion model. Specifically, our approach includes the following components: 1) To mitigate the problem of limited large-scale paired data, we establish a connec- tion between text, 2D images, and 3D shapes through the use of the pre-trained CLIP model, enabling zero- shot learning. 2) To obtain the multi-category 3D shape feature, we employ a conditional flow model to generate a multi-category shape vector conditioned on the CLIP embedding. 3) To generate multi-category 3D shapes, we utilize a hidden-layer diffusion model conditioned on the multi-category shape vector, resulting in signifi- cant reductions in training time and memory consump- tion. We evaluate the generated results of our frame- work and demonstrate that our method outperforms existing methods.

Andreea Pocol (University of Waterloo, Canada)
Lesley Istead (University of Carleton, Canada)
Sherman Siu (University of Waterloo, Canada)
Sara Kodeiri (University of Waterloo, Canada)
Sabrina Mohktari (University of Waterloo, Canada)
Seeing Is No Longer Believing: A Survey on the State of Deepfakes, AI-Generated Humans, and Other Nonveridical Media
PRESENTER: Andreea Pocol

ABSTRACT. Did you see that crazy photo of Chris Hemsworth wearing a gorgeous, blue ballgown? What about the leaked photo of Bernie Sanders dancing with Sarah Palin? If these don't sound familiar, it's because these events never happened-but with text-to-image generators and deepfake AI technologies, it is effortless for anyone to produce such images. Over the last decade, there has been an explosive rise in research papers, as well as tool development and usage, dedicated to deepfakes, text-to-image generation, and image synthesis. These tools provide users with great creative power, but with that power comes "great responsibility;" it is just as easy to produce nefarious and misleading content as it is to produce comedic or artistic content. Therefore, given the recent advances in the field, it is important to assess the impact they may have. In this paper, we conduct meta-research on deepfakes to visualize the evolution of these tools and paper publications. We also identify key authors, research institutions, and papers based on bibliometric data. Finally, we conduct a survey that tests the ability of participants to distinguish photos of real people from fake, AI-generated images of people. Based on our meta-research, survey, and background study, we conclude that humans are falling behind in the race to keep up with AI, and we must be conscious of the societal impact.

Yuantian Huang (University of Tsukuba, Japan)
Satoshi Iizuka (University of Tsukuba, Japan)
Kazuhiro Fukui (University of Tsukuba, Japan)
Diffusion-based Semantic Image Synthesis from Sparse Layouts
PRESENTER: Yuantian Huang

ABSTRACT. We present an efficient framework for generating landscape images from sparse semantic layouts via diffusion models. Previous approaches use dense semantic label maps to generate photorealistic images, where the quality of the results highly depends on the shape of each semantic region. In practice, however, it is not trivial to create detailed and accurate semantic layouts in order to obtain plausible results from these methods. To address this issue, we propose a novel type of input that is more sparse and intuitive for use in real-world settings. Our learning-based framework incorporates a carefully designed random masking process to simulate real user input during model training. We leverage the Semantic Diffusion Model (SDM) as a generator to transform sparse label maps into full landscape images where missing semantic information is complemented based on the learned image structure. Furthermore, through a model distillation process, we achieve comparable inference speed to GAN-based models while preserving the generation quality. After training with the well-designed random masking process, the proposed framework is able to generate high-quality landscape images with sparse and intuitive inputs, which is useful for practical applications. Experiments show that our proposed method outperforms existing approaches both quantitatively and qualitatively.

Jia Chen (School of Computer Science and Artificial Intelligence,Wuhan Textile University, China)
Yanfang Wen (School of Computer Science and Artificial Intelligence,Wuhan Textile University, China)
Jin Huang (School of Computer Science and Artificial Intelligence,Wuhan Textile University, China)
Xinrong Hu (School of Computer Science and Artificial Intelligence,Wuhan Textile University, China)
Tao Peng (School of Computer Science and Artificial Intelligence,Wuhan Textile University, China)
FoldGEN: Multimodal Transformer for Garment Sketch-to-photo Generation
PRESENTER: Yanfang Wen

ABSTRACT. Garment sketch-to-photo generation is one of the most important step in the process of garment design. Most existing methods only contain single conditional information, it is difficult to handle the combination of multiple conditional information, while failing to generate garment folds based on sketch strokes and facing a low-fidelity problem. Therefore, in this paper, we proposed a two-stage multi-modal framework for the generation of garment images, FoldGEN, to generate garment images with folds using sketches and descriptive text as conditional information. In the first stage, we combine feature matching of discriminators and semantic perception of Convolutional Neural Network in vector quantization, which can reconstruct the details and folds of the garment images. In the second stage, a multi-conditional constrained Transformer is used to establish the association between different modality data, which allows the generated images to contain not only text description information but also folds corresponding to the strokes of the sketch. Experiments show that our method can generate garment images with different folds from sketches with high fidelity, while achieving the best FID and IS on both unimodal and multimodal tasks.

Ruien Shen (DALAB Shanghai Jiao Tong University, China)
Chi Weng Ma (DALAB Shanghai Jiao Tong University, China)
Deli Dong (Shanghai Jiao Tong University, China)
Shuangjiu Xiao (School of software, Shanghai Jiao Tong University, China)
Light Accumulation Map for Natural Foliage Scene Generation

ABSTRACT. Foliage scene generation is an important problem of virtual reality applications. Realistic virtual floras require simulation of real plant symbiotic principles. Among the factors that affect the spatial distribution of plants, lighting is the most important one. The change of seasons, geographic locations, and shading from higher plants will greatly affect the sunlight conditions for different plants in floras, which cannot be easily described with parameters. In order to generate natural foliage scene that accurately reflects the sunlight condition while maintaining efficiency, we propose a novel method named Light Accumulation Map (LAM) which stores sunlight receiving and occlusion information of each tree model. By calculating sun lighting accumulation during one year at different latitudes, we simulate the sunlight occlusion effect of the tree model and store the occlusion result as LAM. Then, a LAM-based foliage generation algorithm is brought out to simulate accurate foliage distribution with different latitudes and seasons. The evaluation shows that our method exhibits strong adaptability in creating a lifelike distribution of foliage, particularly in undergrowth areas, across various regions and throughout different seasons of the year.

Frederick W. B. Li (University of Durham, UK)
DrawGAN: Multi-view Generative Model Inspired By The Artist's Drawing Method

ABSTRACT. We presents a novel approach for modeling artists' drawing processes using an unconditional generative adversarial network (GAN) architecture with a multi-view generator and multi-discriminator. The proposed method can synthesize different types of picture drawing, including line drawing, shading, and color drawing, with high quality and robustness. Also, the proposed method outperforms the existing state-of-the-art unconditional GANs. The novelty of our approach lies in the design of the architecture that closely resembles the typical sequence of an artist's drawing process, which can significantly enhance the quality of the generated images. Our experimental results demonstrate the potential of using a multi-view generative model to provide more feature knowledge for modulating image generation processes. The proposed method holds promise for advancing the field of AI in the visual arts, and can open new avenues for research and creative practices.

Diego Thomas (Kyushu University, Japan)
Takumi Kitamura (Kyushu University, Japan)
Hiroshi Kawasaki (Kyushu University, Japan)
Naoya Iwamoto (Huawei, Japan)
A Two-step Approach for Interactive Animatable Avatars
PRESENTER: Takumi Kitamura

ABSTRACT. We propose a new two-step human body animation technique based on displacement mapping that can learn a detailed deformation space, works at interactive time (more than 30 fps) and can be directly integrated into standard animation environments. To achieve real-time animation we employ the template-based approach and model pose-dependent deformations with 2D displacement images. We propose our own template model to facilitate and automatize training data preparation. Key to achieve detailed animation with few artifacts is to learn pose-dependent displacements directly in the pose space, without having to predict skinning weights. In order to generalize to totally new motions we employ a two step approach where the first step contains knowledge about general human motion while second step contains information about user specific motion. Our experimental results show that our proposed method can animate an avatar up to 300 times faster than baselines while keeping similar or even better level of details.

Akinori Ishitobi (Keio University, Japan)
Masanori Nakayama (Keio University, Japan)
Issei Fujishiro (Keio University, Japan)
Visual simulation of crack generation and bending in deteriorated films coated on metal objects: Combination of static fracture and position-based deformation
PRESENTER: Akinori Ishitobi

ABSTRACT. Weathering, an expression of degradation caused by rain and wind, is essential for photorealistic computer graphics. One of the most typical targets of weathering is metal, which is omnipresent in reality. However, to reproduce scenes realistically, rust-proof paint applied to metal surfaces cannot be ignored. In our study, we propose a weathering method for coated films on metal objects. Our method models a coated film as a 3D triangular polygon mesh and deforms it by combining two kinds of simulations: static simulation for determining fractures based on the balance of the internal forces and the position-based bend simulation for moving vertices according to geometric constraints. Our method can digitally reproduce the deterioration of coated films using complex 3D deformation, which is difficult to express by material manipulation only.