previous day
next day
all days

View: session overviewtalk overview

08:15-09:00 Session Keynote3: A/Prof. Daisuke Iwai, Osaka University, Graduate School of Engineering Science

Zoom Link:        Meeting ID: 849 0583 5505, Password: cgi2023

Daisuke Iwai, Associate Professor, Osaka University, Graduate School of Engineering Science

Talk Title: TBD

Abstract: TBD

Bio: Daisuke Iwai is an Associate Professor at the Graduate School of Engineering Science, Osaka University in Japan. Ater receiving his PhD degree from Osaka University in 2007, he started his career at Osaka University. He was also a visiting scientist at Bauhaus-University Weimar, Germany, from 2007 to 2008, and a visiting Associate Professor at ETH, Switzerland, in 2011. His research interests include augmented reality, projection mapping, and human-computer interaction. He is currently serving as an Associate Editor of IEEE Transactions on Visualization and Computer Graphics (TVCG), and previously served as Program Chairs of IEEE International Symposium on Mixed and Augmented Reality (ISMAR) (2021, 2022) and IEEE Conference on Virtual Reality and 3D User Interfaces (VR) (2022). His publications received Best Paper Awards at IEEE VR (2015), IEEE Symposium on 3D User Interfaces (3DUI) (2015), and IEEE ISMAR (2021). He is a recipient of JSPS Prize (2023).

09:30-10:30 Session LNCS15-Detection3: Detection and Recognition

Zoom Link:        Meeting ID: 849 0583 5505, Password: cgi2023

Mohammed Bennamoun (The University of Western Australia, Australia)
Junqi Li (Wuhan Textile University, China)
Tao Peng (Wuhan Textile University;Engineering Research Center of Hubei Province of Clothing Information, China)
Junjie Huang (Wuhan Textile University, China)
Junping Liu (Wuhan Textile University;Hubei Provincial Engineering Research Center for Intelligent Textile and Fashion, China)
Xinrong Hu (Wuhan Textile University;Hubei Provincial Engineering Research Center for Intelligent Textile and Fashion, China)
Zili Zhang (Wuhan Textile University;, China)
Yu Mao (School of life science,Hubei University, China)
Driver action recognition based on Dynamic Adaptive Transformer

ABSTRACT. In industrial-grade applications, the efficiency of algorithms andmodels takes precedence, ensuring a certain level of performance while aligningwith the specific requirements of the application and the capabilities of the underlying equipment. In recent years, the Vision Transformer has been introduced as a powerful approach to significantly improve recognition accuracy invarious tasks. However, it faces challenges concerning portability, as well ashigh computational and input requirements. To tackle these issues, a dynamicadaptive transformer (DAT) has been proposed. This innovative method involves dynamic parameter pruning, enabling the trained Vision Transformer toadapt effectively to different tasks. Experimental results demonstrate that thedynamic adaptive transformer (DAT) is capable of reducing the model's parameters and Gmac with minimal accuracy loss.

Charbel El Achkar (Antonine University, Lebanon)
Raphaël Couturier (FEMTO-ST Institute, CNRS, Université Bourgogne Franche-Comté (UBFC), France)
Abdallah Makhoul (FEMTO-ST Institute, CNRS, Université Bourgogne Franche-Comté (UBFC), France)
Talar Atéchian (Antonine University, Lebanon)
Leveraging Computer Vision Networks for Guitar Tablature Transcription

ABSTRACT. Generating music-related notations offers assis- tance for musicians in the path of replicating the music using a specific instrument. In this paper, we evaluate the state-of- the-art guitar tablature transcription network named TabCNN against state-of-the-art computer vision networks. The eval- uation is performed using the same dataset as well as the same evaluation metrics of TabCNN. Furthermore, we pro- pose a new CNN-based network named TabInception to tran- scribe guitar-related notations, also called guitar tablatures. The network relies on a custom inception block converged by dense layers. The TabInception network outperforms the TabCNN in terms of multi-pitch precision (MP), tablature precision (TP), and tablature F-measure (TF). Moreover, the Swin Transformer achieves the best score in terms of multi- pitch recall (MR) and tablature recall (TR), while the Vision Transformer achieves the best score in terms of multi-pitch F-measure (MF). These results were acquired while train- ing all the networks with 8 or 16 epochs. Motivated by the previous insights, we train the networks with more epochs and propose another network named Inception Transformer (InT) to surpass all the estimation metrics of TabCNN us- ing a single network. The InT network relies on an incep- tion block converged by a Transformer Encoder. The TabIn- ception and the InT network outperformed all estimation metrics of TabCNN except the tablature disambiguation rate (TDR) when trained using a bigger epoch size. The TabIn- ception achieved a value of 95.33% for MP, 78.34% for MR, 86% for MF, 86.39% for TP, 73.9% for TR, 79.65% for TF, and 90.6% for TDR. The InT network achieved a value of 94.81% for MP, 91.4% for MR, 93.07% for MF, 85.51% for TP, 80.41% for TR, 82.8% for TF, and 90.1% for TDR.

Shiqi Zou (Shanghai University, China)
Jingqiao Zhang (Shanghai University, China)
Serial spatial and temporal transformer for point cloud sequences recognition

ABSTRACT. Point cloud sequences are unordered and irregular, which means extracting spatial and temporal features from them is challenging. This paper presents a novel network named Serial Spatial and Temporal Transformer (SerialSTTR) for point cloud sequences recognition. Specifically, point-based self-attention is used to gather global information on each point at the spatial level, and frame-based self-attention is used to reconstruct the sequences with motion features at the temporal level. In addition, an orderly local module is proposed to supplement the local feature learning ability that spatial transformer lacks. And relative position encoding is adopted to complete the order information for temporal transformer. Extensive experiments demonstrate that the SerialSTTR achieves the state-of-the-art performance on 3D human action recognition with the challenging dataset MSR-Action3D. And to show its generalizability, experiments on gesture recognition with SHREC'17 dataset are performed, which also present competitive results.

Bowen Deng (Institute of automation, Chinese Academy of Sciences\Guangxi University, China)
Shuangliang Zhao (Guangxi University, China)
Dongchang Liu (Institute of automation, Chinese Academy of Sciences, China)
TadML: A fast temporal action detection with Mechanics-MLP

ABSTRACT. Temporal Action Detection (TAD) is a crucial but challenging task in video understanding. Its goal is to detect both the action categories and their start and end frame for each action instance in long and untrimmed video. While most current TAD models utilize both RGB and Optical- Flow streams, the manual conversion of original RGB frames into Optical-Flow frames requires additional computation and time cost, presenting an obstacle to achieve real-time processing. At present, many models adopt two-stage meth- ods that focus on proposal generation in the first step, lead- ing to a significant slowdown in inference speed. To over- come this challenge, we propose a one-stage anchor-free temporal localization method that utilizes only RGB stream. we establish a novel Newtonian Mechanics multi-layer per- ceptron (Mechanics-MLP) architecture that achieves com- parable accuracy to all existing state-of-the-art (SOTA) mod- els while surpasses their inference speed by a large mar- gin. Our approach achieves an impressive inference speed of 4.44 videos per second on THUMOS14. Since our method does not require the conversion of RGB frames into Optical- Flow frames, it has more potential in practical detection and has faster inference speed. Our study also demonstrates that MLP has great potential in downstream tasks such as TAD. The source code is available at /TadML

09:30-10:30 Session WC5-MIAV: CGI 2023 - Image Analysis and Visualization in Advanced Medical Imaging Technology

Zoom Link:        Meeting ID: 885 3670 3032, Password: cgi2023

Euijoon Ahn (James Cook University, Australia)
Lei Bi (University of Sydney, Australia)
Younhyun Jung (Gachon University, South Korea)
Mingjian Li (The University of Sydney, Australia)
Younhyun Jung (Gachon University, South Korea)
Michael Fulham (Royal Prince Alfred Hospital; The University of Sydney, Australia)
Jinman Kim (The University of Sydney, Australia)
Importance-aware 3D volume visualization for medical content-based image retrieval: A preliminary study

ABSTRACT. A Medical Content-Based Image Retrieval (CBIR) system is designed to retrieve images from large imaging repositories that are visually similar to the user's query image. CBIR is widely used for evidence-based diagnosis, teaching, and research. Although retrieval accuracy has been largely improved, there is limited development toward visualizing important image features that indicate the similarity of the retrieved images. Despite the prevalence of 3D volumetric data in medical imaging nowadays, such as computed tomography (CT), the current CBIR systems still rely on 2D cross-sectional views for the visualization of the retrieved images. Such 2D visualization requires the users to browse through the image stacks to confirm the similarity of the retrieved images, and often involves mental reconstruction of the 3D information, including the size, shape, and spatial relations of multiple structures. This process is time-consuming and reliant on the user’s experience. In this study, we propose an importance-aware 3D volume visualization. We automatically optimize the rendering parameters to maximize the visibility of the important structures that are detected and prioritized in the retrieval process. We then integrate our visualization into a CBIR system and thereby complementing 2D cross-sectional views for relevance feedback and further analysis. We present our preliminary results that demonstrate that 3D visualization can provide additional information using multi-modal positron emission tomography and computed tomography (PET-CT) images of a non-small cell lung cancer dataset.

Hail An (School of Computing, Gachon University, Korea, South Korea)
Jinman Kim (School of Computer Science, The University of Sydney, Australia, Australia)
Ping Li (Department of Computing and the School of Design, The Hong Kong Polytechnic University, Hong Kong., Hong Kong)
Younhyun Jung (School of Computing, Gachon University, Korea, South Korea)
A Transfer Function Optimization Using Visual Saliency For Region of Interest-based Direct Volume Rendering

ABSTRACT. Direct volume rendering (DVR) helps data interpretation by enabling users to interactively focus attention on specific regions in a volume that are of most interest to them. The ideal visualization of these regions of interest (ROIs), however, remains a major challenge. The visual attention given to ROIs depends on the appropriate assignment of optical parameters (opacity and/or color) to ROIs as well as other regions via transfer function (TF), and it is typically a repetitive trial-and-error process from a TF scratch. There have been various automated TF optimization approaches to address the extensive user involvement. They fine-tune initial TF parameters in an iterative manner toward satisfying a pre-defined objective metric. In this work, we propose a new TF optimization approach where we introduce a visual saliency-based objective metric, motivated by the conceptional property of visual saliency as a biologically-inspired measure to aid the identification of regions considered important by the human visual system. Our approach is capable of optimizing opacity and color parameters according to the user-defined target visual saliency of ROIs and producing DVR images that direct visual attention to the ROIs. In addition, we provide an intuitive ROI selection via an image-based user interaction that operates directly on an initial DVR space rather than a complex TF parameter space. We outline our approach by applications to a variety of volumetric datasets and highlight its advantages in comparison to current state-of-the-art TF optimization approaches that use a visibility-based objective metric for opacity parameter optimization.

Suhyeon Kim (School of Computing, Gachon University, Seonam-si, Republic of Korea, South Korea)
Hail An (School of Computing, Gachon University, Seonam-si, Republic of Korea, South Korea)
Myungji Song (School of Computing, Gachon University, Seonam-si, Republic of Korea, South Korea)
Sungmin Lee (School of Computing, Gachon University, Seonam-si, Republic of Korea, South Korea)
Hoijoon Jung (School of Computer Science, University of Sydney, Sydney, Australia, Australia)
Seontae Kim (Department of Otolaryngology-Head and Neck Surgery, College of Medicine, Gil Medical Center, Korea, South Korea)
Younhyun Jung (School of Computing, Gachon University, Seonam-si, Republic of Korea, South Korea)
Automated Marker-less Patient-to-Preoperative Medical Image Registration approach using RGB-D Images and Facial Landmarks for Potential Use in Computed-Aided Surgical Navigation of the Paranasal Sinus
PRESENTER: Suhyeon Kim

ABSTRACT. Paranasal sinus surgery is an established treatment option for chronic rhinosinusitis. Because this surgery is performed inside the nasal cavity, where critical anatomical structures, such as optic nerves and pituitary glands, exist nearby, surgeons usually rely on computer-aided surgical navigation (CSN) to provide a wide field of view in the surgical site and to allow for precise control of surgical instruments. In the CSNs, it is essential to register the surgical site of the actual patient with the corresponding view from the preoperative computed tomography (CT) images. The traditional registration approaches are performed manually by the user or automatically by attaching fiducial markers on both the patient's surgical site and preoperative CT images for every surgery before use. In this work, we propose an automated approach to register patient-to-preoperative CT image without fiducial markers. The proposed approach detected and extracted facial anatomical landmarks in 2D RGB images through the use of deep learning models. These landmarks were located in 3D facial mesh reconstructed from depth images by using unprojection and ray-marching algorithms. The facial landmark pairs acquired from the patient site and the preoperative CT images are then registered with singular value decomposition and iterative closet point algorithms. We demonstrate the registration capability of our approach using Microsoft HoloLens 2, a mixed reality head-mounted display because it facilitates the acquisition of RGB-depth images and the prototype development of in-situ visualization to illustrate how the CT images are properly registered on the target surgical site. We compared our automated marker-less registration approach to the manual counterpart using a facial phantom with three participants. The results show that our approach produces relatively good registration accuracy, with a marginal target registration error of 4.4 mm when compared to the manual counterpart.

Ge Jin (University of Sydney, Australia)
Younhyun Jung (Gachon University, North Korea)
Jinman Kim (University of Sydney, Australia)
Challenges and Constraints in Deformation-Based Medical Mesh Representation

ABSTRACT. Mesh representation of medical imaging isosurfaces are essential for medical analysis. These representations are typically obtained using mesh extraction methods to segmented 3D volumes. However, the meshes extracted from such methods often suffer from undesired staircase artefacts. In this paper, we evaluate the existing mesh deformation methods that deform a template mesh to desired shapes. We evaluate two variants of such method on three datasets of varying topological complexity. Our objective is to demonstrate that, despite the mesh deformation methods have their limitations, they avoid the generation of staircase artefacts.

Aiwu Shi (School of Computer Science and Artificial Intelligence, Wuhan Textile University, China)
Bei Sheng (School of Computer Science and Artificial Intelligence, Wuhan Textile University, China)
Jin Huang (School of Computer Science and Artificial Intelligence, Wuhan Textile University, China)
Gan Luo (School of Computer Science and Artificial Intelligence, Wuhan Textile University, China)
Chao Han (School of Computer Science and Artificial Intelligence, Wuhan Textile University, China)
He Huang (School of Computer Science and Artificial Intelligence, Wuhan Textile University, China)
Shuran Ma (School of Computer Science and Artificial Intelligence, Wuhan Textile University, China)
LS-Net: COVID-19 Lesion Segmentation from CT image via Diffusion Probabilistic Model

ABSTRACT. Coronavirus Disease 2019 (COVID-19) ravaged the world in early 2020, causing great harm to human health. However, there are several challenges to segment the infected areas from computed tomography (CT) image, including blurry boundaries between the lesion and normal lung tissues, and uncertain characteristics about lesion's scale, location, and texture. To solve these problems, a COVID-19 lesion segmentation network (LS-Net) based on probabilistic diffusion model is proposed to segment lesion areas from CT images. The Feature Fusion Decoder module is introduced to aggregate high-level features and generate a guidance as the next steps so that the small lesion could not be omitted. In addition, the attention mechanism is set to pay attention to the information about position of lesion's edge. So, the LS-Net framework can improve the precision of lesion segmentation result from CT image slice. Experiments on datasets such as the COVID-19 CT Segmentation dataset shows that LS-Net is advanced than most current segmentation models.

10:30-11:00Coffee Break
11:00-12:30 Session CAVW2

Zoom Link:        Meeting ID: 849 0583 5505, Password: cgi2023

Libo Sun (Southeast University, China)
Kenta Akita (Kyushu University, Japan)
Yuki Morimoto (Kyushu University, Japan)
Reiji Tsuruno (Kyushu University, Japan)
Hand-Drawn Anime Line Drawing Colorization of Faces with Texture Details
PRESENTER: Kenta Akita

ABSTRACT. Automatic or semi-automatic colorization can reduce the burden of illustrators in color illustration production, which is a research area with significant market demand because the colorization of line drawings of character illustrations is time-consuming. Texture details in eyes and hair influence the impression of character illustrations. Generally, these details are not expressed in line drawings, but in colorization in color illustration production. Many existing automatic or semi-automatic colorization methods do not target hand-drawn line drawings and it is difficult to paint texture details on such drawings. In this paper, we propose the semi-automatic colorization of character line drawings around faces with texture details. Our method uses a reference image as a color hint and transfers the textures of the reference image to a line drawing. To achieve this, our method uses segmentation masks to match parts of the line drawing with the same parts of the reference image. We create two types of segmentation datasets to train a segmentation network that creates segmentation masks. We transfer texture details to a hand-drawn line drawing by mapping each part of the reference image to the corresponding part of the line drawing using segmentation masks. We show that our method is more effective for hand-drawn line drawings than existing methods using qualitative and quantitative evaluations.

Chenyu Zang (Shandong University, China)
Wei Gai (Shandong University, China)
Haodong Li (Shandong University, China)
Chenzhi Xing (Shandong University, China)
Wenfei Wang (Shandong University, China)
Dongli Li (Second Experimental Kindergarten of the Jinan Huaiyin District, China)
Gaorong Lv (Shandong University, China)
Chenglei Yang (Shandong University, China)
Supporting Foot Interaction of Simple Reaction Time Training System
PRESENTER: Chenzhi Xing

ABSTRACT. Reaction time, the ability to detect, process and respond to stimuli, is one of the fundamental factors in human computer interaction, and is a key cognitive skill in clinical and healthy populations. Good reaction time allows us to respond to stimuli and situations with agility and efficiency. How to train and improve a person's reaction time has become an important research question. In this paper, we present a new training genre which combines the user-centered personalized training objects generation with precise tracking of foot interaction.Virtual objects are created with respect to the user’s features and historical training effectiveness, and are in motion. A foot tracking algorithm based on three-Gaussian model is designed to support interaction with stepping on moving virtual objects. We present the design and implementation of the system, as well as user studies. Findings illustrate that the reaction time performances are significantly improved following of seven days of training based on foot interaction.

Dong-Min Kim (Korea University, South Korea)
Jeonghyeon Ahn (Media Laboratory, Korea University, South Korea)
Jongmin Lee (Media Laboratory, Korea University, South Korea)
Myungho Kim (Iaan Co., Ltd, South Korea)
JungHyun Han (Korea University, South Korea)
Real-time Reconstruction of Pipes using RGB-D Cameras

ABSTRACT. This paper presents a novel method that automatically reconstructs pipes in real time using an RGB-D camera. The input image is decomposed into superpixels, and a pipe element, which is a 3D circle lying on the pipe surface, is generated for each superpixel. Over frames, the pipe elements are grouped into sequences and are finally modeled in parametric equations. The method is tested in daily settings, where the pipes may be curved, and their radii may vary along the axes. The test results show that the method is able to reconstruct pipes efficiently, precisely and robustly.

Qiang Li (Shenyang Aerospace University, China)
Peng Wang (Shenyang Aerospace University, China)
Zexue Liu (Shenyang Aerospace University, China)
Yuxin Zhang (Shenyang Aerospace University, China)
How Generous Interface Affect User Experience and Behavior: Evaluating the Information Display Interface for Museum Cultural Heritage
PRESENTER: Yuxin Zhang

ABSTRACT. The web interfaces of museum websites continue to play an important role in how users search for information and navigate through digital cultural heritage collections. Based on information behavior theory, a generous interface was deemed appropriate for displaying large amounts of digital cultural heritage. However, in this field of study, there is still a lack of example verification. This study compared the traditional interface and the generous interface of the Liaoning Provincial Museum to investigate the role of a generous interface for cultural heritage collections. Forty second-year design graduate students were randomly assigned to one of two interface groups: traditional or generous. To compare the differences between the two interfaces for the participants, four variables were measured. In addition, two parameters, holding time and average retrieval time, are recorded and used to evaluate and compare the impact of the two interfaces on user behavior. The generous interface, according to the findings, is more effective than the traditional interface in increasing students' engagement with cultural heritage and is perceived to be more aesthetically pleasing. In addition, the generous interface has been proven to be more suitable for casual leisure users to explore cultural heritage information. This research provides the basis and reference for developing the web interface of cultural heritage collections.

Qiang Li (Shenyang Aerospace University, China)
Peng Wang (Shenyang Aerospace University, China)
Zexue Liu (Shenyang Aerospace University, China)
Haoyan Zhang (Shenyang Aerospace University, China)
Yujie Song (Shenyang Aerospace University, China)
Yuxin Zhang (Shenyang Aerospace University, China)
Using scaffolding theory in serious games to enhance traditional Chinese murals culture learning

ABSTRACT. This study explores a game model that uses scaffolding to learn about cultural heritage based on adventure games. A case study using traditional Chinese murals tested the effectiveness of serious games in improving learning performance and knowledge acquisition. This study observed and evaluated the learning outcomes of 64 students by using serious game learning compared to traditional video learning in an experimental setting. Changes in their knowledge acquisition, intrinsic motivation, cognitive load (extrinsic load vs. germane load), and engagement were collected through a series of tests and scales. Experimental results show that digital adventure games have better learning performance and knowledge retention effects, higher intrinsic motivation, germane load and engagement than traditional video learning. The reasons affecting academic performance were analyzed from the data, and it was found that intrinsic motivation and germane cognitive load had a positive effect on game performance, and external cognitive load had a negative effect on game performance. This study provides experience on the design of serious games in cultural heritage learning.

Haoxiang Wang (Beijing Jiaotong University, China)
Xiaoping Che (Beijing Jiaotong University, China)
Enyao Chang (Beijing Jiaotong University, China)
Chenxin Qu (Beijing Jiaotong University, China)
Yao Luo (Beijing Jiaotong University, China)
Zhenlin Wei (Beijing Jiaotong University, China)
How to Set Safety Boundary in Virtual Reality: A Dynamic Approach based on User Motion Prediction
PRESENTER: Enyao Chang

ABSTRACT. Virtual Reality (VR) interaction safety is a prerequisite for all user activities in the virtual environment. While seeking a deep sense of immersion with little concern about surrounding obstacles, users may have limited ability to perceive the real-world space, resulting in possible collisions with real-world objects. Nowadays, recent works and rendering techniques such as the Chaperone can provide safety boundaries to users but confines them in a small static space and lack of immediacy. This paper proposes the SCARF framework, which provides dynamic user motion detection and prediction in Virtual Reality. We study the relationship between user characteristics, human motion, and categories of VR tasks and provides an approach that uses biomechanical analysis to define the interaction space in VR dynamically.We report on a user study with 58 volunteers and establish a three dimensional kinematic dataset from a VR game. A motion segmentation algorithm is proposed to extract motion features: slashing amplitude, slashing speed, and range of slashing space. We adopt the rule-based machine learning model, RIPPER, and explore human motion relations in the form of rules.Furthermore, few-shot learning models are introduced to the field of human motion analysis in VR. The experiments validate that our few-shot learning model is effective and can improve the performance of motion prediction. Finally, we implement SCARF in VR environment for dynamic safety boundary adjustment.

11:00-12:30 Session TVCJ2-IP1

Zoom Link:        Meeting ID: 885 3670 3032, Password: cgi2023

Weiliang Meng (LIAMA - NLPR, CAS Institute of Automation, China)
Zhihao Ma (Institute of Automation, Chinese Academy of Sciences, China)
Wei Li (Institute of Automation, Chinese Academy of Sciences, China)
Muyang Zhang (Institute of Automation, Chinese Academy of Sciences, China)
Weiliang Meng (MAIS, CAS Institute of Automation, China)
Shibiao Xu (Institute of Automation, Chinese Academy of Sciences, China)
Xiaopeng Zhang (Institute of Automation, Chinese Academy of Sciences, China)
HTCViT: An effective network for image classification and segmentation based on natural disaster datasets

ABSTRACT. Classifying and segmenting natural disaster images are crucial for predicting and responding to disasters. However, current convolutional networks perform poorly in processing natural disaster images, and there are few proprietary networks for this task. To address the varying scales of the region of interest (ROI) in these images, we propose the Hierarchical TSAM-CB-ViT(HTCViT) network, which builds on the ViT network's attention mechanism to better process natural disaster images. Considering that ViT excels at extracting global context but struggles with local features, our method combines the strengths of ViT and convolution, and can capture overall contextual information within each patch using the Triple-Strip Attention Mechanism (TSAM) structure. Experiments validate that our HTCViT can improve the classification task with $3-4 \%$ and the segmentation task with $1-2 \%$ on natural disaster datasets compared to the vanilla ViT network.

Kun Hu (Tsinghua university, China)
Zhaoyangfan Huang (tsinghua university, China)
Xiaochao Wang (School of Mathematical Sciences, Tiangong University, China)
Xingjun Wang (tsinghua university, China)
StegaEdge: Learning Edge-Guided Steganography

ABSTRACT. Steganography is critical in traceability, authentication, and secret delivery for multimedia. In this paper, we propose a novel image steganography framework, named StegaEdge, via learning edge-guided network to simultaneously address three challenges, capacity, multi-task, and invisibility. First, we use an up-sampling strategy to expand the embedding space and thus increase the capacity of the embedded message. Second, our algorithm improves the embedding way of messages so that it can handle different messages embedded in the same image and achieve split-task recovery completely. Different information can be embedded in one cover image without affecting each other. Third, we innovatively propose an edge-guided strategy to solve the problem of poor invisibility in smooth regions. The human eye is significantly less perceptive of intensity changes in edges than in smooth areas. Unlike traditional steganography methods, our edge-guided steganography can appropriately embed part of the information into non-edge regions when the amount of embedded information is too large. Experimental results on the datasets of COCO, Div2K, and Mirflickr show that the newly-proposed StegaEdge algorithm achieves satisfactory results in terms of capacity, multi-task, imperceptibility, and security compared to the state-of-the-art algorithms.

Yuankang Chen (University of Electronic Science and Technology of China, China)
Yifan Lu (University of Electronic Science and Technology of China, China)
Xiao Hua Zhang (Hiroshima Institute of Technology, Japan)
Ning Xie (University of Electronic Science and Technology of China, China)
Interactive Neural Cascade Denoising for 1-SPP Monte Carlo Images
PRESENTER: Yuankang Chen

ABSTRACT. Monte Carlo (MC) path tracing is known for its high fidelity and heavy computational cost. With the development of neural networks, the kernel-based post-processing method has succeeded in denoising noisy images under low sampling rates, but the complex network structure impedes its deployment in interactive applications. In this paper, we propose a lightweight cascaded network which progressively denoises 1-spp Monte Carlo images through both pixel and kernel prediction methods. A primary denoised image is generated by the pixel prediction network at the first stage, which is then fed to the kernel prediction network to obtain multi-resolution kernels. In addition, to take full advantage of the auxiliary buffers, we introduce a bilateral method during image reconstruction. Experimental results show that our approach achieves state-of-the-art denoising qualities for 1-spp images at an interactive frame speed.

Zhongmin Jiang (University of Shanghai for Science and Technology, China)
Wanyan Zhang (University of Shanghai for Science and Technology, China)
Wenju Wang (University of Shanghai for Science and Technology, China)
Fusiform Multiscale Pixel Self-attention Network for Hyperspectral Images Reconstruction from a Single RGB Image
PRESENTER: Wanyan Zhang

ABSTRACT. Current research on deep learning algorithms is directed to reconstruct hyperspectral images from a single RGB image. However, this does not consider the feature information between regions, so the feature capture of context is insufficient. This causes the quality of reconstructed hyperspectral images to be low. We propose correcting this with a fusiform multiscale pixel self-attention (FMPSA) network. The proposed FMPSA consists of a fusiform multiscale feature extraction (FMFE) module cascaded with several multiscale adaptive residual attention blocks (MARABs). FMFE extracts multiscale detail features by interleaving dual components to avoid degrading spectral reconstruction quality due to local and edge spatial information loss. Each MARAB consists of paired FMFE-Left and FMFE-Right components, an optimal non-local model, a pixel self-attention module, a LayerNorm layer, a multilayer perceptron with Gelu nonlinearity, and long-short dual residual connection, which can be regarded as a residual structure based on a pixel self-attention mechanism. MARAB can adaptively track regions containing feature-rich information for more accurate hyperspectral reconstruction with a hierarchical focus on the salient pixels. The proposed FMPSA was applied to the NTIRE 2020 hyperspectral dataset. Experimental results show that the proposed method outperforms current methods in terms of MRAE and RMSE.

Taishi Ito (University of Tsukuba, Japan)
Yuki Endo (University of Tsukuba, Japan)
Yoshihiro Kanamori (University of Tsukuba, Japan)
Age-Dependent Face Diversification via Latent Space Analysis

ABSTRACT. Facial age transformation methods can change facial appearance according to the target age. However, most existing methods do not consider that people get older with different attribute changes (e.g., wrinkles, hair volume, and face shape) depending on their circumstances and environment. Diversifying such age-dependent attributes while preserving a person’s identity is crucial to broaden the applications of age transformation. In addition, the accuracy of age transformation to childhood is limited due to dataset bias. To solve these problems, we propose an age transformation method based on latent space analysis of StyleGAN. Our method obtains diverse age-transformed images by randomly manipulating age-dependent attributes in a latent space. To do so, we analyze the latent space and perturb channels affecting age-dependent attributes. We then optimize the perturbed latent code to refine the age and identity of the output image. We also present an unsupervised approach for improving age transformation to childhood. Our approach is based on the assumption that existing methods cannot sufficiently move a latent code toward a desired direction. We extrapolate an estimated latent path and iteratively update the latent code along the extrapolated path until the output image reaches the target age. Quantitative and qualitative comparisons with existing methods show that our method improves output diversity and preserves the target age and identity. We also show that our method can more accurately perform age transformation to childhood.

Sheng Wang (School of Software, Nanchang University, China)
Qi Wang (School of Mathematics and Computer Science, Nanchang University, China)
Weidong Min (School of Mathematics and Computer Science, Nanchang University; Institute of Metaverse, Nanchang University;, China)
Qing Han (School of Mathematics and Computer Science, Nanchang University, China)
Di Gai (School of Mathematics and Computer Science, Nanchang University, China)
Haowen Luo (Medical Big-Data Center, the Second Affiliated Hospital of Nanchang University, China)
Trade-off Background Joint Learning for Unsupervised Vehicle Re-identification

ABSTRACT. Existing vehicle re-identification (Re-ID) methods either extract valuable background information to enhance the robustness of the vehicle model or segment background interference information to learn vehicle fine-grained information. However, these methods do not consider the background information as a trade-off attribute to unite valuable background and background interference. This work proposes the trade-off background joint learning method for unsupervised vehicle Re-ID, which consists of two branches, to exploit the ambivalence of background information. In the global branch, a background focus of the pyramid global branch module is proposed to optimize the sample feature space. The designed pyramid background-aware attention extracts background-related features from the global image and constructs a two-fold confidence metric based on background-related and identity-related confidence scores to obtain robust clustering results during the clustering. In the local branch, a background filtering of the local branch module is proposed to alleviate the background interference. First, the background of each local region is segmented and weakened. Then, a background adaptive local label smoothing is designed to reduce noise in every local region. Comprehensive experiments on VeRi-776 and VeRi-Wild are conducted to validate the performance of the proposed balanced background information method. Experimental results show that the proposed method outperforms the state-of-the-art.

12:30-13:30Lunch Break
13:30-15:30 Session TVCJ7-IP2

Zoom Link:        Meeting ID: 885 3670 3032, Password: cgi2023

Ying Song (Zhejiang Sci-Tech University, China)
Zhi Wei Zhang (Beijing Forestry University, China)
Han Wang (Beijing Forestry University, China)
A Convolutional Neural Network based Blind Robust Image Watermarking Approach Exploiting the Frequency Domain

ABSTRACT. Image watermarking embeds information in the image that is visually imperceptible and can be recovered even if the image is modified or attacked during distribution, thus protecting the image copyright. Current image watermarking methods make the learned model resistant to attacks by simulating specific attacks but lack robustness to unspecified attacks. In this paper, we propose to hide the information in the frequency domain. To control the distribution and intensity of watermarking information, we introduce a channel weighting module (CWM) based on modified Gaussian distribution. In the spatial domain, we design a spatial weighting module (SWM) to improve the watermarking visual quality. Moreover, a channel attention enhancement module (CAEM) designed in the frequency domain senses the distribution of watermarking information and enhances the frequency domain channel signals to improve the watermarking robustness. Abundant experimental results show that our method guarantees high image visual quality and high watermarking capacity. The generated watermarking images can robustly resist unspecified attacks such as noise, crop, blur, color transform, JPEG compression, and screen-shooting.

Lixiang Lin (ZheJiang University, China)
Jianke Zhu (ZheJiang University, China)
Topology-Preserved Human Reconstruction with Details
PRESENTER: Lixiang Lin

ABSTRACT. Due to the high diversity and complexity of body shapes, it is challenging to directly estimate the human geometry from a single image with the various clothing styles. Most of model-based approaches are limited to predict the shape and pose of a minimally clothed body with over-smoothing surface. While capturing the fine detailed geometries, the model-free methods are lack of the fixed mesh topology. To address these issues, we propose a novel topology-preserved human reconstruction approach by bridging the gap between model-based and model-free human reconstruction. We present an end-to-end neural network that simultaneously predicts the pixel-aligned implicit surface and an explicit mesh model built by graph convolutional neural network. Experiments on DeepHuman and our collected dataset showed that our approach is effective. The code will be made publicly available.

Wang Xuechun (Beijing Normal University, China)
Chao Wentao (Beijing Normal University, China)
Wang Liang (Beijing University of Technology, China)
Duan Fuqing (Beijing Normal University, China)
Light Field Depth Estimation Using Occlusion-aware Consistency Analysis
PRESENTER: Wang Xuechun

ABSTRACT. Occlusion modeling is critical for light field depth estimation, since occlusion destroys the photo-consistency assumption which most depth estimation methods hold. Previous works always detect the occlusion points on the basis of Canny detector, which can leave some occlusion points out. Occlusion handling, especially for multi-occluder occlusion, is still challenging. In this paper, we propose a novel occlusion-aware depth estimation method which can better solve the occlusion problem. We design two novel consistency costs based on the photo-consistency for depth estimation. According to the consistency costs, we analyze the influence of the occlusion and propose an occlusion detection technique based on depth consistency, which can detect the occlusion points more accurately. For the occlusion point, we adopt a new data cost to select the un-occluded views, which are used to determine the depth. Experimental results demonstrate the proposed method is superior to the other compared algorithms, especially in multi-occluder occlusions.

Shiyin Du (Zhejiang Sci-Tech University, China)
Ying Song (Zhejiang Sci-Tech University, China)
Multi-Exemplar Guided Image Weathering via Texture Synthesis

ABSTRACT. We propose a novel method for generating gradually varying weathering effects from a single image. Time-variant weathering effects tends to appear simultaneously on one object. Compared to previous methods, our method is able to obtain gradually changing weathering effects through simple interactions while keeping texture variations and shading details. We first classify the weathering regions into several stages based on a weathering degree map extracted from the image. For each weathering stage we automatically extract the corresponding weathering sample, from which a texture image is synthesized subsequently. Then we generate weathering effects by fusing different textures according to the weathering degree of the image pixels. Finally, in order to maintain the intrinsic shape details of the object during the fusing process, we utilize a new shading preserving method taking account of the weathering degrees. Experiments show that our method is able to produce visually realistic and time-variant weathering effects interactively.

Jiayang Li (College of Intelligence and Computing, Tianjin University, China)
Chongke Bi (College of Intelligence and Computing, Tianjin University, China)
Visual Analysis of Air Pollution Spatio-temporal Patterns

ABSTRACT. The advances in air monitoring methods have made it possible to reanalyze the large-scale air pollution phenomena. Mining potential air pollution information from the large-scale air pollution data is an important issue in the current environmental field. Although direct data visualization provides an intuitive presentation, the method is less applicable in long-time domain and high temporal resolution. To better meet analysis needs from domain experts in this paper, we design an visual analysis framework PTSVis by the friendly multi-view interactions and novel visual view designs, which can explore the spatio-temporal dynamics from multiple pollution data. To extract possible pollutant transport patterns at a macro level, we propose a two-stage cluster analysis method to extract transport patterns from the large-scale pollutant transport trajectories. It will be substantially helpful for domain experts to make relevant decisions. In order to mine the potential information, the index is constructed by long-time series data at the grid point in the specific transport pattern trajectory or transport pattern, which can help experts to complete sketch match with custom time resolution in this paper. It can assist domain experts to conclude some important potential time-varying features from air pollution data. Finally, we verified the validity through spatial and temporal case analysis for pollutant data.

Feiyu Xue (College of Information Engineering, Northwest A&F University, China)
Min Zhou (College of Information Engineering, Northwest A&F University, China)
Yahui Shao (College of Information Engineering, Northwest A&F University, China)
Chengjie Zhang (College of Information Engineering, Northwest A&F University, China)
Yongping Wei (Northwest A&F University, China)
Meili Wang (College of Information Engineering, Northwest A&F University, China)
RT-SwinIR: An Improved Digital Wallchart Image Super-Resolution with Attention-based Learned Text Loss

ABSTRACT. In recent years, image super-resolution (SR) has made remarkable progress in areas such as natural images or text images. However, in the field of digital wallchart image super-resolution, existing methods have failed to preserve the finer details of text regions while restoring graphics. To address this challenge, we present a new model called Real Text-SwinIR (RT-SwinIR), which employs a novel plug-and-play Attention-based Learned Text Loss (LTL) technique to enhance the architecture's ability to render clear text structure while preserving the clarity of graphics. To evaluate the effectiveness of our method, we have collected a dataset of digital wallcharts and subjected them to a two-order degradation process that simulates real-world damage, including creases and stains on wallcharts, as well as noise and blurriness caused by compression during computer network transmission. On the proposed dataset, RT-SwinIR achieves the best 0.58 on Learned Text Loss and 0.11 on LPIPS, reduced by an average of 45.3% and 36.6%, respectively. Experiments have shown that our method outperforms prior works in digital wallchart image super-resolution, indicating its superior visual perceptual performance.

Yinghua Liu (Shenzhen University, China)
Chengze Li (Caritas Institute of Higher Education, Hong Kong)
Xueting Liu (Shenzhen University, China)
Huisi Wu (Shenzhen University, China)
Zhenkun Wen (Shenzhen University, China)
AddCR: A Data-driven Cartoon Remastering
PRESENTER: Yinghua Liu

ABSTRACT. Old cartoon classics have the lasting power to strike the resonance and fantasies of audiences today. However, cartoon animations from earlier years suffered from noise, low resolution, and dull lackluster color due to the improper storage environment of the film materials and limitations in the manufacturing process. In this work, we propose a deep learning-based cartoon remastering application that investigates and integrates noise removal, super-resolution, and color enhancement to improve the presentation of old cartoon animations on displays. We employ multi-task learning methods in the denoising part and color enhancement part individually to guide the model to focus on the structure lines so that the generated image retains the sharpness and color of the structure lines. We evaluate existing super-resolution methods for cartoon inputs and find the best one that can guarantee the sharpness of the structure lines and maintain the texture of images. Moreover, we propose a reference-free color enhancement method that leverages a pre-trained classifier for old and new cartoons to guide color mapping.

Wei Zhang (Dalian University, China)
Wanshu Fan (Dalian University, China)
Xin Yang (Dalian University of Technology, China)
Qiang Zhang (Dalian University, China)
Dongsheng Zhou (Dalian University, China)
Lightweight Single-Image Super-Resolution via Multi-scale Feature Fusion CNN and Multiple Attention Block

ABSTRACT. In recent years, single-image super-resolution (SISR) has acquired tremendous progress with the development of deep learning. However, the majority of SISR methods based on deep learning focus on building more complex networks, which inevitably lead to the problems of computational and memory costs. Thus, these methods may fail to be applied in real-world scenarios. To solve this problem, this paper proposes a lightweight convolution network combined with Transformer for SISR named as MMSR. Specifically, an efficient convolutional neural network (CNN) based on multi-scale feature fusion is designed for local feature extraction, which is called MFF-CNN. In addition, we propose a simple and efficient multiple attention block (MAB) to further utilize the context information in features. MAB incorporates channel attention and Transformer to help network obtain similar features at a longterm dependence, making full use of global information to further refine texture details. Finally, this paper provides comprehensive results for different settings of the entire network. Experimental results on common used datasets demonstrate that the proposed method can achieve better performances at the ×2, ×3 and ×4 scales than other state-of-the-art lightweight methods.

13:30-15:30 Session VCIBA

Zoom Link:        Meeting ID: 849 0583 5505, Password: cgi2023

Fuchang Liu (Hangzhou Normal University, China)
Fuchang Liu (Hangzhou Normal University, China)
Shen Zhang (Hangzhou Normal University, China)
Hao Wang (Hangzhou Normal University, China)
Caiping Yan (Hangzhou Normal University, China)
Yongwei Miao (Hangzhou Normal University, China)
Local Imperceptible Adversarial Attacks against Human Pose Estimation Networks
PRESENTER: Fuchang Liu

ABSTRACT. Deep neural networks are vulnerable to attacks from adversarial inputs. The corresponding attack research on human pose estimation has been largely unexplored, especially for body joints detection. It is not straightforward to transfer classification-based attack methods to body joints regression tasks. Another issue is that attack effectiveness and imperceptibility contradicts each other. To solve these issues, we propose local imperceptible attacks on human pose estimation networks. Specifically, we reformulate imperceptible attacks on body joints regression into the constrained maximum allowable attack and approximate the solution by iterative gradient-based strength refinement and greedy-based pixel selection. Our method crafts effective perceptual adversarial attacks, which takes the human perception and attack effectiveness into consideration. We conduct a series of imperceptible attacks against the state-of-the-art human pose estimation methods, including HigherHRNet, DEKR, and ViTPose. Experimental results demonstrate the proposed method achieves excellent imperceptibility while maintaining attack effectiveness by significantly reducing the number of perturbed pixels, only about 4% of pixels can achieve sufficient attacks on human pose estimation.

Feilong Chen (Huawei Cloud, China)
Zhiyu Wang (Huawei Cloud, China)
Jiali Xu (ShanDong Energy Group CO..LTD, China)
Liquan Hu (ShanDong Energy Group CO..LTD, China)
Huaixuan Cao (Yunding Technology Co..Ltd, China)
Jianlong Chang (Huawei Cloud, China)
Make ``V'' and ``Q'' Inseparable: Deliberately Dual-Channel Adversarial Learning for Robust Visual Question Answering

ABSTRACT. In visual question answering (VQA), vision-language bias often causes poor model performance because today’s VQA models tend to capture superficial correlations in the training set and fail to sufficiently learn the multi-modal knowledge from both vision and language. Several recent works try to alleviate this problem via weakening language prior. In this paper, we propose a novel Deliberately Dual-Channel Adversarial Learning to Make ``V'' and ``Q'' Inseparable, named {\bf DCAL}, which aims to weaken prior from both vision and language. Specifically, DCAL introduces in-batch random negative sampling to force the model to be wrong when given the wrong questions or images. DCAL maximizes the probability of the original question-image pairs producing ground-truth answers and minimizes the probability of random negative samples producing ground-truth answers. In order to solve the problem of false negatives, DCAL exploits a deliberate strategy to utilize the sampled question-image pairs. Experiments demonstrate that our proposed Deliberately Adversarial Learning framework 1) is general to various VQA backbones and fusion strategies, 2) improves the performance of existing robust VQA models on the sensitive VQA-CP dataset while performing robustly on the balanced VQA v2 dataset.

Dashun Zheng (Faculty of Applied Sciences, Macao Polytechnic University, Macao)
Rongsheng Wang (Faculty of Applied Sciences, Macao Polytechnic University, Macao)
Yaofei Duan (Faculty of Applied Sciences, Macao Polytechnic University, Macao)
Patrick Cheong-Iao Pang (Faculty of Applied Sciences, Macao Polytechnic University, Macao)
Tao Tan (Faculty of Applied Sciences, Macao Polytechnic University, Macao)
Focus-RCNet: A lightweight recyclable waste classification algorithm based on Focus and knowledge distillation
PRESENTER: Dashun Zheng

ABSTRACT. Waste pollution is one of the most important environmental problems in the modern world. With the continuous improvement of the living standard of the population and the increasing richness of the consumption structure, the amount of domestic waste generated has increased dramatically and there is an urgent need for further waste treatment of waste. The rapid development of artificial intelligence provides an effective solution for automated waste classification. However, the large computational power and high complexity of algorithms make convolutional neural networks (CNNs) unsuitable for real-time embedded applications. In this paper, we propose a lightweight network architecture, Focus-RCNet, designed with reference to the sandglass structure of MobileNetV2, which uses deeply separable convolution to extract features from images. The Focus module is introduced into the field of recyclable waste image classification to reduce the dimensionality of features while retaining relevant information. In order to make the model focus more on waste image features while keeping the amount of parameters computationally small, we introduce the SimAM attention mechanism. Additionally, knowledge distillation is used to further compress the number of parameters in the model. By training and testing on the TrashNet dataset, the Focus-RCNet model not only achieves an accuracy of 92%, but also has high mobility of deployment.

Chenxu Zhao (University of Electronic Science and Technology of China, China)
Zhenjiang Du (University of Electronic Science and Technology of China, China)
Ning Xie (University of Electronic Science and Technology of China, China)
Guan Wang (Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, China)
Xiaohua Zhang (Hiroshima Institute of Technology, China)
Deformation-aware Shape Retrieval via Post-deformation Embedding Prediction

ABSTRACT. Recently, a new task called Deformation-aware retrieval was proposed, which aims to find a shape from a database that matches the given query best after deformation. However, previous methods either lack consideration of the deformation process during retrieval or involve time-consuming calculations. In addition, none of these methods exploit shapes’ local features, which may lead to suboptimal solutions. We propose a novel neural Deformation-Retrieval Network (DRNET) to map the 3D shape to a retrieval space and estimate the difference between a deformed source model and the target by predicting the embedding of the deformed source in this space. To implement our retrieval system based on shapes’ local features, we introduce cross-attention into the network to learn the correspondence between the local features of a pair of shapes, which improves the performance of retrieval. Experimental results on ShapeNet show that our network outperforms the state-of-the-art methods.

Dingchang Wu (School of Artificial Intelligence and Computer Science, Jiangnan University, China)
Yinghui Wang (School of Artificial Intelligence and Computer Science, Jiangnan University, China)
Haomiao Ma (School of Computer Science,ShaanxiNormalUniversity, China)
Lingyu Ai (School of Internet of Things Engineering, Jiangnan University, China)
Jinlong Yang (School of Artificial Intelligence and Computer Science, Jiangnan University, China)
Shaojie Zhang (School of Artificial Intelligence and Computer Science, Jiangnan University, China)
Wei Li (School of Artificial Intelligence and Computer Science, Jiangnan University, China)
An Adaptive Feature Extraction Method for Capsule Endoscopy Images
PRESENTER: Dingchang Wu

ABSTRACT. The traditional feature detector method Oriented Fast and Rotated Brief (ORB) with fixed threshold detects features of images, and its descriptors do not support for the distinction of features of capsule endoscopy images effectively. Therefore, a new feature detector which uses a new way to set thresholds called Adaptive Threshold Fast and FREAK in Capsule Endoscopy Image (AFFCEI) is proposed. First, the method constructs an image pyramid, and then calculates the thresholds of pixels based on the contrast of the gray values of all pixels in the local neighborhood of the image to achieve the adaptive image feature extraction in each layer of the pyramid; after that, the features are expressed by the FREAK descriptor, which can enhances the discrimination of the feature description of the stomach image. Finally, the refined matching is obtained based on Hamming distance with the application of Grid Motion Statistics (GMS) algorithm, and then using RANSAC algorithm rejects mismatches. Compared with the ASIFT method which performed best in the past, the average running time of this method is 4/5 of that of ASIFT and the average matching score is improved by 5% when tracking features in a moving capsule endoscope.

Xian Li (Wuhan Textile University, China)
Li Li (Wuhan Textile University, China)
Hao Chen (Wuhan Textile University, China)
Shengyi Guo (Wuhan Textile University, China)
Ping Zhu (Wuhan Textile University, China)
Tao Peng (Wuhan Textile University, China)
Xiong Pan (Wuhan Textile University, China)
Highlight removal using an improved unsupervised CycleGAN network in combination with differentiable renderer

ABSTRACT. The highlight in the images of specular objects can significantly reduce the accuracy of vision tasks, which is an important research topic in computer vision. It is challenging to obtain paired image datasets with and without highlights to perform supervised learning with the aim to remove the highlights. In this study, a paired synthetic image dataset with known viewing angle information was first produced by using the differentiable neural renderer combined with the illumination controllable specular reflection-diffuse reflection method. An improved unsupervised CycleGAN network was then proposed to overcome the shortage of the traditional image style transfer network. In this proposed network the images can be decomposed into foreground and background by aligning the images with geometric consistency constraints using only a small batch of background images and the image style transfer was only conducted on the foreground of the input images. Experimental results show that our proposed method achieved significantly higher values in SSIM and PSNR evaluation metrics compared with other highlight remove methods.

Chi Weng Ma (Shanghai Jiao Tong University, China)
Ruien Shen (Shanghai Jiao Tong University, China)
Deli Dong (Shanghai Jiao Tong University, China)
Shuangjiu Xiao (Shanghai Jiao Tong University, China)
Botanical Tree Reconstruction from a Single Image via 3D GAN-based Skeletonization

ABSTRACT. 3D botanical tree reconstruction from single images plays a vital role in computer graphics. However, accurately capturing the intricate branching patterns and detailed morphology of trees remains a challenging task. In this paper, we propose a novel approach for single-image tree reconstruction using a conditional generative adversarial network to infer the 3D skeleton of a tree in the form of a 2D skeleton depth map from the image. Based on the 2D skeleton depth map, its corresponding branching structure (3D skeleton) that inherits the tree shape in the input image and leaves can be generated using procedural modeling technique. Our proposed approach for generating lifelike 3D tree models from a single image with no user input showcases our proficiency in achieving efficient and reliable reconstruction. The outcomes showcase our capability to faithfully recreate complex tree architectures while capturing their visual authenticity.

Chuanyu Pan (University of California Berkeley, United States)
Guowei Yang (Tsinghua University, China)
Generating Animatable 3D Cartoon Faces from Single Portraits
PRESENTER: Chuanyu Pan

ABSTRACT. With the booming of virtual reality (VR) technology, there's a growing need for customized 3D avatars. However, traditional methods for 3D avatar modeling are either time-consuming or fail to retain similarity to the person to be modeled. We present a novel framework to generate animatable 3D cartoon faces from a single portrait image. We first transfer an input real-world portrait to a stylized cartoon image with a StyleGAN. Then we propose a two-stage reconstruction method to recover the 3D cartoon face with detailed texture. Our two-stage strategy first makes a coarse estimation based on template models, and then refines the model by non-rigid deformation under landmark supervision. Finally, we propose a semantic preserving face rigging method based on manually created templates and deformation transfer. Compared with prior arts, qualitative and quantitative results show that our method achieves better accuracy, aesthetics, and similarity criteria. Furthermore, we demonstrate the capability of real-time facial animation of our 3D model.

15:30-16:00Coffee Break
16:00-18:00 Session TVCJ8&CAVW3

Zoom Link:        Meeting ID: 885 3670 3032, Password: cgi2023

Shu Liu (Central South University, China)
Xiyu Bao (Shandong University, China)
Yulong Bian (Shandong University, China)
Meng Qi (Shandong Normal University, China)
Yu Wang (Shandong University, China)
Ran Liu (Shandong Normal University, China)
Wei Gai (Shandong University, China)
Juan Liu (Shandong University, China)
Hongqiu Luan (Shandong University, China)
Chenglei Yang (Shandong University, China)
A Toolkit for Automatically Generating and Modifying VR Hierarchy Tile Menus

ABSTRACT. Current VR/AR system development studios lack a toolkit to automatically generate hierarchical tile menu layouts and menu prototypes for VR/AR devices without the need for user programming. This paper proposes a toolkit that automatically generates a hierarchy tile menu layout via a modified circular treemap algorithm and allows users to interactively arrange and resize tiles to form various layouts with their preferences or needs via a circle packer method and then automatically generates a VR/AR menu prototype based on the outputted layout. Moreover, reprogramming is also not required each time when the hierarchy is modified, or the menu layout is redesigned. The user test shows that the proposed toolkit simplifies the creation of hierarchy tile menu layouts, improves the creation efficiency of users, and allows users to flexibly create hierarchical tile menu prototypes based on their design idea.

Dian Zhou (Tianjin University, China)
Shiguang Liu (Tianjin University, China)
Qing Xu (Tianjin University, China)
Music Conditioned 2D Hand Gesture Dance Generation with HSG

ABSTRACT. In recent years, the short video industry is booming. However, there are still many difficulties in the action generation of virtual characters.We observed that on the short video social platform, ‘gesture dance’ is a very popular short video form. However, its development is limited by the professionalism of choreography. In order to solve these problems, we propose an intelligent choreography framework, which can generate new gesture sequences for unseen audio based on pairing data in the database. Our framework adopts multimodal method and obtains excellent results. In additional, we collected and produced the first and largest pair labeled hand gesture dance data set. Various experiments showed that our results not only generate smooth and rich action sequences, but also collect some semantic information contained in the audio.

Libo Sun (Southeast University, China)
Rui Tian (Southeast University, China)
Wenhu Qin (Southeast University, China)
Physical based Motion Reconstruction from Videos using Musculoskeletal Model

ABSTRACT. The field of data-driven character animation has predominantly relied on joint-torque-driven research methods that utilize motion capture data as the primary data source. However, other driven approaches leveraging video as a data source have largely been overlooked. To address this shortcoming, we propose a novel method that combines human pose estimation and physical simulation of character animation based on prior knowledge of biomechanics and reinforcement learning. Our approach allows characters to directly learn from the actor's skills captured in video and subsequently reconstruct the movements with high fidelity in a physically simulated environment. First, we model the character based on the human musculoskeletal system and build a complete dynamics model of the proposed system using the Lagrange equations of motion. Next, we employ the pose estimation method to process the input video and generate human reference motion from the estimated video. Finally, we design a hierarchical control framework comprising a trajectory tracking layer and a muscle control layer that work together to coordinate the output of the necessary muscle force by modulating the degree of muscle activation. The trajectory tracking layer aims to minimize the difference between the reference motion pose and the actual output pose, while the muscle control layer aims to minimize the difference between the target torque and the actual output muscle force. The two layers interact by passing parameters through a proportional differential controller until the desired learning objective is achieved. A series of complex experimental results demonstrate that our proposed method can learn to produce comparable high-quality movements from videos of varying degrees of complexity and remains stable in the presence of muscle contracture weakness perturbations.

Xinru Wu (Zhejiang Sci-Tech University, China)
A novel jigsaw game with eye-tracking: a multi-model interaction based on psycholinguistics for ADHD therapeutic

ABSTRACT. Attention-deficit hyperactivity disorder (ADHD) causes impulsive or hyperactive behavior. People with ADHD might have emotional outbursts or trouble focusing. Simple psychotherapy is difficult to achieve the desired therapeutic effect, and diagnoses of ADHD in adults and children have been kept increasing. However, e-health game may mitigate the limitations of traditional laboratory methods by providing a more ecologically relevant experience. In this paper, inspired by the psycholinguistics, multi-modal interaction and the former studies of Pupil-CR method, we have developed and a narrative interactive game-therapy with eye-tracking device to enhance manifestations of dyslexia and attention deficit disorder. By the designed situated discourses in the game interactive narration, ADHD patients are able to concentrate and reduce symptoms of anxiety and excessive impulsivity. To evaluate the efficacy of a designed game in improving attention performance in patients with ADHD, a parallel-group and controlled trial in which volunteers (N=48) were randomly assigned 1:1:1 to either an eye-tracking jigsaw game or 2 control groups. The principal measure of interest was the average difference in scores of Attention Comparison Score (ACS) of the Test of Variables of Attention (TOVA) between the pre-intervention and post-intervention periods. Based on the significant differences observed between the experimental and control groups, we concluded that the designed eye-tracking jigsaw game could serve as a viable therapeutic intervention for children with ADHD.

Yu Liu (National University of Defense Technology, China)
Enquan Huang (Central South University, China)
Ziyu Zhou (Central South University, China)
Kexuan Wang (Central South University, China)
Shu Liu (Central South University, China)
3D Facial Attractiveness Prediction Based on Deep Feature Fusion

ABSTRACT. Facial attractiveness prediction is an important research topic in the computer vision community. It not only contributes to the development of interdisciplinary research in psychology and sociology, but also provide fundamental technical support for applications like aesthetic medicine and social media. With the advances in 3D data acquisition and feature representation, this paper aims to investigate the facial attractiveness from deep learning and three-dimensional perspectives. The 3D faces are first processed to unwrap the texture images and refine the raw meshes. The feature extraction networks for texture, point cloud, and mesh are then delicately designed, considering the characteristics of different types of data. A more discriminative face representation is derived by feature fusion for the final attractiveness prediction. During network training, the cyclical learning rate with an improved range test is introduced, so as to alleviate the difficulty in hyperparameter setting. Extensive experiments are conducted on a 3D FAP benchmark, where the results demonstrate the significance of deep feature fusion and enhanced learning rate in cooperatively facilitating the performance. Specifically, the fusion of texture image and point cloud achieves the best overall prediction, with PC, MAE and RMSE of 0.7908, 0.4153 and 0.5231, respectively.

Yankong Zhang (Hefei University of Technology, China)
Yuetong Luo (Hefei University of Technology, China)
Yuhua Liu (Hangzhou Dianzi University, China)
Bo Zhou (Hefei University of Technology, China)
Xiaoping Liu (Hefei University of Technology, China)
CCET: Towards Customized Explanation of Clustering
PRESENTER: Yankong Zhang

ABSTRACT. Classical clustering algorithms use all features to partition a dataset, making it difficult for users to understand the clustering results. Some scholars have proposed interpretable clustering algorithms that use a few understandable features to explain clustering results. However, the existing algorithms can only generate one interpretation and fail to satisfy the diverse needs of different users. To address this challenge, the Clustering Customized Explanation Tree (CCET), a visual analytics system, was constructed in this paper. The system helps users modify existing explanations to obtain customized explanations. Firstly, a variety of views are designed to visualize the explanations and help users judge whether the existing explanations meet the requirements. Then, an explanations modification strategy based on cluster centroids splitting is proposed making it easy for users to revise explanations according to the requirement. We demonstrate the CCET using a case study and a user study. The results show that the system can deepen users' understanding of clustering results and make it easy for them to conduct further decision analysis.

Inwoo Ha (KAIST, SAIT (Samsung Advanced Institute of Technology), South Korea)
Hyun Sung Chang (SAIT (Samsung Advanced Institute of Technology), South Korea)
Minjung Son (SAIT (Samsung Advanced Institute of Technology), South Korea)
Sungeui Yoon (KAIST, South Korea)
Learning to Disentangle Latent Physical Factors of Deformable Faces

ABSTRACT. We proposed a monocular image disentanglement framework based on a compositional model. Our model disentangles the input image into albedo, depth, deformation, pose, and illumination. Instead of using any handcrafted priors, we guided our deep neural network to understand the physical meaning of each element by imitating real-world operations to reconstruct images in a self-supervised manner. Our model, trained on multi-frame images of each subject, implies a better understanding of the objects without any supervision or strong model assumption. We drove a deformationfree canonical space to align multi-frame images in the same space. This enables the understanding of information from multi-frame images in the same space. Our experiments showed that our approach accurately disentangles the physical elements of deformable faces with wide variations from images in the wild.

Yi Dou (School of Computer and Information Sciernce, Southwest University, Chongqing, China, China)
Xingling Liu (School of Mathematics and Statistics, Southwest University, Chongqing, China, China)
Min Zhou (Information Construction Office, Southwest University, Chongqing, China, China)
Jianjun Wang (School of Mathematics and Statistics, Southwest University, Chongqing, China, China)
Robust Principal Component Analysis via Weighted Nuclear Norm with Modified Second-Order Total Variation Regularization

ABSTRACT. The traditional robust principal component analysis (RPCA) model aims to decompose the original matrix into low-rank and sparse components and uses the nuclear norm to describe the low-rank prior information of the natural image. In addition to low- rankness, it has been found in many recent studies that local smoothness is also crucial prior in low-level vision. In this paper, we propose a new RPCA model based on weight nuclear norm and modified second-order total variation regularization(WMSTV-RPCA for short), which exploits both the global low-rankness and local smoothness of the matrix. Extensive experimental results show, both qualitatively and quantitatively, that the proposed WMSTV-RPCA can more effectively remove noise, and model dynamic scenes compared with the competing methods.

16:00-18:00 Session VI&JCST

Zoom Link:        Meeting ID: 849 0583 5505, Password: cgi2023

Chongke Bi (College of Intelligence and Computing, Tianjin University, China)
Younhyun Jung (Gachon University, South Korea)
Liang Yuan (Keio University, Japan)
Issei Fujishiro (Keio University, Japan)
Multiview SVBRDF Capture from Unified Shape and Illumination

ABSTRACT. This paper proposes a stable method for reconstructing spatially varying appearances (SVBRDFs) from multiview images captured under casual lighting conditions. Unlike flat surface capture methods, ours can be applied to surfaces with complex silhouettes. The proposed method takes multiview images as inputs and outputs a unified SVBRDF estimation. We generated a large-scale dataset containing the multiview images, SVBRDFs, and lighting appearance of vast synthetic objects to train a two-stream hierarchical U-Net for SVBRDF estimation that is integrated into a differentiable rendering network for surface appearance reconstruction. In comparison with state-of-the-art approaches, our method produces SVBRDFs with lower biases for more casually captured images.

Yujie Wang (Shandong University, China)
Xuelin Chen (Tencent AI Lab, China)
Baoquan Chen (Peking University, China)
SinGRAV: Learning a Generative Radiance Volume from a Single Natural Scene

ABSTRACT. We present SinGRAV – an attempt to learn a Generative RAdiance Volume from multi-view obser- vations of a Single natural scene, in stark contrast to existing category-level 3D generative models that learn from images of many object-centric scenes. Inspired by SinGAN, we also learn the internal distribution of the input scene, which necessitates our key designs w.r.t the scene representation and network architecture. Unlike popular MLP-based architectures, we particularly em- ploy convolutional generators and discriminators, which inherently possess spatial locality bias, to operate over voxelized volumes for learning the internal distribution over a plethora of overlapping regions. On the other end, localizing the adversarial generators and discriminators over confined areas with limited receptive fields easily leads to highly implausible geometric structures in the spatial. Our remedy is to use spatial inductive bias and joint discrimination on geometric clues in the form of 2D depth maps. We show this is effective and efficient to improve the spatial arrangement. Experimental re- sults demonstrate the ability of SinGRAV in generating plausible and diverse variations from a single scene, the merits of SinGRAV over state-of-the-art generative neu- ral scene models, and the versatility of SinGRAV by its use in a variety of applications. Code and data will be released to facilitate further research.

Younhyun Jung (Gachon University, South Korea)
Jim Kong (School of Computer Science, the University of Sydney, Australia, Australia)
Jinman Kim (School of Computer Science, the University of Sydney, Australia, Australia)
A Transfer Function Design Using A Knowledge Database based on Deep Image and Primitive Intensity Profile Features Retrieval
PRESENTER: Younhyun Jung

ABSTRACT. Direct volume rendering (DVR) is a technique that emphasizes structures of interest (SOIs) within an image volume visually, while simultaneously depicting adjacent regional information, e.g., the spatial location of a structure concerning its neighbors. In DVR, transfer function (TF) plays a key role by enabling accurate identification of SOIs interactively as well as ensuring appropriate visibility of them. TF generation typically involves non-intuitive trial-and-error optimization of rendering parameters, which is time-consuming and inefficient. Attempts at mitigating this manual process have led to approaches that make use of a knowledge database consisting of pre-designed TFs by domain experts. In these approaches, a user navigates the knowledge database to find the most suitable pre-designed TF for their input volume to visualize the SOIs. Although these approaches potentially reduce the workload to generate the TFs, they, however, require manual TF navigation of the knowledge database, as well as the likely fine tuning of the selected TF to suit the input. In this work, we propose a TF design approach where we introduce a new content-based retrieval (CBR) to automatically navigate the knowledge database. Instead of pre-designed TFs, our knowledge database contains image volumes with SOI labels. Given an input image volume, our CBR approach retrieves relevant image volumes (with SOI labels) from the knowledge database; the retrieved labels are then used to generate and optimize TFs of the input. This approach does not need any manual TF navigation and fine tuning. For our CBR approach, we introduce a novel volumetric image feature which includes both a local primitive intensity profile along the SOIs and regional spatial semantics available from the co-planar images to the profile. For the regional spatial semantics, we adopt a convolutional neural network to obtain high-level image feature representations. For the intensity profile, we extend the dynamic time warping technique to address subtle alignment differences between similar profiles (SOIs). Finally, we propose a two-stage CBR scheme to enable the use of these two different feature representations in a complementary manner, thereby improving SOI retrieval performance. We demonstrate the capabilities of our approach with comparison to a conventional CBR approach in visualization, where an intensity profile matching algorithm is used, and also with potential use-cases in medical image volume visualization.

Zinuo Li (University of Macau, Macao)
Xuhang Chen (University of Macau, Macao)
Shuna Guo (University of Macau, Macao)
Chi-Man Pun (University of Macau, Macao)
Shuqiang Wang (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China)
WavEnhancer: Unifying Wavelet and Transformer for Image Enhancement

ABSTRACT. Image enhancement is a commonly employed technique in digital image processing, which aims to improve the aesthetic appeal and visual quality of an image. However, traditional enhancement approaches based on pixel-level or global-level modifications have limited effectiveness. With the increasing popularity of learning-based techniques, recent works have focused on utilizing various networks for image enhancement. Nevertheless, these methods often lack optimization of image frequency domains. To address this gap, this study introduces a transformer-based model for enhancing images in the wavelet domain. The proposed model refines different frequency bands of an image and prioritizes both local details and high-level features. As a result, the proposed method generates superior enhancement results. The performance evaluation of the model was assessed through comprehensive benchmark evaluations, which indicate that our method outperforms the state-of-the-art techniques.

Farhan Rasheed (Linköping University, Sweden, Sweden)
Talha Bin Masood (Linköping University, Sweden, Sweden)
Tejas Murthy (Department of Civil Engineering, Indian Institute of Science, Bangalore, India)
Vijay Natarajan (Department of Computer Science and Automation, Indian Institute of Science, Bangalore, India)
Ingrid Hotz (Linköping University, Sweden, Sweden)
Multi-scale visual analysis of cycle characteristics in spatially-embedded graphs
PRESENTER: Farhan Rasheed

ABSTRACT. We present a visual analysis environment based on a multi-scale partitioning of a 2d domain generated by cycles in weighted planar embedded graphs. The work has been inspired by an application in granular materials research, where the question of scale plays a fundamental role in the analysis of material properties. We propose an efficient algorithm to extract the hierarchical cycle structure using persistent homology. The core of the algorithm is a filtration on a dual graph exploiting Alexander’s duality. The resulting partitioning is the basis for the derivation of statistical properties that can be explored in a visual environment. We demonstrate the proposed pipeline on a few synthetic and one real-world data set.

Yang Wen (Shenzhen University, China)
Yilin Wu (Shenzhen University, China)
Lei Bi (Shanghai Jiao Tong University, China)
Wuzhen Shi (Shenzhen University, China)
Wenming Cao (, China)
Xun Xu (Shanghai General Hospital, China)
Dagan Feng (The University of Sydney, China)
A Transformer assisted cascade learning network for choroidal vessel segmentation

ABSTRACT. The choroid, a highly vascular part of the eye, plays a crucial role in the diagnosis of various eye diseases. Despite this, limited research has focused on the inner structure of the choroid, particularly the choroidal vessels, due to challenges in obtaining accurate labels. Direct vessel segmentation struggles with noisy datasets, while the synergistic segmentation approach compromises vessel segmentation performance for the choroid layer segmentation tasks. Common cascaded structures grapple with error propagation during training. To address these challenges, this paper proposes a robust segmentation method for the inner vessel structures of the choroid. Specifically, we propose a Transformer assisted cascade learning network (TACLNet) for choroidal vessel segmentation, which comprises a two-stage training strategy: a pre-training for choroid layer segmentation and a joint training for choroid and vessel co-segmentation. We also enhance the skip connection structures by introducing a multi-scale subtraction connection (MSC) module, simultaneously capturing differential and detailed information. Additionally, we implemented an auxiliary transformer branch (ATB) to integrate global features into the segmentation process. Experimental results show our method obtains state-of-the-art performance in segmenting choroid vessels. Our proposed TACLNet contributes to advancing choroidal vessel segmentation and promises significant implications for ophthalmic research and clinical applications.

Wei Zhang (State Key Lab of CAD&CG, Zhejiang University., China)
Jian-Wei Zhang (State Key Lab of CAD&CG, Zhejiang University, China)
Kam Kwai Wong (Hong Kong University of Science and Technology, China)
Yifang Wang (Northwestern University, China)
Yingchaojie Feng (State Key Lab of CAD&CG, Zhejiang University, China)
Luwei Wang (State Key Lab of CAD&CG, Zhejiang University, China)
Wei Chen (State Key Lab of CAD&CG, Zhejiang University, China)
Computational Approaches for Traditional Chinese Painting: From the “Six Principles of Painting” Perspective

ABSTRACT. Traditional Chinese Painting (TCP) is an invaluable cultural heritage resource and a unique visual art style. In recent years, increasing interest has been placed on digitalizing TCPs to preserve and revive the culture. The resulting digital copies have enabled the advancement of computational methods for structured and systematic understanding of TCPs. To explore this topic, we conducted an in-depth analysis of 92 pieces of literature. We examined the current use of computer technologies on TCPs from three perspectives, based on numerous conversations with specialists. First, in light of the "Six Principles of Painting" theory, we categorized the articles according to their research focus on artistic elements. Second, we created a four-stage framework to illustrate the purposes of TCP applications. Third, we summarized the popular computational techniques applied to TCPs. The framework also provides insights into potential applications and future prospects, with professional opinion. The list of surveyed publications and related information is available online at