previous day
next day
all days

View: session overviewtalk overview

08:30-09:30 Session Opening

Stream Link:

09:30-10:30 Session Keynote1: Prof. Mohammed Bennamoun, The University of Western Australia, Department of Computer Science and Software Engineering

Stream Link:

Prof. Mohammed Bennamoun, Winthrop Professor, The University of Western Australia, Department of Computer Science and Software Engineering

Title: 3D Vision for Intelligent Robots

Abstract: In structured settings like industrial environments, robotic technology has exhibited remarkable efficiency. However, its deployment in dynamic and less predictable environments, such as domestic settings, remains a challenge. Robots, in areas like agility, power, and precision, often surpass human abilities. Yet, they still encounter difficulties in tasks like object and person identification, linguistic interpretation, manual dexterity, and social interaction and understanding capabilities.The quest for computer vision systems mirroring human visual abilities has been arduous. Two primary obstacles have been: (i) the absence of 3D sensors that can parallel the human eye’s capability to concurrently record visual attributes (e.g., colour and texture) and the dynamic surface shapes of objects, and (ii) the lack of real-time data processing algorithms. However, with the recent emergence of cost-effective 3D sensors, there’s a surge in the creation of functional 3D systems. These span from 3D biometric systems, e.g., for face recognition, to assistive home robotic systems to assist the elderly with mild cognitive impairment.The objective of the talk will be to describe few 3D computer vision projects and tools used towards the development of a platform for assistive robotics in messy living environments. Various systems with applications and their motivations will be described including 3D object recognition, 3D face/ear biometrics, grasping of unknown objects, and systems to estimate the 3D pose of a person.

Bio: Mohammed Bennamoun is Winthrop Professor in the Department of Computer Science and Software Engineering at the University of Western Australia (UWA) and is a researcher in computer vision, machine/deep learning, robotics, and signal/speech processing. He has published 4 books (available on Amazon), 1 edited book, 1 Encyclopedia article, 14 book chapters, 200+ journal papers, 270+ conference publications, 16 invited and keynote publications. His h-index is 72 and his number of citations is 25,200+ (Google Scholar). He was awarded 70+ competitive research grants, from the Australian Research Council, and numerous other Government, UWA and industry Research Grants. He successfully supervised 30+ PhD students to completion. He won the Best Supervisor of the Year Award at Queensland University of Technology (1998) and received award for research supervision at UWA (2008 and 2016) and Vice-Chancellor Award for mentorship (2016). He delivered conference tutorials at major conferences, including IEEE CVPR 2016, Interspeech 2014, IEEE ICASSP, and ECCV. He was also invited to give a Tutorial at an International Summer School on Deep Learning (DeepLearn 2017).



10:30-11:00Coffee Break
11:00-12:30 Session CAVW1

Zoom Link:       Meeting ID: 822 8242 5843, Password: cgi2023

Lijie Yang (Huaqiao University, China)
Xunxiang Li (College of Fine Arts and Design, Wenzhou University, China)
Modeling and presentation of 3D digital ink landscape painting

ABSTRACT. Referring to the "modeling of everything" concept of RGG's extended L system, the modular "three-dimensional digital brush" or "digital brush" can be used to dynamically draw rocks and trees in virtual three-dimensional space directly and quickly. In addition, the "oblique projection view" technology is used to realize the "scattered perspective" of traditional Chinese landscape painting and the "three far method" in the construction theory of landscape painting, which solves the problem of arbitrary creation of digital ink landscape painting(include animation)in virtual three-dimensional space. The creation of three-dimensional digital ink landscape painting can not only achieve the effect of traditional landscape painting, but also freely control the layout of rocks and trees in the virtual three-dimensional space, which greatly improves the convenience and interest of digital ink landscape painting creation. In the virtual digital three-dimensional space, it successfully shows the dynamic three-dimensional ink space that the traditional two-dimensional ink painting cannot achieve.

Yuan Xiong (Beihang University, China)
Tong Chen (Beihang University, China)
Tianjing Li (Beihang University, China)
Zhong Zhou (Beihang University, China)
DreamWalk: Dynamic Remapping and Multiperspectivity for Large-Scale Redirected Walking

ABSTRACT. Redirected walking provides an immersive user experience in virtual reality applications. In redirected walking, the size of the physical play area is limited, which makes it challenging to design the virtual path in a larger virtual space. Mainstream redirected walking approaches rigidly manipulate gains to guide the user to follow predetermined walking rules. However, these methods may cause simulator sickness, boundary collision, and walking reset. Static mapping approaches warp the virtual path through expensive vertex replacement in the stage of model pre-processing. They are restricted to narrow spaces consisting of non-looping pathways, partition walls, and planar surfaces. These methods fail to provide a smooth walking experience for large-scale open scenes. To tackle these problems, we propose a novel approach that dynamically redirects the user to walk in a non-linear virtual space. More specifically, we propose a Bezier-curve-based mapping algorithm to warp the virtual space dynamically and apply multiperspective fusion for visualization augmentation. We conduct comparable experiments to show its superiority over state-of-the-art redirected walking approaches in terms of user experience on our self-collected dataset.

Chen Yonghua (Shanghai Jiao Tong University, China)
Can Manipulating Control-Display Ratio Dynamically Really Work in Changing Pseudo-Haptic Weight?

ABSTRACT. Due to the limited hardware available in virtual reality, such as VR controllers or simple motion capture devices, users will need more tactile feedback, like the weight of virtual objects. Pseudo-tactile feedback, such as the control display (C/D) ratio manipulation method, is considered a standard method to simulate weight perception. In past studies, this method was basically used in a static environment, and the C/D ratio was usually determined initially. Can this method still work if the C/D ratio changes in dynamic usage scenarios? In a series of experiments, we tried to answer this question. We proved that the dynamic change of C/D ratio could simulate the weight change and improve the sense of embodiment through another hand redirection method.

Hao Li (Peng Cheng Laboratory, China)
Ju Dai (Peng Cheng Laboratory, China)
Rui Zeng (Beihang University, China)
Junxuan Bai (Capital University of Physical Education and Sports, China)
Zhangmeng Chen (Beihang University, China)
Junjun Pan (Beihang University, China)
Foot-constrained Spatial-Temporal Transformer for Keyframe-based Complex Motion Synthesis

ABSTRACT. Keyframe-based motion synthesis holds significant effects in games and movies. Existing methods for complex motion synthesis often require secondary post-processing to eliminate foot sliding to yield satisfied motions. In this paper, we analyze the cause of the sliding issue attributed to the mismatch between root trajectory and motion postures. To address the problem, we propose a novel end-to-end Spatial-Temporal transformer network conditioned on foot contact information for high-quality keyframe-based motion synthesis. Specifically, our model mainly compromises a spatial-temporal transformer encoder and two decoders to learn motion sequence features and predict motion postures and foot contact states. A novel constrained embedding, which consists of keyframes and foot contact constraints, is incorporated into the model to facilitate network learning from diversified control knowledge. To generate matched root trajectory with motion postures, we design a differentiable root trajectory reconstruction algorithm to construct root trajectory based on the decoder outputs. Qualitative and quantitative experiments on the public LaFAN1, Dance, and Martial Arts datasets demonstrate the superiority of our method in generating high-quality complex motions compared with the existing most advanced methods.

Mengying Gao (Dalian Minzu University, China)
Pengjie Wang (Dalian Minzu University, China)
Dehai Shang (Dalian Minzu University, China)
Personalized facial makeup transfer based on outline correspondence
PRESENTER: Dehai Shang

ABSTRACT. Many existing makeup transfer techniques, most of which focus on light makeup styles, limit the task of makeup transfer to color manipulation issues such as eye shadow and lip gloss. However, the makeup in real life is diverse and personalized, not only the most basic foundation, eye makeup, but also the beauty pupil, painted patterns on the face, jewelry decoration and other personalized makeup. Inspired by the painting steps of drawing the outline first and then coloring, we propose a makeup transfer network for personalized makeup, which realizes face makeup transfer by learning outline correspondence. In this work, we propose a outline feature extraction module and a weak supervised outline loss that can promote outline correspondence. Our network can not only transfer daily light makeup, but also complete transfer for complex facial painting patterns. Experiments show that our method can obtain visually more accurate makeup transfer results. Quantitative and qualitative experimental results show that the proposed method achieves favorable results in comparison with state-of-the-art methods in both benchmark datasets.

Jingwen Ren (School of Mathematical Sciences, Zhejiang University, China)
Hongwei Lin (School of Mathematical Sciences, Zhejiang University, China)
Nonlinear cloth simulation with isogeometric analysis
PRESENTER: Jingwen Ren

ABSTRACT. Physically based cloth simulation with nonlinear behaviors is studied in this paper by employing isogoemetric analysis (IGA) for the surface deformation in 3D space. State-of-the-art simulation techniques, which primarily rely on the triangular mesh to calculate physical points on the cloth directly, require a large number of degrees of freedom. An effective method for the cloth deformation that employs high-order continuous B-spline surfaces dependent on control points is proposed. This method leads to the merit of fewer degrees of freedom and superior smoothness. The deformation gradient on the high-order IGA element is then represented by the gradient of the B-spline function. An iterative method for solving the nonlinear optimization transferred from the implicit integration and a direct implicit-explicit method are derived on the basis of elastic force calculation to improve efficiency. The knots of the representation are effectively utilized in collision detection and response to reduce the computational burden. Experiments of nonlinear cloth simulation demonstrate the superiority of the proposed method considering performance and efficiency, achieving accurate, efficient, and stable deformation.

11:00-12:30 Session TVCJ1-Texture

Zoom Link:        Meeting ID: 816 5475 7227, Password: cgi2023

Jinxing Liang (Wuhan Textile University, China)
Xinrong Hu (Wuhan Textile University & State Key Laboratory of New Textile Materials and Advanced Processing Technologies, China)
Cheng Zheng (Wuhan Textile University, China)
Junjie Huang (Wuhan Textile University, China)
Ruiqi Luo (Wuhan Textile University, China)
Junping Liu (Wuhan Textile University, China)
Tao Peng (Wuhan Textile University, China)
Cloth Texture Preserving Image-Based 3D Virtual Try-On

ABSTRACT. 3D virtual try-on based on a single image can provide an excellent shopping experience for Internet users and has enormous business potential. The existing methods of processing the clothed 3D human body generated from the virtual try-on images are reconstructed in 3D models by extracting the depth information from the input images. However, the generated results are unstable and often fail to capture the high-frequency information loss detail features in the larger spatial background during the process of downsampling for depth prediction, and the loss of the generator gradient when predicting the occluded areas in the high-resolution images. To address this problem, we propose a multi-resolution parallel approach to obtain low-frequency information and retain as much of the high-frequency depth features in the images during depth prediction; at the same time, we use a multi-scale generator and discriminator to more accurately infer the feature images of the occluded regions to generate a fine-grained dressed 3D human body. Our method not only provides better details and effects to the final 3D mannequin generation for 3D virtual fitting, but also significantly improves the user's try-on experience than previous studies, as evidenced by our higher quantitative and qualitative evaluations.

Xu Wang (University of Tsukuba, Japan)
Makoto Fujisawa (University of Tsukuba, Japan)
Masahiko Mikawa (University of Tsukuba, Japan)
XProtoSphere: an eXtended multi-sized sphere packing algorithm driven by particle size distribution

ABSTRACT. The sphere packing problem, which involves filling an arbitrarily shaped geometry with the maximum number of non-overlapping spheres, is a critical research challenge. ProtoSphere is a prototype-oriented algorithm designed for solving sphere packing problems. Due to its easily parallelizable design, it exhibits high versatility and has wide-ranging applications. However, the controllable regulation of particle size distribution (PSD) produced by ProtoSphere is often neglected, which limits its application on algorithm. This paper proposes a novel PSD-driven technique that extends the ProtoSphere algorithm to achieve multi-sized sphere packing with distribution-specific characteristics, as dictated by a pre-defined cumulative distribution function. The proposed approach improves the controllability and flexibility of the packing process, and enables users to generate packing configurations that meet their specific requirements. In addition, by combining the relaxation method with the ProtoSphere algorithm, we can further improve the packing density and ensure the average overlap below 1%. Our method generates multi-sized particles that can be used to simulate the behavior of various granular materials, including sand-like and clay-like soils.

Rongyan Xu (School of Computer Science and Technology, Zhejiang Sci-Tech University, China)
Yao Jin (School of Computer Science and Technology, Zhejiang Sci-Tech University, China)
Huaxiong Zhang (School of Computer Science and Technology, Zhejiang Sci-Tech University, China)
Yun Zhang (College of Media Engineering, Communication University of Zhejiang, China)
Yu-Kun Lai (School of Computer Science & Informatics, Cardiff University, UK)
Zhe Zhu (Mathworks, United States)
Fang-Lue Zhang (School of Engineering and Computer Science, Victoria University of Wellington, New Zealand)
A variational approach for feature-aware B-spline curve design on surface meshes

ABSTRACT. Robust curve design on surface meshes with flexible controls is useful in a wide range of applications but remains challenging. Most existing methods fall into one of the two strategies: one is to discretize a curve into a polyline which is then optimized, and the other is to directly design smooth splines on meshes. While the former approach usually needs a sufficiently dense sampling of curve points which is computational costly, the latter approach relaxes the sampling requirement but suffers from the lack of user control. To tackle these problems, we proposed a variational method for designing feature-aware B-spline curves on surface meshes. Given the recent advances in shell space construction methods, we could relax the B-spline curve inside a simplified shell mesh and evaluate its distance to the surface using equipped bijective mapping. To effectively minimize the distance between the curve and the surface, with additional controls in the form of both internal and external constraints, we applied the interior point method and adaptively insert knots of the spline to increase its freedom and adjust the weighting during the iterations. When the curve is close enough to the surface, it can be efficiently sampled at any resolution and robustly projected to the surface. Experiments show that our method is more robust, has higher flexibility, and generates smoother results than existing methods.

Shumeet Baluja (Google, United States)
The Infinite Doodler: Expanding Textures Within Tightly Constrained Manifolds

ABSTRACT. Hand-drawn doodles present a difficult set of textures to model and synthesize. Unlike the typical natural images that are most often used in texture synthesis studies, the doodles examined here are characterized by the use of sharp, irregular, and imperfectly scribbled patterns, frequent imprecise strokes, haphazardly connected edges, and randomly or spatially shifting themes. The almost binary nature of the doodles examined makes it difficult to hide common mistakes such as discontinuities. Further, there is no color or shading to mask flaws and repetition; any process that relies on, even stochastic, region copying is readily discernible. To tackle the problem of synthesizing these textures, we model the underlying generation process of the doodle taking into account potential unseen, but related, expansion contexts. We demonstrate how to generate infinitely long textures, such that the texture can be extended far beyond a single image's source material. This is accomplished by creating a novel learning mechanism that is taught to condition the generation process on the its own generated context -- what was generated in previous steps -- not just upon the original.

Xi Zhao (Xi'an Jiaotong University, China)
Haodong Li (Xi'an Jiaotong University, China)
Haoran Wang (Xi'an Jiaotong University, China)
Learning Shape Abstraction by Cropping Positive Primitives with Negative Ones

ABSTRACT. High-quality 3D model abstraction is needed in many graphics or 3D vision tasks to improve the rendering efficiency, increase transmission speed or reduce space occupation. Traditional simplification algorithms for 3D models rely heavily on the mesh topology and ignore objects' overall structure during optimization. Learning-based methods are then proposed to form an end-to-end regression system for abstraction. However, existing learning-based methods have difficulty representing shapes with hollow or concave structures. We propose a self-supervised learning-based abstraction method for 3D meshes to solve this problem. Our system predicts the positive and negative primitives, where positive primitives are to match the inside part of the shape, and negative primitives represent the hollow area of the shape. More specifically, the bool difference between positive primitives and the object is fed to a network using Iteration Error Feedback (IEF) mechanism to predict the negative primitives, which crop the positive primitives to create hollow or concave structures. In addition, we design a new separation loss to prevent a negative primitive from overlapping the object too much. We evaluate the proposed method on the ShapeNetCore dataset by Chamfer Distance (CD) and Intersection over Union (IoU). The results show that our positive-negative abstraction schema outperforms the baselines.

Pengwei Zhou (School of Informatics, Xiamen University, China)
Xiao Dong (BNU-HKBU United International College, China)
Juan Cao (School of Mathematical Sciences, Xiamen University, China)
Zhonggui Chen (School of Informatics, Xiamen University, China)
MeT: Mesh Transformer with an Edge
PRESENTER: Pengwei Zhou

ABSTRACT. Transformers have been widely applied in various vision tasks processing different data, such as images, videos and point clouds. However, the use of Transformers in 3D mesh analysis remains largely unexplored. To address this gap, we propose a Mesh Transformer (MeT) that utilizes local self-attention on edges. MeT is based on a transformer layer that uses vector attention for edges, which is a kind of attention operator that supports adaptive modulation to both feature vectors and individual feature channels. Based on the transformer block, we build a lightweight Mesh Transformer network that consists of encoder and decoder. MeT provides general backbones for subsequent 3D mesh analysis tasks. To evaluate the effectiveness of our network MeT, we conduct experiments on two classic mesh analysis tasks: shape classification and shape segmentation. MeT achieves the state-of-the-art performance on multiple datasets for two tasks. We also conduct ablation studies to show the effectiveness of key designs in our network.

12:30-13:30Lunch Break
13:30-15:30 Session TVCJ3-Rendering

Zoom Link:        Meeting ID: 816 5475 7227, Password: cgi2023

Jinxing Liang (Wuhan Textile University, China)
Iordanis Evangelou (Athens University of Economics and Business, Greece)
Georgios Papaioannou (Athens University of Economics and Business, Greece)
Konstantinos Vardis (Athens University of Economics and Business, Greece)
A Neural Builder for Spatial Subdivision Hierarchies

ABSTRACT. Spatial data structures, such as k-d trees and bounding volume hierarchies, are extensively used in computer graphics for the acceleration of spatial queries in ray tracing, nearest neighbour searches and other tasks. Typically, the splitting strategy employed during the construction of such structures is based on the greedy evaluation of a predefined objective function, resulting in a less than optimal subdivision scheme. In this work, for the first time, we propose the use of unsupervised deep learning to infer the structure of a fixed-depth k-d tree from a constant, subsampled set of the input primitives, based on the recursive evaluation of the cost function at hand. This results in high-quality upper spatial hierarchy, inferred in constant time and without paying the intractable price of a fully recursive tree optimisation. The resulting fixed-depth tree can then be further expanded, in parallel, into either a full k-d tree or transformed into a bounding volume hierarchy, with any known conventional tree builder. The approach is generic enough to accommodate different cost functions, such as the popular surface area and volume heuristics. We experimentally validate that the resulting hierarchies have competitive traversal performance with respect to established tree builders, while maintaining minimal overhead in construction times.

Ziyi Chen (Wuhan Textile University, China)
Feng Yu (Wuhan Textile University, China)
Minghua Jiang (Wuhan Textile University, China)
Hua Wang (Wuhan Textile University, China)
Ailing Hua (Wuhan Textile University, China)
Tao Peng (Wuhan Textile University, China)
Xinrong Hu (Wuhan Textile University, China)
Ping Zhu (Wuhan Textile University, China)
AFSF-3DVTON: 3D Virtual Try-On Network based on Appearance Flow and Shape Field

ABSTRACT. The virtual try-on technology can satisfy the demands for online shopping and help consumers experience online clothes through image generation technology. Compared with image-based try-on, the 3D virtual try-on methods can realize the multi-perspective of try-on simulation and get the attention of many researchers. The current 3D virtual try-on methods are mainly base on the Thin-Plate Spline (TPS) method and depth information-based 3D reconstruction, which increases the costs of implementing 3D virtual try-on and the results lack of clothing details such as folds and patterns. To solve those problems, we propose a novel 3D virtual try-on network based on appearance flow and shape field called AFSF-3DVTON. Specifically, this network consists of three modules. First, the Appearance Flow Warping Module (AFW) generates the desired warped clothes according to the appearance flow of the original clothes. Then, the Flat Try-on Module (FTM) facilitates geometric matching between the warped clothes and reference person images and synthesizes 2D try-on results. Third, to increase the image's details to 3D try-on synthesis, the Shape Field-based Reconstruction (SFR) is adopted, which extracts shape features of 2D try-on results to improve the quality of 3D try-on reconstruction. We evaluate the proposed method on the VITON and MPV3D datasets, and several state-of-the-art virtual try-on algorithms are used as comparisons. The qualitative analyses verify the superiority of the proposed method, and the evaluation indexes, including Abs., Sq., and RMSE, demonstrate the outperformance of the proposed network.

Yuxuan Hou (State Key Lab of CAD & CG, Zhejiang University, China)
Zhong Ren (State Key Lab of CAD & CG, Zhejiang University, China)
Qiming Hou (State Key Lab of CAD & CG, Zhejiang University, China)
Yubo Tao (State Key Lab of CAD & CG, Zhejiang University, China)
Yankai Jiang (State Key Lab of CAD & CG, Zhejiang University, China)
Wei Chen (State Key Lab of CAD & CG, Zhejiang University, China)
InstantTrace: Fast Parallel Neuron Tracing On GPUs

ABSTRACT. Neuron tracing, also known as neuron reconstruction, is an essential step in investigating the morphology of neuronal circuits and mechanisms of the brain. Since the ultra-high throughput of optical microscopy (OM) imaging leads to images of multiple gigabytes or even terabytes, it takes tens of hours for the state-of-the-art methods to generate a neuron reconstruction from a whole mouse brain OM image. We present a novel framework, InstantTrace, that leverages parallel neuron tracing on GPUs, reaching a notable speed boost of more than 20× against state-of-the-art methods with comparable reconstruction quality on the BigNeuron dataset. Our framework utilizes two methods to achieve this performance advance. Firstly, it takes advantage of the sparse feature and tree structure of the neuron image, which serial tracing methods cannot fully exploit. Secondly, all stages of the neuron tracing pipeline, including the initial reconstruction stage that have not been parallelized in the past, are executed on GPU using carefully designed parallel algorithms. Furthermore, to investigate the applicability and robustness of the InstantTrace framework, a test on a whole mouse brain OM Image is conducted, and a preliminary neuron reconstruction of the whole brain is finished within 1 hour on a single GPU, an order of magnitude faster than the existing methods. Our framework has the potential to significantly improve the efficiency of the neuron tracing process, allowing neuron image experts to obtain a preliminary reconstruction result instantly before engaging in manual verification and refinement.

Yiming Qin (Shanghai Jiao Tong University, China)
Rynson W.H. Lau (City University of Hong Kong, Hong Kong)
GuideRender: Large-scale Scene Navigation Based on Multi-modal View Frustum Movement Prediction

ABSTRACT. Distributed parallel rendering provides a valuable way to navigate large-scale scenes. However, previous works typically focused on outputting ultra-high-resolution images. In this paper, we target on improving the interactivity of navigation and propose a large-scale scene navigation method, GuideRender, based on multi-modal view frustum movement prediction. Given previous frames, user inputs and object information, GuideRender first extracts frames, user inputs and objects features spatially and temporally using the multi-modal extractor. To obtain effective fused features for prediction, we introduce an attentional guidance fusion module to fuse these features of different domains with attention. Finally, we predict the movement of the view frustum based on the attentional fused features and obtain its future state for loading data in advance to reduce latency. In addition, to facilitate GuideRender, we design an object hierarchy hybrid tree for scene management based on the object distribution and hierarchy, and an adaptive virtual sub-frustum decomposition method based on the relationship between the rendering cost and the rendering node capacity for task decomposition. Experimental results show that GuideRender outperforms baselines in navigating large-scale scenes. We also conduct a user study to show that our method satisfies the navigation requirements in large-scale scenes.

Jaroslav Kravec (Czech Technical University in Prague, Czechia)
Martin Kacerik (Czech Technical University in Prague, Czechia)
Jiri Bittner (Czech Technical University in Prague, Czechia)
PVLI: Potentially Visible Layered Image for real-time Ray Tracing
PRESENTER: Martin Kacerik

ABSTRACT. Novel view synthesis is frequently employed in video streaming, temporal upsampling, or virtual reality. We propose a new representation, potentially visible layered image (PVLI), that uses a combination of a potentially visible set of the scene geometry and layered color images. PVLI encodes the depth implicitly and enables cheap run-time reconstruction. Furthermore, PVLI can also be used to reconstruct pixel and layer connectivities, which is crucial for filtering and post-processing of the rendered images. We use PVLIs to achieve local and server-based real-time ray tracing. In the first case, PVLIs are used as a basis for temporal and spatial upsampling of ray-traced illumination. In the second case, PVLIs are compressed, streamed over the network, and then used by a thin client to perform temporal and spatial upsampling and to hide latency. To shade the view, we use path tracing, accounting for effects such as soft shadows, global illumination, and physically based refraction. Our method supports dynamic lighting, and up to a limited extent, it also handles view-dependent surface interactions.

Yuanzhen Li (School of Computer Science, Wuhan University, China)
Fei Luo (School of Computer Science, Wuhan University, China)
Chunxia Xiao (School of Computer Science, Wuhan University, China)
Monocular Human Depth Estimation with 3D Motion Flow and Surface Normals
PRESENTER: Yuanzhen Li

ABSTRACT. Existing monocular human depth estimation methods need ground truth depth as supervision signals, while extracting divers and accurate ground truth depth is challenging. We propose a new monocular human depth estimation method using video sequences as a training dataset. We jointly learn the depth and 3D motion flow networks and establish the photometric and 3D geometric consistency constraints to optimize the two networks. At the same time, we use the pre-computed surface normals as pseudo labels instead of depth information to supervise the depth network learning. The depth estimation model may produce texture copy artifacts in the inference when the clothes exhibit the pattern and text mark (non-dominant color). Thus, we first use the k-means clustering algorithm to detect non-dominant color areas. Then, we design a linear color transformation and image inpainting algorithm altering the non-dominant color close to the dominant color based on the non-dominant color mask size, respectively. Extensive experiments on public datasets and the Internet proved that our approach achieves competitive results and generalizes well.

Hyeonjang An (Gwangju Institute of Science and Technology, South Korea)
Wonjun Lee (Gwangju Institute of Science and Technology, South Korea)
Bochang Moon (Gwangju Institute of Science and Technology, South Korea)
Adaptively Weighted Discrete Laplacian for Inverse Rendering
PRESENTER: Hyeonjang An

ABSTRACT. Reconstructing a triangular mesh from images by a differentiable rendering framework often exploits discrete Laplacians on the mesh, e.g., the cotangent Laplacian, so that a stochastic gradient descentbased optimization in the framework can become stable by a regularization term formed with the Laplacians. However, the stability stemming from using such a regularizer often comes at the cost of over-smoothing a resulting mesh, especially when the Laplacian of the mesh is not properly approximated, e.g., too-noisy or overlysmoothed Laplacian of the mesh. This paper presents a new discrete Laplacian built upon a kernel-weighted Laplacian. We control the kernel weights using a local bandwidth parameter so that the geometry optimization in a differentiable rendering framework can be improved by avoiding blurring high-frequency details of a surface. We demonstrate that our discrete Laplacian with a local adaptivity can improve the quality of reconstructed meshes and convergence speed of the geometry optimization by plugging our discrete Laplacian into recent differentiable rendering frameworks.

Shuo Li (Peng Cheng Laboratory, China)
Ju Dai (Peng Cheng Laboratory, China)
Zhangmeng Chen (Beihang University, China)
Junjun Pan (Beihang University, China)
A Lightweight Pose Estimation Network with Multi-Scale Receptive Field

ABSTRACT. Existing lightweight networks perform inferior to large-scale models in human pose estimation because of shallow model depths and limited receptive fields. Current approaches utilize large convolution kernels or attention mechanisms to encourage long-range receptive field learning at the expense of model redundancy. In this paper, we propose a novel Multi-scale Field Lightweight High-resolution Network (MFite-HRNet) for human pose estimation. Specifically, our model mainly consists of two lightweight blocks, a Multi-scale Receptive Field Block (MRB) and a Large Receptive Field Block (LRB), to learn informative multi-scale and long-range spatial context information. The MRB utilizes group depthwise dilation convolutions with varied dilation rates to extract multi-scale spatial relationships from different feature maps. The LRB leverages large depthwise convolution kernels to model large-range spatial knowledge at the low-level features. We apply MFite-HRNet to single-person and multi-person pose estimation tasks. Experiments on COCO, MPII, and CrowdPose datasets demonstrate that our network outperforms current state-of-the-art lightweight networks in either single-person or multi-person pose estimation tasks. The source code will be publicly available at

13:30-15:30 Session TVCJ4-MR&AR&Reco

Zoom Link:       Meeting ID: 822 8242 5843, Password: cgi2023

Hai-Ning Liang (Xi'an Jiaotong-Liverpool University, China)
Liqing Gao (The College of Intelligence and Computing, Tianjin University, China)
Lianyu Hu (The College of Intelligence and Computing, Tianjin University, China)
Fan Lyu (The College of Intelligence and Computing, Tianjin University, China)
Lei Zhu (The Hong Kong University of Science and Technology (Guangzhou), China)
Liang Wan (The College of Intelligence and Computing, Tianjin University, China)
Chi-Man Pun (University of Macau, China)
Wei Feng (The College of Intelligence and Computing, Tianjin University, China)
Difference-Guided Multi-Scale Spatial-Temporal Representation for Sign Language Recognition

ABSTRACT. Sign language recognition (SLR) is a challenging task, which requires a thorough understanding of spatial-temporal visual features for translating it into comprehensible written or spoken language. However, existing SLR methods ignore the importance of key spatial-temporal representation due to its sparsity and inconsistency in space and time. To tackle this issue, we propose a Difference-Guided Multi-Scale Spatial-Temporal representation (DMST) learning model for SLR. In DMST, we devise two modules: 1) Key Spatial-Temporal Representation, to extract and enhance key spatial-temporal information by a spatial-temporal difference strategy; 2) Multi-Scale Sequence Alignment, to perceive and fuse multi-scale spatial-temporal features and achieve sequence mapping. The DMST model outperforms state-of-the-art performance on four public sign language datasets, which demonstrates the superiority of DMST model and the significance of key spatial-temporal representation for SLR.

Yu Wang (Shandong University, Jinan, Shandong, China, China)
Hongqiu Luan (Shandong University, Jinan, Shandong, China, China)
Lutong Wang (Shandong University, Jinan, Shandong, China, China)
Shengzun Song (The National Police University for Criminal Justice, Baoding, China, China)
Yulong Bian (Shandong University, Jinan, Shandong, China, China)
Xiyu Bao (Shandong University, Jinan, Shandong, China, China)
Ran Liu (Shandong Normal University, Jinan, China, China)
Wei Gai (Shandong University, Jinan, Shandong, China, China)
Gaorong Lv (Shandong University, Jinan, Shandong, China, China)
Chenglei Yang (Shandong University, Jinan, Shandong, China, China)
MGP: A Monitoring-Based VR Interactive Mode to Support Guided Practice
PRESENTER: Hongqiu Luan

ABSTRACT. Currently, VR training systems are primarily designed according to the non-directive mode supporting independent practice without a real teacher. Students must complete all system operations resulting in heavy interaction burden and confusion. In this paper, a VR interactive mode called M-IMGP is designed to support guided practice in the presence of teacher resources. The mode mainly realizes the separation of teacher-student interaction tasks, environment, and equipment, helping the students focus on their own interactions and tasks. It also provides multi-channel information of the students for teachers to better understand students' training status and make precise guidance. A platform and two applications were developed based on this mode for user study. The results show that the proposed mode can improve learning efficiency and enhance immersion experience.

Shuo Gao (Northeastern University, China)
Zhenhua Tan (Northeastern University, China)
Jingyu Ning (Northeastern University, China)
Bingqian Hou (Northeastern University, China)
Li Li (Northeastern University, China)
ResGait: Gait Feature Refinement based on Residual Structure for Gait Recognition

ABSTRACT. Gait recognition is a biometric recognition technology, where the goal is to identify the subject by the subject’s walking posture at a distance. However, a lot of redundant information in gait sequence will affect the performance of gait recognition, and the most existing gait recognition models are overly complicated and parameterized, which leads to the low efficiency in model training. Consequently, how to reduce the complexity of the model and eliminate redundant information effectively in gait have become a challenging problem in the field of gait recognition. In this paper, we present a residual structure based gait recognition model, short for ResGait, to learn the most discriminative changes of gait patterns. To eliminate redundant information in gait, the soft thresholding is inserted into the deep architectures as a nonlinear transformation layer to improve gait feature learning capability from the noised gait feature map. Moreover, each sample own unique set of thresholds, making the proposed model suitable for different gait sequences with different redundant information. Furthermore, residual link is introduced to reduce the learning difficulties and alleviate computational costs in model training. Here, we train the network in terms of various scenarios and walking conditions, and the effectiveness of the method is validated through abundant experiments with various types of redundant information in gait. In comparison to the previous state-of-the-art works, experimental results on the common datasets, CASIA-B and OUMVLP-Pose, show that ResGait has higher recognition accuracy under various walking conditions and scenarios.

David Sibrina (Durham University, UK)
Sarath Bethapudi (Univ. Hospital of N. Durham; Durham University, UK)
George Alex Koulieris (Durham University, UK)
OrthopedVR: Clinical Assessment and Pre-operative Planning of Paediatric Patients with Lower Limb Rotational Abnormalities in Virtual Reality
PRESENTER: David Sibrina

ABSTRACT. Rotational abnormalities in the lower limbs causing patellar mal-tracking negatively affect patients' lives, particularly young patients (10-17 years old). Recent studies suggest that rotational abnormalities can increase degenerative effects on the joints of the lower limbs. Repetitive patellar dislocation from the trochlear groove may require a knee arthroplasty. Rotational abnormalities are diagnosed using 2D CT imaging and X-Rays, and this data are then used by surgeons to make decisions during an operation. However, 3D representation of data is considered to be preferable in the examination of 3D structures, such as bones. This correlates with added benefits for medical judgement, pre-operative planning and clinical training. Virtual Reality can enable the transformation of standard clinical imaging examination methods (CT/MRI) into immersive examinations and pre-operative planning in 3D. The benefits of VR in surgical planning have been demonstrated in several studies recently. We present a VR system (OrthopedVR) which allows orthopaedic surgeons to examine patients' specific anatomy of the lower limbs in an immersive three-dimensional environment and to simulate the effect of potential surgical interventions such as corrective osteotomies on a stand-alone VR headset. In OrthopedVR, surgeons can perform corrective incisions and re-align segments into desired rotational angles. From the system evaluation performed by 4 experienced surgeons with a mean of 20 years of experience in the field, on 5 paediatric patient cases, we found that OrthopedVR provides a better understanding of lower limb alignment and rotational profiles in comparison to isolated 2D CT scans. In addition, it was demonstrated that using VR software in conjunction with radiology reports improves pre-operative planning, surgical precision and post-operative outcomes for patients. The increased understanding of patients' anatomy in VR was also reflected in the statistically significant variation in task completion time for each case in comparison to task completion time on a desktop DICOM viewer. The results indicate that our system can become a stepping stone into simulating corrective surgeries of the lower limbs on personalised patient data, and suggest future improvements which will help adopt VR surgical planning into the clinical orthopaedic practice.

Yujie Cui (Shenzhen University, China)
Xiaoyan Zhang (Shenzhen University, China)
Yongkai Huo (Shenzhen University, China)
Deformable Patch Embedding Based Shift Module Enhanced Transformer for Panoramic Action Recognition
PRESENTER: Xiaoyan Zhang

ABSTRACT. 360◦ video action recognition is one of the most promising fields with the popularity of omnidirectional cameras. To obtain a more precise action understanding in panoramic scene, in this paper, we propose a Deformable patch embedding based temporal Shift module enhanced Vision Transformer model (DS-ViT), which aims to simultaneously eliminate the distortion effects caused by Equi-Rectangular Projection (ERP) and construct temporal relationship among the video sequences. Panoramic action recognition is a practical but challenging domain for the lack of panoramic feature extraction methods. With deformable patch embedding, our scheme can adaptively learn the position offsets between different pixels, which effectively captures the distorted features. The temporal shift module facilitates temporal information exchanging by shifting part of the channels with zero parameters. Thanks to the powerful encoder, DS-ViT can efficiently learn the distorted features from the ERP inputs. Simulation results show that our proposed solution outperforms the state-of-the-art two-stream solution by an action accuracy of 9.29% and an activity accuracy of 8.18%, where the recent EgoK360 dataset is employed.

Nan Xiang (Xi'an Jiaotong-Liverpool University, China)
Hai-Ning Liang (Xi'an Jiaotong-Liverpool University, China)
Lingyun Yu (Xi'an Jiaotong-Liverpool University, China)
Xiaosong Yang (Bournemouth University, UK)
Jian J Zhang (Bournemouth University, UK)
Development of A Mixed Reality Framework for Microsurgery Simulation

ABSTRACT. Microsurgery is a general term for surgery combining surgical microscope and specialized precision instruments during operation. Training in microsurgery requires considerable time and training resources. With the rapid development of computer technologies, virtual surgery simulation has gained extensive attention over the past decades. In this work, we take advantage of mixed reality (MR) that creates an interactive environment where physical and digital objects coexist, and present an MR framework for the microsurgery simulation. It aims to enable users to practice anastomosis skills with real microsurgical instruments rather than additional haptic feedback devices that are typically used in virtual reality (VR) based systems. A vision-based tracking system is proposed to simultaneously track microsurgical instruments and artificial blood vessels, and a learning-based anatomical modeling approach is introduced to facilitate the development of simulations in different microsurgical specialities by rapidly creating virtual assets. Moreover, we build a prototype system for the simulation specializing in microvascular hepatic artery reconstruction to demonstrate the feasibility and applicability of our framework.

Lijie Yang (Huaqiao University, China)
Zhan Wu (Huaqiao University, China)
Tianchen Xu (SKLCS-ISCASU & University of CAS, Beijing, China, China)
Enhua Wu (SKLCS-ISCASU & University of Macau, China)
Easy Recognition of Artistic Chinese Calligraphic Characters

ABSTRACT. Chinese calligraphy is one of the excellent expressions of Chinese traditional art. But people without domain knowledge of calligraphy can hardly read, appreciate, or learn this art form, due to it contains many brush strokes with unique shapes and complicate structural topological relationship. In this paper, we explore the solution of text sequence recognition of calligraphy, which is a challenging task because traditional algorithms of text recognition can rarely obtain the satisfied results for the varied styles of calligraphy. Therefore, based on a trainable neural network, this paper proposes an easy recognition method, which combines feature sequence extraction based on DenseNet (Dense Convolutional Network) model, sequence modeling and transcription into a consolidated architecture. Comparedwith previous algorithms for text recognition, it has two distinctive properties: One is to acquire the artistic features on shapes and structures of different styles of calligraphic characters and the other is to handel sequences in arbitrary lengths without character segmentation. Our experimental results prove that in contrast with several common recognition methods, our method of Chinese calligraphic character in diverse styles demonstrates greater robustness, and the recognition accuracy rate reaches 84.70%.

Jialin Wang (Xi'an Jiaotong-Liverpool University, China)
Nan Xiang (Xi'an Jiaotong-Liverpool University, China)
Navjot Kukreja (The University of Liverpool, UK)
Lingyun Yu (Xi'an Jiaotong-Liverpool University, China)
Hai-Ning Liang (Xi'an Jiaotong-Liverpool University, China)
LVDIF: A Framework for Real-Time Interaction with Large Volume Data
PRESENTER: Jialin Wang

ABSTRACT. The interest in real-time volume graphics has been growing rapidly in the last few years driven by the increasing demands from both academia and industry. GPU-based volume rendering has been used in a wide variety of fields, including scientific visualization, visual effects, and video games. Similarly, real-time volume editing has been used to build terrain and create visual effects during game development, and has even become an integral part of gameplay in various video games (e.g., in Minecraft). Nowadays, as the size of volume data increases, processing large volume data in real-time is inevitable in many modern application scenarios. However, manipulation and editing of large volume data are associated with various challenges, such as dramatically increasing memory usage and computational burden. In this work, we propose a framework for interactive manipulation and editing of large volume data to address these challenges. A robust and efficient method for large signed distance function (SDF) volume generation is presented and incorporated into the framework. Also, a complete implementation with specialized GPU optimization is introduced to demonstrate its usefulness and effectiveness--it is included in the framework as well. The framework can be an easy-to-use middleware or a plugin that is able to integrate into game engines for the development of various types of applications (e.g., video games). It can also contribute to the research looking at dealing with large volume data from a user-centered perspective (e.g., for human-computer interaction researchers).

15:30-16:00Coffee Break
16:00-17:30 Session TVCJ5-Medical

Zoom Link:        Meeting ID: 816 5475 7227, Password: cgi2023

Liang Wan (College of Intelligence and Computing, Tianjin University, China)
Dongdong Meng (Peking University, China)
Sheng Li (Peking University, China)
Bin Sheng (Shanghai Jiao Tong University, China)
Hao Wu (Peking University Cancer Hospital, China)
Suqing Tian (Peking University Third Hospital, China)
Wenjun Ma (Peking University, China)
Guoping Wang (Peking University, China)
Xueqing Yan (Peking University, China)
3D Reconstruction-Oriented Fully Automatic Multi-modal Tumor Segmentation By Dual Attention-guided VNet
PRESENTER: Dongdong Meng

ABSTRACT. Existing automatic contouring methods for primary nasopharyngeal carcinoma (NPC) and metastatic lymph nodes (MLNs) may suffer from low segmentation accuracy and cannot handle multi-modal images correctly. Furthermore, high inter-patient physiological variations and ineffective multi-modal information fusion pose further difficulties. To address these issues, we present a 3D reconstruction-oriented fully automatic multi-modal segmentation method to delineate primary NPC tumors and MLNs via a dual attention-guided VNet. Specifically, we leverage a physiologically-sensitive feature enhancement (PFE) module that emphasizes long-range spatial context information in tumor regions of interest and thereby copes with the variability resulting from inter-patient characteristics, which can help extract the 3D spatial feature and facilitate the high-quality reconstruction of 3D geometry of tumors. Next, we develop a multi-modal feature aggregation (MFA) module to describe multi-scale modality-aware features, exploring the effective information aggregation of multi-modal images. To the best of our knowledge, this is the first fully automatic, highly accurate segmentation framework of the primary NPC tumors and MLNs on combined CT-MR datasets. Experimental results on clinical medical datasets validate the effectiveness of our segmentation method, and our method outperforms the state-of-the-art methods.

Shiqun Lin (Department of Ophthalmology, Peking Union Medical College Hospital, China)
Anum Masood (Department of Circulation and Medical Imaging, Faculty of Medicine and Health Sc, Norway)
Tingyao Li (Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China)
Gengyou Huang (Shanghai Jiao Tong University, China)
Rongping Dai (Department of Ophthalmology, Peking Union Medical College Hospital, China)
Deep Learning Enabled Automatic Screening of SLE Diseases Using OCT Images
PRESENTER: Anum Masood

ABSTRACT. Optical Coherence Tomography (OCT) is a non-invasive imaging technique that enables the visualization of tissue microstructure in-vivo. Recent studies have suggested that OCT can be used for detecting and monitoring retinal changes over time in patients with Systemic Lupus Erythematosus (SLE), an auto-immune disease that causes damage to various organs, including the eye itself. This research work discusses the potential of using OCT as a screening tool for SLE. OCT provides a highly detailed view of the retina, allowing for the detection of subtle changes that may indicate early-stage SLE-related eye disease to screen SLE patients. The use of OCT as a screening tool may help to identify LR (Lupus Erythematosus Retinopathy) and SLE patients, and facilitate earlier interventions, ultimately improving patient outcomes. In addition, we have used deep learning for automated screening based on OCT images from SLE patients. We present a novel deep-learning model combining a pre-trained CNN, a multi-scale module, a pooling module, and an FC classifier. Our prediction model for SLE disease has outperformed the state-ofthe-art method using the in-house dataset from Peking Union Medical College Hospital. Our model achieved a higher AUC indicating a high correlation between the ground truth and predicted output. However, further studies are needed to determine the sensitivity and specificity of OCT in detecting SLE and to establish appropriate screening protocols for this patient population.

Shan Huang (Shanghai Jiao Tong University, China)
Xiaohong Liu (Shanghai Jiao Tong University, China)
Huating Li (Department of Endocrinology and Metabolism, Shanghai Jiao Tong UniversityAffiliated Sixth People’s Hospital, China)
Tan Tao (Faculty of Applied Sciences, Macao Polytechnic University, Macao)
Menghan Hu (East China Normal University, China)
Xiaoer Wei (Institute of Diagnostic and Interventional Radiology, Shanghai Jiao Tong University Affiliated Sixth People's Hospital, China)
Tingli Chen (Department of Ophthalmology, Huadong Sanatorium, China)
TransMRSR: Transformer-based Generative Prior for Brain MRI Super-Resolution

ABSTRACT. Magnetic resonance images (MRI) acquired with low through-plane resolution compromise time and cost. The poor resolution in one orientation is insufficient to meet the requirement of high resolution for early diagnosis of brain disease and morphometric study. Single image super-resolution (SISR) methods face large scale restoration when applied for reconstructing thick-slice MRI into high-resolution (HR) isotropic data. In this paper, we propose a novel network for brain MRI SR named TransMRSR based on the transformer to capture long-range dependencies. TransMRSR consists of three modules: the shallow local feature extraction, the deep non-local feature capture, and the HR image reconstruction. Specifically, we design an U-Net architecture with diverse priors encapsulated in a pre-trained Style-Swin. The channel-wise operation between the output of the encoder and decoder combines the low resolution image information with rich prior knowledge of large-scale HR MRI. The extensive experiments show that our method achieves superior performance to other SSIR methods on both public and private datasets.

Abdulrhman H. Al-Jebrni (Shanghai Jiao Tong University, China)
Saba Ghazanfar Ali (Shanghai Jiao Tong University, China)
Huating Li (Shanghai Jiao Tong University Affiliated Sixth People’s Hospital, China)
Xiao Lin (Shanghai Normal University, China)
Ping Li (The Hong Kong Polytechnic University, Hong Kong)
Younhyun Jung (Gachon University, South Korea)
Jinman Kim (The University of Sydney, Australia)
David Dagan Feng (The University of Sydney, Australia)
Lixin Jiang (Renji Hospital, School of Medicine, Shanghai Jiao Tong University, China)
Jing Du (Renji Hospital, School of Medicine, Shanghai Jiao Tong University, China)
SThy-Net: A Feature Fusion Enhanced Dense-Branched Modules Network for Small Thyroid Nodule Classification from Ultrasound Images

ABSTRACT. Deep learning studies of thyroid nodule classification from ultrasound (US) images have focused mainly on nodules with diameters > 1 cm. However, small thyroid nodules measuring ≤ 1 cm, especially nodules with high-risk stratification, are prevalent in the population but without enough focus, including papillary thyroid microcarcinoma (PTMC) as their common malignant type. Additionally, small nodules with highrisk stratification are difficult for physicians to diagnose from US images due to their atypical features. In this work, we propose a small thyroid nodule classification network (SThy-Net) to classify benign and PTMC small thyroid nodules with high-risk stratification from US images. We design two main components, a dense-branched module and a Gaussian-enhanced feature fusion module, to help recognize small thyroid nodules. To our knowledge, this work is the first to address the challenging task of classifying small thyroid nodules using US images. Our SThy-Net achieves as high accuracy as 87.4% compared to five state-of-the-art thyroid nodule diagnosis studies, several state-of-the-art deep learning models, and three radiologists. From visual explainability, our network shows an intuitive feature extraction method and consistency with US image analysis of radiologists. The results suggest that our network has the potential to be an affordable tool for radiologists to diagnose small nodules with high-risk stratification in clinical practice.

Rongsheng Wang (Faculty of Applied Sciences, Macao Polytechnic University, China)
Yaofei Duan (Faculty of Applied Sciences, Macao Polytechnic University, China)
Yukun Li (Faculty of Applied Sciences, Macao Polytechnic University, China)
Dashun Zheng (Faculty of Applied Sciences, Macao Polytechnic University, China)
Xiaohong Liu (Faculty of John Hopcroft Center, Shanghai Jiao Tong University, China)
Chan Tong Lam (Faculty of Applied Sciences, Macao Polytechnic University, China)
Tao Tan (Faculty of Applied Sciences, Macao Polytechnic University, China)
PCTMF-Net: Heart Sound Classification with Parallel CNNs-Transformer and Second-Order Spectral Analysis
PRESENTER: Rongsheng Wang

ABSTRACT. Heart disease is a common condition worldwide and has become one of the leading causes of death worldwide. The electrocardiogram (PCG) is a safe, painless, and non-invasive test that captures bioacoustic information reflecting the function of the heart by capturing the acoustic signal of the patient's heart. Nowadays, based on biosignal processing and artificial intelligence technologies, automated heart sound classification is playing an increasingly important role in clinical applications. In this paper, we propose a new parallel CNNs-transformer network with multi-scale feature context aggregation (PCTMF-Net). It combines the advantages of CNNs and transformer to achieve efficient heart sound classification. In PCTMF-Net, firstly, the heart tone signal features are extracted using the second-order spectral analysis, and a Transformer-based MHTE-4 is designed to encode and aggregate the contextual information, and then two CNNs feature extractors are designed in parallel with the MHTE-4 to capture the hierarchical features. Finally, the feature vectors obtained from CNNs and MHTE-4 through feature fusion in PCTMF-Net will be fed into the fully connected layer for predicting the classification results of heart sounds. In addition, we perform validation based on two publicly available mutually exclusive heart sound datasets and conduct extensive experiments and comparisons of existing algorithms under different metrics. The experimental results show that our proposed method achieves 99.36% accuracy on the Yaseen dataset and 93% accuracy on the PhysioNet dataset. It surpasses current algorithms in terms of accuracy, recall and F1-score metrics. The aim of this study is to apply these new techniques and methods to improve the diagnostic accuracy and validity of heart disease for clinical use.

Wennan Liu (Medical College, Tianjin University, China)
Tianling Liu (College of Intelligence and Computing, Tianjin University, China)
Tong Han (Tianjin Huanhu Hospital, China)
Liang Wan (College of Intelligence and Computing, Tianjin University, China)
Multi-Modal Deep-Fusion Network for Meningioma Presurgical Grading with Integrative Imaging and Clinical Data

ABSTRACT. Predicting meningioma grade before surgery is crucial to making decisions on therapy planning and prognosis prediction. Prior works mainly investigate traditional classification techniques with hand-crafted features or rely on image data only, thus having limited accuracy. In this study, we propose a novel multi modal classification method, i.e., Multi-Modal Deep-Fusion Network (MMDF), integrating high-dimensional 3D MRI imaging information and low-dimensional tabular clinical data to classify low-grade and high-grade meningiomas. Specifically, the MMDF adopts two modality-specific branches to extract image and clinical features respectively, and leverages an image-clinical integration module (ICIM) in the shared branch to fuse cross-model features, while specifically considering the impact of low-dimensional clinical data. Besides, we propose a multi-modal image feature aggregation module (MIA) to integrate three image modalities in the image-specific branch, which can compensate for the feature distribution gaps among the contrast-enhanced T1, contrast-enhanced T2-FLAIR and ADC modalities. Comprehensive experiments show that our approach significantly outperforms the SOTA methods using imaging data only and those combining image and tabular data both, with AUC of 0.958, sensitivity of 0.877, specificity of 0.926, and accuracy of 0.921. Our approach holds high potentials to aid radiologists to do the presurgical evaluation for clinical decision making.

16:00-17:30 Session TVCJ6-GAN

Zoom Link:       Meeting ID: 822 8242 5843, Password: cgi2023

Jin Huang (wuhan textile university, China)
Muhammad Mamunur Rashid (School of Computer Science and Engineering, South China University of Technology, China)
Shihao Wu (Capskin AG, Switzerland)
Yongwei Nie (School of Computer Science and Engineering, South China University of Technology, China)
Guiqing Li (School of Computer Science and Engineering, South China University of Technology, China)
High-Fidelity Facial Expression Transfer using Part-based Local-Global Conditional GANs

ABSTRACT. We propose a GAN-based Facial Expression Transfer method. It can transfer the facial expression of a reference subject to the source subject while pre- serving the source identity attributes such as shape, ap- pearance, and illumination. Our method consists of two modules based on GAN: Parts Generation Networks (PGNs), and Parts Fusion Network (PFN). Instead of training on the entire image globally, our key idea is to train different PGNs for different local facial parts independently and then fuse the generated parts to- gether using PFN. To encode the facial expression faith- fully, we use a pre-trained parametric 3D head model (called photometric FLAME) to reconstruct realistic head models from both source and reference images. We also extract 3D facial feature points of the reference im- age to handle extreme poses and occlusions. Based on the extracted contextual information, we use PGNs to generate different parts of the head independently. Fi- nally, PFN is used to fuse all the parts together to form the final image. Experiments show that the proposed model outperforms state-of-the-art approaches in faith- fully transferring facial expressions, especially when the reference image has a different head pose to the source image. Ablation studies demonstrate the power of using PGNs.

Xinrong Hu (Wuhan Textile University, China)
Qing Chang (Wuhan Textile University, China)
Junjie Huang (Wuhan Textile University, China)
Ruiqi Luo (Wuhan Textile University, China)
Bangchao Wang (Wuhan Textile University, China)
Chang Hu (Wuhan Textile University, China)
HSSAN: Hair Synthesis with Style-Guided Spatially Adaptive Normalization on Generative Adversarial Network

ABSTRACT. Hair synthesis plays a crucial role in generating facial images, but the complex textures and varied shapes of hair create obstacles in creating genuine images of hair on photographs utilizing generative adversarial networks. This research paper proposes an inventive normalization technique, HSSAN (Hair Style-Guided Spatially Adaptive Normalization), that incorporates semanticization, stylization, structuring, and modulation in four phases to generate hair. The hair synthesizer generator utilizes several HSSAN residual blocks in the network framework, while the input modules comprise only an appearance module and a background module. Furthermore, a regularized loss function is introduced to regulate the style vector. The network generates realistic hair texture generation images. We employed the FFHQ dataset to perform our experiments and observed that our methodology generates hair images surpassing existing generative adversarial network-based methods in terms of visual realism and Fréchet Inception Distance (FID).

Sifei Li (Institute of Automation, Chinese Academy of Sciences, China)
Fuzhang Wu (Institute of Software, Chinese Academy of Sciences, China)
Yuqing Fan (Institute of Software, Chinese Academy Of Sciences, China)
Xue Song (Zhengzhou University, China)
Weiming Dong (Institute of Automation, Chinese Academy of Sciences, China)
PLDGAN: Portrait Line Drawing Generation with Prior Knowledge and Conditioning Target

ABSTRACT. Line drawing, a form of minimalist art, is highly abstract and expressive with practical use in conveying 3D shapes and indicating object occlusion. Generating line drawings from photos is a challenging task, requiring the compression of rich texture information into sparse geometric elements, such as lines, curves, and circles, without compromising semantic information. Furthermore, a portrait line drawing should include a full human silhouette and important semantic lines of scenes while avoiding messy lines. To address those challenges, we propose a novel conditional GAN-based portrait line drawing generation method (PLDGAN), which leverages prior knowledge of pose and semantic segmentation information. We also design a conditioning target and adjust the content loss to original target loss. To train our PLDGAN, we collect a new dataset containing pairwise portrait images and professional portrait line drawings. Our experiments show that our method achieves state-of-the-art performance and can generate high-quality portrait line drawings.

Miaomiao Chen (Dalian Minzu University, China)
Pei Wang (Dalian Minzu University, China)
Pengjie Wang (Dalian Minzu University, China)
Dehai Shang (Dalian Minzu University, China)
Cycle-attention-derain: unsupervised rain removal with CycleGAN
PRESENTER: Dehai Shang

ABSTRACT. Single image deraining is a fundamental task in computer vision, which can greatly improve the performance of subsequent high-level tasks under rainy conditions. Existing data-driven rain removal methods, rely heavily on paired training data, which is expensive to collect. In this paper, we propose a new unsupervised method, called Cycle-Derain, which removes rain from single images in the absence of paired data. Our method is based on the CycleGAN framework with two major novelties. First, since rain removal is highly correlated with analyzing the texture features of an input image, we propose a novel attention fusion module (AFM) with complementary channel attention and spatial attention, which can effectively learn more discriminative features for rain removal. Second, to further improve the generalization ability of our model, we propose a globallocal attention discriminator architecture with an attention mechanism to guide the network training, so that the rain removal results are realistic, both globally and locally. Our proposed model is able to remove rain streaks of varying degrees without paired training images. Extensive experiments on synthetic and real datasets demonstrate that the proposed method outperforms most of the state-of-the-art unsupervised rain removal methods in terms of both PSNR and SSIM on Rain800 datasets and achieves slightly close results to other popular supervised learning methods.

Yu Liu (Department of Systems Engineering, National University of Defense Technology, China)
Zhe Guo (Northwestern Polytechnical University, China)
Haojie Guo (School of Electronics and Information, Northwestern Polytechnical University, China)
Huaxin Xiao (Department of Systems Engineering, National University of Defense Technology, China)
Zoom-GAN: Learn to Colorize Multi-scale Targets

ABSTRACT. In recent years,the research of image colorization based on deep learning has made great progress. Most of existing methods have achieved impressive colorizing performance over the entire region of a givenimage. However, we notice that the colorizing results of existing methods suffer from color disorder on small target region or boundary. For colorizing multi-scale targets, we propose a feature scaling network in this paper called Zoom-GAN to improve the colorizing consistency for small objects and boundary. Specifically,the Zoom-GAN proposes a zoom instance normalization layer to introduce scale information in color feature. Meanwhile, multi-scale structure is adopted in the generator and discriminator to improve the colorizing performance for various targets. Experiments on three public datasets Oxford102, Bird100 and Hero show that our Zoom-GAN achieves state-of-the-art on three subjective and objective evaluation metrics.