Page Not Found
Page not found. Your pixels are in another canvas.
A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.
Page not found. Your pixels are in another canvas.
About me
Home page
KeypointNet
News
This is a page not in th emain menu
Side projects
Services
Published:
This post will show up by default. To disable scheduling of future posts, edit config.yml
and set future: false
.
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Short description of portfolio item number 1
Short description of portfolio item number 2
Published in IEEE Transactions on Image Processing, 2018
We propose a deep learning approach for directly estimating relative atmospheric visibility from outdoor photos without relying on weather images or data that require expensive sensing or custom capture. Our data-driven approach capitalizes on a large collection of Internet images to learn rich scene and visibility varieties. The relative CNN-RNN coarse-to-fine model, where CNN stands for convolutional neural network and RNN stands for recurrent neural network, exploits the joint power of relative support vector machine, which has a good ranking representation, and the data-driven deep learning features derived from our novel CNN-RNN model.
Recommended citation: You, Y., Lu, C., Wang, W., & Tang, C. K. (2018). Relative CNN-RNN: Learning relative atmospheric visibility from images. IEEE Transactions on Image Processing, 28(1), 45-55.
Published in Preprint, 2020
Visual semantic correspondence is an important topic in computer vision and could help machine understand objects in our daily life. However, most previous methods directly train on correspondences in 2D images, which is end-to-end but loses plenty of information in 3D spaces. In this paper, we propose a new method on predicting semantic correspondences by leveraging it to 3D domain and then project corresponding 3D models back to 2D domain, with their semantic labels. Our method leverages the advantages in 3D vision and can explicitly reason about objects self-occlusion and visibility.
Recommended citation: You, Y., Li, C., Lou, Y., Cheng, Z., Ma, L., Lu, C., & Wang, W. (2020). Semantic Correspondence via 2D-3D-2D Cycle. arXiv preprint arXiv:2004.09061.
Published in ECCV, 2020
We observe that people have a consensus on semantic correspondences between two areas from different objects, but are less certain about the exact semantic meaning of each area. Therefore, we argue that by providing human labeled correspondences between different objects from the same category instead of explicit semantic labels, one can recover rich semantic information of an object. In this paper, we introduce a new dataset named CorresPondenceNet. Based on this dataset, we are able to learn dense semantic embeddings with a novel geodesic consistency loss.
Recommended citation: Lou, Y., You, Y., Li, C., Cheng, Z., Li, L., Ma, L., ... & Lu, C. (2020, August). Human Correspondence Consensus for 3D Object Semantic Understanding. In European Conference on Computer Vision (pp. 496-512). Springer, Cham.
Published in AIIDE, 2020
In this paper, we study a special class of Asian popular card games called Dou Di Zhu, in which two adversarial groups of agents must consider numerous card combinations at each time step, leading to huge number of actions. We propose a novel method to handle combinatorial actions, which we call combinational Q-learning (CQL). We employ a two-stage network to reduce action space and also leverage order-invariant max-pooling operations to extract relationships between primitive actions.
Recommended citation: You, Y., Li, L., Guo, B., Wang, W., & Lu, C. (2019). Combinational Q-Learning for Dou Di Zhu. arXiv preprint arXiv:1901.08925.
Published in CVPR, 2020
We present KeypointNet: the first large-scale and diverse 3D keypoint dataset that contains 83,231 keypoints and 8,329 3D models from 16 object categories, by leveraging numerous human annotations. To handle the inconsistency between annotations from different people, we propose a novel method to aggregate these keypoints automatically, through minimization of a fidelity loss. Finally, ten state-of-the-art methods are benchmarked on our proposed dataset.
Recommended citation: You, Y., Lou, Y., Li, C., Cheng, Z., Li, L., Ma, L., ... & Wang, W. (2020). KeypointNet: A Large-scale 3D Keypoint Dataset Aggregated from Numerous Human Annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13647-13656).
Published in AAAI, 2020
In this paper, we propose a new point-set learning framework named Pointwise Rotation-Invariant Network (PRIN), focusing on achieving rotation-invariance in point clouds. We construct spherical signals by Density-Aware Adaptive Sampling (DAAS) from sparse points and employ Spherical Voxel Convolution (SVC) to extract rotation-invariant features for each point. Our network can be applied to applications ranging from object classification, part segmentation, to 3D feature matching and label alignment.
Recommended citation: You, Y., Lou, Y., Liu, Q., Tai, Y. W., Ma, L., Lu, C., & Wang, W. (2020). Pointwise Rotation-Invariant Network with Adaptive Sampling and 3D Spherical Voxel Convolution. In AAAI (pp. 12717-12724).
Published in CVPR, 2021
In this paper, we propose an unsupervised aligned keypoint detector, Skeleton Merger, which utilizes skeletons to reconstruct objects. It is based on an Autoencoder architecture. The encoder proposes keypoints and predicts activation strengths of edges between keypoints. The decoder performs uniform sampling on the skeleton and refines it into small point clouds with pointwise offsets. Then the activation strengths are applied and the sub-clouds are merged. Composite Chamfer Distance (CCD) is proposed as a distance between the input point cloud and the reconstruction composed of sub-clouds masked by activation strengths.
Recommended citation: Shi, R., Xue, Z., You, Y., & Lu, C. (2021). Skeleton Merger: an Unsupervised Aligned Keypoint Detector. arXiv preprint arXiv:2103.10814.
Published in TPAMI, 2021
Pixel-level 2D object semantic understanding is an important topic in computer vision and could help machine deeply understand objects (e.g. functionality and affordance) in our daily life. However, most previous methods directly train on correspondences in 2D images, which is end-to-end but loses plenty of information in 3D spaces. In this paper, we propose a new method on predicting image corresponding semantics in 3D domain and then projecting them back onto 2D images to achieve pixel-level understanding. In order to obtain reliable 3D semantic labels that are absent in current image datasets, we build a large scale keypoint knowledge engine called KeypointNet, which contains 103,450 keypoints and 8,234 3D models from 16 object categories. Our method leverages the advantages in 3D vision and can explicitly reason about objects self-occlusion and visibility. We show that our method gives comparative and even superior results on standard semantic benchmarks.
Recommended citation: You, Y., Li, C., Lou, Y., Cheng, Z., Li, L., Ma, L., ... & Lu, C. (2021). Understanding Pixel-level 2D Image Semantics with 3D Keypoint Knowledge Engine. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Published in TPAMI, 2021
Point cloud analysis without pose priors is very challenging in real applications, as the orientations of point clouds are often unknown. In this paper, we propose a brand new point-set learning framework PRIN, namely, Point-wise Rotation Invariant Network, focusing on rotation invariant feature extraction in point clouds analysis. We construct spherical signals by Density Aware Adaptive Sampling to deal with distorted point distributions in spherical space. Spherical Voxel Convolution and Point Re-sampling are proposed to extract rotation invariant features for each point. In addition, we extend PRIN to a sparse version called SPRIN, which directly operates on sparse point clouds. Both PRIN and SPRIN can be applied to tasks ranging from object classification, part segmentation, to 3D feature matching and label alignment. Results show that, on the dataset with randomly rotated point clouds, SPRIN demonstrates better performance than state-of-the-art methods without any data augmentation. We also provide thorough theoretical proof and analysis for point-wise rotation invariance achieved by our methods.
Recommended citation: You, Y., Lou, Y., Shi, R., Liu, Q., Tai, Y. W., Ma, L., ... & Lu, C. (2021). PRIN/SPRIN: On Extracting Point-wise Rotation Invariant Features. arXiv preprint arXiv:2102.12093.
Published in CVPR, 2022
In the work, we disentangle the direct offset into Local Canonical Coordinates (LCC), box scales and box orientations. Only LCC and box scales are regressed while box orientations are generated by a canonical voting scheme. Finally, a LCC-aware back-projection checking algorithm iteratively cuts out bounding boxes from the generated vote maps, with the elimination of false positives. Our model achieves state-of-the-art performance on challenging large-scale datasets of real point cloud scans: ScanNet, SceneNN with 11.4 and 5.3 mAP improvement respectively.
Recommended citation: You, Y., Ye, Z., Lou, Y., Li, C., Li, Y. L., Ma, L., ... & Lu, C. (2020). Canonical Voting: Towards Robust Oriented Bounding Box Detection in 3D Scenes. arXiv preprint arXiv:2011.12001.
Published in CVPR, 2022
In this paper, we tackle the problem of category-level 9D pose estimation in the wild, given a single RGB-D frame. Drawing inspirations from traditional point pair features (PPFs), in this paper, we design a novel Category-level PPF (CPPF) voting method to achieve accurate, robust and generalizable 9D pose estimation in the wild. To obtain robust pose estimation, we sample numerous point pairs on an object, and for each pair our model predicts necessary SE(3)-invariant voting statistics on object centers, orientations and scales. A novel coarse-to-fine voting algorithm is proposed to eliminate noisy point pair samples and generate final predictions from the population. To get rid of false positives in the orientation voting process, an auxiliary binary disambiguating classification task is introduced for each sampled point pair. In order to detect objects in the wild, we carefully design our sim-to-real pipeline by training on synthetic point clouds only, unless objects have ambiguous poses in geometry.
Recommended citation:
Published in CVPR, 2022
In this work, we reckon keypoints under an information compression scheme to represent the whole object. Based on this, we propose UKPGAN, an unsupervised 3D keypoint detector where keypoints are detected so that they could reconstruct the original object shape. Two modules: GAN-based keypoint sparsity control and salient information distillation modules are proposed to locate those important keypoints. Extensive experiments show that our keypoints preserve the semantic information of objects and align well with human annotated part and keypoint labels.
Recommended citation: You, Y., Liu, W., Li, Y. L., Wang, W., & Lu, C. (2020). UKPGAN: Unsupervised KeyPoint GANeration. arXiv preprint arXiv:2011.11974.
Published in AAAI, 2023
In this paper, we propose the CRIN, namely Centrifugal Rotation-Invariant Network. CRIN directly takes the coordinates of points as input and transforms local points into rotation-invariant representations via centrifugal reference frames. Aided by centrifugal reference frames, each point corresponds to a discrete rotation so that the information of rotations can be implicitly stored in point features. Unfortunately, discrete points are far from describing the whole rotation space. We further introduce a continuous distribution for 3D rotations based on points. Furthermore, we propose an attention-based down-sampling strategy to sample points invariant to rotations. A relation module is adopted at last for reinforcing the long-range dependencies between sampled points and predicts the anchor point for unsupervised rotation estimation. Extensive experiments show that our method achieves rotation invariance, accurately estimates the object rotation. Ablation studies validate the effectiveness of the network design.
Recommended citation: Lou, Y., Ye, Z., You, Y., Jiang, N., Lu, J., Wang, W., ... & Lu, C. (2023). CRIN: Rotation-Invariant Point Cloud Analysis and Rotation Estimation via Centrifugal Reference Frame. arXiv preprint arXiv:2303.03101.
Published in Arxiv, 2023
Object pose estimation constitutes a critical area within the domain of 3D vision. While contemporary state-of-the-art methods that leverage real-world pose annotations have demonstrated commendable performance, the procurement of such real-world training data incurs substantial costs. This paper focuses on a specific setting wherein only 3D CAD models are utilized as a priori knowledge, devoid of any background or clutter information. We introduce a novel method, CPPF++, designed for sim-to-real pose estimation. This method builds upon the foundational point-pair voting scheme of CPPF, reconceptualizing it through a probabilistic lens. To address the challenge of voting collision, we model voting uncertainty by estimating the probabilistic distribution of each point pair within the canonical space. This approach is further augmented by iterative noise filtering, employed to eradicate votes associated with backgrounds or clutters. Additionally, we enhance the context provided by each voting unit by introducing $N$-point tuples. In conjunction with this methodological contribution, we present a new category-level pose estimation dataset, DiversePose 300. This dataset is specifically crafted to facilitate a more rigorous evaluation of current state-of-the-art methods, encompassing a broader and more challenging array of real-world scenarios. Empirical results substantiate the efficacy of our proposed method, revealing a significant reduction in the disparity between simulation and real-world performance.
Recommended citation: You, Y., He, W., Liu, J., Xiong, H., Wang, W., & Lu, C. (2022). CPPF++: Uncertainty-Aware Sim2Real Object Pose Estimation by Vote Aggregation. arXiv preprint arXiv:2211.13398.
Published in Arxiv, 2023
Deformable object manipulation stands as one of the most captivating yet formidable challenges in robotics. While previous techniques have predominantly relied on learning latent dynamics through demonstrations, typically represented as either particles or images, there exists a pertinent limitation: acquiring suitable demonstrations, especially for long-horizon tasks, can be elusive. Moreover, basing learning entirely on demonstrations can hamper the model’s ability to generalize beyond the demonstrated tasks. In this work, we introduce a demonstration-free hierarchical planning approach capable of tackling intricate long-horizon tasks without necessitating any training. We employ large language models (LLMs) to articulate a high-level, stage-by-stage plan corresponding to a specified task. For every individual stage, the LLM provides both the tool’s name and the Python code to craft intermediate subgoal point clouds. With the tool and subgoal for a particular stage at our disposal, we present a granular closed-loop model predictive control strategy. This leverages Differentiable Physics with Point-to-Point correspondence (DiffPhysics-P2P) loss in the earth mover distance (EMD) space, applied iteratively. Experimental findings affirm that our technique surpasses multiple benchmarks in dough manipulation, spanning both short and long horizons. Remarkably, our model demonstrates robust generalization capabilities to novel and previously unencountered complex tasks without any preliminary demonstrations. We further substantiate our approach with experimental trials on real-world robotic platforms.
Recommended citation: You, Y., Shen, B., Deng, C., Geng, H., Wang, H., & Guibas, L. (2023). Make a Donut: Language-Guided Hierarchical EMD-Space Planning for Zero-shot Deformable Object Manipulation. arXiv preprint arXiv:2311.02787.
Published in Arxiv, 2023
The goal of motion understanding is to establish a reliable mapping between motion and action semantics, while it is a challenging many-to-many problem. An abstract action semantic (i.e., walk forwards) could be conveyed by perceptually diverse motions (walk with arms up or swinging), while a motion could carry different semantics w.r.t. its context and intention. This makes an elegant mapping between them difficult. Previous attempts adopted direct-mapping paradigms with limited reliability. Also, current automatic metrics fail to provide reliable assessments of the consistency between motions and action semantics. We identify the source of these problems as the significant gap between the two modalities. To alleviate this gap, we propose Kinematic Phrases (KP) that take the objective kinematic facts of human motion with proper abstraction, interpretability, and generality characteristics. Based on KP as a mediator, we can unify a motion knowledge base and build a motion understanding system. Meanwhile, KP can be automatically converted from motions and to text descriptions with no subjective bias, inspiring Kinematic Prompt Generation (KPG) as a novel automatic motion generation benchmark. In extensive experiments, our approach shows superiority over other methods. Our code and data would be made publicly available.
Recommended citation: Liu, X., Li, Y. L., Zeng, A., Zhou, Z., You, Y., & Lu, C. (2023). Bridging the Gap between Human Motion and Action Semantics via Kinematic Phrases. arXiv preprint arXiv:2310.04189.
Published in Arxiv, 2023
Humans excel at transferring manipulation skills across diverse object shapes, poses, and appearances due to their understanding of semantic correspondences between different instances. To endow robots with a similar high-level understanding, we develop a DFF for 3D scenes, leveraging large 2D vision models to distill semantic features from multiview images. While current research demonstrates advanced performance in reconstructing DFF from dense views, the development of learning a DFF from sparse views is relatively nascent, despite its prevalence in numerous manipulation tasks with fixed cameras. In this work, we introduce \method, a novel method for acquiring view-consistent 3D Distilled Feature Field from sparse RGBD observations, enabling one-shot learning of dexterous manipulations that are transferable to novel scenes. Specifically, we map the image features to the 3D point cloud, allowing for propagation across the 3D space to establish a dense feature field. At the core of SparseDFF is a lightweight feature refinement network, optimized with a contrastive loss between pairwise views after back-projecting the image features onto the 3D point cloud. Additionally, we implement a point-pruning mechanism to augment feature continuity within each local neighborhood. By establishing coherent feature fields on both source and target scenes, we devise an energy function that facilitates the minimization of feature discrepancies w.r.t. the end-effector parameters between the demonstration and the target manipulation. We evaluate our approach using a dexterous hand, mastering real-world manipulations on both rigid and deformable objects, and showcase robust generalization in the face of object and scene-context variations.
Recommended citation: Wang, Q., Zhang, H., Deng, C., You, Y., Dong, H., Zhu, Y., & Guibas, L. (2023). SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation. arXiv preprint arXiv:2310.16838.
This repo implements the adversarial autoencoders (https://arxiv.org/pdf/1511.05644.pdf), and reproduced the results on MNIST. One difference is that I used Wasserstein distance instead of naive GAN loss.
This is a Tensorflow implementation of the paper Collaborative Learning for Deep Neural Networks. I got 6.09% error rate after 300 epochs which is a slightly different from the paper. Maybe the split point is different from the paper: in my implementation splitting is done right after Batch Normalization and Relu of transition layers while it is not clear whether they split before or after or in the transition layers. Besides, in my implementation, gradients would pass through soft label targets (notation “q” in the paper).
This is a Pytorch implementation of the paper Learning Latent Subspaces in Variational Autoencoders. It reproduced the experiment on the Swiss Roll toy data.
This is a tower-defense game written in Unity, a long time ago. It may not be compatible with recent Unity versions.
This is a Tensorflow implementation of the famous Cycle-GAN described in the paper Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. I modified it by add Wasserstein distance to make the training more stable.
This is an unofficial python implementation of the deblurring algorithm, decribed in Two-Phase Kernel Estimation for Robust Motion Deblurring, ECCV2010.
This is a Matlab implementation of Kaiming He’s famous papaer Single Image Haze Removal Using Dark Channel Prior.
This is a Tensorflow algorithm to train a RL agent on Doom, with A3C training strategy.
This is an iOS transporation of Direct Sparse Odometry (https://github.com/JakobEngel/dso). You could press “toggle” to switch between depth images/RGB images/point cloud.
This repo implements a simple 3D geometric processing library, including some famous algorithms like geodesic distance computation and Point Pair Features (PPF) computation with CUDA.
This is a C++ implementation of the famous image segmentation algorithm described in GrabCut: interactive foreground extraction using iterated graph cuts
This repo is a neat and precise implementation of KdNet (Escape from Cells: Deep Kd-Networks for the Recognition of 3D Point Cloud Models) in Tensorflow with Tensorpack.
This is a Mahjong game with GUI. I also implemented a monte-carlo tree search for the AI opponents.
This is a tiny mask rcnn implemented by Tensorflow with only 700+ lines. Thanks to “tf.map_fn”.
This is a C++ implementation of Material Point Method, with the hybrid Particle-to-Grid and Grid-to-Partical process.
This is a Tensorflow implementation of the paper Multi-stage Reinforcement Learning For Object Detection.
This is a C++ implementation of the paper Surface Simplification Using Quadric Error Metrics. It could reduce the 3D model size by merging vertices and faces.
This is a Pytorch implementation of the paper Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution
This repo (partially) implements patch match stereo algorithm described in PatchMatch Stereo - Stereo Matching with Slanted Support Windows.
An Open-Source Deep Learning Framework for 3D Perception (including classification, segmentation, keypoint detection, registration, shape matching, etc.). The documentation (beta) can be built in docs folder. This framework is inspired by mmdetection3d, but with a better configuration system.
This is an unofficial python implementation of PMVS, decribed in Accurate, dense, and robust multi-view stereopsis, PAMI2010. The implementation is slow (no multi-thread and cuda) and it is for only study and illustration purpose.
Recently, various methods applied transformers to point clouds: PCT: Point Cloud Transformer (Meng-Hao Guo et al.), Point Transformer (Nico Engel et al.), Point Transformer (Hengshuang Zhao et al.). This repo is a pytorch implementation for these methods and aims to compare them under a fair setting. Currently, all three methods are implemented, while tuning their hyperparameters.
This is an unofficial PyTorch implementation of QENet, based on the paper Quaternion Equivariant Capsule Networks for 3D Point Clouds. However, it fails to converge for some unknown reason. I am still invesitigating the issue…
This is a naive tiny C++ renderer that is supposed to work on all platforms, based on the famous PBRT book.
This is a Pytorch implementation of the paper A Neural Representation of Sketch Drawings. I use WGAN to mimic the data distribution of sketch drawings, where the key difference from the original sketch-rnn is in the reparameterization of GMM.
This project implemented real-time indoor objects segmentation and 3D reconstruction. We used fine-tuned MaskRCNN doing instance segmentation for 51 different objects and build 3D model by Truncated Signed Distance Function Volume Reconstruction with semantic predicted from MaskRCNN. By now, there are two steps to execute the pipe line. First, download datasets from RGB-D SLAM datasets. Using mask_process.py to generate mask images for specific datasets. Second, change configuration in kernel.cpp to execute TSDF.
This is a C++ implementation of the paper Segmentation as selective search for object recognition. It reproduced the image segmentation result on Lena.
This is part of my graduation design. It can track multiple deformable images at the same time at a FPS above 30. The idea is based on the paper Template-based Monocular 3D Shape Recovery using Laplacian Meshes but with some performance improvements.
This is a Tensorflow (v1) implementation of Trust Region Proximal Optimization method. It is purely build on Tensorflow (v1)graphs and encapsulated as a seperate optimizer. You only need to pass the policy function and the cost function to the optimizer and create the cache variables.
This is a Pytorch implementation of Google World Models.
Published:
This is a description of your talk, which is a markdown files that can be all markdown-ified like any other post. Yay markdown!
Published:
This is a description of your tutorial, note the different field in type. This is a markdown files that can be all markdown-ified like any other post. Yay markdown!
Published:
This is a description of your talk, which is a markdown files that can be all markdown-ified like any other post. Yay markdown!
Published:
This is a description of your conference proceedings talk, note the different field in type. You can put anything in this field.
Undergraduate course, University 1, Department, 2014
This is a description of a teaching experience. You can use markdown like any other post.
Workshop, University 1, Department, 2015
This is a description of a teaching experience. You can use markdown like any other post.