Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Page Not Found

Page not found. Your pixels are in another canvas.

Jupyter notebook markdown generator

Posts

Future Blog Post

less than 1 minute read

Published: January 01, 2199

This post will show up by default. To disable scheduling of future posts, edit config.yml and set future: false.

Blog Post number 4

less than 1 minute read

Published: August 14, 2015

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Headings are cool

You can have many headings

Aren’t headings cool?

Blog Post number 3

less than 1 minute read

Published: August 14, 2014

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Headings are cool

You can have many headings

Aren’t headings cool?

Blog Post number 2

less than 1 minute read

Published: August 14, 2013

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Headings are cool

You can have many headings

Aren’t headings cool?

Blog Post number 1

less than 1 minute read

Published: August 14, 2012

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Headings are cool

You can have many headings

Aren’t headings cool?

portfolio

Portfolio item number 1

Short description of portfolio item number 1

Portfolio item number 2

Short description of portfolio item number 2

projects

Canonical Voting: Towards Robust Oriented Bounding Box Detection in 3D Scenes

Cppf++

CPPF++: Uncertainty-Aware Sim2Real Object Pose Estimation by Vote Aggregation

CPPF: Towards Robust Category-Level 9D Pose Estimation in the Wild

Img2cad

Img2CAD: Reverse Engineering 3D CAD Models from Images through VLM-Assisted Conditional Factorization

Pace

PACE: Pose Annotations in Cluttered Environments

UKPGAN: A General Self-Supervised Keypoint Detector

Unipose9d

<!doctype html>

UniPose9D: Universal Category Agnostic Object Pose Estimation

publications

Relative CNN-RNN: Learning relative atmospheric visibility from images

Published in IEEE Transactions on Image Processing, 2018

We propose a deep learning approach for directly estimating relative atmospheric visibility from outdoor photos without relying on weather images or data that require expensive sensing or custom capture. Our data-driven approach capitalizes on a large collection of Internet images to learn rich scene and visibility varieties. The relative CNN-RNN coarse-to-fine model, where CNN stands for convolutional neural network and RNN stands for recurrent neural network, exploits the joint power of relative support vector machine, which has a good ranking representation, and the data-driven deep learning features derived from our novel CNN-RNN model.

Recommended citation: You, Y., Lu, C., Wang, W., & Tang, C. K. (2018). Relative CNN-RNN: Learning relative atmospheric visibility from images. IEEE Transactions on Image Processing, 28(1), 45-55.

Semantic Correspondence via 2D-3D-2D Cycle

Published in Preprint, 2020

Visual semantic correspondence is an important topic in computer vision and could help machine understand objects in our daily life. However, most previous methods directly train on correspondences in 2D images, which is end-to-end but loses plenty of information in 3D spaces. In this paper, we propose a new method on predicting semantic correspondences by leveraging it to 3D domain and then project corresponding 3D models back to 2D domain, with their semantic labels. Our method leverages the advantages in 3D vision and can explicitly reason about objects self-occlusion and visibility.

Recommended citation: You, Y., Li, C., Lou, Y., Cheng, Z., Ma, L., Lu, C., & Wang, W. (2020). Semantic Correspondence via 2D-3D-2D Cycle. arXiv preprint arXiv:2004.09061.

Human Correspondence Consensus for 3D Object Semantic Understanding

Published in ECCV, 2020

We observe that people have a consensus on semantic correspondences between two areas from different objects, but are less certain about the exact semantic meaning of each area. Therefore, we argue that by providing human labeled correspondences between different objects from the same category instead of explicit semantic labels, one can recover rich semantic information of an object. In this paper, we introduce a new dataset named CorresPondenceNet. Based on this dataset, we are able to learn dense semantic embeddings with a novel geodesic consistency loss.

Recommended citation: Lou, Y., You, Y., Li, C., Cheng, Z., Li, L., Ma, L., ... & Lu, C. (2020, August). Human Correspondence Consensus for 3D Object Semantic Understanding. In European Conference on Computer Vision (pp. 496-512). Springer, Cham.

Combinational Q-Learning for Dou Di Zhu

Published in AIIDE, 2020

In this paper, we study a special class of Asian popular card games called Dou Di Zhu, in which two adversarial groups of agents must consider numerous card combinations at each time step, leading to huge number of actions. We propose a novel method to handle combinatorial actions, which we call combinational Q-learning (CQL). We employ a two-stage network to reduce action space and also leverage order-invariant max-pooling operations to extract relationships between primitive actions.

Recommended citation: You, Y., Li, L., Guo, B., Wang, W., & Lu, C. (2019). Combinational Q-Learning for Dou Di Zhu. arXiv preprint arXiv:1901.08925.

KeypointNet: A Large-scale 3D Keypoint Dataset Aggregated from Numerous Human Annotations

Published in CVPR, 2020

We present KeypointNet: the first large-scale and diverse 3D keypoint dataset that contains 83,231 keypoints and 8,329 3D models from 16 object categories, by leveraging numerous human annotations. To handle the inconsistency between annotations from different people, we propose a novel method to aggregate these keypoints automatically, through minimization of a fidelity loss. Finally, ten state-of-the-art methods are benchmarked on our proposed dataset.

Recommended citation: You, Y., Lou, Y., Li, C., Cheng, Z., Li, L., Ma, L., ... & Wang, W. (2020). KeypointNet: A Large-scale 3D Keypoint Dataset Aggregated from Numerous Human Annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13647-13656).

Pointwise Rotation-Invariant Network with Adaptive Sampling and 3D Spherical Voxel Convolution

Published in AAAI, 2020

In this paper, we propose a new point-set learning framework named Pointwise Rotation-Invariant Network (PRIN), focusing on achieving rotation-invariance in point clouds. We construct spherical signals by Density-Aware Adaptive Sampling (DAAS) from sparse points and employ Spherical Voxel Convolution (SVC) to extract rotation-invariant features for each point. Our network can be applied to applications ranging from object classification, part segmentation, to 3D feature matching and label alignment.

Recommended citation: You, Y., Lou, Y., Liu, Q., Tai, Y. W., Ma, L., Lu, C., & Wang, W. (2020). Pointwise Rotation-Invariant Network with Adaptive Sampling and 3D Spherical Voxel Convolution. In AAAI (pp. 12717-12724).

Skeleton Merger, an Unsupervised Aligned Keypoint Detector

Published in CVPR, 2021

In this paper, we propose an unsupervised aligned keypoint detector, Skeleton Merger, which utilizes skeletons to reconstruct objects. It is based on an Autoencoder architecture. The encoder proposes keypoints and predicts activation strengths of edges between keypoints. The decoder performs uniform sampling on the skeleton and refines it into small point clouds with pointwise offsets. Then the activation strengths are applied and the sub-clouds are merged. Composite Chamfer Distance (CCD) is proposed as a distance between the input point cloud and the reconstruction composed of sub-clouds masked by activation strengths.

Recommended citation: Shi, R., Xue, Z., You, Y., & Lu, C. (2021). Skeleton Merger: an Unsupervised Aligned Keypoint Detector. arXiv preprint arXiv:2103.10814.

Understanding Pixel-level 2D Image Semantics with 3D Keypoint Knowledge Engine

Published in TPAMI, 2021

Pixel-level 2D object semantic understanding is an important topic in computer vision and could help machine deeply understand objects (e.g. functionality and affordance) in our daily life. However, most previous methods directly train on correspondences in 2D images, which is end-to-end but loses plenty of information in 3D spaces. In this paper, we propose a new method on predicting image corresponding semantics in 3D domain and then projecting them back onto 2D images to achieve pixel-level understanding. In order to obtain reliable 3D semantic labels that are absent in current image datasets, we build a large scale keypoint knowledge engine called KeypointNet, which contains 103,450 keypoints and 8,234 3D models from 16 object categories. Our method leverages the advantages in 3D vision and can explicitly reason about objects self-occlusion and visibility. We show that our method gives comparative and even superior results on standard semantic benchmarks.

Recommended citation: You, Y., Li, C., Lou, Y., Cheng, Z., Li, L., Ma, L., ... & Lu, C. (2021). Understanding Pixel-level 2D Image Semantics with 3D Keypoint Knowledge Engine. IEEE Transactions on Pattern Analysis and Machine Intelligence.

PRIN/SPRIN: On Extracting Point-wise Rotation Invariant Features

Published in TPAMI, 2021

Point cloud analysis without pose priors is very challenging in real applications, as the orientations of point clouds are often unknown. In this paper, we propose a brand new point-set learning framework PRIN, namely, Point-wise Rotation Invariant Network, focusing on rotation invariant feature extraction in point clouds analysis. We construct spherical signals by Density Aware Adaptive Sampling to deal with distorted point distributions in spherical space. Spherical Voxel Convolution and Point Re-sampling are proposed to extract rotation invariant features for each point. In addition, we extend PRIN to a sparse version called SPRIN, which directly operates on sparse point clouds. Both PRIN and SPRIN can be applied to tasks ranging from object classification, part segmentation, to 3D feature matching and label alignment. Results show that, on the dataset with randomly rotated point clouds, SPRIN demonstrates better performance than state-of-the-art methods without any data augmentation. We also provide thorough theoretical proof and analysis for point-wise rotation invariance achieved by our methods.

Recommended citation: You, Y., Lou, Y., Shi, R., Liu, Q., Tai, Y. W., Ma, L., ... & Lu, C. (2021). PRIN/SPRIN: On Extracting Point-wise Rotation Invariant Features. arXiv preprint arXiv:2102.12093.

Canonical Voting: Towards Robust Oriented Bounding Box Detection in 3D Scenes

Published in CVPR, 2022

In the work, we disentangle the direct offset into Local Canonical Coordinates (LCC), box scales and box orientations. Only LCC and box scales are regressed while box orientations are generated by a canonical voting scheme. Finally, a LCC-aware back-projection checking algorithm iteratively cuts out bounding boxes from the generated vote maps, with the elimination of false positives. Our model achieves state-of-the-art performance on challenging large-scale datasets of real point cloud scans: ScanNet, SceneNN with 11.4 and 5.3 mAP improvement respectively.

Recommended citation: You, Y., Ye, Z., Lou, Y., Li, C., Li, Y. L., Ma, L., ... & Lu, C. (2020). Canonical Voting: Towards Robust Oriented Bounding Box Detection in 3D Scenes. arXiv preprint arXiv:2011.12001.

CPPF: Towards Robust Category-Level 9D Pose Estimation in the Wild

Published in CVPR, 2022

This paper addresses category-level 9D pose estimation in the wild using a single RGB-D frame. Inspired by traditional point pair features (PPFs), we introduce a novel Category-level PPF (CPPF) voting method for accurate, robust, and generalizable 9D pose estimation. Our approach samples numerous point pairs on an object, predicting SE(3)-invariant voting statistics for object centers, orientations, and scales. We propose a coarse-to-fine voting algorithm to filter out noisy samples and refine predictions. An auxiliary binary classification task helps eliminate false positives in orientation voting. To ensure robustness, our sim-to-real pipeline trains on synthetic point clouds, except for geometrically ambiguous objects.

Recommended citation:

UKPGAN: Unsupervised KeyPoint GANeration

Published in CVPR, 2022

In this work, we reckon keypoints under an information compression scheme to represent the whole object. Based on this, we propose UKPGAN, an unsupervised 3D keypoint detector where keypoints are detected so that they could reconstruct the original object shape. Two modules: GAN-based keypoint sparsity control and salient information distillation modules are proposed to locate those important keypoints. Extensive experiments show that our keypoints preserve the semantic information of objects and align well with human annotated part and keypoint labels.

Recommended citation: You, Y., Liu, W., Li, Y. L., Wang, W., & Lu, C. (2020). UKPGAN: Unsupervised KeyPoint GANeration. arXiv preprint arXiv:2011.11974.

CRIN: Rotation-Invariant Point Cloud Analysis and Rotation Estimation via Centrifugal Reference Frame

Published in AAAI, 2023

In this paper, we propose the CRIN, namely Centrifugal Rotation-Invariant Network. CRIN directly takes the coordinates of points as input and transforms local points into rotation-invariant representations via centrifugal reference frames. Aided by centrifugal reference frames, each point corresponds to a discrete rotation so that the information of rotations can be implicitly stored in point features. Unfortunately, discrete points are far from describing the whole rotation space. We further introduce a continuous distribution for 3D rotations based on points. Furthermore, we propose an attention-based down-sampling strategy to sample points invariant to rotations. A relation module is adopted at last for reinforcing the long-range dependencies between sampled points and predicts the anchor point for unsupervised rotation estimation. Extensive experiments show that our method achieves rotation invariance, accurately estimates the object rotation. Ablation studies validate the effectiveness of the network design.

Recommended citation: Lou, Y., Ye, Z., You, Y., Jiang, N., Lu, J., Wang, W., ... & Lu, C. (2023). CRIN: Rotation-Invariant Point Cloud Analysis and Rotation Estimation via Centrifugal Reference Frame. arXiv preprint arXiv:2303.03101.

SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation

Published in ICLR, 2024

Humans excel at transferring manipulation skills across diverse objects due to their understanding of semantic correspondences. To give robots similar abilities, we develop a method for acquiring view-consistent 3D Distilled Feature Fields (DFF) from sparse RGBD observations. Our approach, \method, maps image features to 3D point clouds, creating a dense feature field for one-shot learning of dexterous manipulations transferable to new scenes. The core of \method is a lightweight feature refinement network, optimized with contrastive loss between pairwise views. We also use point-pruning to enhance feature continuity. Evaluations show our method enables robust manipulations of rigid and deformable objects, demonstrating strong generalization to varying objects and scene contexts.

Recommended citation: Wang, Q., Zhang, H., Deng, C., You, Y., Dong, H., Zhu, Y., & Guibas, L. (2023). SparseDFF: Sparse-View Feature Distillation for One-Shot Dexterous Manipulation. ICLR 2024.

Primitive-based 3D Human-Object Interaction Modelling and Programming

Published in AAAI, 2024

Embedding Human and Articulated Object Interaction (HAOI) in 3D is crucial for understanding human activities. Unlike previous works using parametric and CAD models, we propose a novel approach using 3D geometric primitives to encode both humans and objects. In our paradigm, humans and objects are compositions of primitives, enabling mutual information learning between limited 3D human data and various object categories. We choose superquadrics as our primitive representation for their simplicity and rich information. We introduce a new 3D HAOI benchmark with primitives and their images and propose a task for machines to recover 3D HAOI from images. Additionally, we provide a baseline for single-view 3D reconstruction on HAOI, paving the way for future 3D HAOI research.

Recommended citation: Liu, S., Li, Y. L., Fang, Z., Liu, X., You, Y., & Lu, C. (2023). Primitive-based 3D Human-Object Interaction Modelling and Programming. arXiv preprint arXiv:2312.10714.

CPPF++: Uncertainty-Aware Sim2Real Object Pose Estimation by Vote Aggregation

Published in TPAMI, 2024

Object pose estimation is crucial in 3D vision, but real-world data collection is costly. This paper presents CPPF++, a new sim-to-real pose estimation method using only 3D CAD models. CPPF++ enhances the point-pair voting scheme with probabilistic modeling to address voting collision and iterative noise filtering. We introduce N-point tuples for richer voting context and a new dataset, DiversePose 300, to test current methods in diverse scenarios. Our results show CPPF++ significantly reduces the gap between simulation and real-world performance.

Recommended citation: You, Y., He, W., Liu, J., Xiong, H., Wang, W., & Lu, C. (2022). CPPF++: Uncertainty-Aware Sim2Real Object Pose Estimation by Vote Aggregation. TPAMI 2024.

RPMArt: Towards Robust Perception and Manipulation for Articulated Objects

Published in IROS, 2024

Articulated objects are common in daily life, requiring robots to have robust perception and manipulation skills. Current methods struggle with noise in point clouds and bridging the gap between simulation and reality. We propose RPMArt, a framework for robust perception and manipulation of articulated objects, learning to estimate articulation parameters from noisy point clouds. Our main contribution, RoArtNet, predicts joint parameters and affordable points using local feature learning and point tuple voting. An articulation-aware classification scheme enhances sim-to-real transfer. RPMArt achieves state-of-the-art performance in both noise-added simulations and real-world environments.

Recommended citation: Wang, J., Liu, W., Yu, Q., You, Y., Liu, L., Wang, W., & Lu, C. (2024). RPMArt: Towards Robust Perception and Manipulation for Articulated Objects. arXiv preprint arXiv:2403.16023.

Bridging the Gap between Human Motion and Action Semantics via Kinematic Phrases

Published in ECCV, 2024

Motion understanding aims to map motion to action semantics, but the variability in both makes this challenging. Abstract actions like ‘walk forwards’ can be conveyed by diverse motions, while a single motion can have different meanings depending on context. Previous direct-mapping methods are unreliable, and current metrics fail to consistently assess motion-semantics alignment. To bridge this gap, we propose Kinematic Phrases (KP), which abstractly and objectively represent human motion with interpretability and generality. Using KP, we unify a motion knowledge base and build a motion understanding system. KP also enables Kinematic Prompt Generation (KPG), a novel benchmark for automatic motion generation. Experiments show our approach outperforms others, and we plan to release our code and data publicly.

Recommended citation: Liu, X., Li, Y. L., Zeng, A., Zhou, Z., You, Y., & Lu, C. (2023). Bridging the Gap between Human Motion and Action Semantics via Kinematic Phrases. arXiv preprint arXiv:2310.04189.

PACE: Pose Annotations in Cluttered Environments

Published in ECCV, 2024

Pose estimation is vital for tracking and manipulating objects in images or videos. Existing datasets lack a focus on cluttered scenes with occlusions, hindering real-world application development. To address this, we introduce PACE (Pose Annotations in Cluttered Environments), a large-scale benchmark for evaluating pose estimation methods in cluttered scenarios. PACE includes 54,945 frames with 257,673 annotations across 300 videos, featuring 576 objects from 44 categories. An innovative annotation system with a 3-camera setup was developed for efficient real-world data annotation.

ProvNeRF: Modeling per Point Provenance in NeRFs as a Stochastic Process

Published in NeurIPS, 2024

Neural radiance fields (NeRFs) excel at 3D scene reconstruction but struggle with sparse, unconstrained camera views. ProvNeRF addresses this by modeling per-point provenance as a stochastic process, enabling improved uncertainty estimation, criteria-based view optimization, and enhanced novel view synthesis. Our method, compatible with any pre-trained NeRF, extends IMLE for stochastic processes, leveraging training camera poses to enrich the 3D point information.

Recommended citation: Nakayama, G.K., Uy, M.A., You, Y., Li, K., & Guibas, L. (2024). ProvNeRF: Modeling per Point Provenance in NeRFs as a Stochastic Process. NeurIPS 2024.

Make a Donut: Language-Guided Hierarchical EMD-Space Planning for Zero-shot Deformable Object Manipulation

Published in RA-L,IROS, 2025

Deformable object manipulation is a challenging area in robotics, often relying on demonstrations to learn task dynamics. However, obtaining suitable demonstrations for long-horizon tasks is difficult and can limit model generalization. We propose a demonstration-free hierarchical planning approach for complex long-horizon tasks without training. Using large language models (LLMs), we create a high-level, stage-by-stage plan for a task, specifying tools and generating Python code for intermediate subgoal point clouds. With these, we employ a closed-loop model predictive control strategy using Differentiable Physics with Point-to-Point correspondence (DiffPhysics-P2P) loss in the earth mover distance (EMD) space. Our method outperforms benchmarks in dough manipulation tasks and generalizes well to novel tasks without demonstrations, validated through real-world robotic experiments.

Recommended citation: You, Y., Shen, B., Deng, C., Geng, H., Wang, H., & Guibas, L. (2023). Make a Donut: Language-Guided Hierarchical EMD-Space Planning for Zero-shot Deformable Object Manipulation. arXiv preprint arXiv:2311.02787.

Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

Published in ICLR, 2025

This work evaluates and improves the 3D awareness of Vision Transformer (ViT)-based models, showing that enhancing 3D equivariance in their semantic embeddings leads to better performance in tasks like pose estimation and tracking. The authors propose a simple finetuning strategy based on 3D correspondences, demonstrating substantial improvements with minimal finetuning on a single object.

Recommended citation: You, Y., Li, Y., Deng, C., Wang, Y., & Guibas, L. (2024). Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning. arXiv preprint arXiv:2411.19458.

AllTracker: Efficient Dense Point Tracking at High Resolution

Published in ICCV, 2025

We introduce AllTracker: a model that estimates long-range point tracks by computing flow fields between a query frame and every other frame of a video. Unlike existing methods, our approach delivers high-resolution, dense correspondence fields that can track hundreds of frames at once. The model uses an efficient architecture with iterative inference on low-resolution grids, combining 2D convolutions for spatial propagation and pixel-aligned attention for temporal propagation. With only 16 million parameters, it achieves state-of-the-art point tracking accuracy at high resolution (768x1024 pixels) and can be trained on diverse datasets for optimal performance.

Recommended citation: Adam W. Harley, Yang You, Xinglong Sun, Yang Zheng, Nikhil Raghuraman, Yunqi Gu, Sheldon Liang, Wen-Hsuan Chu, Achal Dave, Pavel Tokmakov, Suya You, Rares Ambrus, Katerina Fragkiadaki, Leonidas J. Guibas. (2025). AllTracker: Efficient Dense Point Tracking at High Resolution. ICCV 2025.

ARCH: Hierarchical Hybrid Learning for Long-Horizon Contact-Rich Robotic Assembly

Published in CoRL, 2025

ARCH proposes a hierarchical modular approach for long-horizon, high-precision robotic assembly in contact-rich settings. It employs a hierarchical planning framework, including a low-level primitive library of parameterized skills and a high-level policy learned via IL. ARCH generalizes well to unseen objects and outperforms baseline methods in terms of success rate and data efficiency.

Recommended citation: Sun, J., Curtis, A., You, Y., Xu, Y., Koehle, M., Chen, Q., Huang, S., Guibas, L., Chitta, S., Schwager, M., & Li, H. (2025). ARCH: Hierarchical Hybrid Learning for Long-Horizon Contact-Rich Robotic Assembly. CoRL 2025.

Img2CAD: Reverse Engineering 3D CAD Models from Images through VLM-Assisted Conditional Factorization

Published in SIGGRAPH Asia, 2025

Img2CAD introduces a novel approach for reconstructing 3D CAD models from single-view images. Leveraging large vision-language models (VLMs) like GPT-4V for semantic guidance, and TrAssembler, a transformer-based network, for continuous attribute prediction, our method achieves accurate and editable CAD outputs from common image inputs. We also provide a newly curated dataset, CAD-ified from ShapeNet, covering diverse everyday objects.

Recommended citation: You, Y., Uy, M.A., Han, J., Thomas, R., Zhang, H., Du, Y., Chen, H., Engelmann, F., You, S., & Guibas, L. (2025). Img2CAD: Reverse Engineering 3D CAD Models from Images through VLM-Assisted Conditional Factorization. SIGGRAPH Asia 2025.

Robot Learning from Any Images

Published in CoRL, 2025

RoLA is a framework that transforms any in-the-wild image into an interactive, physics-enabled robotic environment. It operates directly on a single image without requiring additional hardware or digital assets. RoLA democratizes robotic data generation by producing massive visuomotor robotic demonstrations within minutes from a wide range of image sources.

Recommended citation: Zhao, S., Mao, J., Chow, W., Shangguan, Z., Shi, T., Xue, R., Zheng, Y., Weng, Y., You, Y., Seita, D., Guibas, L., Zakharov, S., Guizilini, V., & Wang, Y. (2025). Robot Learning from Any Images. CoRL 2025.

Rodrigues Network for Learning Robot Actions

Published in ICLR, 2026

This work introduces the Neural Rodrigues Operator, a learnable generalization of the classical Rodrigues’ rotation formula, to embed kinematic inductive bias directly into neural networks. Built upon this operator, the proposed Rodrigues Network (RodriNet) effectively models articulated actions and significantly improves performance across tasks like forward kinematics prediction, imitation learning for robot manipulation, and 3D hand pose estimation from images.

Recommended citation: Zhang, J., Geng, H., You, Y., Deng, C., Abbeel, P., Malik, J., & Guibas, L. (2025). Rodrigues Network for Learning Robot Actions. arXiv preprint arXiv:2506.02618.

DOT-Sim: Differentiable Optical Tactile Simulation with Precise Real-to-Sim Physical Calibration

Published in ICRA, 2026

DOT-Sim is a differentiable optical tactile simulator that models soft sensors as elastic materials via the Material Point Method (MPM), enabling rapid real-to-sim calibration within minutes and supporting large, non-linear deformations. It simulates optical responses by learning a residual image relative to the real-world idle state, and demonstrates strong zero-shot sim-to-real performance on a DenseTact sensor across object classification, embedded tumor detection, and precise trajectory following.

Recommended citation: You, Y., Do, W.K., Swann, A., Antonova, R., Kennedy, M., & Guibas, L. (2026). DOT-Sim: Differentiable Optical Tactile Simulation with Precise Real-to-Sim Physical Calibration. ICRA 2026.

UniPose9D: Universal Category-Agnostic Object Pose Estimation

Published in arXiv, 2026

UniPose9D is a category agnostic foundation model for 9D object pose estimation. Given an object mask and either a color and depth observation or a color image with predicted depth, it estimates rotation, translation, and metric size without category labels, CAD models, shape priors, or reference views. A single model performs competitively across standard benchmarks and transfers to unseen objects and everyday scenes.

Recommended citation: You, Y., Du, Y., Harrison, C., & Guibas, L. (2026). UniPose9D: Universal Category-Agnostic Object Pose Estimation. arXiv preprint arXiv:2607.09985.

sideprojects

Pytorch implementation of adversarial autoencoders

This repo implements the adversarial autoencoders (https://arxiv.org/pdf/1511.05644.pdf), and reproduced the results on MNIST. One difference is that I used Wasserstein distance instead of naive GAN loss.

Tensorflow implementation of Collaborative Learning for Deep Neural Networks

This is a Tensorflow implementation of the paper Collaborative Learning for Deep Neural Networks. I got 6.09% error rate after 300 epochs which is a slightly different from the paper. Maybe the split point is different from the paper: in my implementation splitting is done right after Batch Normalization and Relu of transition layers while it is not clear whether they split before or after or in the transition layers. Besides, in my implementation, gradients would pass through soft label targets (notation “q” in the paper).

Pytorch implementation of Learning Latent Subspaces in Variational Autoencoders

This is a Pytorch implementation of the paper Learning Latent Subspaces in Variational Autoencoders. It reproduced the experiment on the Swiss Roll toy data.

Tower-Defense Game in Unity

This is a tower-defense game written in Unity, a long time ago. It may not be compatible with recent Unity versions.

Cycle-W-GAN

This is a Tensorflow implementation of the famous Cycle-GAN described in the paper Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. I modified it by add Wasserstein distance to make the training more stable.

Implementation of Two-Phase Kernel Estimation for Robust Motion Deblurring

This is an unofficial python implementation of the deblurring algorithm, decribed in Two-Phase Kernel Estimation for Robust Motion Deblurring, ECCV2010.

Matlab implementation of Single Image Haze Removal Using Dark Channel Prior

This is a Matlab implementation of Kaiming He’s famous papaer Single Image Haze Removal Using Dark Channel Prior.

A3C for Doom

This is a Tensorflow algorithm to train a RL agent on Doom, with A3C training strategy.

Direct Sparse Odometry for iOS platform

This is an iOS transporation of Direct Sparse Odometry (https://github.com/JakobEngel/dso). You could press “toggle” to switch between depth images/RGB images/point cloud.

A geometric processing library in C++ & Python

This repo implements a simple 3D geometric processing library, including some famous algorithms like geodesic distance computation and Point Pair Features (PPF) computation with CUDA.

C++ implementation of GrabCut algorithm

This is a C++ implementation of the famous image segmentation algorithm described in GrabCut: interactive foreground extraction using iterated graph cuts

Tensorflow implementation of KdNet

This repo is a neat and precise implementation of KdNet (Escape from Cells: Deep Kd-Networks for the Recognition of 3D Point Cloud Models) in Tensorflow with Tensorpack.

Python Mahjong game with monte-carlo AI

This is a Mahjong game with GUI. I also implemented a monte-carlo tree search for the AI opponents.

Tiny Mask-RCNN in Tensorflow

This is a tiny mask rcnn implemented by Tensorflow with only 700+ lines. Thanks to “tf.map_fn”.

A minimal C++ example of Material Point Method

This is a C++ implementation of Material Point Method, with the hybrid Particle-to-Grid and Grid-to-Partical process.

Tensorflow implementation of Multi-stage Reinforcement Learning For Object Detection

This is a Tensorflow implementation of the paper Multi-stage Reinforcement Learning For Object Detection.

A C++ 3D Object Simplifier

This is a C++ implementation of the paper Surface Simplification Using Quadric Error Metrics. It could reduce the 3D model size by merging vertices and faces.

Pytorch implementation of OctConv

This is a Pytorch implementation of the paper Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution

Python implementation of PatchMatch Stereo

This repo (partially) implements patch match stereo algorithm described in PatchMatch Stereo - Stereo Matching with Slanted Support Windows.

Perception3D: An Open-Source Deep Learning Framework for 3D Perception

An Open-Source Deep Learning Framework for 3D Perception (including classification, segmentation, keypoint detection, registration, shape matching, etc.). The documentation (beta) can be built in docs folder. This framework is inspired by mmdetection3d, but with a better configuration system.

A numpy implementation of PMVS

This is an unofficial python implementation of PMVS, decribed in Accurate, dense, and robust multi-view stereopsis, PAMI2010. The implementation is slow (no multi-thread and cuda) and it is for only study and illustration purpose.

Pytorch Implementation of Various Point Transformers

Recently, various methods applied transformers to point clouds: PCT: Point Cloud Transformer (Meng-Hao Guo et al.), Point Transformer (Nico Engel et al.), Point Transformer (Hengshuang Zhao et al.). This repo is a pytorch implementation for these methods and aims to compare them under a fair setting. Currently, all three methods are implemented, while tuning their hyperparameters.

PyTorch implementation of QENet

This is an unofficial PyTorch implementation of QENet, based on the paper Quaternion Equivariant Capsule Networks for 3D Point Clouds. However, it fails to converge for some unknown reason. I am still invesitigating the issue…

A tiny renderer based on PBRT

This is a naive tiny C++ renderer that is supposed to work on all platforms, based on the famous PBRT book.

Pytorch implementation of Sketch-WGAN

This is a Pytorch implementation of the paper A Neural Representation of Sketch Drawings. I use WGAN to mimic the data distribution of sketch drawings, where the key difference from the original sketch-rnn is in the reparameterization of GMM.

Semantic 3D Reconstruction from a Collection of Images

This project implemented real-time indoor objects segmentation and 3D reconstruction. We used fine-tuned MaskRCNN doing instance segmentation for 51 different objects and build 3D model by Truncated Signed Distance Function Volume Reconstruction with semantic predicted from MaskRCNN. By now, there are two steps to execute the pipe line. First, download datasets from RGB-D SLAM datasets. Using mask_process.py to generate mask images for specific datasets. Second, change configuration in kernel.cpp to execute TSDF.

C++ implementation of Selective Search Windows

This is a C++ implementation of the paper Segmentation as selective search for object recognition. It reproduced the image segmentation result on Lena.

RealTimeTrack: A fast algorithm tracking real time deformable planers

This is part of my graduation design. It can track multiple deformable images at the same time at a FPS above 30. The idea is based on the paper Template-based Monocular 3D Shape Recovery using Laplacian Meshes but with some performance improvements.

Tensorflow implementation of TRPO

This is a Tensorflow (v1) implementation of Trust Region Proximal Optimization method. It is purely build on Tensorflow (v1)graphs and encapsulated as a seperate optimizer. You only need to pass the policy function and the cost function to the optimizer and create the cache variables.

Pytorch implementation of World Models

This is a Pytorch implementation of Google World Models.

talks

Talk 1 on Relevant Topic in Your Field

Published: March 01, 2012

This is a description of your talk, which is a markdown files that can be all markdown-ified like any other post. Yay markdown!

Tutorial 1 on Relevant Topic in Your Field

Published: March 01, 2013

More information here

This is a description of your tutorial, note the different field in type. This is a markdown files that can be all markdown-ified like any other post. Yay markdown!

Talk 2 on Relevant Topic in Your Field

Published: February 01, 2014

More information here

This is a description of your talk, which is a markdown files that can be all markdown-ified like any other post. Yay markdown!

Conference Proceeding talk 3 on Relevant Topic in Your Field

Published: March 01, 2014

This is a description of your conference proceedings talk, note the different field in type. You can put anything in this field.

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Heading 1

Heading 2

Heading 3

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.

Yang You (尤洋)

Sitemap

Pages

Posts

Headings are cool

You can have many headings

Aren’t headings cool?

Headings are cool

You can have many headings

Aren’t headings cool?

Headings are cool

You can have many headings

Aren’t headings cool?

Headings are cool

You can have many headings

Aren’t headings cool?

portfolio

projects

Canonical Voting: Towards Robust Oriented Bounding Box Detection in 3D Scenes

CPPF: Towards Robust Category-Level 9D Pose Estimation in the Wild

UKPGAN: A General Self-Supervised Keypoint Detector

publications

sideprojects