Todd Hollon | University of Michigan

Visual Intelligence

Our visual world is more complex than human language. Language is discrete, sequential, and constructed. Our visual world is continuous, geometric, and discovered. A major open problem in machine intelligence is how best to model visual perception and reasoning. We believe that the major advances in large language models, while impressive, do not illuminate this problem. Our lab focuses on visual reperesentation learning broadly and how image data structures can inform visual learning. For example, we exploit inherent hierarchical structures in biomedical microscopy (HiDisc) or neuroimaging (HLIP) to better learn complete and grounded visual features. We also aim to unify visual self-supervision and languauge supervision (CLIPred, SimCLIP), which are generally treated as independent learning enviroments. Visual reasonsing enables AI agents to reason about images, performing actions on those images such as cropping, clipping, and resizing (CodeV).

Learning complete and explainable visual representations from itemized text supervision

Yiwei Lyu, Chenhui Zhao, Soumyanil Banerjee, Shixuan Liu, Akshay T. Rao, Akhil Kondepudi, Honglak Lee, and Todd C. Hollon

COMPUTER VISION AND PATTERN RECOGNITION · 2026

Standard contrastive language-image pre-training can neglect objects in visual scenes. ItemizedCLIP forces models to learn and attend to all described items, resulting in better visual representations.

PDF Code

CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

Xinhai Hou, Shaoyuan Xu, Manan Biyani, Moyan Li, Jia Liu, Todd C. Hollon, and Bryan Wang

COMPUTER VISION AND PATTERN RECOGNITION · 2026

Recent visual agents can score well while using image tools unfaithfully-e.g., cropping irrelevant regions or ignoring tool outputs. CodeV represents tools as executable Python code and trains with Tool-Aware Policy Optimization (TAPO), using process-level rewards on visual tool inputs and outputs to improve both accuracy and faithful tool use on search and broader multimodal benchmarks.

PDF Code Oral Paper (Top 1%)

An Empirical Study on Unifying JEPA and Language Supervision for Visual Representation Learning

Shixuan Liu^*, Daniel A. Li^*, Yiwei Lyu, Akhil Kondepudi, Honglak Lee, and Todd C. Hollon

NEURIPS UNIREPS WORKSHOP · 2025

CLIPred is a framework that jointly optimizes the I-JEPA self-supervision and CLIP language supervision objectives for visual representation learning, outperforming either alone and achieving better zero-shot transfer than DINOv2+CLIP at lower training cost.

OpenReview PDF

Step-Calibrated Diffusion for Biomedical Optical Image Restoration

Yiwei Lyu, Sung Jik Cha, Cheng Jiang, Asadur Chowdury, Xinhai Hou, Edward Harake, Akhil Kondepudi, Christian Freudiger, Honglak Lee, and Todd C. Hollon

AAAI · 2025

This paper introduces Restorative Step-Calibrated Diffusion (RSCD) for biomedical optical image restoration, improving reconstruction fidelity by adapting denoising dynamics to the characteristics of microscopy data.

arXiv Github Poster

An empirical study of CLIP fine-tuning with similarity clusters

Shixuan Liu, Yiwei Lyu, Honglak Lee, and Todd C. Hollon

NEURIPS FITML WORKSHOP · 2024

SimCLIP is a generalized framework for CLIP fine-tuning that constructs minibatches containing clusters of similar image-text pairs to produce harder in-batch negatives, improving downstream performance over standard CLIP fine-tuning without hand-crafted hard negative captions.

OpenReview PDF Github

Super-resolution of biomedical volumes with 2D supervision

Cheng Jiang, Alexander Gedeon, Yiwei Lyu, Eric Landgraf, Yufeng Zhang, Xinhai Hou, Akhil Kondepudi, Asadur Chowdury, Honglak Lee, and Todd C. Hollon

CVPR WORKSHOP · 2024

This work proposes Masked Slice Diffusion for Super-Resolution (MSDSR), a strategy for volumetric biomedical super-resolution trained with only 2D supervision, enabling high-quality 3D reconstruction when fully paired 3D labels are scarce.

Website arXiv Github Poster

A self-supervised framework for learning whole slide representations

Xinhai Hou ^*, Cheng Jiang^*, Akhil Kondepudi, Yiwei Lyu, Asadur Zaman Chowdury, Honglak Lee, and Todd C. Hollon

ARXIV · 2024

This study introduces Slide Pre-trained Transformers (SPT), a self-supervised framework for whole-slide representation learning that captures multiscale histologic structure to support downstream pathology tasks with limited manual annotation.

arXiv

Hierarchical Discriminative Learning Improves Visual Representations of Biomedical Microscopy

Cheng Jiang^*, Xinhai Hou^*, Akhil Kondepudi, Asadur Chowdury, Christian W. Freudiger, Daniel A. Orringer, Honglak Lee, and Todd C. Hollon

COMPUTER VISION AND PATTERN RECOGNITION · 2023

HiDisc is a self-supervised learning method that leverages the inherent patient-slide-patch hierarchy of biomedical microscopy to learn stronger visual representations without explicit negative mining.

Website arXiv Github Highlight Paper (Top 5%)

OpenSRH: optimizing brain tumor surgery using intraoperative stimulated Raman histology

Cheng Jiang^*, Asadur Chowdury^*, Xinhai Hou^*, Akhil Kondepudi, Christian W. Freudiger, Kyle Conway, Sandra Camelo-Piragua, Daniel A. Orringer, Honglak Lee, and Todd C. Hollon

NEURIPS DATASETS & BENCHMARKS · 2022

OpenSRH is the first public dataset of clinical stimulated Raman histology images from brain tumor patients, released alongside benchmarks to accelerate machine learning research for intraoperative brain tumor diagnosis.

Website arXiv Github Talk Poster OpenReview

Denoising stimulated Raman histology using weak supervision to improve label-free optical microscopy of human brain tumors

Esteban Urias, Christopher Freudiger, Daniel Orringer, Honglak Lee, and Todd Hollon

MLHC · 2020

This paper develops a weakly supervised denoising approach for stimulated Raman histology, improving image quality in label-free optical microscopy of human brain tumor specimens.

PDF

← Back to home

Visual Intelligence

Related publications