Research

Exploring the Frontiers of AI

Yunnan Garden · Nanyang Technological University
Photo: Yunnan Garden, NTU © GoAhead / CC BY-SA 4.0

Research

Juanxi Tian

Currently, I am interested in several topics, including but not limited to:

  • Native Multimodal Foundation Model: Rather than retrofitting language models with vision or audio adapters, native multimodal foundation models are designed from the ground up to jointly perceive and reason across text, images, video, and audio within a single architecture. This paradigm treats every modality as a first-class citizen during pre-training, enabling richer cross-modal representations, more coherent multimodal reasoning, and emergent capabilities that modular pipelines struggle to achieve. Core challenges include designing unified tokenization schemes, balancing modality-specific and shared representations, crafting scalable pre-training objectives that capture inter-modal dependencies, and establishing principled scaling laws that govern how performance evolves as data, compute, and the number of modalities grow.
  • Efficient & Scalable AI: As foundation models continue to scale in both parameter count and training data, many conventional techniques face diminishing returns or become prohibitively expensive. Advancing efficient and scalable methodologies—spanning adaptive optimization strategies, speculative and parallel decoding, low-precision training, memory-efficient attention mechanisms, and domain-specific acceleration techniques for generative models—is essential to sustaining progress. Crucially, these approaches must not only reduce computational cost at today's scale but also preserve their effectiveness as model size, sequence length, and deployment complexity continue to grow.
  • Unified Models for Generation and Understanding: A growing line of work aims to bridge visual generation and visual understanding within a single model, rather than treating them as separate tasks served by disparate architectures. Unification opens the door to shared representations that benefit both directions—generative objectives can regularize discriminative features, while perceptual understanding can guide more semantically faithful synthesis. Key research threads include joint tokenization strategies, hybrid training paradigms that balance reconstruction and recognition losses, and data composition strategies that expose the model to both generative and discriminative signals at scale. Although the field has not yet converged on a dominant recipe, such unified frameworks represent a compelling path toward more general and versatile visual intelligence.
  • World Model: World models seek to learn an internal, predictive representation of the environment that can simulate future states, reason about physical dynamics, and support planning without exhaustive trial-and-error interaction. By capturing spatial, temporal, and causal structure from large-scale video and sensor data, world models hold the potential to serve as a general-purpose "mental simulator" for embodied agents, autonomous driving, robotics, and scientific discovery. Central challenges include learning physically grounded dynamics from passive observation, generalizing across diverse environments and embodiments, achieving long-horizon temporal consistency, and efficiently integrating world-model predictions into downstream decision-making and control.

The following illustrates how two complementary research pillars — Efficient & Scalable AI and Generative AI — jointly converge toward building a Native Unified Multimodal Foundation Model.


Efficient & Scalable AI

"Efficiency isn't about doing less — it's about understanding more."

Generative AI

"What I cannot create, I do not understand."

Adaptive Optimization

Developing optimizer-architecture co-design principles and scalable training strategies that remain effective across diverse model scales.

Unified Generation & Understanding

Bridging visual synthesis and perception within a shared framework via joint tokenization and hybrid training paradigms.

Training-Inference Acceleration

Unified acceleration across training and inference — speculative decoding, memory-efficient attention, and consistent optimization strategies that bridge the training-serving gap for real-time deployment.

Multimodal Synthesis

High-fidelity generation across images, video, 3D/4D, and beyond — advancing quality, consistency, and compositional control.

Scalable Architecture Design

Designing model architectures with inherent scalability — exploring structural inductive biases, modular composition, and architecture-optimizer synergies that generalize as capacity grows.

World Modeling

Learning predictive, physically grounded representations of environments to enable simulation, planning, and embodied reasoning.

Native Unified Multimodal Foundation Model

A single architecture that natively perceives and generates across all modalities — text, image, video, audio, and 3D — grounded in physical-world understanding, trained efficiently at scale, and capable of seamless reasoning across generation and comprehension.