Publications
*: Equal contribution, ✉: Corresponding authorLatest
- Task-Robust Pre-Training for Worst-Case Downstream Adaptation NeurIPS'23Jianghui Wang*, Yang Chen*, Xingyu Xie, Cong Fang✉, and Zhouchen Lin✉In Advances in Neural Information Processing Systems 2023 , 2023
Pre-training has achieved remarkable success when transferred to downstream tasks. In machine learning, we care about not only the good performance of a model but also its behavior under reasonable shifts of condition. The same philosophy holds when pre-training a foundation model. However, the foundation model may not uniformly behave well for a series of related downstream tasks. This happens, for example, when conducting mask recovery regression where the recovery ability or the training instances diverge like pattern features are extracted dominantly on pre-training, but semantic features are also required on a downstream task. This paper considers pre-training a model that guarantees a uniformly good performance over the downstream tasks. We call this goal as downstream-task robustness. Our method first separates the upstream task into several representative ones and applies a simple minimax loss for pre-training. We then design an efficient algorithm to solve the minimax loss and prove its convergence in the convex setting. In the experiments, we show both on large-scale natural language processing and computer vision datasets our method increases the metrics on worse-case downstream tasks. Additionally, some theoretical explanations for why our loss is beneficial are provided. Specifically, we show fewer samples are inherently required for the most challenging downstream task in some cases.
@inproceedings{wang2023taskrobust, author = {Wang, Jianghui and Chen, Yang and Xie, Xingyu and Fang, Cong and Lin, Zhouchen}, booktitle = {Advances in Neural Information Processing Systems}, editor = {A. Oh and T. Naumann and A. Globerson and K. Saenko and M. Hardt and S. Levine}, pages = {9458--9478}, publisher = {Curran Associates, Inc.}, title = {Task-Robust Pre-Training for Worst-Case Downstream Adaptation}, url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/1e4322fddd833f83c855660ac65e428d-Paper-Conference.pdf}, volume = {36}, year = {2023} } - NONPARAMETRIC TEACHING OF ATTENTION LEARNERS ICLR'26Chen Zhang*✉, Jianghui Wang*✉, Bingyang Cheng, Zhongtao Chen, Wendong Xu, Cong Wang, Marco Canini, Francesco Orabona, Yik-Chung Wu, and Ngai WongIn International Conference on Learning Representations (ICLR) , 2026
Attention learners, neural networks built on the attention mechanism, e.g., transformers, excel at learning the implicit relationships that relate sequences to their corresponding properties, e.g., mapping a given sequence of tokens to the probability of the next token. However, the learning process tends to be costly. To address this, we present a novel paradigm named Attention Neural Teaching (AtteNT) that reinterprets the learning process through a nonparametric teaching perspective. Specifically, the latter provides a theoretical framework for teaching mappings that are implicitly defined (i.e., nonparametric) via example selection. Such an implicit mapping is embodied through a dense set of sequence-property pairs, with the AtteNT teacher selecting a subset to accelerate convergence in attention learner training. By analytically investigating the role of attention on parameter-based gradient descent during training, and recasting the evolution of attention learners, shaped by parameter updates, through functional gradient descent in nonparametric teaching, we show for the first time that teaching attention learners is consistent with teaching importance-adaptive nonparametric learners. These new findings readily commit AtteNT to enhancing learning efficiency of attention learners. Specifically, we observe training time reductions of 13.01% for LLMs and 20.58% for ViTs, spanning both fine-tuning and training-from-scratch regimes. Crucially, these gains are achieved without compromising accuracy; in fact, performance is consistently preserved and often enhanced across a diverse set of downstream tasks.
@inproceedings{zhang2026nonparametric, title={Nonparametric Teaching for Attention Learners}, author={Zhang, Chen and Wang, Jianghui and Cheng, Bingyang and Chen, Zhongtao and Xu, Wendong and Wang, Cong and Canini, Marco and Orabona, Francesco and Wu, Yik-Chung and Wong, Ngai}, booktitle={ICLR}, year={2026} } - MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning ArXivJianghui Wang*, Yuxuan Wang*, Dongyan Zhao, and Zilong Zheng✉ArXiv , 2023
We introduce MoviePuzzle, a novel challenge that targets visual narrative reasoning and holistic movie understanding. Despite the notable progress that has been witnessed in the realm of video understanding, most prior works fail to present tasks and models to address holistic video understanding and the innate visual narrative structures existing in long-form videos. To tackle this quandary, we put forth MoviePuzzle task that amplifies the temporal feature learning and structure learning of video models by reshuffling the shot, frame, and clip layers of movie segments in the presence of video-dialogue information. We start by establishing a carefully refined dataset based on MovieNet by dissecting movies into hierarchical layers and randomly permuting the orders. Besides benchmarking the MoviePuzzle with prior arts on movie understanding, we devise a Hierarchical Contrastive Movie Clustering (HCMC) model that considers the underlying structure and visual semantic orders for movie reordering. Specifically, through a pairwise and contrastive learning approach, we train models to predict the correct order of each layer. This equips them with the knack for deciphering the visual narrative structure of movies and handling the disorder lurking in video data. Experiments show that our approach outperforms existing state-of-the-art methods on the \MoviePuzzle benchmark, underscoring its efficacy.
@article{wang2023moviepuzzle, title={MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning}, author={Wang, Jianghui and Wang, Yuxuan and Zhao, Dongyan and Zheng, Zilong}, journal={arXiv preprint arXiv:2306.02252}, year={2023} }
2026
2023
- Shuō Wén Jiě Zì: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training ACL'23Yuxuan Wang, Jianghui Wang, Dongyan Zhao, and Zilong ZhengIn ACL-Findings , 2023
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters. We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries and Jiezi refers to the process of enhancing characters' glyph representations with structure understanding. To facilitate dictionary understanding, we propose three pre-training tasks, i.e., Masked Entry Modeling, Contrastive Learning for Synonym and Antonym, and Example Learning. We evaluate our method on both modern Chinese understanding benchmark CLUE and ancient Chinese benchmark CCLUE. Moreover, we propose a new polysemy discrimination task PolyMRC based on the collected dictionary of ancient Chinese. Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks. Moreover, our approach yields significant boosting on few-shot setting of ancient Chinese understanding.
@inproceedings{wang2023shuo, title={Shu\={o} W\'{e}n Ji\v{e} Z\`{i}: \\ Rethinking Dictionaries and Glyphs for Chinese Language Pre-training}, author={Wang, Yuxuan and Wang, Jianghui and Zhao, Dongyan and Zheng, Zilong}, booktitle={Findings of the Association for Computational Linguistics: ACL-Findings}, year={2023} }