From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception.

Overview

Multimodal large language models are strong at general visual understanding, but they often fail on fine-grained perception tasks that require identifying tiny objects or subtle visual relationships. We attribute this limitation to Visual Attenuation: sparse visual signals are progressively suppressed by dominant textual tokens during network propagation, so deep layers not only attend less to visual evidence, but also lose spatial focus and drift toward diffuse, text-dominated attention patterns. VIF addresses this problem with a conditional variational formulation that aligns an answer-aware posterior q(z|I,Q,A) and a question-conditioned prior p(z|I,Q), decodes the latent representation into a sparse spatial Gaussian mixture, and restores deep-layer visual information flow by injecting the learned visual bias into selected layers. Experiments on general VQA, fine-grained perception, and visual grounding show that VIF improves fine-grained reasoning while preserving general multimodal capability.

Highlights

Visual Attenuation analysis: VIF is built around the finding that deep layers in MLLMs both reduce visual attention strength and lose spatial focus.
Variational visual saliency modeling: VIF learns an answer-aware posterior and a question-only prior to infer response-relevant visual cues.
Spatial GMM decoding: latent slots are decoded into a sparse visual importance distribution for fine-grained attention restoration.
Deep-layer information-flow restoration: the learned visual bias is injected into selected high layers.

Quick Start

Install

conda create -n llava-vif python=3.10 -y
conda activate llava-vif

pip install -r requirements.txt

Training

Run:

bash finetune_7b.sh

Key VIF knobs:

Enable: --use_latent_importance True
Learning range: --latent_learning_start / --latent_learning_end
Injection range: --latent_apply_start / --latent_apply_end
Loss weights: --latent_kl_weight / --latent_sparsity_weight
Distribution config: --latent_num_components

Testing / Inference

python -m llavavif.eval.run_llava \
  --model-path checkpoints/<SAVE_PATH> \
  --image-file /path/to/image.jpg \
  --query "Describe the most important visual details and answer: ..."

Repository Structure

llavavif/model/latent_importance.py: variational prior/posterior, GMM decoder, sparsity terms, and bias construction
llavavif/model/attention_intervention.py: selective deep-layer attention injection

Citation

If you find this work useful in your research, please consider cite:

@misc{zhu2026vif,
  title        = {From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception},
  author       = {Jilong Zhu and Yang Feng},
  year         = {2026},
  eprint       = {2604.12508},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  doi          = {10.48550/arXiv.2604.12508}
}

License

Apache License 2.0

Acknowledgements

This project is built on top of the open-source implementations of LLaVA, Open-LLaVA-NeXT and lmms-eval. Thanks to the authors and the community.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
images		images
llavavif		llavavif
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cog.yaml		cog.yaml
finetune_7b.sh		finetune_7b.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
train_mem.py		train_mem.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception.

Overview

Highlights

Quick Start

Install

Training

Testing / Inference

Repository Structure

Citation

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception.

Overview

Highlights

Quick Start

Install

Training

Testing / Inference

Repository Structure

Citation

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages