INDOTABVQA: A New Benchmark for Multilingual Table VQA
INDOTABVQA pushes the boundaries of cross-lingual Table Visual Question Answering (VQA). With support for four languages including Bahasa Indonesia, the benchmark reveals striking performance gaps in current Vision-Language Models.
In the field of machine learning, the latest benchmark to watch is INDOTABVQA. This ambitious dataset evaluates cross-lingual Table Visual Question Answering (VQA) on real-world document images. Developed with a focus on Bahasa Indonesia, it includes 1,593 document images featuring various table styles.
Why INDOTABVQA Matters
INDOTABVQA isn't just another dataset. It offers a diverse linguistic challenge with question-answer sets spanning four languages: Bahasa Indonesia, English, Hindi, and Arabic. This diversity allows Vision-Language Models (VLMs) to be assessed in both monolingual and cross-lingual contexts. The real kicker is its potential to highlight performance discrepancies in VLMs, especially in languages that don't get much spotlight.
The Performance Gaps
Leading open-source VLMs like Qwen2.5-VL, Gemma-3, LLaMA-3.2, and GPT-4o were put to the test. The findings weren't exactly flattering. These models exhibited substantial performance gaps, particularly when dealing with complex table structures and low-resource languages. Strip away the marketing and you get a clear picture: we're not there yet.
Fine-Tuning: A Step Forward
Fine-tuning showed promise. A compact 3 billion parameter model and a LoRA-finetuned 7 billion parameter model improved accuracy by 11.6% and 17.8%, respectively. Clearly, the architecture matters more than the parameter count. Notably, adding explicit table region coordinates as input further boosted performance by 4-7%. This highlights the value of spatial priors in table-based reasoning.
Implications for Underrepresented Regions
INDOTABVQA isn't just a technical feat. It's a significant step for underrepresented regions and languages in AI research. Language-diverse and domain-specific datasets like these can propel advancements in document understanding. But are we doing enough to support low-resource languages in AI?
, INDOTABVQA is more than a benchmark. It's a call to action for developing models that truly understand diverse languages and structures. As VLMs evolve, they'll need to rise to such challenges, not just in popular languages but everywhere.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A standardized test used to measure and compare AI model performance.
The process of taking a pre-trained model and continuing to train it on a smaller, specific dataset to adapt it for a particular task or domain.
Generative Pre-trained Transformer.
Meta's family of open-weight large language models.