When an Indian enterprise signs a contract with a global AI vendor, the assumption baked into the pitch is that the product works. The demo ran well. The benchmark numbers looked strong. The sales team showed a case study from a comparable company in a comparable industry.

What the pitch rarely addresses is what happens when the system meets the actual conditions of Indian deployment: users who write in Hindi, Tamil, or Marathi; data that cannot leave Indian shores without navigating a shifting regulatory framework; queries that mix scripts, languages, and informal registers in ways that no English-dominant training set was designed to handle.

This piece examines three specific dimensions of that gap: how global LLMs perform on Indian languages compared to English, what the tokenization economics mean for cost at scale, and what India’s data protection framework requires that most AI vendor contracts do not address.

1. The Language Performance Gap Is Real and Documented

Every major global AI vendor describes their model as multilingual. The technical reports for GPT-4, Claude, Llama, and Gemini all demonstrate multilingual capability against benchmark datasets. What those reports do not emphasize is that multilingual capability is not evenly distributed across languages, and Indian languages sit in a structurally disadvantaged position relative to English.

The reason is simple: these models are trained predominantly on data from the internet, and the internet is not evenly multilingual. Hindi, despite being the third most widely spoken language in the world by number of native speakers, ranks only 25th in all-time Wikipedia page views, a signal of how underrepresented it is in the written digital corpus that models train on. Hindi Wikipedia has approximately 168,000 articles compared to over 7.1 million in English. The gap between a language’s spoken presence in the world and its written presence online directly determines how much training data a model sees in that language, and how well the model performs in it.

IndicGenBench, a benchmark developed by researchers at Google and published at the Association for Computational Linguistics Annual Meeting in 2024, evaluated models including GPT-3.5, GPT-4, PaLM-2, and Llama across 29 Indian languages on tasks including cross-lingual summarization, machine translation, and question answering. The study found a significant performance gap in all Indian languages compared to English across every model tested.

MILU, a multi-task Indic language understanding benchmark developed by IBM researchers and published at the 2025 NAACL conference, evaluated LLMs across 11 Indian languages and 41 subjects. The benchmark was designed to assess cultural knowledge, not just linguistic fluency, across domains including arts and humanities, social sciences, and STEM. PARIKSHA, a Microsoft Research evaluation framework, assessed 30 models across 10 Indic languages covering prompts in finance, health, and culture through 90,000 human evaluations and 30,000 LLM-based evaluations, building leaderboards to track model performance across languages and evaluation settings.

The three-tier language performance structure in global LLMs How research categorizes Indian language performance relative to English English Baseline — models trained primarily on English-dominant internet data High-resource Indian languages: Hindi, Bengali, Tamil, Telugu Meaningful gap vs. English — better than lower-resource languages Medium/low-resource languages: Odia, Assamese, Punjabi, Manipuri and others Significant gap — most benchmarks cover only 10–12 of India’s 22 scheduled languages Source: Analysis of Indic Language Capabilities in LLMs, arXiv 2501.13912, January 2025
Performance tiers reflect training data availability, not model capability limits. Languages with less internet presence receive structurally less training signal.

The issue is compounded by what researchers call code-switching, the practice of mixing languages and scripts within a single piece of text. This is not an edge case in Indian communication; it is a dominant pattern. Hinglish, the mixture of Hindi and English that characterizes a large proportion of urban Indian digital communication, presents a specific challenge for models whose tokenizers were not designed with it in mind. Research evaluating LLMs on Hinglish and other mixed-script inputs found that informal, mixed-script communication at scale remains a challenge for multilingual models, and that existing benchmarks minimally cover it.

The situation is not static. OpenAI released IndQA in late 2025, a benchmark covering 12 Indian languages and 10 domains built in collaboration with 261 domain experts from across India, designed specifically to test cultural reasoning rather than just linguistic accuracy. The benchmark itself signals that global labs recognize the gap exists. It does not, by itself, close it. Benchmarks describe a problem. Closing it requires different training data, different tokenizers, and sustained investment in languages that are not commercially dominant in the global AI market.

What this means for enterprise deployment

If your AI application involves any Indian language input from real users, the model’s English-language benchmark score is not a reliable predictor of performance. Evaluation on your actual user inputs, in the languages your users write in, is the only number that matters for your deployment decision.

This is particularly consequential for customer-facing applications in BFSI, healthcare, and government services, where language accuracy is a requirement, not a preference.

2. Tokenization Creates a Structural Cost Disadvantage

There is a second, less discussed dimension to the language gap: cost. Global AI models price their APIs on a per-token basis. A token is roughly a subword unit, and how text gets broken into tokens depends on the tokenizer the model uses. Tokenizers are trained to efficiently encode the text they see most during training. Because that text is predominantly English, English is encoded most efficiently. Other languages, including Indian languages, require more tokens to represent the same semantic content. More tokens means higher cost per query, higher latency, and faster consumption of the model’s context window.

A tokenization benchmarking study published in January 2025 comparing GPT-4o, Claude 3.5 Sonnet, Gemini 2.0 Flash, Sarvam-1, and DeepSeek V3 on Indian languages including Tamil, Hindi, Marathi, Bengali, and Telugu found that GPT-4o and Gemini required approximately twice the number of tokens for Indian language content compared to equivalent English content. Other models performed significantly worse. Sarvam-1, the Indian-origin model, achieved approximately 80% fewer tokens than Llama 3.3 on Tamil specifically.

Token multiplier for Indian language content vs. English Approximate token count relative to equivalent English content (1x = same as English) 5x 4x 3x 2x 1x English 1x GPT-4o ~2x Gemini ~2x DeepSeek 2–4x Other OSS 4x+ Sarvam-1 ~1x Source: Benchmarking LLM Tokenization, Indic Languages Under the Lens (Medium/manick2411, 2025). Figures are approximate.
Token multipliers vary by model and language. A 2x multiplier means an Indian-language deployment costs approximately twice as much per query as the equivalent English deployment on the same model.

A separate study evaluating tokenizer performance across all 22 official Indian languages found that the SUTRA tokenizer outperformed all others tested, including Indic-specific models, across 14 of those languages. GPT-4o showed improvement over its predecessor GPT-4 in processing Indian languages, but Project Indus, an India-focused model, performed well only on languages using the Devanagari script and struggled with others.

The practical implication is direct. An enterprise building a customer service application that serves Hindi-speaking users on a global AI model pays approximately twice the per-query cost that an identical English-language application would incur. At production scale, across millions of queries, this is not an optimization problem — it is a structural cost disadvantage built into the architecture of models that were not designed with Indian languages as a primary use case.

The language you deploy in determines what you pay. An Indian-language deployment on a global model carries an invisible cost premium that no vendor proposal will surface unprompted.

NervNow Analysis

There is a more subtle consequence beyond cost. More tokens for the same content means the model’s context window fills faster. A context window that holds 50 documents of English content will hold approximately 25 of the same documents in Hindi on a model with a 2x token multiplier. For applications that rely on long-context reasoning — document analysis, contract review, knowledge-intensive question answering — this has direct implications for system design that most pre-sales engagements do not address.

3. India’s Data Regulatory Landscape Is More Complex Than Most AI Contracts Reflect

The third dimension of the gap is regulatory. India’s Digital Personal Data Protection Act, enacted in August 2023 and referred to as the DPDP Act, establishes India’s first comprehensive data protection framework. Its implementing rules were published by the Ministry of Electronics and Information Technology in November 2025. The Act and its rules have specific implications for how AI systems process Indian user data that most standard global AI vendor contracts were not written to address.

What the DPDP Act actually does on data transfers

The DPDP Act is frequently mischaracterized as a data localization law. As enacted, it does not impose blanket data localization. Section 16 permits personal data to be transferred outside India by default, except to countries specifically designated by the central government as restricted destinations. This is a blacklist model rather than a whitelist model, representing a significant softening from the 2018 draft of the legislation, which proposed strict localization requirements.

The picture becomes more complicated for entities classified as Significant Data Fiduciaries. The DPDP Rules empower the government to specify categories of personal data that Significant Data Fiduciaries must process only within India, regardless of whether the destination country is on the restricted list. Which entities qualify as Significant Data Fiduciaries and which data categories will be subject to this requirement have not yet been fully defined, leaving enterprises operating at scale facing regulatory uncertainty that is difficult to build architecture around.

India’s DPDP Act: Data transfer framework for AI deployments DEFAULT RULE Cross-border transfer permitted by default Except to government- designated blacklist Section 16, DPDP Act 2023 but SDF EXCEPTION Significant Data Fiduciaries may face India-only processing for certain data categories Categories not yet defined and SECTOR RULES RBI, SEBI, IRDAI rules apply on top DPDP does not override sectoral mandates Source: IAPP analysis of DPDPA; Vidhi Legal Policy; DPDP Rules 2025 (MeitY, November 2025)
Three overlapping frameworks govern data in Indian AI deployments. Most global AI vendor contracts are written against none of them specifically.

Sectoral rules that predate and sit alongside the DPDP Act

The DPDP Act does not operate in isolation. Sector-specific regulators in India have their own data requirements that are independent of the DPDP framework and are not overridden by it. The Reserve Bank of India’s 2018 directive on payment data storage requires payment data to be stored only in India. SEBI and IRDAI have their own data governance requirements for securities and insurance data respectively. For enterprises in BFSI, this means that an AI system processing financial data is subject to both the DPDP framework and sector-specific rules, and the more restrictive requirement governs.

This creates a compliance architecture challenge that most global AI vendor contracts are not designed to resolve. A standard cloud AI service agreement describes where data is processed in terms of cloud regions and data centers. It does not typically address which categories of data it is processing on a given customer’s behalf, or whether those categories are subject to Indian sectoral regulations that require domestic processing regardless of where the vendor’s infrastructure is located.

What the DPDP Act requires around AI training

A separate and underappreciated implication of the DPDP framework concerns AI model training. The Act’s consent-centric regime requires explicit, purpose-specific consent before personal data can be processed. For AI providers that scrape or process Indian user data and use it as training signal, the Act’s obligations are triggered if the model connects to services offered in India, regardless of where the processing physically occurs.

The Future of Privacy Forum’s analysis of the DPDP Act noted that the Act does not recognize contractual necessity or legitimate interests as alternative legal bases for processing personal data, which are grounds that other major data protection frameworks like GDPR permit. This consent-centric approach creates specific challenges for AI development that relies on broad data collection, and its implications for enterprise AI deployments where user data flows into model fine-tuning pipelines have not yet been tested by enforcement.

What this means for enterprise deployment

Before signing any global AI vendor contract for an Indian deployment, legal review should confirm three things independently: where the vendor’s infrastructure processes data on your behalf, whether your industry’s sectoral regulator imposes requirements stricter than the DPDP baseline, and what the contract states about your data being used for model training or improvement purposes.

These are standard conditions of Indian enterprise AI deployment that the standard vendor contract review process frequently does not surface.

The Domestic Response and What It Signals

The gaps described above are not unaddressed. They are being worked on, primarily by Indian-origin AI labs and research institutions, and the response is instructive precisely because it exists.

AI4Bharat, a research lab at IIT Madras, has developed open-source datasets, tools, and models specifically for Indian languages, including the IndicTrans2 machine translation system covering all 22 scheduled Indian languages. Sarvam AI has built a model optimized for Indian language tokenization that, as the tokenization benchmarking data shows, performs substantially better than global models on cost efficiency for Indian language content. BharatGen, a government-funded initiative, is developing multimodal LLMs tailored specifically to India’s linguistic and contextual needs.

The fact that these efforts exist and are attracting investment and government attention is itself evidence that the gap is real and material. If global models performed equivalently on Indian languages, there would be no market for India-specific models. The domestic AI ecosystem is building for a gap that global vendor pitches rarely acknowledge.

None of this means that global AI products have no place in Indian enterprise deployments. Many do, and for many use cases, particularly those that are primarily English-language or code-adjacent, the performance gap is negligible. The argument is not that Indian enterprises should use only Indian models. The argument is that the performance, cost, and regulatory characteristics of a global AI deployment on Indian workloads are substantially different from what the standard pitch describes, and that closing that gap requires deliberate architectural and procurement decisions that most enterprises are not currently making.

The vendor claims and performance patterns described in this article reflect published research and regulatory documentation. No enterprise deployment case studies are referenced. Tokenization figures cited are approximate and sourced from the referenced benchmarking studies. This article does not constitute legal advice. Enterprises should seek qualified legal counsel on DPDP Act compliance specific to their operations.