Context: In July 2025, the government-backed BharatGen released PARAM-1, a bilingual Large Language Model (LLM) built from scratch to reflect India’s linguistic and cultural realities, focusing on Hindi and English.
Relevance of the Topic: Prelims: Key Features of PARAM-1.
Foundational AI
- Foundational AI: Large-scale AI models trained on very large datasets and over which numerous specific applications can be built, including generative AI.
- Large Language Models (LLMs) are a type of Foundational AI model trained with vast datasets with at least one billion or more parameters. E.g., AI-powered tools like ChatGPT, Gemini, Perplexity, DeepSeek, Grok.
- Small Language Models (SLMs) are compact AI systems typically having fewer than 1 billion parameters (ranges from millions to a few billion parameters). Cheaper to run and maintain, and ideal for specific use cases.
In its mission to build open source Large Language Models (LLMs) for Indian researchers and developers, BharatGen, the government-backed AI Initiative, has released a LLM called PARAM-1.
About PARAM-1
- PARAM-1 is a 2.9-billion parameter bilingual foundational AI model developed by the BharatGen team.
- It reflects India’s linguistic and cultural realities- with 25% of its training data in Hindi and the rest in carefully curated English.
Key Features:
- Bilingual focus: Trained in Hindi and English, incorporating government documents, literary works, educational and community content.
- Script-aware Tokeniser:
- Tokeniser is the first step in how a language model processes text. It breaks sentences into smaller units, or tokens, which the model can interpret.
- Standard tokenisers (built for English) perform poorly on Indian scripts, splitting words into too many fragments.
- PARAM-1 addresses this with a script-aware tokeniser that recognises Hindi and other Indic scripts, creating fewer and more meaningful tokens. This improves both accuracy and efficiency.
- Three-phase training focuses on language fluency, factual consistency, and long-context understanding. This allows the model to gradually develop fluency, retain factual information, and improve performance on tasks that require reading and reasoning over longer texts.
- India-centric evaluation: Tested on Indian benchmarks like MILU (competitive exam questions) and SANSKRITI (cultural knowledge), besides global ones like MMLU and ARC.
Limitations:
- Currently supports only Hindi and English, excluding India’s wider linguistic diversity. Raises concerns over the model’s inclusivity, especially in a country where linguistic identity often intersects with regional politics and access to services.
