TL;DR: GitHub has released the GitHub Multilingual Repositories Dataset — a CC0-1.0 metadata collection covering 40 million public repositories with language classifications for READMEs, issues, and pull requests. The dataset reveals Portuguese leads non-English READMEs (3M+ repos) while Korean dominates issue discussions, giving AI developers a developer-specific multilingual signal distinct from general web corpora.
GitHub published the GitHub Multilingual Repositories Dataset on June 15, 2026, fulfilling a 2025 commitment under Microsoft’s European Digital Commitments to make multilingual developer data accessible to open-source AI builders GitHub Blog. Unlike raw content dumps, this is a repository-level metadata dataset designed to help researchers discover where non-English collaboration actually happens on the platform.
The dataset spans over 80 million classification rows across more than 40 million public repositories GitHub Blog. For each repository, it provides language classifications for three distinct text sources: the README, the most-commented issue, and the most-commented pull request — each truncated to the first 150 characters, with texts under 20 characters excluded.
Three classifiers, no forced consensus
GitHub deliberately did not collapse the three classifiers into a single label. Instead, each text source receives independent classifications from fastText, gcld3, and lingua-py, each with a confidence score. Only classifications above 0.5 confidence are included GitHub Blog.
This design choice reflects a practical reality: different classifiers have different coverage and calibration, especially for lower-resource languages. By exposing all three, GitHub lets downstream users choose their own precision-recall tradeoff.
| Use case | Suggested approach |
|---|---|
| High-precision Greek subset | Require all three classifiers to agree above a confidence threshold |
| Broad recall for Romance languages | Accept a single classifier’s prediction |
| Cross-classifier validation | Compare agreement rates as a quality signal |
Language distribution surprises
The dataset surfaces striking divergences in where languages appear across repository artifacts:
- Portuguese tops the non-English README list with more than 3 million repositories
- Korean is the most common non-English language in issue text, but only the fifth-most common in READMEs
This split matters. READMEs represent outward-facing documentation; issues and pull requests capture active collaboration. A language dominating issues but not READMEs suggests a community that builds in that language but documents in English — a pattern general web corpora would miss entirely.
Repository metadata included
Beyond language labels, each row carries repository metadata useful for filtering and stratification GitHub Blog:
- Creation timestamp
- Disk usage
- Stars and forks
- Primary programming language
- SPDX license identifier
- Issue and pull request counts
- Snapshot date
This lets researchers slice by license permissiveness, project maturity, or programming language ecosystem — not just natural language.
Why developer-specific multilingual data matters
Most multilingual training corpora (Common Crawl, mC4, Wikipedia dumps) reflect general web text: news, blogs, encyclopedic entries. Developer collaboration has its own vocabulary, code-switching patterns, and discourse structures. A Korean developer discussing a React bug in an issue thread uses different language than a Korean news article — and that difference propagates into model behavior when the model encounters technical Korean in production.
The dataset also enables community-level analysis: which language communities are active in which programming ecosystems, how issue response times vary by language, whether certain licenses correlate with multilingual contribution patterns.
Practical entry points for builders
For data engineers curating training mixes: Filter the dataset for repositories where all three classifiers agree on a target language above 0.8 confidence, then pull the corresponding repository contents via the GitHub API for a high-precision domain-specific corpus.
For evaluation benchmark creators: Use the issue/PR language labels to construct multilingual code review or technical QA benchmarks grounded in real developer interactions, not translated synthetic data.
For product teams building localized tooling: Identify which repositories have non-English issues but English READMEs — these are communities that would benefit from localized documentation but haven’t produced it yet.
Licensing and access
The dataset is available on GitHub under CC0-1.0 (public domain dedication), removing legal friction for commercial and academic use alike GitHub Blog. No attribution required, no share-alike constraints — just a metadata map pointing toward the repositories where the world’s developers actually write to each other in languages other than English.
Bottom line: This isn’t a model release or a benchmark — it’s a discovery layer. The real value emerges when teams combine these metadata signals with repository contents to build the multilingual developer tools the next generation of AI-assisted coding will require.
