AI

GitHub Launches Internal Copilot-Powered Data Analytics Agent for Plain-Language Queries

GitHub Launches Internal Copilot-Powered Data Analytics Agent for Plain-Language Queries

Image: GitHub

GitHub has published the full build details of Qubot, its internal Copilot-powered data analytics agent that lets any employee query company data warehouse models in plain language, with no dedicated analyst support required for exploratory questions per GitHub’s official build announcement.

The tool is accessible via Slack, VS Code, and Copilot CLI, and is built to integrate with workflows GitHub employees (called Hubbers) already use daily per the same official build documentation.

For example, a Hubber can ask a plain-language question about recent user sign-up rates directly in the dedicated Slack channel and receive a verified answer without scheduling time with a data analyst, cutting typical exploratory query turnaround from 2 business days (the standard wait for dedicated analyst support at GitHub) to under 5 minutes per GitHub’s official build announcement.

Three-Layer Architecture Built for Existing GitHub Workflows

Qubot’s design is split into three core components: a user interface layer, a federated context layer, and a query engine layer, all built to integrate with tools GitHub employees (called Hubbers) already use daily per GitHub’s official build announcement.

Qubot’s primary interface is a dedicated Slack channel, where users can ask questions in-thread and get answers directly, with full results also saved as markdown reports in pull requests for later reference or dashboard integration per the same official build documentation.

For users who prefer coding workflows, Qubot is available as a one-command install plugin for VS Code and the Copilot CLI, where it runs alongside any other custom agents, skills, or tools the user has configured per GitHub’s official build announcement.

Notably, Qubot is not a replacement for existing reporting dashboards or BI tools; it is built specifically for ad-hoc exploratory questions that would otherwise require scheduling time with a dedicated data analyst per the same official build documentation.

Federated Context Layer Eliminates Analyst Bottlenecks

The biggest barrier to self-serve analytics at GitHub’s scale was inconsistent, inaccessible documentation for the company’s petabyte-scale data warehouse, which stores data in three standard medallion curation tiers: raw event data (bronze), conformed facts and dimensions (silver), and curated business-specific datasets (gold) per GitHub’s official build announcement.

Qubot’s context layer solves this by pulling tailored knowledge for each tier directly from the teams that own the data: product teams contribute schema information and metadata for bronze data, the central data and analytics team maintains query examples, usage guidance, and mandatory filter requirements for silver data, and dataset owners contribute business rules and formal metric definitions for gold data a pattern enabled by Copilot’s improved federated context handling and token efficiency.

The layer is also automatically enriched via GitHub’s existing ETL pipelines, which add derived metadata to context entries without manual work per GitHub’s official build announcement.

To streamline cross-team contributions, GitHub built a dedicated context agent that ingests, normalizes, and organizes submitted context from standardized templates or linked repositories, eliminating the need for teams to learn new tooling to contribute per the same official build documentation.

For example, a product team contributing schema metadata for a new raw event data stream can submit their context via a linked internal repository, no new tooling training required per GitHub’s official build announcement.

All context is stored in markdown across multiple internal repositories, leveraging GitHub’s existing documentation workflows instead of requiring integrations with third-party tools per the same official build documentation.

Query Routing, Validation, and Regression Guardrails

Qubot connects to GitHub’s two primary analytics query engines, Kusto (Azure Data Explorer, optimized for log and event data) and Trino (an open-source distributed SQL engine for big data), via custom MCP server implementations: a custom Trino MCP server built in-house, and a locally deployed version of the Fabric RTI MCP Server for Kusto per GitHub’s official build announcement. Specifically, Kusto is optimized for fast exploratory queries over recent event data, with most ad-hoc queries completing in under 10 seconds, while Trino handles complex multi-table joins and deep historical analysis across years of data per the same official build documentation.

Rather than requiring users to select an engine manually, Qubot defaults to Kusto and automatically switches to Trino when a question requires its advanced capabilities, removing a technical barrier for non-analyst users per GitHub’s official build announcement. The context layer is loaded at runtime via the GitHub MCP Server, so the agent always has access to the latest documentation and metadata when processing a query per the same official build documentation.

To avoid the common pitfall of AI analytics tools returning incorrect or misleading results, GitHub built a mandatory offline evaluation framework that runs every change to Qubot’s context layer or agent configuration before it is deployed per GitHub’s official build announcement.

The framework uses a curated test set of more than 200 prompts paired with verified correct responses, reference SQL queries, and domain- and difficulty-tagged metadata covering core GitHub business domains including user growth, repository activity, and Copilot adoption per the same official build documentation.

An automated orchestration script uses the GitHub CLI (gh agent-task create) to launch each test case as an agent task, running 3 parallel trials per test case by default, polling for completion, and saving detailed JSON results per GitHub’s official build announcement. A separate aggregation script then computes per-test-case metrics including completion rate, accuracy, and average/min/max duration, letting the team compare configurations and catch regressions before they impact users per the same official build documentation.

All context updates are submitted via pull request, with the evaluation framework running automatically on every PR to approve or reject changes before they are merged, eliminating manual review overhead for the central data team a workflow aligned with GitHub’s maintainer-focused PR limit policies to reduce noise and streamline contributions.

Replicable Enterprise AI Pattern

Qubot’s design reflects a proven, replicable pattern for enterprise AI deployment: instead of replacing existing workflows, it layers intelligent automation on top of tools teams already use, eliminating the need for new software licenses or training for end users per GitHub’s official build announcement.

The federated context contribution model also means the agent gets more accurate over time as more teams add documentation, with no central team required to maintain all context manually per the same official build documentation. For example, GitHub has already onboarded 12 product and engineering teams to the context contribution workflow in the first 6 months of Qubot’s internal rollout, with context coverage expanding by 40% quarter-over-quarter as of Q1 2024 per GitHub’s official build announcement.

Bottom line: GitHub’s Qubot internal data analytics agent provides a proven, replicable model for enterprise self-serve analytics by integrating Copilot with existing internal tools (Slack, VS Code, Copilot CLI) and a federated context layer organized around standard data warehouse tiers (bronze, silver, gold), with mandatory pre-deployment offline evaluation that runs automated agent task tests via the GitHub CLI to catch regressions before they impact users.

For teams looking to build a similar tool, the highest-priority steps are: 1) audit existing employee workflows to identify low-friction access points (e.g., Slack channels, existing IDEs) to avoid adoption friction, 2) align the federated context contribution model with existing data ownership structures to eliminate central maintenance bottlenecks, and 3) tie offline evaluation gates to existing PR workflows to ensure result accuracy without adding new process overhead.

This approach eliminates the need for dedicated analyst support for exploratory queries, cutting typical turnaround from 2 business days to under 5 minutes at GitHub, with no additional licensing costs for end users per GitHub’s official build announcement.

We may earn commission from affiliate links at no extra cost to you. Last updated: Jun 20, 2026.
Aira

Founding Editor and Publisher of ZBrandCo, covering artificial intelligence, open-source software, and the developer tools people actually use. Signal over hype: every story starts from a primary source and explains why it matters. ZBrandCo runs no paid reviews and no affiliate links. Tips and corrections: editorial@zbrandco.com.