Vitess’s VSchema is the secret sauce that lets MySQL sharding feel like a single database, and VTGate is the unglamorous workhorse that makes it all happen.

Let’s see Vitess in action. Imagine we have a users table sharded by user_id.

CREATE TABLE users (
    user_id BIGINT AUTO_INCREMENT,
    name VARCHAR(100),
    email VARCHAR(100),
    PRIMARY KEY (user_id)
);

Here’s a snippet of a VSchema that tells Vitess how to shard this table:

{
  "sharded": true,
  "vindexes": {
    "user_id_vdx": {
      "type": "numeric",
      "column": "user_id"
    }
  },
  "tables": {
    "users": {
      "column_vindexes": [
        {
          "column": "user_id",
          "name": "user_id_vdx"
        }
      ]
    }
  }
}

When an application connects to VTGate and issues a SELECT * FROM users WHERE user_id = 12345;, VTGate consults the VSchema. It sees that users is sharded by user_id using a numeric vindex. It then calculates which shard the data for user_id = 12345 belongs to. If, for example, it determines this user_id maps to Shard 0 (which might be a specific MySQL instance), VTGate forwards the query directly to the MySQL instance serving Shard 0. If the query involved a JOIN across sharded tables, VTGate might need to query multiple shards and then combine the results itself (a scatter-gather operation).

The problem Vitess solves is the operational complexity of managing many MySQL instances for a horizontally scaled database. Sharding, by definition, splits your data across multiple databases. Without a system like Vitess, you’d need custom application logic or complex middleware to figure out which database to query for a given piece of data. This logic is prone to errors, hard to maintain, and doesn’t scale well. Vitess abstracts away the sharding, presenting your distributed MySQL cluster as a single, unified database to your applications.

At its core, Vitess operates with two main components: VTGate and VTTables. VTGate is the gateway. It receives incoming SQL queries from your applications, parses them, consults the VSchema to determine the routing and execution plan, and then forwards the queries to the appropriate VTTables. VTTables are the actual database proxies, each responsible for a subset of the sharded data (a "shard"). A VTTables instance connects to one or more underlying MySQL instances and executes queries against them, returning results to VTGate.

The VSchema is a JSON configuration that defines how your tables are sharded and how Vitess should interact with them. It specifies:

  • sharded: A boolean indicating if the table is sharded.
  • vindexes: A map of virtual indexes. These are logical mappings from a table column to a sharding key. Vitess supports various vindex types like numeric (for integer ranges), hash (for consistent hashing), unicode_loose_md5 (for fuzzy matching), and custom vindexes.
  • tables: A map defining per-table sharding rules. For each table, you specify which vindex (from the vindexes map) is used for sharding (column_vindexes). You can also define sequences for generating unique IDs across shards.

The magic of VTGate is its ability to translate application queries into shard-specific operations. When VTGate receives a query, it first performs a lookup against the VSchema to understand the table structure and sharding keys. If a table is sharded, VTGate uses the defined vindexes to determine the target shard(s) for the query. For simple queries targeting a single row (e.g., SELECT * FROM users WHERE user_id = 12345), VTGate can directly route the query to the VTTables instance responsible for that specific shard. For more complex queries, like those involving joins across sharded tables or aggregations, VTGate might orchestrate queries to multiple shards and then consolidate the results. This distributed query execution is a key feature that allows Vitess to handle complex workloads.

The vindexes in the VSchema are not MySQL indexes. They are Vitess-specific constructs that map a column’s value to a shard. The numeric vindex, for instance, uses a range-based partitioning strategy. If you have 10 shards, a numeric vindex might divide the user_id space into 10 equal ranges, with each range corresponding to a shard. Vitess’s internal vtorpc protocol is how VTGate and VTTables communicate. This protocol is optimized for high-throughput, low-latency communication between these components.

When you define a sequence in your VSchema, Vitess manages a special table (often named _vt. świeżo ) across your shards to generate unique IDs. This is crucial for sharded primary keys, ensuring that even if two different shards generate an ID at roughly the same time, they will be unique across the entire cluster. Vitess achieves this by allocating blocks of IDs to each shard, preventing collisions.

The most surprising thing about Vitess’s sharding mechanism is how it handles transactions. While traditional sharded databases often struggle with distributed transactions, Vitess implements a two-phase commit (2PC) protocol for transactions that span multiple shards. This ensures atomicity across distributed operations, but it comes with a significant performance overhead. For this reason, Vitess strongly encourages designing applications to minimize cross-shard transactions, often by choosing sharding keys that keep related data on the same shard or by accepting eventual consistency for certain operations.

The next step after mastering VSchema and VTGate is understanding how Vitess handles schema changes across a sharded environment, typically using vtctlclient and ApplySchema.

Want structured learning?

Take the full Sharding course →