.. _schemaregistry_intro: Schema Management ================= |sr-long| provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving your `Avro® `__, `JSON Schema `__, and `Protobuf `__ :ref:`schemas `. It stores a versioned history of all schemas based on a specified :ref:`subject name strategy `, provides multiple :ref:`compatibility settings ` and allows evolution of schemas according to the configured compatibility settings and expanded support for these schema types. It provides serializers that plug into |ak-tm| clients that handle schema storage and retrieval for |ak| messages that are sent in any of the supported formats. |sr| lives outside of and separately from your |ak| brokers. Your producers and consumers still talk to |ak| to publish and read data (messages) to topics. Concurrently, they can also talk to |sr| to send and retrieve schemas that describe the data models for the messages. .. figure:: ../images/schema-registry-and-kafka.png :align: center |sr-long| for storing and retrieving schemas |sr| is a distributed storage layer for schemas which uses |ak| as its underlying storage mechanism. Some key design decisions: * Assigns globally unique ID to each registered schema. Allocated IDs are guaranteed to be monotonically increasing but not necessarily consecutive. * |ak| provides the durable backend, and functions as a write-ahead changelog for the state of |sr| and the schemas it contains. * |sr| is designed to be distributed, with single-primary architecture, and |zk|/|ak| coordinates primary election (based on the configuration). .. seealso:: To see a working example of |sr|, check out :ref:`Confluent Platform demo `. The demo shows you how to deploy a |ak| streaming ETL, including |sr|, using |ksqldb| for stream processing. .. _sr-subjects-topics-primer: Schemas, Subjects, and Topics ----------------------------- First, a quick review of terms and how they fit in the context of |sr|: what is a |ak| `topic` versus a `schema` versus a `subject`. .. include:: includes/terms-schemas-topics.rst The :ref:`schema_registry_tutorial` shows an example of a :ref:`schema definition `. Starting with |cp| 5.2.0, you can use |crep-full| to :ref:`migrate schemas ` from one |sr| to another, and automatically rename subjects on the target registry. |ak| Serializers and Deserializers Background --------------------------------------------- When sending data over the network or storing it in a file, you need a way to encode the data into bytes. The area of data serialization has a long history, but has evolved quite a bit over the last few years. People started with programming language specific serialization such as Java serialization, which makes consuming the data in other languages inconvenient, then moved to language agnostic formats such as pure JSON, but without a strictly defined schema format. Not having a strictly defined format has two significant drawbacks: 1. **Data consumers may not understand data producers:** The lack of structure makes consuming data in these formats more challenging because fields can be arbitrarily added or removed, and data can even be corrupted. This drawback becomes more severe the more applications or teams across an organization begin consuming a data feed: if an upstream team can make arbitrary changes to the data format at their discretion, then it becomes very difficult to ensure that all downstream consumers will (continue to) be able to interpret the data. What's missing is a "contract" (cf. schema below) for data between the producers and the consumers, similar to the contract of an API. 2. **Overhead and verbosity:** They are verbose because field names and type information have to be explicitly represented in the serialized format, despite the fact that are identical across all messages. A few cross-language serialization libraries have emerged that require the data structure to be formally defined by schemas. These libraries include `Avro `__, `Thrift `__, `Protocol Buffers `__, and `JSON Schema `__ . The advantage of having a schema is that it clearly specifies the structure, the type and the meaning (through documentation) of the data. With a schema, data can also be encoded more efficiently. Avro was the default supported format for |cp|. For example, an Avro schema defines the data structure in a JSON format. The following Avro schema specifies a user record with two fields: ``name`` and ``favorite_number`` of type ``string`` and ``int``, respectively. .. sourcecode:: json {"namespace": "example.avro", "type": "record", "name": "user", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": "int"} ] } You can then use this Avro schema, for example, to serialize a Java object (POJO) into bytes, and deserialize these bytes back into the Java object. Avro not only requires a schema during data serialization, but also during data deserialization. Because the schema is provided at decoding time, metadata such as the field names don't have to be explicitly encoded in the data. This makes the binary encoding of Avro data very compact. .. _avro-json-proto-exten: Avro, JSON, and Protobuf Supported Formats and Extensibility ------------------------------------------------------------ Avro was the original choice for the default supported schema format in |cp|, with |ak| serializers and deserializers provided for the Avro format. .. include:: includes/serdes-formatter-intro.rst To learn more, see :ref:`serializer_and_formatter`. Schema ID Allocation -------------------- Schema ID allocation always happens in the primary node and Schema IDs are always monotonically increasing. If you are using |ak| primary election, the Schema ID is always based off the last ID that was written to |ak| store. During a primary re-election, batch allocation happens only after the new primary has caught up with all the records in the store ````. If you are using |zk| primary election, ``//schema_id_counter`` path stores the upper bound on the current ID batch, and new batch allocation is triggered by both primary election and exhaustion of the current batch. This batch allocation helps guard against potential zombie-primary scenarios, (for example, if the previous primary had a GC pause that lasted longer than the |zk| timeout, triggering primary reelection). .. _schemaregistry_design: |ak| Backend ------------ .. include:: includes/backend.rst .. _schemaregistry_single_master: Single Primary Architecture --------------------------- |sr| is designed to work as a distributed service using single primary architecture. In this configuration, at most one |sr| instance is the primary at any given moment (ignoring pathological 'zombie primaries'). Only the primary is capable of publishing writes to the underlying |ak| log, but all nodes are capable of directly serving read requests. Secondary nodes serve registration requests indirectly by simply forwarding them to the current primary, and returning the response supplied by the primary. Starting with |cp| 4.0, primary election is accomplished with the |ak| group protocol. (|zk| based primary election is now deprecated. ) .. note:: Please make sure not to mix up the election modes amongst the nodes in same cluster. This will lead to multiple primaries and issues with your operations. --------------------------------- |ak| Coordinator Primary Election --------------------------------- .. figure:: ../images/schema-registry-design-kafka.png :align: center |ak| based Schema Registry |ak| based primary election is chosen when ```` is not configured and has the |ak| bootstrap brokers ```` specified. The kafka group protocol, chooses one amongst the primary eligible nodes ``master.eligibility=true`` as the primary. |ak| based primary election should be used in all cases. (|zk| based leader election is deprecated. See :ref:`schemaregistry_zk_migration`.) |sr| is also designed for multi-colocated configuration. See :ref:`schemaregistry_mirroring` for more details. --------------------- |zk| Primary Election --------------------- .. important:: |zk| leader election is deprecated. |ak| leader election should be used instead. See :ref:`schemaregistry_zk_migration` for full details. .. _sr-high-availability-single-primary: ------------------------------------------ High Availability for Single Primary Setup ------------------------------------------ Many services in |cp| are effectively stateless (they store state in |ak| and load it on demand at start-up) and can redirect requests automatically. You can treat these services as you would deploying any other stateless application and get high availability features effectively for free by deploying multiple instances. Each instance loads all of the |sr| state so any node can serve a READ, and all nodes know how to forward requests to the primary for WRITEs. A common pattern is to put the instances behind a single virtual IP or round robin DNS such that you can use a single URL in the ``schema.registry.url`` configuration but use the entire cluster of |sr| instances. This also makes it easy to handle changes to the set of servers without having to reconfigure and restart all of your applications. The same strategy applies to REST proxy or |kconnect-long|. A simple setup with just a few nodes means |sr| can fail over easily with a simple multi-node deployment and single primary election protocol. .. _schemas_migrate_overview: Migrate Schemas (|ccloud| and self-managed) ------------------------------------------- Starting with |cp| 5.2.0, you can use :ref:`connect_replicator` to migrate schemas from a self-managed cluster to a target cluster which is either self-managed or in `Confluent Cloud `__. - For a concept overview and quick start tutorial on migrating schemas from self-managed clusters to |ccloud|, see :ref:`migrate_self_managed_schemas_to_cloud`. - For a demo of migrating schemas from one self-managed cluster to another, see :ref:`schemaregistry_migrate` and :ref:`quickstart-demos-replicator-schema-translation`. License ------- |sr| is a :ref:`component` of |cp|, available for a 30-day trial period without a license key, and thereafter under an :ref:`cp_enterprise_subs_license` as part of |cp|. A |cp| enterprise license is required for the :ref:`confluentsecurityplugins_schema_registry_security_plugin`. To learn more, see :ref:`schema-registry-license-config` and :ref:`confluentsecurityplugins_schema_registry_security_quickstart`. |sr| is also available under the `Confluent Community License `__. Suggested Reading ----------------- - :ref:`schema_registry_tutorial` (|ccloud| and on-premises) - `Schema Registry and Confluent Cloud `__ - :devx-examples:`Run an automated Confluent Cloud quickstart with Avro, Protobuf, and JSON formats|cp-quickstart/README.md` - :ref:`Use Control Center to manage schemas for on-premises deployments ` - :ref:`schema-registry-quickstart` - :ref:`schema-registry-prod` - :ref:`schema_validation` - :ref:`schemaregistry_config` - For Developers: :ref:`serializer_and_formatter` and :ref:`schemaregistry_dev_guide`. ---------- Blog Posts ---------- - `Schemas, Contracts, and Compatibility `_ - `17 Ways to Mess Up Self-Managed Schema Registry `_ - `Yes, Virginia, You Really Do Need a Schema Registry `_ - `How I Learned to Stop Worrying and Love the Schema `_ .. toctree:: :maxdepth: 2 :hidden: Installing and Configuring schema_registry_tutorial schema-validation Monitoring Schema Formats, Serializers, and Deserializers Single and Multi-Datacenter Setup avro Schemas in Control Center <../control-center/topics/schema> Schemas on Confluent Cloud installation/migrate Deleting Schemas Security Developer Guide Integrate Schemas from Connectors Changelog