<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>Datapipelines on kmcd.dev</title><link>https://kmcd.dev/tags/datapipelines/</link><description>Recent content in Datapipelines on kmcd.dev</description><generator>Hugo -- gohugo.io</generator><language>en</language><copyright>All Rights Reserved</copyright><lastBuildDate>Thu, 16 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://kmcd.dev/tags/datapipelines/index.xml" rel="self" type="application/rss+xml"/><item><title>Unknown Fields in Protobuf</title><link>https://kmcd.dev/posts/protobuf-unknown-fields/</link><pubDate>Thu, 16 Apr 2026 00:00:00 +0000</pubDate><guid>https://kmcd.dev/posts/protobuf-unknown-fields/</guid><description> 
                &lt;p> &lt;img hspace="5" src="https://kmcd.dev/posts/protobuf-unknown-fields/cover.svg" /> &lt;/p>
                
                How Protobuf unknown fields enable seamless schema evolution and robust middleware.
                </description><content:encoded><![CDATA[<div class="disclaimer">
    This article was originally published in March 2024. It was republished in April 2026 after some significant editing and modernization.
</div>

<p><a href="https://protobuf.dev/programming-guides/proto3/" rel="external">Protobuf</a> includes a feature known as <a href="https://protobuf.dev/programming-guides/proto3/#unknowns" rel="external"><strong>unknown fields</strong></a>. They act as a safety net when systems encounter data they weren&rsquo;t explicitly built to handle. Here is a breakdown of what they are and why they matter.</p>
<h2 id="what-are-protobuf-unknown-fields">What are Protobuf Unknown Fields?</h2>
<p>Your <code>.proto</code> file defines the expected structure, fields, and data types. But what happens when you parse a message and it contains fields that aren&rsquo;t in your current <code>.proto</code> definition?</p>
<p>These extra pieces of data are called <strong>unknown fields</strong>.</p>
<p>At a lower level, unknown fields are <strong>field numbers and wire types that exist in the serialized message but are not defined in the current schema</strong>.</p>
<p>This mechanism is what enables <strong>forward compatibility</strong>: an older version of your software can safely read, process, and forward data produced by a newer version of the schema without crashing or losing the new data.</p>
<blockquote>
<p><strong>Key idea:</strong> Unknown fields enable forward compatibility by default.</p>
</blockquote>
<hr>
<h2 id="preserving-unknown-data">Preserving Unknown Data</h2>
<p>A key aspect of unknown fields is how they behave during message manipulation.</p>
<p>If you receive a message with unknown fields and forward it to another system, Protobuf defaults to <strong>forwarding the unknown fields alongside the known ones</strong>. This ensures the receiving system gets the complete payload.</p>
<p>If this didn&rsquo;t happen, you could accidentally clear field values set by another part of the system.</p>
<p>This forwarding capability also applies when <strong>persisting messages</strong>, as long as they remain in <strong>binary Protobuf format</strong>. If you store and later reload the binary payload, the unknown fields are preserved.</p>
<blockquote>
<p><strong>Key idea:</strong> Binary Protobuf preserves unknown fields end-to-end.</p>
</blockquote>
<blockquote>
<p><strong>Historical Note:</strong> Protobuf v3 initially tried to simplify the specification by removing several proto2 features, but real-world usage forced them to walk the biggest ones back. Early versions of proto3 dropped unknown fields entirely, but this was reversed in v3.5. Similarly, proto3 initially removed the <code>optional</code> keyword, but brought it back in v3.15 after developers struggled to distinguish between a field being unset and a field just having a zero value, which is <a href="https://en.wikipedia.org/wiki/Null_Island" rel="external">a classic programming mistake</a>.</p>
</blockquote>
<hr>
<h2 id="json-comparison-what-actually-breaks">JSON Comparison (What Actually Breaks)</h2>
<p>Consider a scenario where a new field, <code>email</code>, is added to a user object. The backend is updated, but the frontend is not.</p>
<p>The issue in JSON systems is not JSON itself, but rather <strong>typed deserialization</strong>.</p>
<div class="highlight"><pre tabindex="0" style="color:#d8dee9;background-color:#2e3440;"><code class="language-json" data-lang="json"><span style="display:flex;"><span><span style="color:#eceff4">{</span>
</span></span><span style="display:flex;"><span>  <span style="color:#81a1c1">&#34;user&#34;</span><span style="color:#eceff4">:</span> <span style="color:#eceff4">{</span>
</span></span><span style="display:flex;"><span>    <span style="color:#81a1c1">&#34;id&#34;</span><span style="color:#eceff4">:</span> <span style="color:#a3be8c">&#34;0edc0903-9e31-47be-adad-1dfc434ca2d3&#34;</span><span style="color:#eceff4">,</span>
</span></span><span style="display:flex;"><span>    <span style="color:#81a1c1">&#34;name&#34;</span><span style="color:#eceff4">:</span> <span style="color:#a3be8c">&#34;Bob&#34;</span><span style="color:#eceff4">,</span>
</span></span><span style="display:flex;"><span>    <span style="color:#81a1c1">&#34;email&#34;</span><span style="color:#eceff4">:</span> <span style="color:#a3be8c">&#34;bob@example.com&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#eceff4">}</span>
</span></span><span style="display:flex;"><span><span style="color:#eceff4">}</span>
</span></span></code></pre></div><p>If the frontend maps this into a typed structure:</p>
<div class="highlight"><pre tabindex="0" style="color:#d8dee9;background-color:#2e3440;"><code class="language-typescript" data-lang="typescript"><span style="display:flex;"><span><span style="color:#81a1c1;font-weight:bold">class</span> User <span style="color:#eceff4">{</span>
</span></span><span style="display:flex;"><span>  id: <span style="color:#81a1c1">string</span><span style="color:#eceff4">;</span>
</span></span><span style="display:flex;"><span>  name: <span style="color:#81a1c1">string</span><span style="color:#eceff4">;</span>
</span></span><span style="display:flex;"><span><span style="color:#eceff4">}</span>
</span></span></code></pre></div><p>The unknown field (<code>email</code>) is dropped during deserialization. When the object is sent back:</p>
<div class="highlight"><pre tabindex="0" style="color:#d8dee9;background-color:#2e3440;"><code class="language-json" data-lang="json"><span style="display:flex;"><span><span style="color:#eceff4">{</span>
</span></span><span style="display:flex;"><span>  <span style="color:#81a1c1">&#34;user&#34;</span><span style="color:#eceff4">:</span> <span style="color:#eceff4">{</span>
</span></span><span style="display:flex;"><span>    <span style="color:#81a1c1">&#34;id&#34;</span><span style="color:#eceff4">:</span> <span style="color:#a3be8c">&#34;0edc0903-9e31-47be-adad-1dfc434ca2d3&#34;</span><span style="color:#eceff4">,</span>
</span></span><span style="display:flex;"><span>    <span style="color:#81a1c1">&#34;name&#34;</span><span style="color:#eceff4">:</span> <span style="color:#a3be8c">&#34;Bob&#34;</span>
</span></span><span style="display:flex;"><span>  <span style="color:#eceff4">}</span>
</span></span><span style="display:flex;"><span><span style="color:#eceff4">}</span>
</span></span></code></pre></div><p>The <code>email</code> field is lost.</p>
<blockquote>
<p><strong>Key idea:</strong> Typed JSON pipelines often drop unknown fields during reserialization.</p>
</blockquote>
<hr>
<h2 id="protobuf-behavior">Protobuf Behavior</h2>
<p>With Protobuf, the same scenario behaves differently.</p>
<p>Even if the frontend does not know about the <code>email</code> field, it is preserved internally:</p>
<div class="highlight"><pre tabindex="0" style="color:#d8dee9;background-color:#2e3440;"><code class="language-typescript" data-lang="typescript"><span style="display:flex;"><span>Symbol<span style="color:#eceff4">(</span><span style="color:#81a1c1;font-weight:bold">@bufbuild</span><span style="color:#81a1c1">/</span>protobuf<span style="color:#81a1c1">/</span><span style="color:#81a1c1">unknown</span><span style="color:#81a1c1">-</span>fields<span style="color:#eceff4">)</span><span style="color:#81a1c1">:</span> <span style="color:#eceff4">[</span>
</span></span><span style="display:flex;"><span>  <span style="color:#eceff4">{</span><span style="color:#b48ead">0</span><span style="color:#81a1c1">:</span> <span style="color:#eceff4">{</span>no:<span style="color:#81a1c1">3</span><span style="color:#eceff4">,</span> wire_type:<span style="color:#81a1c1">2</span><span style="color:#eceff4">,</span> data: <span style="color:#81a1c1">Uint8Array</span><span style="color:#eceff4">(</span><span style="color:#b48ead">14</span><span style="color:#eceff4">)}}</span>
</span></span><span style="display:flex;"><span><span style="color:#eceff4">]</span>
</span></span></code></pre></div><p><em>(Note: This specific <code>Symbol</code> representation is how the <code>@bufbuild/protobuf</code> implementation manages it under the hood. Other JS/TS generators might expose this data slightly differently, but the underlying concept remains the same.)</em></p>
<ul>
<li><code>no: 3</code> → field number (email)</li>
<li><code>wire_type: 2</code> → length-delimited (used for strings)</li>
<li><code>data</code> → raw encoded value</li>
</ul>
<p>When the message is re-encoded:</p>
<div class="highlight"><pre tabindex="0" style="color:#d8dee9;background-color:#2e3440;"><code class="language-text" data-lang="text"><span style="display:flex;"><span>1:LEN {&#34;0edc0903-9e31-47be-adad-1dfc434ca2d3&#34;}
</span></span><span style="display:flex;"><span>2:LEN {&#34;Bob&#34;}
</span></span><span style="display:flex;"><span>3:LEN {&#34;bob@example.com&#34;}
</span></span></code></pre></div><p>The unknown field survives the round trip.</p>
<blockquote>
<p><strong>Key idea:</strong> Unknown fields are preserved even when not understood.</p>
</blockquote>
<hr>
<h2 id="the-middleware-advantage">The Middleware Advantage</h2>
<p>Unknown fields shine in internal, middleware-heavy architectures.</p>
<p>Example:</p>
<ul>
<li>API Gateway reads <code>id</code> for routing</li>
<li>Logging service reads <code>trace_id</code></li>
<li>Downstream service understands full schema including new fields</li>
</ul>
<p>Intermediate services can safely:</p>
<ol>
<li>Unmarshal using an older schema</li>
<li>Read known fields</li>
<li>Forward the message unchanged</li>
</ol>
<p>No coordination is required when new fields are added upstream.</p>
<blockquote>
<p><strong>Key idea:</strong> Internal middleware can stay stable while schemas evolve.</p>
</blockquote>
<hr>
<h2 id="observability-a-signal-for-upgrades">Observability: A Signal for Upgrades</h2>
<p>Beyond just forwarding data safely, unknown fields provide a highly valuable observability metric.</p>
<p>When an API gateway or a downstream service detects unknown fields in incoming payloads, it serves as a clear telemetry signal: a client or upstream service is sending extra information because it is using a newer schema.</p>
<p>Instead of crashing or silently dropping the data, the service can log the presence of these unknown fields. You can use this data to trigger alerts, track the rollout progress of new features across your architecture, and pinpoint exactly which legacy services are lagging behind and due for an upgrade.</p>
<hr>
<h2 id="a-note-on-json-serialization-and-object-re-use">A Note on JSON Serialization and Object Re-use</h2>
<p>There are a couple of important exceptions where unknown fields get lost.</p>
<p>First, unknown field preservation applies <strong>only to binary Protobuf serialization</strong>. If you convert from binary to JSON (e.g., using <code>protojson</code> in Go or <code>toJson</code> in TypeScript), unknown fields are <strong>dropped</strong> during the encoding process. Conversely, when unmarshaling JSON back into Protobuf, many libraries are strictly configured by default. For instance, Go&rsquo;s <code>protojson.Unmarshal</code> will throw a hard error if it encounters unknown fields in the JSON payload unless you explicitly bypass it by passing <code>DiscardUnknown: true</code>. JSON simply isn&rsquo;t designed to carry this extra payload without a strict schema map.</p>
<p>Second, preserving these fields during binary serialization requires that you re-use the exact same object for re-serialization. If you read a message, pull out the known fields, and map them into a freshly created object to send downstream, the unknown fields tied to the original object will be left behind.</p>
<blockquote>
<p><strong>Key idea:</strong> Binary preserves and JSON drops. Always re-use the original object if you want to keep unknown fields intact.</p>
</blockquote>
<hr>
<h2 id="databases-and-security">Databases and Security</h2>
<p>The theoretical elegance of unknown fields often collides with the messy reality of databases and security perimeters. In practice, relying on unknown fields breaks down entirely in a few critical scenarios.</p>
<p>First, consider database persistence. If clients are trying to store extra data, and a backend service parses a Protobuf message to map it to standard relational database columns, those unknown fields are absolutely gone. There is no magic column for data your database schema does not know about.</p>
<p>The only way to achieve true end-to-end preservation is to store the entire serialized Protobuf message directly in the database as a BLOB. Some teams do this, but blindly storing data you haven&rsquo;t validated and don&rsquo;t even recognize is highly dangerous.</p>
<p>Allowing unknown fields to propagate unchecked from external sources is a significant security risk. While they are a powerful tool inside clearly defined, trusted internal pipelines, accepting them from the open web opens your system up to data smuggling. It allows malicious actors to sneak unvalidated payloads into unknown fields to bypass validation layers that only inspect known schema structures. If your systems blindly unmarshal, store, and forward this data, older services act as unwitting mules for malicious input.</p>
<p>Because of these exact risks, the standard security posture is to aggressively filter at the edge. API Gateways and ingress proxies should explicitly discard unknown fields before the data ever reaches internal microservices.</p>
<hr>
<h2 id="conclusion">Conclusion</h2>
<p>Unknown fields provide a powerful mechanism for <strong>forward compatibility</strong> in distributed systems. They allow internal systems to evolve independently, act as a clear signal for required upgrades, reduce coordination overhead, and simplify middleware design.</p>
<p>However, they are not a substitute for validation, schema discipline, or proper security boundaries. Use them intentionally in trusted internal pipelines, but never trust them at the edge.</p>
]]></content:encoded></item></channel></rss>