“Who controls the past controls the future. Who controls the present controls the past.”
-Orwell sometimes by way of RATM
Who controls the data controls the present. Consider the era we live in. Information zips around us at unprecedented scale and velocity because of the Internet. Artificial intelligence has uncapped our ability to programmatically interpret and even generate information feeds using software. Given that nation-states, corporations, and we ourselves are information machines, it’s no wonder then that data has become the most potent resource of our time, from unlocking scientific discovery to toppling political regimes.
At Computable Labs, we’re working hard to make data a shared resource that is as openly accessible as possible. We believe it’s the biggest lever we have for creating social equity while delivering business value in the modern world. To that end, we’re building open source technology to make it not just possible, but also profitable to trade and exchange data. With the right trustless Internet infrastructure in place, we believe an entire industry will be created around an open global marketplace of data and algorithms.
Decentralized networks will be the heart (and perhaps also the soul) of this brave new world. Blockchain technologies offer new Internet primitives for economic ownership and secure transactions of digital goods online. As a result, concepts like fractional ownership of crowdsourced online datasets and machine learning models become realizable, and data markets can be created around them. Still, there is much work to be done to extend existing blockchain technology to fully implement these ideas, so that’s what we’ve set out to do.
In this post, we want to share an initial sneak peek at our development plans. We put a lot of consideration into establishing the right order of operations, resulting in roughly three phases of development. Our guiding light in building this roadmap ended up being pretty simple. Each phase outputs the technological building blocks we need to successfully complete the next one.
Phase 1: Data Market Protocol
We will develop a permissionless protocol that lets anyone easily establish online data markets in any domain.
The first phase of our work will involve the development of an open protocol for creating and operating data markets and exchanges. We have already completed a preliminary design and will release the first of several formal technical publications soon. Anyone will be able to use this protocol to tokenize data and invite collaborators to build a crowdsourced data market together. We expect some projects will focus on verticals, e.g. genomics, while others will focus on functions, e.g. object detection, perhaps for self-driving cars.
While we will experiment by initializing some data markets ourselves, we intend to fully support other projects and companies that want to use our protocol for their own data market applications. Our primary goal in this first phase is simply to drive adoption amongst developers, whether as individuals or as part of organizations. We want to help pollinate the emerging ecosystem of data markets. We’re also looking for as much market feedback as possible, and there’s no better way than to have the market tell us what we can do to help them.
This first protocol is only the tip of the iceberg for what we ultimately want to accomplish. However, it is crucial because it delivers a couple technical primitives that our future work will rely on. Namely, we will be able to turn data into an asset well-suited for marketplace transactions and exchange, and we will formalize an incentive structure for crowdsourced construction of data markets.
Creating data markets for machine learning computations requires not only supplying data, but also curating and structuring it. These latter two functions are vital and often overlooked in many data market proposals. Curation ensures data is relevant and consistent. Structuring ensures it is actually machine learnable. Just as important is providing the necessary incentives for people and organizations to come together to collaboratively create these projects. Like any marketplace, data markets will face cold start challenges (perhaps more familiarly framed as the chicken-and-egg problem). Why would any data market contributor elect to contribute first? Historically, all sorts of interesting things happen when contractual ownership rights are offered on Day 0 in exchange for services. Banks loan money to homebuyers, wildcatters drill for oil, and employees work for startups. In the decentralized world, ownership is reflected by tokens, and (smart) contracts are enforced by software. During Phase 1, we will extend these concepts to work for ownership and transactions of data and algorithms, which can be thought of as data as well.
More concretely, starting a data market with our protocol generates a custom non-fungible token associated with that market and lets its creators parameterize token issuance. We’ll save the juicy details for some rapidly upcoming posts and publications, but suffice it to say that we have architected the protocol to maximize for design flexibility. Understanding of token economics is rapidly evolving, so we believe it’s important to give data market creators leeway in how they want to incentivize the participants of their projects.
By the end of Phase I, it will be possible to effectively create crowdsourced data APIs owned on the permissionless Internet by their contributors. The economics of these data resources will be equitable. Those who contribute more to a data market receive more ownership in that market and therefore more financial payout as downstream users buy access into the data market.
One practical challenge with data markets is data replication. The ability to copy and paste data limits its value as a tradable asset. For some enterprises, it also holds them back from embracing more commercial data trade activities. While many businesses see the appeal in monetizing or exchanging data, they often remain leery of exposing their data to any potential loss of control. Our first generation data market applications enabled by Phase 1 and Phase 2, which we describe next, will only allow for one-way privacy. That is, either the buyer or seller will have to expose his or her asset (data or model) and trust the other side of a transaction. This will work just fine for certain data markets, but it will limit the overall opportunity until our privacy technologies of Phase 3 come online to enable fully trustless transactions.
Phase 2: AI Smart Contracts
We will make blockchain networks compatible with machine learning computations.
One modality of data markets involves selling computational access to data rather outright transfer of it (as is the case during Phase 1). For example, a data market could be constructed to sell training services. Buyers would submit models to a network that would use its data to train and return improved models. Alternatively, a model market could offer inference as a service. That is, buyers can rent access to state-of-the-art machine learning models without ever taking possession of those models. In effect, Phase 2 data markets keep sellers’ assets private and secure, whereas Phase 1 data markets protect buyers’ assets from exposure.
There’s a lot of value in enabling Phase 2 designs. We’ve seen considerable market interest, for perhaps obvious reasons, from a number of businesses we’ve talked to that want to monetize their data without compromising it. Here’s the rub. Phase 2 data markets require blockchain coordination of machine learning computations, but blockchains and machine learning are at present quite incompatible. For example, it is non-trivial for heterogeneous nodes to deterministically arrive at the same machine learning outputs. This is obviously debilitating for decentralized networks that rely on consensus protocols. There are several other limitations, which we will discuss in future posts, but this is why the crux of Phase 2 will be making machine learning and blockchains more technologically compatible. Successfully doing so will significantly expand our market of users by accommodating sellers who want to protect their data from exposure.
One thing that Phase 2 does not solve is the trust problem. It simply shifts trust requirements to buyers instead, and again only one-way privacy is guaranteed. Achieving bidirectional privacy becomes the focus for Phase 3, and the technology developed in Phase 2 plays a crucial role for that.
Phase 3: Private Data Market Protocol
Trustless third party network for private data transactions and computations
The untapped opportunity of data markets fully unleashes when the right privacy technologies come to fruition. We’ve talked to a number of businesses across several industries, and the message was clear. There is a strong appetite for finding new ways to monetize data and models, but they want to do so without giving those assets away. For example, is it possible to sell data to train a customer’s machine learning model without exposing the raw data (or model)? Conversely, how can we monetize a model and sell inference capabilities without revealing model weights and parameters? For now, companies acquiesce to exposing data through APIs or high-trust private data deals since there are simply no other alternatives. In the future, data commerce truly unlocks when both buyers and sellers can feel confident about exchanging data-related services without having to expose their precious assets.
For this reason, we view a trustless blockchain network with privacy capabilities as a key foundation to an emerging industry built around data marketplaces. Decentralization will likely play the lead character in this story for a couple reasons. First, a single company brokering data transactions is not trustless. Quite the contrary, now both buyers and sellers would have to trust this centralized third party to neither maliciously nor mistakenly leak data. Second, decentralized networks reflect a technology paradigm that is particularly suitable for both robust and secure transactions. They eliminate single points of failure, and their distributed nature lends well to powerful privacy technologies e.g. secure multi-party computing. Our approach to privacy-preserving computations will combine a number of different methods, but distributed systems security techniques will certainly play a core role.
In order for a trustless blockchain network to guarantee data privacy, both data transactions and computations must privately execute within that network. This therefore requires tokenizing data and machine learning compatibility with blockchains, which is precisely why we’re working hard on Phase I and Phase II as we rapidly approach Phase III. We already have some exciting plans for making privacy-preserving computations and transactions a reality on the permissionless web, and we’re excited talk more about that in the coming months.
-The Computable Labs Team