Three Data Point Thursday

Share this post

🐰 Airbyte's wrong turn; Metadata lake; A modern data stack in 5 mins; ThDPTh #41 🐰

thdpth.substack.com

🐰 Airbyte's wrong turn; Metadata lake; A modern data stack in 5 mins; ThDPTh #41 🐰

Sven Balnojan
Oct 14, 2021
Share
Three Data Point Thursday

Hi,

This week I got caught off guard by the company Airbyte, a hot new data integration start-up, I’ve been watching closely.

The company is betting on the fact that open-source is the future of data integration, and yet decided to close the source on its core.

Mmmmh.

I’m Sven, I collect ā€œData Pointsā€ to help understand & shape the future, one powered by data, not electricity anymore.

Svens Thoughts

If you only have 30 seconds to spare, here is what I would consider actionable insights for investors, data leaders, and data company founders.

  • Airbyte took a wrong ā€œlicenseā€ turn.

  • If you’re in the same situation don’t fall for the tyranny of the OR, embrace your genius of the AND.

  • Think about buyer-side pricing and figure out how to drive community engagement to the max, without speed limits like closed cores.

  • Take a look at the idea of the metadata lake.

  • Then think whether you agree that you should instead think about how to connect individual data pieces together, the underlying ā€œgraphā€.

  • There might be a business opportunity here both in providing & connecting metadata lakes if one wants such a thing.

  • Getting started & providing modularity are still key challenges in the data space, in every single area, again there is more than one business opportunity here I feel.

Airbyte's WrongĀ Turn

šŸ”„ What: Airbyte, a young and super quickly growing data integration startup just changed the license on its core from MIT to EL2, thereby moving away from open-source to a protective license that does not allow others to monetize the core.

It’s a move that’s been a somewhat common reaction from open-source-based companies fearing ā€œcommoditizationā€. I’ve already written a lengthy piece about why I think these companies fall for ā€œthe tyranny of the ORā€.

🐰 My perspective: Based on my research so far, this seems to be a move in the opposite direction of where the company has to go if it wants to win in this market. Especially in the early stages. Worse, they might not notice it because they got good momentum going and will be on the rise for quite some time.

Airbyte rightly realizes that nailing the ā€œconnectorā€ problem (or data snowflake problem, as I like to call it) is key to the data integration space. But the fear of big companies providing a hosted Airbyte solution drives them in the wrong direction.

If you as a company want to have as much ā€œconnector contributionā€ as possible, you’ll have to create strong incentives for the developers of the connectors. That means as Tobi Lütke of Spotify puts it, to leave all the money on the table and give it to the developers. That in turn means, getting as many consumers onto your ā€œplatformā€ as possible, and that means, getting widespread adoption!

Widespread adoption is if big companies come and host your product. Because you then get a lot of spreading for free. That means you will not make money on part of that, but that’s just a question of how you spin your business. Because it’s a huge huge uplift and a big incentive for your creators to create more connectors, which in turn will spin your flywheel faster.

What’s even worse is that keeping the connectors + specification open source, while closing the core means the company just opened up an easy vector of attack: replacing the core with something truly open source & using the existing connectors; Drive that by giving more money to the developers.

Finally, the pricing strategy Airbyte mentioned in that announcement differs from what I previously took as a ā€œbuyer-sideā€ strategy. Turns out they seem to go for a ā€œcompany-sizeā€ pricing model. Why this actually hinders all the profits of the company, the growth of the tool, and again opens a path to commoditization I also explore a bit in my open-source pricing article.

It’s a way to go, and other companies work their closed core quite well. Indeed lots of companies don’t have any open-source. But if you’re at that crossroad, you should think really deeply about why you are actually in fear of ā€œcommoditizationā€ (and read my post on it!), or whether it actually might be either a great opportunity or whether you can have it all, no commoditization while still having almost everything in an open core!

airbyte.io  •  Share

Rise of the MetadataĀ Lake

šŸŽ What: Metadata is data about data. Today, there are many different forms of metadata like performance metadata or user metadata. Prukalpa makes the case that we’re now at the time of metadata creation, driven by both the explosion of kinds of metadata, that justifies the creation of a metadata lake.

Whereas the data lake enables easy access to all data we get through centralization, the metadata lake enables easy identification of valuable data pieces through centralization. Key to that of course is a graph structure connecting pieces in the metadata lake as well as the data lake.

🐰 My perspective: I’m not sure Prukalpa is right. Very likely some organizations are at this point. But I also see the fact that centralization in itself is not the final solution to any data problem. Barr Moses points out in her ā€œdata discovery 2.0ā€ article, that we need to acknowledge the distributed nature of data even in the metadata layer.

I also feel like the graph structure connecting pieces is actually the crucial ingredient here, otherwise, there is no ā€œvalue identificationā€ throughout the metadata lake. But if the graph is the key ingredient, you don’t really need to centralize anything, you just need to take care of the graph.

It’s interesting that we put so much struggle into such simple concepts that ultimately come down to a simple question ā€œEh what exactly is this weird attribute you’re providing me with over your API?ā€.

I feel like we’re still not at the point of tackling the elephant in the room.

humansofdata.atlan.com  •  Share

Setting up a data stack in 5Ā mins.

What: Tuan Nguyen wrote this short piece about setting up a ā€œmodern data stackā€ within 5 minutes using terraform on GCP. I’m not so much interested in the hows of setting up this on GCP, but rather on the meta point about setting up a complete data stack within 5 minutes.

🐰 My perspective: We are currently at a weird point of time where startups have a huge advantage over incumbent companies because incumbent companies usually have a locked-in data stack, whereas startups can launch one within minutes. Even more interesting, they can launch a better data stack in minutes! And still, I consider this a key challenge of the data space.

I feel it’s not just about launching a data stack, but about modularization and being able to exchange parts. There still seems to be a lot of coupling inside a ā€œmodern data stackā€ which really shouldn’t be there.

So yes, there are still a lot of business opportunities right there! Getting up quickly, wrapping stuff & modularizing things, all of these things seem to be nowhere where they should be.

I only know of a few companies even going into that direction, GoodData and their ā€œheadless BIā€ concept come to my mind, as does what tabular plans to do (in like 5 years).

towardsdatascience.com  •  Share

šŸŽ„ Thanks => Feedback!

Thanks for reading this far! I’d also love it if you shared this newsletter with people whom you think might be interested in it.

Data will power every piece of our existence in the near future. I collect ā€œData Pointsā€ to help understand & shape this future.

If you want to support this, please share it on Twitter, LinkedIn, or Facebook.

And of course, leave feedback if you have a strong opinion about the newsletter! So?

It is terrible | It’s pretty bad | average newsletter… | good content… | I love it!

P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. I tend to be opinionated. You can always hit the unsubscribe button!

By Sven Balnojan

Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.

Tweet Ā Ā Ā  Share

In order to unsubscribe, click here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

Share
Previous
Next
Comments
Top
New
Community

No posts

Ready for more?

Ā© 2023 Sven Balnojan
Privacy āˆ™ Terms āˆ™ Collection notice
Start WritingGet the app
SubstackĀ is the home for great writing