🐰 #25 TechStyles modern data platform, ETL needs OS, is Airflow good enough?; ThDPTh #25 🐰
How TechStyle created their modern data platform, why ETL needs open-source, and whether airflow is good enough as a data orchestrator.
Notice how completely normal the data mesh concept appears?
Data will power every piece of our existence in the near future. I collect “Data Points” to help understand & shape this future.
If you want to support this, please share it on Twitter, LinkedIn, or Facebook.
(1) TechStyles modern data platform
The data world is in turmoil, so I love every piece of experience I can get my hands on. I really enjoyed this article by Prukalpa Sankar, featuring a modern data stack with Snowflake, Atlan, and Tableau.
I’ll just share two quotes and would simply recommend you to read the whole article. It’s really well written.
““Things are moving so fast now…” […] Instead, TechStyle opted for an ELT style of data engineering, where they load the data as-is from the source. Once the raw data is loaded, TechStyle uses a hybrid approach to model whatever needs to be modeled and happily leave the rest untouched.”
““We’re onboarding analysts, but they’re not as effective because they don’t understand the data.””
So after modernizing the data warehouse, they noticed that we need more, education, data cataloging, etc. It’s a great example of todays’ journey for data organizations.
How TechStyle Used Agile Sprints to Roll Out a Modern Data Platform | by Prukalpa | Jun, 2021 | Towards Data Science
TechStyle’s approach to data warehousing and data analytics, metadata creation, democratizing tribal knowledge, championing data management, and more
towardsdatascience.com • Share
(2) Why ETL needs Open Source
I’ve been saying again and again that I think the data space will be dominated by open source solutions, because of the “snowflake problem”, the problem that every data setup inside companies is completely unique.
So it’s great to finally get an article from the guys behind airbyte on that topic that backs this up with their experience. They focus very much on the ETL use case, where I think this conclusion applies to the complete data case. But I really like how they put their experience and 200 company interviews into this form and show the exact road ETL has to go on in the future.
I also like the way they think about their CDK because the CDK is really an essential part of the incentive structure for their open source project. Great to see that even though they are at the very beginning, they got a good vision of where they need to go.
I do think though, in the future, they’ll need to spend more time on the high-level structure of the data space & their open-source side (because after all, I think ETL is a system that is set up to be obsolete in 5–10 years). But I’m sure they will get there.
Why ETL Needs Open Source to Address the Long Tail of Integrations - DATAVERSITY
In our interviews, we found that many users’ ETL solutions didn’t support the connector they wanted, or supported it but not in the way they needed.
(3) Is Airflow Good Enough?
Anna Geller wrote a good piece on airflow and data orchestrators in general. Here’s a little summary of her points:
The strength of Airflow lies undoubtedly in the community, the support & extensibility that comes with that. However, as Anna writes, Airflow also brings a bunch of weaknesses.
There’s no native versioning of flows, it’s very unintuitive for new users, it’s got a configuration overload and is hard to use locally. All things that basically make it hard to develop fast. It’s also where some of the new tools shine. Prefect focuses on taking lots of this out of your hands. Dagster has a great testing concept and is much easier to handle when it comes to developing new flows.
The problems at setting up Airflow at production are to my knowledge mostly mirrored at both prefect & Dagster so I’m not sure one can consider that a weakness of airflow but more of the category of tools.
However, there are managed solutions available that take away quite a bit of the hassle. If you’re looking for a data orchestrator, take a look at Annas article.
Is Apache Airflow good enough for current data engineering needs? | by Anna Geller | Towards Data Science
The pros and cons of Apache Airflow as a workflow management platform for ETL & Data Science and deriving from that the use cases for which Airflow may be a good or a bad choice
towardsdatascience.com • Share
🎄 Slides for Talk on Data Mesh & Thanks
Thanks for reading this far! I’d also love it if you shared this newsletter with people whom you think might be interested in it.
I finally got around to holding a talk on data meshes, focused on being as concise as possible while still giving a bit of my bigger view on things. You can find the slides here:
Mars missions & Data meshes - a crash course to data meshes
Data meshes are the latest data architecture trend. Really a paradigm shift. But what actually happens is just the natural evolution of technological decentral…
P.S.: I share things that matter, not the most recent ones. I share books, research papers, and tools. I try to provide a simple way of understanding all these things. I tend to be opinionated. You can always hit the unsubscribe button!
Data; Business Intelligence; Machine Learning, Artificial Intelligence; Everything about what powers our future.
In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue