
Geo-politically a lot is happening at the moment. No surprise there. Because of this, the EU now sees the need to reduce its dependence on US products, companies, and services. This is also true for the way that we host and organize everything around data. The culprit in these discussions is the US CLOUD Act. This enables the US government to have any US cloud provider disclose any data that they keep for us. Those cloud providers will say that they do not want to comply with such requests, but legally, as US companies, they will have to concede at some point in time.
And then there is the long known fact that each of the hyperscalers has an incentive to lock you in into their ecosystem: migrating to other cloud providers then becomes harder and harder over time as you technically invest more and more in their specific platforms and tooling. If your US cloud provider decides to raise prices, what leverage do you have? If you have designed your cloud infrastructure in a way that is portable, you can negotiate a lower price with them or switch over to a different cloud provider with lower costs.
As stated by Gaia-X, the EU non-profit association in charge of creating a secure, federated data infrastructure in the EU, digital sovereignty can only be achieved if data and infrastructure remains under EU jurisdiction. From here, you have two options: hosting your platform on-premise, or bringing it under an EU cloud provider.The most secure and controlled environment for your data is a system that you host entirely yourselves. However, in practice, nowadays it is hard and probably unwise to try to host the bare metal, i.e. servers, yourselves. We are past that point, I think. Using EU based cloud providers enables you to scale at will and does away with any hardware configuration you need to do yourself.
The ship for on-premise based solutions has sailed, unless one really needs to keep everything locked behind doors as is the case for highly confidential information. Then there basically is no alternative.
So let's go back to the situation that holds up for most of us: we prefer to use hardware, or rather services, at will, and we would like to scale up and down freely with immediate impact on the financial side of things, lowering costs as we scale down or as we make our architectures more efficient.
Which cloud provider do you like to use? Do you like using the most advanced services currently available and want to experiment with them? Then, currently, you cannot do without the Big Three: Amazon,Microsoft and Google. As we have explained before, however, that also brings the US CLOUD Act into the picture again. If you are free of any operational impact that the use of those cloud services might have, then you are well served by those cloud providers. Their candy stores with services are fully at your disposal.
However, if operational risk is real, you better think twice. Not only is your data at stake because of the US CLOUD Act, but in extreme cases those cloud providers might even be forced to pull the plug on the environments you have deployed following a request by the US government.
As part of your exit strategy, you may have set up procedures to store data outside of those US cloud providers so you can always fall back onto that data as it is safely set aside. But that does not mean your solution will be up and running in production any time soon on an alternative cloud. You can also cater for that by running multi-cloud solutions. The question is then, what are the costs involved in maintaining multiple solutions on multiple cloud platforms, not only regarding the services you use, but also the investment in knowledge in the people that need to maintain those solutions. Different cloud providers use different ecosystems that are not compatible with each other.
Deployment with infrastructure as code (IaC) that creates the solution in the cloud ecosystem of your choice, differs per cloud provider. Think about CloudFormation, ARM templates, bicep, and Terraform. Terraform you say? Is it not the lingua franca of cloud deployments? Well, the tool is generic, but your IaC is not: the language definitions differ per cloud provider. Unless you stick to the more Esperanto language of the Cloud: OpenStack. This is a framework and language for defining IaC in a really reusable specification. Side note: the Big Three do not support this. Why would that be?
So, to summarize:
Looking at the continent we live in, Europe, we can choose from quite a number of providers. The EU cloud providers have their own set of challenges, though. Their set of services is far more restrictive than those of the Big Three. There are two options here: deploy any services that you require yourselves on those cloud platforms (self-hosting), or redesign your solution to use the more generic services they offer. Be aware, though, that not all of them offer OpenStack interfaces.
If we constrain ourselves to the context of data platforms, what do you really need?:
This will cover more than 90% of the use cases we have seen at clients for data platforms.
Three out of these five are no issue at all: all EU cloud providers that have a substantial size and that we know of, offer private networking, data storage and compute or data processing resources.
Currently, analytical data lakes and databases and analytical tooling are real differentiators. Only a select few offer what we would call an analytical database (e.g. T Cloud Public offers Data Warehouse Service, and Scaleway offers Data Warehouse for Clickhouse Service).
For analytical tooling the situation is even worse: none of the EU cloud providers we know of offer services for this functionality. Sure, some offer Jupyter notebooks support or something similar, but we would not call that broadly usable BI tooling.
Are we at a loss, here? No, because we did not talk about how we approached these matters in the past: by installing a (commercial) product on your platform, on your chosen EU cloud provider, to access those seldom-seen services.
Look at the Databricks suite for example. A viable alternative could be Cloudera, in your use case. It includes all required functionalities to ingest, transform and store data, to analyze data and to create dashboards to be used by mere mortals.
More engineering oriented companies might use the DIY approach: stick with the base services of the cloud provider, and use different tools for the analytical use case on top of that.
And do not forget: analytical use cases that are limited to 30 million records of data are probably quite at home with a traditional setup with PostgreSQL for example, a database that even is available in serverless variants to lower your maintenance burdens. You might need to make use of advanced functionalities such as materialized views to squeeze the last bit of performance out of the system, but so be it.
Larger use cases might call for the data warehouse services we mentioned above, or they might need to rely on open source software such as Trino to query large volumes of data that can practically be queried with good-old PostgreSQL.
For analytics use cases, OSS tooling such as Marimo or Jupyter notebooks might be all you need. Most likely, you need to self-host these too in your cloud provider of choice.
Did you forget the data processing part of the solution? Well, we did not address this so far, but we consider this a no-brainer these days. There are plenty of tools around that will help you create efficient and powerful data workflows: dbt, sling, dlt, Apache Flink, Apache Hop, dagster, airflow, etc, etc
And all of these tools can be wrapped easily in Docker containers making them executable on any cloud platform, and therefore portable (there’s your exit strategy!).
For containers you will need a platform to run them. The de facto cloud OS that will provide that platform is Kubernetes (k8s). Running containers without the maintenance burden of administering a Kubernetes cluster is an aspect you still need to decide on: are we going to run k8s clusters? Do we use the managed k8s services that some(!) of the cloud providers offer? Or are we opting for serverless container services? Again, not something that a lot of cloud providers offer.
Let us be clear: there is a lot of open source tooling out there for data platforms running on your cloud of choice, but you should not switch over without careful consideration: different tools from different developers will likely need some work to integrate with each other. And what if something breaks? A reliable external support organization that you can trust your business with, is pretty much priceless.
Our advice is to look closely at open source tooling, especially those with companies backing it with support. Also, if they provide SaaS services built on the same tooling, you know they have a vested interest themselves.
In general you have to take into account that open source software will require you to have more in depth knowledge of the tooling and it will result in higher overall maintenance costs for integration of components and troubleshooting.
So do not abandon commercial software providers just yet. They may very well offer outstanding products with an important benefit: support. And if they base the core of their products on open source software, that is very much preferable over closed source software. Otherwise you escaped the vendor lock-in of cloud provider services and stepped into a new pitfall. In theory, open source software as deployed by commercial suppliers, offers you best of both worlds: well supported systems and easy portability to other environments.
A lot of text, but if we would need to distill all of this in a flow chart, it would look something like this:
Solutions Architect and EU-Cloud topic lead at Xomnia
