Building the AI Factory of 2021 and beyond— A Journey

Jun 15, 2021

Let’s start at the beginning though. Every architecture and organizational decision should be founded on a business challenge, so here’s ours: We are and we bridge the gap between industry and AI. That means working with a variety of customers across all verticals, diving into their data and implementing custom machine learning models to their challenges. Clearly, your specific need might be different, but chances are

  • You have or are starting to build up significant amounts of potentially valuable data
  • You need to enable a mid-sized to large organisation to deliver AI products on this data

All in Cloud

The first step we took was all about speed. We went all in on AWS SageMaker, implemented AWS Glue as a managed spark solution and glued everything together with AWS CodePipeline. In addition we used Metaflow by Netflix as a tool to describe machine learning workflows in code. This was at the beginning of 2020 and the issues we encountered then — namely a lack of pipeline solution from AWS and easily extendable custom models on top of SageMaker — have largely been addressed. In fact I would go as far to say that in 2021, if you are already committed to one of the big three cloud providers going all in on their respective managed cloud AI service might just be the right thing for you. The feature speed of the big cloud providers is impressive and can pay off quickly.

All in cloud might be the way to go for you, since the investment the big cloud providers are making is significant and can pay off quickly for you

But, then reality hits. We started working with health providers. Government. Automotive. Deutsche Bahn. The truth is that many of these industries are far from going all in on cloud and never might. In particular in the German market there are some real concerns in terms of data regulation and ownership. For this reason many of our customers choose to run their workloads on premise or hybrid cloud environments. Even if you are not worried about data, your concerns might be economic or strategic: how much of a commitment are you willing to make into a single providers ecosystem, while maintaining some degree of independence? So, let’s add two more requirements to our list

  • You want to maintain a degree of independence from the big cloud providers, due to data, financial or strategic concerns
  • You are running on premise or hybrid infrastructure for the foreseeable future

The Middle Ground: Open Source and Kubernetes

At this point everyone that’s left should have more or less the same question on their minds: how can I benefit from the feature speed of the machine learning ecosystem, even sometimes from cloud capabilities without compromising on my requirements? It really isn’t feasible to start all the way from zero, due to the complexity of the workloads. Luckily, there is a middle ground.

In late 2020 we started adopting Kubernetes and Kubeflow in a big way. The two of them solve major issues outlined above: Kubernetes as the de facto standard for running containerized applications allows us to move our workloads flexibly. As a common denominator between clouds, hybrid and on premise environments Kubernetes gives us flexible access to compute power for both CPU and GPU based workloads. So if we manage to make machine learning at home on Kubernetes we are in a great position. Luckily, that is exactly the mission of Kubeflow. Kubeflow is a collection of curated open source services, collected into one integrated, cohesive experience on top of Kubernetes.

Kubeflow is supported and contributed to by both Microsoft and Google, in fact you can run Kubeflow natively on Google Cloud. At this time, there is support for running Jupyter Notebooks, building python based, reproducible pipelines for training, hyperparameter search and deploying your models as serverless functions in pretty much any machine learning framework you may already be using today. The project is being developed very actively; in fact just recently support for open source feature store darling Feast has been announced. There are the typical issues with Open Source as well, namely sometimes things don’t integrate perfectly. And not every solution is always perfect and final, but for any of the larger building blocks (training, pipelines, deployment), there is freedom of choice.

Finally Kubeflow has been designed from the ground up to run on all clouds as well as on premise, with specific manifests that make deployment on all of them similar and manageable.

Cloud Native, At Home Anywhere

Today we are running multiple machine learning workloads in various environments successfully on production. Kubeflow is the common ground, the shared contract that our data scientist can rely on. No matter where we go, our tooling, our pipelines, our deployment and our training will be the same. Yet, depending on where we are we can leverage cloud capacity. S3 can cover for in-cluster Minio. Managed spark jobs can be run on Glue 2.0 instead of the in-cluster spark operator. Kubernetes itself can run on the native managed Kubernetes services, for example AKS on Azure.

Of course the journey is far from over, in fact just a few weeks ago we started adding support for great expectations — a wonderful open source tool that allows to write assumptions, tests against data. I suspect the next step will be adoption of one of the major feature stores.

Ferdinand von den Eichen

Weitere Blogs

Aug 17, 2022

7 vs. Wild – How to make sense of a Jungle of Clouds

Jun 15, 2021

Building the AI Factory of 2021 and beyond— A Journey

contact us

Realize your AI plans now

We look forward to getting to know you with a no-obligations conversation. Contact us now and we will get back to you immediately. team
Thank you so much for
Your enquiry

We'll get back to you as soon as possible.  
‍In the meantime, have a look at the other pages.

Oops! Something went wrong while submitting the form.