I and Machine Learning have traditionally caused some headaches for DevOps engineers. From testing to deploying, a machine learning system is often very different from traditional software. Fortunately you are not the only one deploying AI in the cloud! The open-source community offers a lot of useful tools to make the process simpler.
One of the largest components of a CI/CD pipeline (continuous integration and continuous delivery) is automatic testing of the codebase. The major difficulty with testing an AI system is that there is no way to programmatically express its specifications; otherwise we wouldn’t need AI to solve the problem!
This means you should rely more on tiny specific unit-tests rather than large blurry integration tests. For instance, you can test small components of an AI system, let’s say a classifier. In order to assess whether your classifier works, we recommend to test it on small, randomly generated datasets where you control perfectly the assumptions. Make it easy to pass, because you want your test to fail only if something is wrong, never because of “bad luck”. Otherwise engineers may blame “bad luck” again in situations where there is actually something wrong, which would make the all purpose of testing useless.
We found that testing on real data is not appropriate for CI/CD. For one, it takes longer to load than random data, and a good test-driven environment requires quick pipelines. Also, compared to synthetic data, you are unsure of what you are testing exactly when validating on real data.
Be careful because percentage can be misleading for AI systems when relying on test coverage. Indeed, machine learning code tends to be very sequential, without many branches and edge-cases like traditional software. You might have 100% test coverage, but still lots of bugs in your code. Instead, the real edge-cases of machine learning lie in the data. Does your classifier still work with highly unbalanced classes? Or with extremely un-normalized features? Test coverage tools will fail to account for these.
The best testing approach of your classifier is to use data generated with the exact same model, where all the weights and parameter of the true generator are kept hidden. You can then apply standard cross-validation techniques to ensure that the validation error is low enough. Sometimes you may be able to compare the error to a predefined threshold. Otherwise when finding a good threshold is too cumbersome, you can simply compare the error of your classifier against a simpler baseline model. If your classifier performs worse than the baseline, something is definitely wrong.
Once you’ve defined your unit-tests that way, your pipeline can look like traditional CI/CD. Building your dockerfiles or the CI jobs may be harder. We recommend looking at how large open-source projects such as TensorFlow CI Build are doing it.
The second major difference with traditional DevOps is that AI systems are not stateless. Having stateless services is a key assumption of solutions like Kubernetes to deploy, scale and maintain highly-available applications. Modern DevOps practices involve a complete separation between the logic (the code) and data (typically in managed databases).
Yet, in an AI system the decision logic involves both the code and data, such as the weights and parameters of a classifier. Recommender systems push this to an extreme, where a model requires to learn and store vector embeddings for each user and for each item, summing to millions or even billions of floating point blobs.
Deciding the best technology and physical device to store and maintain this data will highly depend on its size and how frequently it is read and updated. There is no “one stop shop” technology for machine learning data management. Under several GBs and without frequent updates you may simply store your states in a static storage such as AWS S3, and then load in memory of your services as soon as they start. Beyond this, you will need to use distributed big data systems such as RedShift, BigQuery, Hadoop/HDFS or in-memory with Spark.
In both small and big data, it is important to have a clear versioning of your models: their architecture but also their weights, parameters and the dataset their were trained on. Unlike traditional databases which evolve slowly and incrementally, AI model data are often updated with radical changes. The code for the model have to be perfectly match the data that is loaded. Deploying the parameters of your new 5-layers-model with your 4-layers-model still in production would certainly lead to catastrophic failures.
When dealing with complex, multi-AI-services systems, managing your data pipelines can’t be achieved at hand without eventual mistakes. At this point, you want to use an ETL orchestrator such as Airflow in order to formally specify your pipeline as code: all the tasks, their dependencies, the order they need to be performed, etc. Such orchestrator does not solve the technical and physical issues of transferring data (you still need to decide where the data is stored and streamed) but will help connecting your data warehouse with your code and with Kubernetes or your CI/CD pipelines.