Bitsletter #9: The MLOps Journey, Cookie Management, and AI Research Highlights
🧠 ML Tip: Brush up your MLOps skill
Even the most performant model is meaningless if it stays in a notebook ... 💡 Machine learning engineers and data scientists should spend more time on DevOps, especially MLOps (DevOps applied to machine learning). Building a machine learning model is only half the battle. To succeed, the model must be deployed and run efficiently and reliably.
Once you consider all the steps necessary to put machine learning into production, you realize that modeling is just one of them:
🟢 Continuity Machine learning models need continuous processes to remain healthy 👉🏽 Continuous integration: extend test and validation to data and models. 👉🏽 Continuous delivery: automatic deployment of models after training. 👉🏽 Continuous training: automatic retraining of models to maintain performance. 👉🏽 Continuous Monitoring: monitor key business metrics of the deployed model.
🟢 Versioning Reproducibility is a critical component of any machine learning project. To achieve reproducibility, versioning is mandatory, but it goes beyond code: scripts, models, and datasets should all be tracked in a versioning system.
🟢 Experiments tracking Machine learning is inherently a trial-error discipline. This is one of the most crucial factors regarding the final performance of your solution. The more you try, the more likely you will find the best solution. But experiment tracking is not straightforward, it's related to versioning, and you need a robust architecture to take the most out of it.
🟢 Testing There are no reliable systems built without a solid testing process. Machine learning introduces complex dependencies between code and data: you need to test both to guarantee the health of your solution. Data quality checks, unit tests, integration tests, data drift checks, ...
MLOps puts in place processes to create, deploy and monitor models in production. It gives life to our machine learning process and should be highly considered. 📚 MLOps education is essential to be a complete machine learning engineer.
🌐 Web Tip: Reverse proxy to share cookies among services
Cookies are an effective way to manage authentication and authorization. You can quickly implement persisting sessions by storing the id of a logged-in user inside a cookie. Then the browser takes care to send it over every request, minimizing the work client-side.
Moreover, they can be fairly secure if you take setup CORS properly and avoid a few pitfalls: it includes using HTTP-only cookies, secure and signed cookies:
👉🏽 HTTP-only**prevent client-side scripts to access the cookie
👉🏽 Secure**send the cookie only over HTTPS
👉🏽 Signed**sign the cookie so you can verify that it's not altered by a tier
Many web apps have a front-end, and an API served by 2 different services. It would be convenient to share the session cookie between services to unify the authentication and authorization flow. Fortunately, a typical pattern is to have a reverse proxy (Nginx, Caddy, ...) to handle asset compression, HTTPS, rate-liming, ...
If you have both your web app and API behind the reverse proxy (and don't want to use subdomains), you can serve them over different subpaths of your domain:
By doing so, you will share the cookie between your different services with a straightforward setup. You could even have both API and the web app hosted on the same machine.
👩🔬 Research Paper: Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Data is the fuel of machine learning models and can be hard to gather. Our models can be complex and require a lot of computations, which can be costly in terms of money and environmental impact. Making the best of the data available the fastest possible is crucial for artificial intelligence.
Data selection is a technique that selects the data points on which to train at different training steps. Intuitively, as your model learn, some data points remain harder to learn than others, so it would make sense to spend more time on these tricky samples. Data selection won't select the training samples uniformly for a given training time budget.
This paper introduces a special selection loss: Reducible Holdout Loss Selection (RHO-LOSS). It tries to select points that most reduce the model's generalization loss. Previous online batch selection methods, such as loss or gradient norm selection, try to select points that would minimize the training set loss after training on them. Instead, the technique in this paper tries to select points that minimize the loss on a holdout set.
However, training on all the candidate data points and evaluating the holdout loss each time would be costly. So the authors introduced approximations to find the data points that would better reduce the holdout loss without training on them. They do so by training an auxiliary model (smaller than the main one) exclusively on the holdout set: it saves computation time and allows us to get a good approximation.
The main algorithm is as follows:
- Sample a big batch of data B1
- Use the auxiliary model to sample a smaller batch B2 from B1 with the most promising data points.
- Train the primary model on B2
RHO-LOSS demonstrated compelling benefits:
- Trains in far fewer steps
- Improves accuracy
- Speeds up training on a wide range of datasets, hyperparameters, and architectures
We've all been there ... You start writing a simple Bash script, and it grows out of control ... You try to modularize bash code with functions, but it's a pain ... Then you move to other languages to manage your scripts.
👉🏽 Dead simple shell commands invocations
👉🏽 Use async/await to handle asynchronous commands
👉🏽 Wrap groups of commands in an async context
Meta “No Language Left Behind” open source AI now translates 200 languages
Meta released No Language Left Behind, a model capable of translating between 200 languages. It even includes low-resource languages such as Asturian, Luganda, Urdu, and more ...