Bitsletter #1: Hashing for ML Splits, Web App Performance, & DALL-E 2 Insights
🧠 ML Tip: Train/test split with a hash function
When you work on a machine learning solution, you often need to split the data into a train and test set.
The simplest ways to split the data is to use something like
train_test_split from sklearn or shuffling and computing the split index.
It’s convenient but has a couple of drawbacks:
- You need to save the seed for reproducibility
- If you gather more data and redo the split, you are not sure that previous all the previous test data will still be in the new test set
- You can have data leakage: let’s say you have a dataset where the data from a same day is highly correlated. You would like to avoid splitting in the middle of a day, since the test data will be too correlated to the train data for this specific day
I was reading some material about machine learning design patterns and discovered a nice trick to overcome these problems:
Use a hash function, on the proper column, to determine the split of each data sample.
A hash function has the nice property to map inputs to a uniform bytes distribution: if you hash each row based on a defined identifier column, convert the hash to an integer and take the modulo 10, you will have a uniform distribution over all the digits:
h = hash(row[identifier]) % 10 will output 0,1,…,9 uniformly and deterministically without a seed (for a well distributed identifier)
is_test = h >= 8 will be true in 20% of the cases.
It solves the issues discussed before:
- You don’t need a seed
- If you gather more data and redo the split, the new test set will contain all the examples from the previous test set
- Using the same example as before, If you use the day column as the identifier, you are sure that every data from the same day will fall either in the train or test split entirely
🌐 Web Tip: Bring your web app close to the users
Edge computing for the web is rising. 🚀
The web is an incredible tool to reach massive amount of people worldwide. However, performance is key to the user experience (and thus for user retention):
0.1 seconds loading time can result in +8% conversions according to Google data.
Furthermore, lots of internet user come from countries where 4G/5G is not widely available: for them it’s even more important to deploy fast-loading web applications. If you deploy your application on servers at a single location, it’s simple but you get the worst performances (some users in the world will be far from the servers). The next step is to use multi region deployment, all cloud have multiple main regions like Europe, Asia, America, .... It’s better, but still coarse-grained, they usually have a few data centers per region.
🔥 If you use edge deployment with tools like CloudFlare Workers, you can deploy as close as possible to the users: CloudFlare global network is present in 270 cities in 100+ countries, resulting in 50 ms from 95% of the world’s internet connected population.
👩🔬 Research Paper: Diffusion Models Beat GANs on Image Synthesis
You likely heard about DALL-E 2, the new OpenAI model which generates stunning images from textual descriptions. At first, we can think that it's based on Generative Adversarial Network, like many other impressive image synthesis models. However, they use another algorithm called Diffusion Probabilistic Model.
The main idea is to incrementally add noise to images, until you get an image with random pixels. Then you train a model to denoise the image step by step.
Given a step, and a noisy image, the model denoises only the image a tiny bit to reach the next step, and so on, until you reach the original image. Doing so, your model learns to transform the prior noise distribution to the original image distribution.
Finally, to generate an image, you draw a random image based on your noise prior, and you apply your denoising model many times until you recover an image from the target distribution. In the paper Diffusion Models Beat GANs on Image Synthesis, researchers compare these diffusion models to GAN and show they can achieve better image synthesis, with a more stable training process.
Definitely worth reading if you're interested in generative models.
🛠️ Tool: PyScript, Python interpreted in the browser
AWS Release A Carbon Footprint Tool
AWS released a tool to help you track and forecast the carbon footprint of your cloud utilization. Cloud computing is a great technology but it's easy to forget that it uses “real” resources, and thus has a carbon impact. It's a step in the right direction for businesses willing to reach carbon neutrality.
PyTorch v1.11 Is Out
Last month, PyTorch released a new version: v1.11. Here are 3 main features to watch:
- TorchData beta: a library to build better data loading pipelines.
- Funtorch beta: a JAX-like library for PyTorch: compute the gradient of Python functions.
- Improved distributed training.
Check it out.