Bitsletter #1: Hashing for ML Splits, Web App Performance, & DALL-E 2 Insights
š§ ML Tip: Train/test split with a hash function
When you work on a machine learning solution, you often need to split the data into aĀ trainĀ andĀ testĀ set. The simplest ways to split the data is to use something like train_test_split
from sklearn or shuffling and computing the split index.
Itās convenient but has a couple of drawbacks:
- You need to save the seed forĀ reproducibility
- If youĀ gather more dataĀ and redo the split, you are not sure that previous all the previous test data will still be in the new test set
- You can haveĀ data leakage: letās say you have a dataset where the data from a same day is highly correlated. You would like to avoid splitting in the middle of a day, since the test data will be too correlated to the train data for this specific day
I was reading some material aboutĀ machine learning design patternsĀ and discovered a nice trick to overcome these problems:
Use aĀ hashĀ function, on the proper column, to determine the split of each data sample.
A hash function has the nice property to map inputs to aĀ uniformĀ bytesĀ distribution: if you hash each row based on a defined identifier column, convert the hash to an integer and take the modulo 10, you will have a uniform distribution over all the digits:
h = hash(row[identifier]) % 10Ā will output 0,1,ā¦,9 uniformly and deterministically without a seed (for a well distributed identifier)
is_test = h >= 8Ā will be true in 20% of the cases.
It solves the issues discussed before:
- YouĀ donāt needĀ a seed
- If youĀ gather more dataĀ and redo the split, the new test set will contain all the examples from the previous test set
- Using the same example as before, If you use the day column as the identifier, you are sure that every data from the same day will fall either in the train or test splitĀ entirely
š Web Tip: Bring your web app close to the users
Edge computing for the web is rising. š
The web is an incredible tool to reach massive amount of people worldwide. However, performance is key to the user experience (and thus for user retention):
0.1 secondsĀ loading time can result inĀ +8% conversionsĀ according to Google data.
Furthermore, lots of internet user come from countriesĀ where 4G/5G is not widely available: for them itās even more important to deploy fast-loading web applications. If you deploy your application on servers at a single location, itās simple but you get the worst performances (some users in the world will be far from the servers). The next step is to use multi region deployment, all cloud have multiple main regions like Europe, Asia, America, .... Itās better, but still coarse-grained, they usually have a few data centers per region.
š„ If you use edge deployment with tools likeĀ CloudFlare Workers, you can deploy as close as possible to the users: CloudFlare global network is present in 270 cities in 100+ countries, resulting inĀ 50 ms from 95% of the worldās internet connected population.
š©āš¬ Research Paper: Diffusion Models Beat GANs on Image Synthesis
You likely heard aboutĀ DALL-E 2, the newĀ OpenAIĀ model which generates stunning images from textual descriptions. At first, we can think that it's based on Generative Adversarial Network, like many other impressive image synthesis models. However, they use another algorithm calledĀ Diffusion Probabilistic Model.
The main idea is toĀ incrementally add noiseĀ to images, until you get an image with random pixels. Then you train a model toĀ denoiseĀ the imageĀ step by step.
Given a step, and a noisy image, the model denoises only the image a tiny bit to reach the next step, and so on, until you reach theĀ original image. Doing so, your model learns toĀ transform the prior noise distributionĀ to theĀ original image distribution.
Finally, to generate an image, you draw a random image based on your noise prior, and you apply your denoising model many times until you recover an image from theĀ target distribution. In the paper Diffusion Models Beat GANs on Image Synthesis, researchers compare these diffusion models to GAN and show they can achieve better image synthesis, with a more stable training process.
Definitely worth reading if you're interested in generative models.
š ļøĀ Tool: PyScript, Python interpreted in the browser
AtĀ PyCon US 2022, Anaconda announcedĀ PyScript: Python interpreted in the browser. TheyĀ ported CPython in Web AssemblyĀ to interpret Python code directly in browsers (without transpiling it to another language like JavaScript). ItĀ maximizes the performanceĀ since Web Assembly is the fastest way to execute code in the browser. It's a great news for Python developers who now have Web superpowers. Imagine processing data, building web application, ā¦ in Python directly š.
š°Ā News
AWS Release A Carbon Footprint Tool
AWS released a tool to help you track and forecast theĀ carbon footprintĀ of your cloud utilization. Cloud computing is a great technology but it's easy to forget that it uses ārealā resources, and thus has aĀ carbon impact. It's a step in the right direction for businesses willing to reachĀ carbon neutrality.
PyTorch v1.11 Is Out
Last month, PyTorch released a new version:Ā v1.11. Here are 3 main features to watch:
- TorchDataĀ beta: a library to build better data loading pipelines.
- Funtorch beta: a JAX-like library for PyTorch: compute the gradient of Python functions.
- Improved distributed training.
Check it out.