About Me:

My name is Raphael Pisoni. I am a Machine Learning Researcher but my path was not a straight line.

About Me:

Scaled RBF Attention: Trading Dot Products for Euclidean Distance

Raphael Pisoni

2026-03-29 09:30

RBF Attention Loss Plot

If you crack open the architecture of almost any modern Transformer, you will find Scaled Dot-Product Attention (SDPA) sitting at its core. We rarely second-guess it. It is heavily optimized by hardware accelerators, it scales beautifully, and empirically, it runs the world. But if you look closely at the underlying math, treating a dot product as a proxy for "similarity" carries some subtle structural baggage: it is highly sensitive to vector magnitude.

In this post, we'll explore an experimental alternative: Scaled Radial Basis Function (RBF) Attention. By swapping dot products for Euclidean distance, we naturally penalize "loud" keys and aim to stabilize training. I'll walk through the algebraic trick that makes this viable on existing hardware, share a custom Triton kernel for memory efficiency, explain why we need to introduce "Register Tokens" to make it work, and review the empirical results of training a small causal language model from scratch.

Geometric Alignment via Teacher-Free Self-Distillation

Raphael Pisoni

2026-01-21 17:53

The "Infinite Gap" and Why Softmax Keeps Me Up at Night

To understand any solution, we first have to really understand the problem. I've spent the better part of my research career staring at loss curves, watching them dip, plateau, and occasionally spike catastrophically. We often treat the loss function as a black box, a simple signal telling the network "good dog" or "bad dog." But if you look closer, specifically at the geometry of the final layer, you realize that our standard tools are fundamentally broken.

Decoupling Features and Classes with Self-Organizing Class Embeddings

Raphael Pisoni

2023-10-17 19:30

Classification with neural networks is weird! There, I said it!

We usually have a single output per class, as if for some reason each class was it's own feature. The numbers these outputs produce are then intepreted as a log-probability distribution over all the available classes. Eveybody knows it doesn't make sense, yet we treat it as a mathematical assumption. Also needing a separate output for every single class becomes insanely wasteful once you train for more than a few thousand classes. If you have a small model your output layer might well be bigger than the rest of the network.

You can at least get around the huge amount of outputs with some embedding based method by beheading a pretrained network or by doing some contrastive training, but the similarities you get out of them are hard to interpret and you have no measure of certainty.

So what can we do about it you ask? I collected a few ideas...

Sharpened Cosine Similarity: Part 2

Raphael Pisoni

2022-02-18 06:00

A lot has happened since my last post on the Sharpened Cosine Similarity layer. In this post I will try to give you a quick overview over the most important developments around this feature extractor that shapes up to be more and more interesting.

Sharpened Cosine Similarity Feature Maps

Sharpened Cosine Distance as an Alternative for Convolutions

Raphael Pisoni

2022-01-06 13:00

Some days ago Brandon Rohrer retweeted his own twitter thread from 2020 in which he makes the argument that convolutions are actually pretty bad at extracting features. In it he proposes a method to improve feature extraction that seemed compelling to me.
The formula for this Sharpened Cosine Distance is the following:

$$ scd(s, k) = sign(s \cdot k)\Biggl(\frac{s \cdot k}{(\Vert{s}\Vert+q)(\Vert{k}\Vert+q)}\Biggr)^p $$

I decided to try this idea out and created a neural network layer based on this formula and as it turns out it actually works!

Bringing CLIP to the Italian language with Jax and Hugging Face

Raphael Pisoni

2021-11-05 21:02

CLIP is a model published by OpenAI that is able to learn visual concepts by natural language supervision. It does this by embedding images and their corresponding caption into a joint space and contrastively minimizing their distance. OpenAI only published weights for CLIP trained on english data. That's why during the JAX/Flax community event organized by Hugging Face and Google, we from the clip-italian team wanted to try to train a CLIP version that understands 🤌Italian🤌.

Imax: Making Image Augmentations fast with JAX

Raphael Pisoni

2021-02-22 20:00

sample transformations

Image augmentations make all the difference when working with neural networks. Everybody should know that by now. No matter what you're trying to train, if it involves images you should be using heavy and fancy augmentations! The only downside of these heavy augmentations is that they might slow down your training significantly if they are not implemented in a fast and efficient way. With Imax the goal was to solve that while getting better at Jax.
And you can try the results today!

JUDO-Net (Extended Edition)

Raphael Pisoni

2020-12-30 20:00

Since my paper on "Joint Unsupervised Depth-Estimation and Obstacle-Detection" did not get accepted to NeuRIPS 2019 I now had another unpublished paper lying around. Back then more and more people around me started to get interested in neural networks and some (including my mom😂) were also interested in my work. I however, kept struggling trying to explain to them what exactly it was I was actually doing.
So I had an idea: What if I could start explaining the concepts of neural networks at a relatively low level and explain my way up from there?

BBoxr: A simple Tool to collect Bounding Boxes

Raphael Pisoni

2020-11-30 20:01

I made a thing! It's called BBoxr and it's a web app to collect images and bounding-box information without installing anything. It works for desktops and phones. You can play with it here: www.rpi-bboxr.web.app

Hackathon: Diagnosing COVID-19 with X-Rays and Transfer Learning

Raphael Pisoni

2020-03-31 20:00

x-rays

In the early days of COVID-19, before there were cheap PCR and Antigen-Tests, one way to diagnose it was through a CT-Scan. The obvious drawbacks of this were of course that CT-Scans are pretty expensive, they are not equally available in all parts of the world and that they deliver quite a high dose of radiation.

In March 2020 Esther Schaiter and me heard about a 24hour online hackathon with the goal to surface ideas to combat problems related to COVID-19. Since Esther was a final year medical student she was very close to the issue. Therefor we decided to tackle the idea of diagnosing COVID with X-rays.

Joint Unsupervised Depth-Estimation and Obstacle-Detection from Monocular Images

Raphael Pisoni

2019-12-08 20:00

judo_net_1

Inspired by J-MOD² by Mancini et al. and my previous paper on depth estimation I wanted to go back and re-join depth estimation and obstacle detection in a single network. In the process I not only beat the SOTA for unsupervised monocular depth estimation but also introduced some really cool loss functions and a way to train a model to segment obstacles without ground truth data. Fun stuff, I promise!

Internship Report 2018 (3/3)

Raphael Pisoni

2018-11-03 18:16

3D Reconstruction from Stereo Images

The last internship project I’ll tell you about didn’t result in a satisfactory result in the limited time available, but at the same time it was the one that I definitely found most interesting for reasons I will explain below. While 3D reconstruction is a really interesting topic in general, it is essential for companies like Microtec because in order to cut logs into the required planks in the most efficient way possible. 3Dlog Result 3

Internship Report 2018 (2/3)

Raphael Pisoni

2018-11-02 18:16

Plank Segmentation Model

The next project I’m going to tell you about is a model to segment wooden planks inside of machines or an industrial environment. This information can be used to improve many control problems in sawmill machines that currently have to be addressed with traditional sensors. For this model I would have liked to implement an oriented version of YOLO v3 but due to time constraints in the end I essentially used the same approach as for the log segmentation.

DeeplabV3 Encoder

Internship Report 2018 (1/3)

Raphael Pisoni

2018-11-01 18:16

Microtec logo

This summer I did an internship at Microtec, one of the leading providers of wood-scanning solutions for sawmills, and an emerging power in the domain of food scanning. In the next few posts I’m going to tell you about some of the projects I did there. I was hired with the goal to apply my experience in machine-learning, and especially deep-learning, in order to solve some of the tasks that the company is currently facing.

My First Paper

Raphael Pisoni

2018-09-16 18:16

In December of 2017 after completing the statistical evaluations for a paper published by Kristina Eichbichler from the Medical University of Vienna I was looking for a new freetime project. As luck would have it at the same time I got in contact with Professor Tammam Tillo who happened to be the new Computer Vision specialist at my University. He suggested me to work on depth estimation.

Sample image of monocular vision from the AirSim Environment

About Me:

Raphael Pisoni

2016-01-20 13:53

My name is Raphael Pisoni. I am a Machine Learning Researcher but my path was not a straight line.