Quote Originally Posted by Radar View Post
Might I ask, what kind of experiment that was? Just out of curiosity.

Still I guess you would not want NULL affecting aggregate functions. I would assume you needed to use more sophisticated tools to combine and analyse the data than standard statistical functions.
It's the dataset of observed exoplanets. Some are detected with radial velocity measurements (gets period, but only determines a lower bound on the mass and no inclination information), some via transits (gets radius, period, maybe some other things like eccentricity and inclination), or both (allowing you to know the mass field because you have independent measurement of the inclination).

For transits, it has to be big enough to pass between the Earth's field of view and the disk of the star, so you have a bias towards short periods and/or large radii. For radial velocity measurements, it can be further away, but the size of the effect is based on mass (and star type might factor in). And so on.

In the end, we trained a generative model to reproduce the distribution and found that when we trained with missing data using e.g. a masking scheme to hide those values, the network actually used the fact that some values were masked to improve its estimates of other values we were testing it on. So for instance, there was more information about the orbital period in the pattern of NULLs than there was in the radius, star type, etc.

We didn't actually figure out how to solve this, so we just proceeded using the 600 or so exoplanets that had the complete set of fields we ended up retaining. But this means we had to drop stuff like eccentricity, which would have been cool.