Now lets focus a bit more on the trivial model
in it, I compare just the angular part to 0 (its mean)
and as you see on the left side, the distribution for tops is way more complicated (logarithmic color coding!)
so since comparing to zero=approximating this radius, tops are clearly classifiable using this
compare this to the distribution in #p_t#
basically no preference
even switches depending on the displayed particle
This related to another problem
if you would train a working autoencoder no longer on qcd data, but on top, it would still consider tops more complicated
This you can see best in aucmaps
These show the auc as a function of the particle id and the current feature
blue color = qcd data is simpler
red color = top data is simpler
white color = no preference
a perfectly working network would be darkblue if trained on qcd and darkred if trained on top
you can subtract those maps
here more different=more red
basically no difference in angular data
you have the same problem of adding d-distributions as you have in the scaling case
so you could ask yourself if adding something to the angular data actually helps
comparing the only angular data to the general data, you see that it in fact hurts the auc (even though just a bit)
this effectively means, my current network does not use #p_t# at all
But again, this does not mean, that there is no information in pt
in fact, you see in these aucmaps, that the pt part is actually red where it should be red and blue where it should be blue
so how about using only #p_t#
you obviously lose quality
also training an autoencoder to get an high auc in pt is not yet trivial
multiplicative scaling does not really work
best network reaches an auc of about #0.78# which is about the same, as QCDorWhat gets for minimally mass decorrelated networks
Benefits
Problems
you basically split your training into a network with a good auc, and one that learns (hopefully) non trivial stuff
So maybe you could do the same with some different preprocessing (one that does not just give you trivial information)
Easiest Transformation: no Transformation (4 vectors)
so
Energy
#p_1#
#p_2#
#p_3#
trained on qcd, but prefers top!
Why is that so?
maybe just a bad network
compare metrics (defining distance in topK)
basically require the network to learn the meaning of #phi# and #eta# itself
so without, no concept of locality, meaning no useful graph
add Dense Network infront of the TopK
better, but still not good
run TopK still on preprocessed Data
good, but numerical problems
require to go to 4 particles and less training data
same good reconstruction in #p_1# and #p_2#
makes sense, since #Eq(p_t**2,p_1**2+p_2**2)#
but apparently Energy and #p_3# prefer tops