Go to file
Simon Klüttermann ddb98946ce slight corrections to readme 2022-01-29 18:58:08 +01:00
__pycache__ initial push 2022-01-29 13:04:08 +01:00
imgs initial push 2022-01-29 13:04:08 +01:00
old initial push 2022-01-29 13:04:08 +01:00
runs initial push 2022-01-29 13:04:08 +01:00
README slight corrections to readme 2022-01-29 18:58:08 +01:00
before.png initial push 2022-01-29 13:04:08 +01:00
choosenext.py initial push 2022-01-29 13:04:08 +01:00
data.py initial push 2022-01-29 13:04:08 +01:00
loss.py initial push 2022-01-29 13:04:08 +01:00
main.py initial push 2022-01-29 13:04:08 +01:00
merged.npz initial push 2022-01-29 13:04:08 +01:00
mu.py initial push 2022-01-29 13:04:08 +01:00
multimodel.py initial push 2022-01-29 13:04:08 +01:00
n2ulayer.py initial push 2022-01-29 13:04:08 +01:00
onemodel.py initial push 2022-01-29 13:04:08 +01:00
recombine.py initial push 2022-01-29 13:04:08 +01:00
requirements.txt initial push 2022-01-29 13:04:08 +01:00
suggestion.png initial push 2022-01-29 13:04:08 +01:00
updated.png initial push 2022-01-29 13:04:08 +01:00

README

Instead of trying to find a model that is perfect at finding anomalies, ensembles try to combine multiple (maybe bad) models into one.
To do this, we need an algorithm to combine the predictions of different models. One way (that I commonly use) is to just average them in some way (score=sqrt(score_1**2+score_2**2)). Sadly this only works well if you have a huge number of mostly uncorrelated models.
If you have only a few models or correlated ones you can introduce bias this way. Assume we have three models: An isolation forest (iforest), an svm and a kNN algorithm. Assume further that the iforest has a low correlation to the other models (it finds different things anomalous compared to the svm and kNN), but the svm and the kNN find basically the same anomalies. If we just average each model, the svm and kNN have a much bigger influence on the result compared to the iforest. And there is no good reason why this should be the case.
To solve this, you can add models depending on correlations between them. But instead of relying on the correlation existing between the models themself, this repository uses a special kind of neural network to find uncorrelated parts of the model predictions.

n2ulayer.py and mu.py define this special kind of neural network. loss.py defines the correlation we want to minimize for use in tensorflow.
onemodel.py generates a quick (and quite random) anomaly detection model for use on the data defined in data.py (just a 2d gaussian). 20 models are generated and their predictions (sorted from most normal (green) to most anomal (red)) drawn in the numbered images in the imgs folder
If you use all 20 models and simply average them, this results in imgs/recombine.png. Notice how the green points are much more centered and are much less arbitrary. This is what we want. (This image is created by recombine.py)
choosenext.py creates and uses the tensorflow model to find a list of predictions that are least correlated to a given list of predictions
main.py uses this to combine a random model (before.png) with a combination of 4 models (suggestion.png) into updated.png. Notice how the area is covered much better in updated.png then in before.png. This might not be as good as imgs/recombine.png, but we also only used 2 instead of 20 models.
Youre task would be to extend this method to be able to combine arbitrary many models (use remainder in main.py, find a better combination function than combine(a,b) in main.py and introduce an exit condition) and test if this method results in more stable/powerful ensembles.
If you have any questions, please feel free to write an email to Simon.Kluettermann@cs.tu-dortmund.de