From 31c71877c9ade5be4d0625bcba8be4878a0085a9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Simon=20Kl=C3=BCttermann?= Date: Sat, 29 Jan 2022 19:06:46 +0100 Subject: [PATCH] minor corrections --- README | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/README b/README index 12e6462..d2b2626 100644 --- a/README +++ b/README @@ -1,9 +1,9 @@ Ensembles are a way to combine multiple models to create a more powerful model.In anomaly detection you can use a concept called feature bagging to create multiple predictions from the same algoritm. For this each run of the algorithm only works on some features. Generally this is used to increase the robustness of the anomaly detection method (if 2 features seem really important, anomaly detection methods might neglect the other features. If you have runs without these important features, this forces the algorithm to still consider less important features), but I would like to explore a sligthly different question: If you are given multiple predictions, you will see events that are anomalous to some predictions, but normal to other ones. And when each model has different inputs, you might find that models considering a feature are anomalous, while models that dont consider the current event normal. In this case you could say that the input feature is the reason this event is anomalous. -Youre task would be to develop this into a method to analyze the reason for a given anomaly. Normally I would now include some example code, but since my trivial example needs thausands of models to output something useful, I only show 2 example images. -In both I train an ensemble of anomaly methods to differentiate mnist data (letters). The model should consider a "7" as normal, while finding every other letter as anomaly. The images shown are my favorite from ~20 I have looked at. -The first image (example1.pdf pdf because vector graphics) shows a slightly weird 7 (a 7 with another line at the top) on the left and the "anomaly reason" on the right side. You see the part of the 7 which we would initially consider normal in black (low anomaly reason), but not the additional line as this is not a usual part of the "7". -The second image (example2.pdf) shows a "2" and thus an anomaly. See this "2" here again as a "7" with another line. Again you see the basic structure of the "7" represented in the image, but this time the second line is really anomalous(We can not expect there to be a 7 with a line below, but we could imagine in the test set being another 7 with a line above), and so it is found by the algorithm and as you see in the heatmap, this is represented: This image is not a "7" since it contains another line. -The biggest drawback of this algorithm is that it requires many different anomaly predictions (I used here ~2000, this is also only possible because I use an anomaly algorihm I thought of, which is really fast). This is partially the case since the mnist images used have many (784) features, and we can assume that this effect will be less strong with fewer features. You can probably still improve the speed (number of models) quite a lot. A better querry strategy for the feature bagging, a better combination function for the resulting anomaly scores or even some more active idea (train this model to test the current hypothesis) should help quite a lot. -On the other hand, this algorithm could also be used for fewer features (where it will be much faster), but then you could also consider relations between the features (given two inputs, which are always between 0 and 1, but always the same: They are anomalous not for any value, but always when they are not the same) +Youre task would be to develop this into a method to analyze the reason for a found anomaly. Normally I would now include some example code, but since my trivial example needs thausands of models (see below) to output something useful, I only show 2 example images. +In both I train an ensemble of anomaly methods to differentiate mnist data (letters). The model should consider a "7" as normal, while finding every other letter as anomalous. The images shown are my favorites from ~20 I have looked at. +The first image (example1.pdf pdf because vector graphics) shows a slightly weird 7 (a 7 with another line at the top) on the left and the "anomaly reason" on the right side. You see the part of the 7 which we would initially consider normal in black (low anomaly reason), but not the additional line as this is not a usual part of the "7". So we can clearly see which parts of the image make this "7" normal. +The second image (example2.pdf) shows a "2" and thus an anomaly. See this "2" here again as a "7" with another line. Again you see the basic structure of the "7" represented in the image in black, but this time the second line is really anomalous (We can not expect there to be a 7 with a line below, but we could imagine in the training set being another 7 with a line above), and so it is found by the algorithm and as you see in the heatmap, this is represented: This image is not a "7" since it contains another line at the bottom. +The biggest drawback of this algorithm is that it requires many different anomaly predictions (I used here ~2000, I use a very fast algorithm I invented, but this still takes a couple of hours of computation time). This is partially the case since the mnist images used have many (784) features, and we can assume that this effect will be less strong with fewer features. But you can also probably still improve the speed (number of models) quite a lot. A better querry strategy for the feature bagging, a better combination function for the resulting anomaly scores or even some more active idea (train this model to test the current hypothesis) should help quite a lot. +On the other hand, this algorithm could also be even more useful for fewer features (where it will be much faster), and then you could also consider relations between the features (given two inputs, which are always between 0 and 1, but always the same: They are anomalous not for any value, but always when they are not the same. To find this relationship you really need to consider the relation between features) If you have any questions, feel free to write an email to Simon.Kluettermann@cs.tu-dortmund.de