.. _stats: Interpretation of results ========================= The :ref:`graph` section gives means to try to arrange graphs, splitting a graph in sub-graphs for levels of a variable, ignoring less important variables, ... But if you have many variables and results, graphs will be hard to comprehend. It might be best to use statistical tools to reduce the search space and conduct a more restricted experimental design (see :ref:`variables`). Statistics ---------- The ``--statistics`` argument will generate some statistical information. This is the result of ``--statistics`` with the `01-iperf-advanced.npf` example which is using the iPerf tool to measure the performance of local TCP connections. Compared to the front page example, this version has more variables, such as the number of CPU allocated to the iperf software. The statistics are generated in two parts. First, for each metric observed (THROUGHPUT, L1 misses, etc...), some quantification is proposed. .. As with many observations this would quickly sum up to a lot of text, only observations with Then, a correlation matrix is built for all metrics at once. .. code-block:: text Building dataset... Statistics for THROUGHPUT Learning dataset built with 6144 samples and 6 features... No tree graph when maxdepth is > 8. Use --statistics-maxdepth 8 to fix it to 8. Feature importance: CONGESTION : 0.0017 NODELAY : 0.0025 PARALLEL : 0.2349 CPU : 0.2898 WINDOW : 0.4711 Max: PARALLEL = 8, CPU = 8, WINDOW = 512, CONGESTION = reno, NODELAY = Nagle enabled, TIME = 2 : 272730423296.00 Min: PARALLEL = 1, CPU = 5, WINDOW = 4, CONGESTION = bbr, NODELAY = Nagle enabled, TIME = 2 : 17406361.60 Means per variables: PARALLEL: 1 : 23406949034.67 2 : 41691482521.60 3 : 55738238730.24 4 : 64936604357.97 5 : 69935443312.64 6 : 73196809803.09 7 : 73895472223.57 8 : 71723332840.11 CPU: 1 : 10506062342.83 2 : 29201748241.07 3 : 43929965499.73 4 : 53619540923.73 5 : 70731434530.13 6 : 84852326263.47 7 : 89814343543.47 8 : 91868911479.47 WINDOW: 1 : 104438169.60 2 : 104145715.20 4 : 104596548.27 8 : 104852411.73 16 : 104377548.80 32 : 106264166.40 64 : 99640165444.27 128 : 93488939008.00 256 : 94197328991.57 512 : 94784307855.36 1024 : 94311106478.08 2048 : 94559996477.44 4096 : 94357551404.37 8192 : 94045998503.25 16384 : 94278390906.88 32768 : 94756206018.56 CONGESTION: bbr : 60031966392.32 cubic : 58966340638.72 reno : 58948317777.92 NODELAY: Nagle disabled : 57695090414.93 Nagle enabled : 60935992791.04 Correlation matrix: PARALLEL CPU WINDOW CONGESTION NODELAY THROUGHPUT PARALLEL 1.00 -0.00 -0.00 -0.00 -0.00 0.23 CPU 1.00 0.00 0.00 0.00 0.41 WINDOW 1.00 0.00 0.00 0.25 CONGESTION 1.00 -0.00 -0.01 NODELAY 1.00 0.02 THROUGHPUT 1.00 Graph of correlation matrix saved to doc/covariance-THROUGHPUT-correlation.png P-value of ANOVA (low p-value indicates a probable interaction): PARALLEL CPU WINDOW CONGESTION NODELAY THROUGHPUT PARALLEL 0.00 0.00 0.24 0.36 0.00 CPU 0.00 0.88 0.16 0.00 WINDOW 0.74 0.29 0.00 CONGESTION 0.37 0.51 NODELAY 0.02 Graph of a ANOVA matrix saved to doc/covariance-THROUGHPUT-anova.png Generating graphs... Pandas dataframe written to doc/covariance.csv Graph of test written to /etinfo/users2/tbarbette/workspace/npf/doc/covariance-THROUGHPUT.png Feature importance ^^^^^^^^^^^^^^^^^^ The feature importance is built using the entropy of a regression tree. It shows the importance of most variables. Here ``WINDOW`` is more important than ``PARALLEL``, but arguably they're both important and do contribute to the ``THROUGHPUT`` metric. The regression tree is saved to a PDF file for visualization. In the example above, it is not generated because the tree is too deep. Use `--statistics-maxdepth 5` to limit the tree depth. .. image:: https://github.com/tbarbette/npf/raw/main/doc/covariance-THROUGHPUT-clf.png :width: 400 :alt: Regression tree The tree can be read as the most significant decisions to reach the best (or worst) performance. Max/min and features per variables ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The next lines show the variables for the best and the worst values. Then, for each parameter, the mean of the result (in this case the throughput) for each parameter. Interactions with ANOVA ^^^^^^^^^^^^^^^^^^^^^^^ Finally, the last available statistic is the p-value of a two-way ANOVA test for each pair of variables. .. image:: https://github.com/tbarbette/npf/raw/main/doc/covariance-THROUGHPUT-anova.png :width: 400 :alt: ANOVA p-value matrix It shows the possible interaction between each pair of variables. If the P value is smaller than 0.05 there is a probable interaction. A value higher than 0.05 only means there is no clear linear interaction between variables. Correlation matrix for all parameters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ All the other statics are per-metric, a correlation matrix is then built for all metrics at once. The correlation matrix then shows the pearson correlation between each factor and each observation. The correlation matrix is printed textually but also generated as a picture. .. image:: https://github.com/tbarbette/npf/raw/main/doc/covariance-THROUGHPUT-correlation.png :width: 400 :alt: Correlation matrix Correlation matrix are symmetrical. It shows in this cas the parameters have no correlation between themselves, but the interesting part is for the correlation between factors and results. We find again a notion of importance of the factors towards the throughput.