Some current and next generation security solutions employ machine learning and related technologies. Due to the nature of these applications, correct use of machine learning can be critical. One area that is of particular interest in this regard is the use of appropriate data for training and evaluation. In this work, we investigate different characteristics of datasets for security applications and propose a number of qualitative and quantitative metrics which can be evaluated with limited domain knowledge. We illustrate the need for such metrics by analyzing a number of datasets for anomaly and intrusion detection in automotive systems, covering both internal vehicle network and vehicle-to-vehicle (V2V) communication. We demonstrate how the proposed metrics can be used to learn the strengths and weaknesses in these datasets.
This research was supported by the Vinnova FFI project "CyReV:Cyber Resilience for Vehicles" under the grants 2018-05013 and2019-03071.