Context: Since 1998, the ACM/IEEE 25th International Conference on Model Driven Engineering Languages and Systems (MODELS) has been studying all aspects surrounding modeling in software engineering, from languages and methods to tools and applications. In order to enable empirical studies, the MODELS community developed a need for having examples of models, especially of models used in real software development projects. Such models may be used for a range of purposes, but mostly related to domain analysis and software design (at various levels of abstraction). However, finding such models was very difficult. The most used ones had their origin in academic books or student projects, which addressed “artificial” applications, i.e., were not base on real-case scenarios. To address this issue, the authors of this reflection paper, members of the modeling and of the mining software repositories fields, came together with the aim of creating a dataset with an abundance of modeling projects by mining GitHub. As a scoping of our effort we targeted models represented using the UML notation because this is the lingua franca in practice for software modeling. As a result, almost 100k models from 22k projects were made publicly available, known as the Lindholmen dataset. Objective: In this paper, we analyze the impact of our research, and compare this to what we envisioned in 2016. We draw practical lessons gained from this effort, reflect on the perils and pitfalls of the dataset, and point out promising avenues of research. Method: We base our reflection on the systematic analysis of recent research literature, and especially those papers citing our dataset and its associated publications. Results: What we envisioned in the original research when making the dataset available has to a major extent not come true; however, fellow researchers have found alternative uses of the dataset. Conclusions: By understanding the possibilities and shortcomings of the current dataset, we aim to offer the research community i) future research avenues of how the data can be used; and ii) raise awareness of the limitations, not only to point out threats to validity of research, but also to encourage fellow researchers to find ideas to overcome them. Our reflections can also be helpful to researchers who want to perform similar mining efforts.
The work of G. Robles has been supported in part by the Spanish Ministry of Science and Innovation (PID2022-139551NB-I0).