According to (Andrews, 2022) synthetic data is information that’s artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models.
Data generated by a computer simulation can be seen as synthetic data. This encompasses most applications of physical modeling, such as music synthesizers or flight simulators. The output of such systems approximates the real thing, but is fully algorithmically generated (Nowruzi, et al., 2019).
Synthetic data is generated to meet specific needs or certain conditions that may not be found in the original, real data. Synthetic data are often generated to represent the authentic data and allows a baseline to be set. Another benefit of synthetic data is to protect the privacy and confidentiality of authentic data (Barse, Kvarnstrom, and Jonsson, 2003). Furthermore, the generated synthetic data set allows verification and validations at early stage of developments.
According to the MLOps framework developed in the AIthena project – see Fig. 1. – synthetic data can be used at different stages of MLOps lifecycle – highlighted by the dark blue boxes.

From the perspective of Connected, Cooperative and Automated Mobility (CCAM) synthetic data could play a key role in the verification and validation of different mobility solutions, accelerating the development and the deployment.
The benefits and drawbacks of synthetic data used in verification and validation of CCAM solutions are summarized in Table 1.

In the aim to accelerate the development of CCAM solutions, Siemens Industry Software Netherlands B.V. is making public a small synthetic data set, which contains ground truth, camera, radar, lidar and depth map information, all recorded in case of a simple test scenario, considering different weather and illumination conditions.

Author: Alexandru Forrai, Siemens Industry Software Netherlands B.V.
For more details, please check the following publications:
The synthetic data set, which contains the ground truth, physics-based sensor outputs: camera (.jpg), radar (.pcd) lidar (.pcd) and ideal depth map at: Synthetic data set generated using Simcenter Prescan – Vulnerable Road Users in urban driving scenario
The user manual describing the synthetic dataset structure and data format at: SYNTHETIC DATA DESCRIPTION – Data generated using Simcenter Prescan
A white paper about synthetic data benefits/drawbacks and possible usage at: Supporting automated driving systems development with synthetic data
References:
- Andrews, G., 2022. What is synthetic data? Accessed on 08 09, 2022. Available at: https://blogs.nvidia.com/blog/2021/06/08/what-is-synthetic-data
- Nowruzi, F., Kapoor, P., Kolhatkar, D., Hassanat, F., Laganiere, R., and Rebut, J., 2019. How much real data do we actually need: Analyzing object detection performance using synthetic and real data. In: arXiv preprint. arXiv:1907.07061
- Barse, E., Kvarnstrom, H., and Jonsson, E., 2003. Synthesizing test data for fraud detection systems. In: IEEE Proceedings of the 19th Annual Computer Security Applications Conference. doi:10.1109/CSAC.2003.1254343
