Forest ecosystem models, being widespread science tools and used for forest management decision support are usually evaluated individually against field data sets, while model intercomparison and joint evaluation studies are rare. We tested five forest models according to a harmonized protocol against data from nine forest compartments in the Snĕžnik region, in Slovenia. The suite of models included stand- and landscape-scale, empirical- and process-based models used across Europe. The test dataset originated from inventory data covering 50 years (tree measurements 1963, 1983 and 2013) and included annual harvesting records at tree level. Uncertainties in data and forest conditions were considered by defining 12 scenarios varying initial regeneration, browsing pressure and harvest modalities. We evaluated the models` ability to initialize forest conditions accurately, whether management interventions could be implemented based on harvest records, and how well basal area and diameter structure could be predicted. Simulation results for basal area development showed good to satisfactory performance for all models, at which SAMSARA2, SIBYLA and PICUS showed the best agreement. Comparison of simulated and observed diameter distributions showed good performance of ForClim, PICUS, SAMSARA2 and SIBYLA. Model output variability was between 6% and 24%, indicating the relevance to consider uncertainties that can be attributed to specific sources. There was no clear hierarchy between more empirical or more process-based models regarding accuracy of stand development projections. The cohort-based landscape model LandClim showed the lowest stand-level accuracy and scenario sensitivity, but results nevertheless qualified it for complementary application at landscape scale. Within individual-based models, spatially explicit models seemed to be more suitable for heterogeneous mixed mountain forests. The findings demonstrated the usefulness of inventory datasets for model testing and intercomparison.