Benchmark definition standard
How can we define each benchmark and the metric on which it succeeds?
For example, detection efficiency might detect 80% of events with some Q2 cut and we want it to fail lower than 95%. Could we just have a json file like the following?
{ "name": "My Q2 cut",
"description":"Some Q2 cut that we expect high eff.",
"quantity":"efficiency",
"benchmark":"0.95",
"value":"0.80"
}
Should we think of this as a "benchmark" or a "test"?
I guess a "benchmark" could be comprised of one or more of these "tests"
{ benchmark : "DVCS in central",
test_results: [
{ "name": "My Q2 cut",
"description":"Some Q2 cut that we expect high eff.",
"quantity":"efficiency",
"goal_threshold":"0.95",
"value":"0.80",
"weight": "1.0"
},
{ "name": "Coplanarity analysis",
...
},
...
],
performance_limit "4.5"
performance_goal : "4",
performance: "4.1",
successful_goals: "5",
total_goals: "6"
}
where performance_limit
is computed from the weights:
P_{limit} = \sum_{tests}^i w_i
and the actual performance includes only passing tests:
P = \sum_{tests passed}^i w_i\
This assumes a all tests are pass/fail can probably be relaxed to a measure between [0,1].
Thoughts? @sly2j @cpeng @jihee.kim @Polakovic