Toward grounded spatio-temporal reasoning