Spatial-Temporal Hierarchical Model For Joint Learning And Inference Of Human Action And Pose