Hierarchical Latent Networks For Image And Language Correlation