Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals