Diffusion models have emerged as a powerful class of generative models with wide-spread adoption in many areas. They have shown surprising effectiveness, as a conditional policy representation in the context of robotic learning. This performance has led to the popularity of various frameworks, that use diffusion models to predict trajectories, action sequences or videos. Despite their prowess, existing methodologies do not adequately address learning from multimodal goal specifications, a frequent occurrence in Learning from Play (LfP) with sparse language labels. Addressing this gap, we present Multimodal Diffusion Transformer (MDT), a novel diffusion policy framework. MDT integrates multimodal transformers, pretrained foundation models, and latent token alignment to master long-horizon manipulation based on multimodal goal specifications. Tested on the challenging CALVIN benchmark, MDT not only sets a new performance benchmark for end-to-end policies but also achieves this with less than ten percent of the training time of preceding approaches. Our experiments and ablations further validate the effectiveness and strategic choices behind MDT.