Multimodal Diffusion Transformer for Learning from Play