Dialog robots, like Amazon's Echo, have gained increasing popularity in numerous people's daily life. With the built-in speech recognition module, we can interact with dialog robots easily via voice commands. Before each round human-robot interaction, users have to activate the robot from the standby mode with special wake words, such as "Hey Siri", which is apparently not user-friendly and unlike the human-human talking without foregoing words. Moving one step forward, the latest generation of dialog robots have been equipped with advanced sensors, like the camera, enabling multimodal activation. In light of this, in this work, we first define a new research task of multimodal activation towards awaking the robot without wake words by leveraging multimodal signals captured by the given robot. To accomplish this task, we present a Multimodal Activation Scheme (MAS), consisting of two key components: audio-visual consistency detection and semantic talking intention inference. The first one is devised to measure consistency between the audio and visual modalities in order to figure out wether the heard speech comes from the detected user in front of the camera. Towards this end, two heterogeneous CNN-based networks are introduced to convolutionalize the fine-grained facial landmark features and the MFCC audio features, respectively. The second one is to infer the semantic talking intention of the recorded speech, where the transcript of the speech is recognized and matrix factorization is utilized to uncover the latent human-robot talking topics. We ultimately devise different fusion strategies to unify these two components. To evaluate MAS, we construct a dataset containing 12,741 short videos recorded by 194 invited volunteers. Extensive experiments demonstrate the effectiveness of our scheme.