There is a lot of debate in the literature as to whether metalinguistic, echoing or metarepresentational phenomena require semantic or pragmatic explanations or, perhaps the widest consensus, a mixture of the two. Recently some attention has been paid on whether grammatical models, i.e., models that define syntactic-semantic mappings (see e.g. Potts 2007; Ginzburg and Cooper 2014; Maier 2014), can offer a more substantial contribution in answering this question. In this chapter, we argue that they can, but not under standard assumptions as to what kind of mechanism “syntax” is and what the differentiation is between grammatical and pragmatic processes. Like Ginzburg and Cooper (2014) we take natural languages (NLs) to be primarily means of social engagement and on this basis we believe that various mechanisms that have been employed in the analysis of conversation can be extended to account for metarepresentational phenomena, which, as stressed in the Bakhtinian literature, demonstrate how dialogic interaction can be embedded within a single clause. However, we take such phenomena as a case study to show that a model adequate for accounting for the whole range of metalinguistic data, as well as for their interaction with other dialogue phenomena, has to depart from some standard assumptions in grammatical theorising: (a) we have to abandon the view of syntax as a separate representational level for strings of words, and (b) we need to incorporate in the grammar formalism various aspects of psycholinguistic accounts of NL-processing, like the intrinsic incrementality-predictivity of parsing/production, and a realistic modelling of the context as information states that record or invoke utterance events and their modal and spatiotemporal coordinates.