A lot of natural processes are encoded by sequences of elements: DNA is a sequence of four nucleobases, proteins are sequences of amino-acids, and languages all around the world work in the same way: words are sequences of letters and sentences are sequences of words. But random sequences do not make a language, one has to impose a set of rules which is define a “grammar”. One can see these sentence-grammar relations as a tree-like network in which the nodes are the grammatical rules, and the leaves the words constituting the sentence. With this description, is it possible to distinguish between grammars that produce random-word sentences and information-rich ones?
In a recent work published in Physical Review Letters, Eric DeGiuli, post-doctoral fellow at the Philippe Meyer Institute in the ENS Physics’ department has developed a statistical Physics model to describe grammars in a physicists’ point of view. Its “Random Language Model” applies on a specific sub-ensemble of all the possible grammars, the “context-free grammars” (CFG), a category that contains all human languages.
In this context, DeGiuli managed to determine a specific physical quantity in the CFG ensemble that plays the role of temperature in statistical physics, and which controls the spareness of the all possible word-trees. Lowering the temperature means that the tree interiors become sparser. He showed that below a certain critical value, the entropy of the system changes abruptly: it goes through a phase transition from grammars allowing random sentences to grammars allowing only sentences with information. At that point, words cease to be mere labels and instead become the ingredients of sentences with complex structures and meanings.
This description in terms of phase transition could be useful in understanding the process of learning a language. Children start in the high-temperature phase where all languages are possible. As they are exposed to numerous trees (sentences) constructed with an unknown grammar, the temperature decreases, allowing a phase transition in the information-rich languages, and discovering the underlying grammar. The use of inductive and probabilistic inference in DeGiuli’s theory is consistent with what is observed in children’s language acquisition . He hopes that this abstract process can ultimately be connected to observations at the neurological level.
Référence de l’article : DeGiuli, Random Language Model, Phys. Rev. Lett. 122, 128301 (2019)
Author affiiation :
Institut de Physique Théorique Philippe Meyer, École Normale Supérieure, PSL University, Sorbonne Université, CNRS, Paris, France