LLM в рассуждениях когнитивной архитектуры как мотор в полёте самолёта

ailev · 01.Август.2023 14:56:32

Мы опять немножко разговариваем с John Sowa по поводу LLM, только тред поменялся, теперь это “Deriving or learning functions by a neural net”, https://groups.google.com/g/ontolog-forum/c/_qHJlpqRq9Q/m/1BxH4vreAQAJ

Я вытащу оттуда пару своих писем, ибо в них много интересных ссылок – чтобы их потом не искать ещё раз, раз уж всё оказалось собранным в одном месте. Переводить на русский не стал, исправлять ошибки не стал (писал в три часа ночи из головы и без проверки, так что “хау мач вотч”). Основной опровергаемый мной тезис – “LLM это тензоры, поэтому они неспособны ни создать что-то новое типа открыть физический закон, как Кеплер открыл, что планеты обращаются по эллипсу, ни вообще рассуждать”. Цитирования я сдвину (эх, надо было это сделать в исходных текстах писем, ну да ладно). Основная мысль там, которая не слышится из письма в письмо (но я упорно повторяю её):
– системное мышление заставляет думать не про vanilla LLM, а про надсистему, в которую LLM входит. Это какая-то когнитивная архитектура, какой-то алгоритм (чаще всего это часть генетического алгоритма). Утверждение “LLM не может рассуждать, потому что тензоры не рассуждают” выглядит примерно как “мотор не может летать, потому что он из железа”. Понятно, что LLM это просто часть какого-то другого алгоритма (когнитивная архитектура), в котором обычно есть ещё и память и какие-то вызовы LLM в цикле с подсовыванием разных prompts. И оценивать на рассуждения надо этот алгоритм, а не vanilla LLM.
– тем не менее, способности к рассуждению у vanilla LLM есть, хотя и ограниченные. Полно работ на эту тему.
– и способности к творчеству (порождению нового, что раньше не существовало – примитивное понимание творчества) есть. Творчество берётся из шума/хаоса/случайности, которые появляются для мутаций в генетических алгоритмах, а в LLM это попадает, например, через промпты – и LLM может, например, отобрать для проверки хорошую мутацию и выкинуть заведомо плохую (smart mutations).
– и если специфически говорить о Кеплере, то это любимый пример для symbolic discovery. Кеплер ровно что-то такое и делал, вызывая свою мокрую LLM в цикле на предмет проверки разных догадок и записывая промежуточные результаты на бумагу как внешнюю память.

Ниже некоторое количество литературы в обоснование этих тезисов:

« In 1601, Johannes Kepler got access to the world’s best data tables on planetary orbits, and after 4 years and about 40 failed attempts to fit the Mars data to various ovoid shapes, he launched a scientific revolution by discovering that Mars’ orbit was an ellipse (1). This was an example of symbolic regression: discovering a symbolic expression that accurately matches a given dataset. More specifically, we are given a table of numbers, whose rows are of the form {x1,…, xn, y} where y = f(x1, …, xn), and our task is to discover the correct symbolic expression for the unknown mystery function f, optionally including the complication of noise.»

This is first paragraph from work

AI Feynman: A physics-inspired method for symbolic regression Silviu-Marian Udrescu and Max Tegmark https://www.science.org/doi/10.1126/sciadv.aay2631
A core challenge for both physics and artificial intelligence (AI) is symbolic regression: finding a symbolic expression that matches data from an unknown function. Although this problem is likely to be NP-hard in principle, functions of practical interest often exhibit symmetries, separability, compositionality, and other simplifying properties. In this spirit, we develop a recursive multidimensional symbolic regression algorithm that combines neural network fitting with a suite of physics-inspired techniques. We apply it to 100 equations from the Feynman Lectures on Physics, and it discovers all of them [including Kepler’s ellipse equation], while previous publicly available software cracks only 71; for a more difficult physics-based test set, we improve the state-of-the-art success rate from 15 to 90%.

This is second better version of an algorithm than in previous work. There is open source implementation – GitHub - SJ001/AI-Feynman

AI Feynman 2.0: Pareto-optimal symbolic regression exploiting graph modularity
Silviu-Marian Udrescu, Andrew Tan, Jiahai Feng, Orisvaldo Neto, Tailin Wu, Max Tegmark
[2006.10782] AI Feynman 2.0: Pareto-optimal symbolic regression exploiting graph modularity

These algorithms using neural network as one of its parts.

There are multiple works about symbolic regression (from time to time they call it symbolic discovery). E.g.

From Kepler to Newton: Explainable AI for Science
Zelong Li, Jianchao Ji, Yongfeng Zhang

arXiv.org

From Kepler to Newton: Explainable AI for Science

The Observation--Hypothesis--Prediction--Experimentation loop paradigm for scientific research has been practiced by researchers for years towards scientific discoveries. However, with data explosion in both mega-scale and milli-scale scientific...

… To demonstrate the AI-based science discovery process, and to pay our respect to some of the greatest minds in human history, we show how Kepler’s laws of planetary motion and Newton’s law of universal gravitation can be rediscovered by (Explainable) AI based on Tycho Brahe’s astronomical observation data, whose works were leading the scientific revolution in the 16-17th century.

Rediscovering orbital mechanics with machine learning
Pablo Lemos, Niall Jeffrey, Miles Cranmer, Shirley Ho, Peter Battaglia

arXiv.org

Rediscovering orbital mechanics with machine learning

We present an approach for using machine learning to automatically discover the governing equations and hidden properties of real physical systems from observations. We train a "graph neural network" to simulate the dynamics of our solar system's...

… We train a “graph neural network” to simulate the dynamics of our solar system’s Sun, planets, and large moons from 30 years of trajectory data. We then use symbolic regression to discover an analytical expression for the force law implicitly learned by the neural network, which our results showed is equivalent to Newton’s law of gravitation. The key assumptions that were required were translational and rotational equivariance, and Newton’s second and third laws of motion. Our approach correctly discovered the form of the symbolic force law. Furthermore, our approach did not require any assumptions about the masses of planets and moons or physical constants. They, too, were accurately inferred through our methods. Though, of course, the classical law of gravitation has been known since Isaac Newton, our result serves as a validation that our method can discover unknown laws and hidden properties from observed data.

Sure, there are some of non-neural network algorithms for symbolic regression. E.g.

Combining data and theory for derivable scientific discovery with AI-Descartes
Cristina Cornelio, Sanjeeb Dash, Vernon Austel, Tyler R. Josephson, Joao Goncalves, Kenneth L. Clarkson, Nimrod Megiddo, Bachir El Khadir & Lior Horesh
https://www.nature.com/articles/s41467-023-37236-y

Scientists aim to discover meaningful formulae that accurately describe experimental data. Mathematical models of natural phenomena can be manually created from domain knowledge and fitted to data, or, in contrast, created automatically from large datasets with machine-learning algorithms. The problem of incorporating prior knowledge expressed as constraints on the functional form of a learned model has been studied before, while finding models that are consistent with prior knowledge expressed via general logical axioms is an open problem. We develop a method to enable principled derivations of models of natural phenomena from axiomatic knowledge and experimental data by combining logical reasoning with symbolic regression. We demonstrate these concepts for Kepler’s third law of planetary motion, Einstein’s relativistic time-dilation law, and Langmuir’s theory of adsorption. We show we can discover governing laws from few data points when logical reasoning is used to distinguish between candidate formulae having similar error on the data.

Discussion about empiricism (inference of something from data) and rationalism (guessing some ontology before work with data) is omitted here, Rationalism vs. Empiricism (Stanford Encyclopedia of Philosophy). Sure we speaks about rationalism with guessing of function and then fitting it to data.

I wonder why we speaking here about tensors and neural networks as capable of not capable of something in their pure form. This is too low level for discussion. There are not emergence property of symbolic discovery/symbolic regression. (that have meaning from systems thinking and not “appearing of something from nothing”: this is about changing properties of whole in comparison with properties of parts).

By the way, symbolic discovery applied to neural networks too. Here it is:

Symbolic Discovery of Optimization Algorithms
Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, Quoc V. Le

arXiv.org

Symbolic Discovery of Optimization Algorithms

We present a method to formulate algorithm discovery as program search, and apply it to discover optimization algorithms for deep neural network training. We leverage efficient search techniques to explore an infinite and sparse program space. To...

We present a method to formulate algorithm discovery as program search, and apply it to discover optimization algorithms for deep neural network training. We leverage efficient search techniques to explore an infinite and sparse program space. To bridge the large generalization gap between proxy and target tasks, we also introduce program selection and simplification strategies. Our method discovers a simple and effective optimization algorithm, Lion (EvoLved Sign Momentum). It is more memory-efficient than Adam as it only keeps track of the momentum. Different from adaptive optimizers, its update has the same magnitude for each parameter calculated through the sign operation. We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks. On image classification, Lion boosts the accuracy of ViT by up to 2% on ImageNet and saves up to 5x the pre-training compute on JFT. On vision-language contrastive learning, we achieve 88.3% zero-shot and 91.1% fine-tuning accuracy on ImageNet, surpassing the previous best results by 2% and 0.1%, respectively. On diffusion models, Lion outperforms Adam by achieving a better FID score and reducing the training compute by up to 2.3x. For autoregressive, masked language modeling, and fine-tuning, Lion exhibits a similar or better performance compared to Adam. Our analysis of Lion reveals that its performance gain grows with the training batch size. It also requires a smaller learning rate than Adam due to the larger norm of the update produced by the sign function. Additionally, we examine the limitations of Lion and identify scenarios where its improvements are small or not statistically significant. Lion is also successfully deployed in production systems such as Google search ads CTR model.

Here is another example of combination of genetic algorithm (randomness/noise/chaos as a source of novelty here is in mutation step) with neural network:

Enhanced POET: Open-Ended Reinforcement Learning through Unbounded Invention of Learning Challenges and their Solutions
Rui Wang, Joel Lehman, Aditya Rawal, Jiale Zhi, Yulun Li, Jeff Clune, Kenneth O. Stanley

arXiv.org

Enhanced POET: Open-Ended Reinforcement Learning through Unbounded Invention...

Creating open-ended algorithms, which generate their own never-ending stream of novel and appropriately challenging learning opportunities, could help to automate and accelerate progress in machine learning. A recent step in this direction is the...

Creating open-ended algorithms, which generate their own never-ending stream of novel and appropriately challenging learning opportunities, could help to automate and accelerate progress in machine learning. A recent step in this direction is the Paired Open-Ended Trailblazer (POET), an algorithm that generates and solves its own challenges, and allows solutions to goal-switch between challenges to avoid local optima. However, the original POET was unable to demonstrate its full creative potential because of limitations of the algorithm itself and because of external issues including a limited problem space and lack of a universal progress measure. Importantly, both limitations pose impediments not only for POET, but for the pursuit of open-endedness in general. Here we introduce and empirically validate two new innovations to the original algorithm, as well as two external innovations designed to help elucidate its full potential. Together, these four advances enable the most open-ended algorithmic demonstration to date. The algorithmic innovations are (1) a domain-general measure of how meaningfully novel new challenges are, enabling the system to potentially create and solve interesting challenges endlessly, and (2) an efficient heuristic for determining when agents should goal-switch from one problem to another (helping open-ended search better scale). Outside the algorithm itself, to enable a more definitive demonstration of open-endedness, we introduce (3) a novel, more flexible way to encode environmental challenges, and (4) a generic measure of the extent to which a system continues to exhibit open-ended innovation. Enhanced POET produces a diverse range of sophisticated behaviors that solve a wide range of environmental challenges, many of which cannot be solved through other means.

The source of creativity/novelty in genetic algorithms (that are used in symbolic regression) is mutations. LLM can be used to get «smart mutations».

Evolution through Large Models
Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, Kenneth O. Stanley

arXiv.org

Evolution through Large Models

This paper pursues the insight that large language models (LLMs) trained to generate code can vastly improve the effectiveness of mutation operators applied to programs in genetic programming (GP). Because such LLMs benefit from training data that...

This paper pursues the insight that large language models (LLMs) trained to generate code can vastly improve the effectiveness of mutation operators applied to programs in genetic programming (GP). Because such LLMs benefit from training data that includes sequential changes and modifications, they can approximate likely changes that humans would make. To highlight the breadth of implications of such evolution through large models (ELM), in the main experiment ELM combined with MAP-Elites generates hundreds of thousands of functional examples of Python programs that output working ambulating robots in the Sodarace domain, which the original LLM had never seen in pre-training. These examples then help to bootstrap training a new conditional language model that can output the right walker for a particular terrain. The ability to bootstrap new models that can output appropriate artifacts for a given context in a domain where zero training data was previously available carries implications for open-endedness, deep learning, and reinforcement learning. These implications are explored here in depth in the hope of inspiring new directions of research now opened up by ELM.

Thus neural networks in general and LLM in particular now in use with variety of genetic algorithms for discovering symbolic laws like Kepler laws or inventing algorithms (like optimization algorithms), or providing smart mutations to symbolic evolution of some code.

In all my examples you have no vanilla LLM in use, it is always in some system environment. Solution always came from suprasytem where LLM one of the systems that compute partial result. E.g. combustion engine cannot fly, sure. But airplane with combustion engine can fly. For me your thesis “LLM themselves cannot do any reasoning” is like “motors cannot fly”. Sure, motors cannot fly, this is trivial. But without motor airplane cannot fly! This is the question.

I always speaking about cognitive architectures that today include ANN and LLM as specially trained ANN. In all this architectures with novelty search in use some kind of genetic algorithm with possibly smart (due to LLM) mutations. Mutations has noise/chaos/randomness as a source of novelty. LLMs always work with prompts, therefore with randomness of prompts they will produce some novelty that you can enhance this clever usage of some kind of evolutionary algorithms (with memory of LLM results and some kind of cycle in LLM call. Kepler have memory in paper and call himself in cycle for different mutations of suggested simple forms. LLM usage in symbolic discovery is the same, completely analogous to what Kepler did).

By the way, there are multiple frameworks that deal with LLM reasoning. Here two of them, you can try it, they are open source:

LLM Reasoners – a library for advanced reasoning with Large Language Models

GitHub

GitHub - Ber666/llm-reasoners: A library for advanced large language model...

A library for advanced large language model reasoning - GitHub - Ber666/llm-reasoners: A library for advanced large language model reasoning

LLM Reasoners is a library to enable LLMs to conduct complex reasoning, with advanced reasoning algorithms. It approaches multi-step reasoning as planning and searches for the optimal reasoning chain, which achieves the best balance of exploration vs exploitation with the idea of “World Model” and “Reward”. «Paper: Reasoning with Language Model is Planning with World Model», [2305.14992] Reasoning with Language Model is Planning with World Model

Given any reasoning problem, simply define the reward function and an optional world model (explained below), and let LLM reasoners take care of the rest, including Reasoning Algorithms, Visualization, LLM calling, and more!
Why Choose LLM Reasoners?
• Cutting-Edge Reasoning Algorithms: We offer the most up-to-date search algorithms for reasoning with LLMs, such as RAP-MCTS, Tree-of-Thoughts, Guided Decoding, and more. These advanced algorithms enable tree-structure reasoning and outperform traditional chain-of-thoughts approaches.
• Intuitive Visualization and Interpretation: Our library provides visualization tools to aid users in comprehending the reasoning process. Even for the most complex reasoning algorithms like Monte-Carlo Tree Search, users can easily diagnose and understand what occurred with one line of python code.
• Compatibility with any LLM libraries: Our framework is compatible with any LLM frameworks, e.g. Huggingface transformers, OpenAI API, etc. Specifically, we integrated LLaMA with the option of using fairscale backend for improved multi-GPU performance or LLaMA.cpp backend with lower hardware requirements.

By the way, prompt engineering with LLM can do wonderful thins with LLM. E.g. simply adding “Let’s think step by step” before each answer for LLM give them more reasoning ability. Prompt engineering (how to ask LLM about something and how call LLM to get more reasoning from it) is important part of topic. Declaration about “motors cannot fly because iron that it made of cannot fly” and “LLM cannot reason because tensors it made of cannot reason” is wrong. Systems approach require discussion of full systems stack due to emergent properties at every level. “Emergence” have many meanings. In systems approach it mean new properties and new disciplines that study this properties on every systems level (level of part-whole/system-environment hierarchy). Moreover, Kepler already have “common sense” of math and physics of his time in his brain. LLM have the same, it is not vanilla ANN, it is trained ANN.
Here is paper that was published before GPT-4 launch, it about general principles of prompting reasoning in LLMs. One-shot is about 1 time call of LLM but you should remember that you always can do multiple one-shot reasoning steps e.g. in LLM Reasoners framework that mentioned earlier

Large Language Models are Zero-Shot Reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, Yusuke Iwasawa

Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and generally known as excellent few-shot learners with task-specific exemplars. Notably, chain of thought (CoT) prompting, a recent technique for eliciting complex multi-step reasoning through step-by-step answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficult system-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability for few-shot learning, we show that LLMs are decent zero-shot reasoners by simply adding "Let's think step by step" before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with large InstructGPT model (text-davinci-002), as well as similar magnitudes of improvements with another off-the-shelf large model, 540B parameter PaLM. The versatility of this single prompt across very diverse reasoning tasks hints at untapped and understudied fundamental zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive capabilities may be extracted by simple prompting. We hope our work not only serves as the minimal strongest zero-shot baseline for the challenging reasoning benchmarks, but also highlights the importance of carefully exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning datasets or few-shot exemplars.</blockquote>