Doctoral Student in Computer Science (Ref no: ORU 2.1.1-01867/2023)
Applicant: Konark Karna, MSc Advanced Computer Science, Northumbria University.
PDF: Doctoral student in Computer Science[Orebro University Application]
Aim
To develop novel techniques and methodologies to improve reasoning abilities of large language models.
Background
It has been established that scaling Large Language Models (LLMs) can help emerge new abilities in LLMs that are not present in the smaller LLMs (Wei et al., 2022a). However, the ideal situation is to not build an even-larger language model, but a LLM with reasoning ability to dissect and solve complex reasoning tasks.
To take advantage of LLMs strength, few-shot prompting is now a standard approach. In few- shot, a set of k (less than 10) examples of input-output are concatenated in a prompt, with a test instance for improved response generation. Leveraging this, (Wei et al., 2022b) proposed Chain-of-Thought (CoT) method to solve complex reasoning tasks with LLMs. In CoT, additional “thoughts” or Chain of thought is passed alongside input-output exemplars to generate text describing the reasoning process and the final answer to a given query. CoT improves LLMs reasoning abilities with arithmetic, commonsense, and symbolic reasoning tasks.
CoT, in addition, when with a combined self-consistency decoding method performs with further improvement on reasoning abilities. In this method, it takes a majority vote over all generated answers to get the most consistent answer for the final answer (Wang et al., 2022). However, among CoT based methods, Faithful CoT, performs best. Faithful CoT is a two- stage reasoning method, In the first stage of Translation, the problem is translated into a reasoning chain with interleaving Natural Language (NL) and symbolic Language (SL) component, NL decomposes it further into subproblems, and SL handles each problem separately. In the second stage of Problem solving, a deterministic solver is used to reach the final solution (Lyu et al., 2023). Faithful CoT performs even better than Least-to-Most prompting, discussed below.
CoT, however, performs poorly when presented problems to LLMs are harder than the given exemplars. Least-to-Most (LtM) prompting method is proposed to resolve this generalization problem. In LtM, each problem is subdivided into a list of subproblems, and then solved sequentially where the answer of the previous problem is appended into the prompt of the next sub-problem. (Zhou et al., 2022)
Another proposed method to resolve CoT inefficient generalization is Program of Thoughts (PoT). (Chen et al., 2022) proposed PoT prompting to delegate computations steps to an external Python interpreter. This procedure significantly reduces lines of code necessary and can still perform better than CoT under both - few-shot prompting and zero-shot prompting (where no additional exemplar is provided in the prompt). Similar to PoT, we have a Program-aided Language (PAL) model that uses the Python interpreter as an intermediate reasoning step. PAL is somewhat congruent with Faithful CoT. (Gao et al., 2023) proposed PAL as a one stage method where each problem is composed of interleaved NL and Programming Language (PL) statements.
For further enhancement in reasoning with LLMs, a vision component is also taken into account in a proposed method of Multimodal Chain-of-Thought. It is another two-stage method. In the first stage of rationale generation, vision and language input gets concatenated to generate rationale, which again in the second stage get concatenated with original language input to provide the final answer (Zhang et al., 2023).
General perception about improving LLM models’ accuracy is based on three factors: amount of computation, number of model parameters, and training dataset size. However, aforesaid methods provide brilliant examples to any aspiring researcher for intuitive thinking that can bolster reasoning ability without continuously ever-increasing size of LLMs.
Methodology
First of all, a more thorough literature review will be essential to get a complete grasp on each and every proposed method for solving linguistic complexity with LLM reasoning.
Then, it has been established that when the model scale is less than 100B, CoT prompting can even be detrimental (Qiao et al., 2022). Therefore, future experiments to grow advanced reasoning abilities must be conducted on larger-scale models.
In cases of live interaction with autonomous agents, incorporating a vision component should boost reasoning abilities with fewer prompts. In this scenario, incorporating self-supervised learning methods such as SimCLR can be beneficial to learn labels on object-in-sight that are even partially hidden or distorted. Another method that can be tried is to approach the problem of CoT generalization with a probabilistic approach. With fewer prompts, this approach can provide an uncertainty measure of reasoning paths. Afterwards, one with maximum likelihood can be then passed onto the Python interpreter to reach the final answer.
When it comes to benchmarking datasets, we already have multiple datasets in each section of reasoning – arithmetic reasoning, symbolic reasoning, commonsense reasoning, logical reasoning, and multimodal reasoning, with highest in arithmetic reasoning – almost 20.
We, perhaps, first need to identify 2-3 most suitable ones in each subsection of reasoning, for instance Big Bench Hard (Suzgun et al., 2022) in arithmetic reasoning and StrategyQA (Geva et al., 2021) in commonsense reasoning. Then, conducting a comparative study of existing methods so far on these benchmark datasets would be key to understanding the model’s reasoning ability. It can further provide information on the strategies that should work ahead, and those that can be irrelevant in future studies. In addition, a few more complex benchmarking dataset can be created to further test LLMs’ reasoning ability. For instance, in the multimodal reasoning section with distorted /blurred / incomplete images.
References
- Chen, W., Ma, X., Wang, X. and Cohen, W.W., 2022. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint. Available at: https://arxiv.org/abs/2211.12588
-
Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J. and Neubig, G., 2023, July. Pal: Program-aided language models. In International Conference on Machine Learning (pp. 10764-10799). PMLR.
- Geva, M., Khashabi, D., Segal, E., Khot, T., Roth, D. and Berant, J., 2021. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9, pp.346-361.
- Lyu, Q., Havaldar, S., Stein, A., Zhang, L., Rao, D., Wong, E., Apidianaki, M. and Callison- Burch, C., 2023. Faithful chain-of-thought reasoning. arXiv preprint. Available at: https://arxiv.org/abs/2301.13379.
- Qiao, S., Ou, Y., Zhang, N., Chen, X., Yao, Y., Deng, S., Tan, C., Huang, F. and Chen, H., 2022. Reasoning with language model prompting: A survey. arXiv preprint. Available at: https://arxiv.org/abs/2212.09597
- Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H.W., Chowdhery, A., Le, Q.V., Chi, E.H., Zhou, D. and Wei, J., 2022. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint. Available at https://arxiv.org/abs/2210.09261
- Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A. and Zhou, D., 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint. Available at: https://arxiv.org/abs/2203.11171
- Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D. and Chi, E.H., 2022a. Emergent abilities of large language models. arXiv preprint. Available at: https://arxiv.org/abs/2206.07682
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V. and Zhou, D., 2022b. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, pp.24824-24837.
- Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G. and Smola, A., 2023. Multimodal chain- of-thought reasoning in language models. arXiv preprint. Available at: https://arxiv.org/abs/2302.00923
- Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q. and Chi, E., 2022. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint. Available at: https://arxiv.org/abs/2205.10625