From Transformers to LLMs

Arsalan Aslam

doi:10.48550/ARXIV.2506.01934

Outline

From Transformers to LLMs

Arsalan Aslam

2025, Elsevier

https://doi.org/10.48550/ARXIV.2506.01934

visibility

…

description

12 pages

link

1 file

Abstract

Large Language Models (LLMs) have catalyzed a paradigm shift in Natural Language Processing (NLP). From the introduction of the Transformer architecture to the development of massive generative models such as GPT-3.5, LLaMA2-7B, and PaLM, these architectures demonstrate emergent behaviors allowing for zero-shot and few-shot learning, grounded generation, and cross-modal understanding. This paper provides a comprehensive review of LLMs, covering their architectural foundations, training paradigms, emergent abilities, evaluation benchmarks, and diverse applications-including mental health monitoring, assistive technologies for the visually impaired, and novel regularization techniques. We also highlight current challenges such as interpretability, bias, and resource efficiency, and discuss future directions for research and responsible deployment.

References (49)

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, pages 5998-6008, 2017.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre- training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, pages 4171-4186, 2019.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 2019. URL: https://cdn.openai.com/better-language-models/language_models_ are_unsupervised_multitask_learners.pdf
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877-1901, 2020.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Anil Raghavan, Nikolaus Parulian, Marc-Alexandre Côté, Raphael Lachaux, Evangelos Panagiotopoulou, Andrei Sabbag, Claire Welleck, Noushad Shafique, Quentin Lhoest, Arthur Mensch, Thomas Scialom, Teven Le Scao, Sylvain Gelly, Yacine Jernite, Hervé Delaunay, and Edouard Grave. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebas- tian Gehrmann, Logesh Advani, Adam Hutchinson, Andy Jones, Hen-Wei Huang, Ja- cob Steele, Eric Noland, Kevin Patton, Evan Wang, Sasha Plibble, Maria Rodriguez, Sharon Zhou, Rewon Child, Hyeontaek Lim, Mohit Iyyer, Manan Patel, Deepthi Gopa, Joseph Poon, Allison Lyle, Harsh Mehta, Alex Shao, Jason Gray, Stas Bekman, Manuel Gomez, Arjun Karthik, Yutong He, Johann Faulkner, Ian Leyens, Gayle Ross, Emily Chen, Shuohang Wang, Yiming Yang, and Barret Zoph. PaLM: Scaling language mod- eling with Pathways. arXiv preprint arXiv:2204.02311, 2022.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu, and Jakob Uszkoreit. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1-67, 2020.
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bow- man. GLUE: A multi-task benchmark and analysis platform for natural language un- derstanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353-355, 2018.
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, volume 32, pages 3261-3275, 2019.
Hee Suk Wang, Mingda Chen, Maarten Bosma, Yoav Artzi, Thomas Kipf, and Denny Zhou. OmniNet: Efficient Vision-Language Foundation Models via Decomposed Mul- timodal Iteration. arXiv preprint arXiv:2305.08824, 2023.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Edouard Grave, Yu- long Guo, Sparsh Mittal, Daniel Lowe, Jack W. Rae, Christopher Hallinan, Anirudh Srinivasan, Neel Nanda, Sandhini Agarwal, Mikolaj Binkowski, Aaron Hoffman, Yequan Wang, Fazle F. Karim, Brandon Roberts, Benedetta Trovato, Lucas Dixon, Myle Ott, Franziska Mika, Yunkai Zhang, Angela Fan, and Jeff Dean. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Mengnan Du. Explainability for large language models: A survey. arXiv preprint arXiv:2309.01029, 2023.
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shorish. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610-623, 2021.
Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy consid- erations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645-3650, 2019.
Shachar Maidenbaum, Shira Levy-Tzedek, Daniel-Robert Chebat, and Amir Amedi. The "eyecane," a new electronic travel aid for the blind: Technology, behavior & swift learning. Restorative Neurology and Neuroscience, 32(6):813-824, 2014.
Jian Bai, Shiguo Lian, Zhen Liu, Ke Wang, and Dong Liu. A CNN-based wear- able assistive system for visually impaired people walking outdoors. Applied Sciences, 11(21):10026, 2021.
Shahid Munir Shah, Syeda Anshrah Gillani, Mirza Samad Ahmed Baig, Muhammad Aamer Saleem, and Muhammad Hamzah Siddiqui. Advancing depression detection on social media platforms through fine-tuned large language models. arXiv preprint arXiv:2409.14794, 2024.
Mirza Samad Ahmed Baig, Syeda Anshrah Gillani, Shahid Munir Shah, Mahmoud Al- jawarneh, Abdul Akbar Khan, and Muhammad Hamzah Siddiqui. AI-based Wearable Vision Assistance System for the Visually Impaired: Integrating real-time object recog- nition and contextual understanding using large vision-language models. arXiv preprint arXiv:2412.20059, 2024.
Mirza Samad Ahmed Baig, Syeda Anshrah Gillani, Abdul Akbar Khan, and Shahid Munir Shah. AttentionDrop: A novel regularization method for Transformer models. arXiv preprint arXiv:2504.12088, 2025.
Danxia Liu, Xing Lin Feng, Farooq Ahmed, Muhammad Shahid, and Jing Guo. De- tecting and measuring depression on social media using a machine learning approach: Systematic review. JMIR Mental Health, 9(3):e28968, 2022.
Muhammad Rizwan, Muhammad Farooq Mushtaq, Usman Akram, Adeel Mehmood, Imran Ashraf, and Beatriz Sahelices. Depression classification from tweets using small deep transfer learning language models. IEEE Access, 10:129176-129189, 2022.
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans. Trans- actions of the Association for Computational Linguistics, 8:64-77, 2020.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1715-1725, 2016.
Taku Kudo and John Richardson. SentencePiece: A simple and language indepen- dent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66-71, 2018.
Jack Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
He Suk Wang, Yinfei Yang, Jared Kaplan, Yi Tay, and Colin Raffel. Scaling mixture-of- experts to 1.2 trillion parameters. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1873-1887, 2022.
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture- of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
Roy Schwartz, Maarten Sap, Nazneen Fatema Rajani, David A. Ferrucci, and Yejin Choi. The right to explanation for automated decisions: What is it and when does it apply? In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 537-549, 2020.
Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehen- sion systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021-2031, 2017.
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
Jean-Baptiste Alayrac, Tom Duerig, Alexey Romanov, Natasha Jaques, Gabriel Syn- naeve, Jakob Foerster, and Ivan Krasin. Flamingo: A Visual Language Model for Few- Shot Learning. Advances in Neural Information Processing Systems, 35:23778-23794, 2022.
Valerio Lomonaco, Lorenzo Pellegrini, Luigi Gallo, and Davide Maltoni. Continual learning for robotics: Definition, framework, and evaluation strategies. IEEE Robotics and Automation Letters, 6(2):1951-1958, 2021.
Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology, 10(2):12:1-12:19, 2019.
Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308-318, 2016.
John Cotra, Fashina Atiba, Jamie Putnam, and Alex Cowen. Self-explainable neural networks for interpretability. arXiv preprint arXiv:2302.07962, 2023.
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Chelsea Nash, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Scott Gray, and Ilya Sutskever. Robust speech recog- nition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022.
Wei Xu, Chris Callison-Burch, and Bill Dolan. Problems in current text simplification research: New data can help. In Proceedings of the 9th International Conference on Language Resources and Evaluation, pages 452-459, 2015.
Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. DialogPT: Large-scale generative pre- training for conversational response generation. arXiv preprint arXiv:1911.00536, 2020.
Necati Cihan Camgöz, Oscar Camgöz, Florian Metze, and Raymond Mooney. Sign language translation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 7-8, 2020.
Amila Madzharevic, Armin Gulan, and Admir Bajramovic. Health chatbot for med- ication adherence. In Proceedings of the 14th International Conference on Ubiquitous Information Management and Communication, pages 1-6, 2021.
Jaideep Singh, Pradeep Kumar, and Karan Singh. Deep-learning-based personalized education using GPT architectures. IEEE Transactions on Learning Technologies, 15(2):202-213, 2022.
Paolo Bordone, Lucia Ferroni, Luca Rossi, and Silvia Torta. Assistive conversational agents for memory support in dementia care. International Journal of Human-Computer Interaction, 38(7):651-663, 2022.
Yichen Liu, Jinman Kim, and Lina Zeng. AML-aware fall detection using multimodal sensing and natural language prompts. Sensors, 23(3):1124, 2023.
Shanying Lin, Rik Koncel-Kedziorski, Pearl Pu, Premkumar Natarajan, and William W. Cohen. CommonGen: A constrained text generation challenge for generative com- monsense reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 1680-1692, 2020.
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839-849, 2016.
Amanpreet Singh, Yejin Choi, Percy Liang, and Bo Pang. Beyond the I.I.D. assumption: Inducing generalization in language models through adversarial domain adaptation. arXiv preprint arXiv:2205.08461, 2022.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.

From Transformers to LLMs

Sign up for access to the world's latest research

Abstract

Related papers

References (49)

Related papers

Related topics