Abstract
The field of artificial intelligence (AI) is advancing rapidly, and at its core are transformative models that have redefined natural language processing: trans- formers and large language models (LLMs). This book, Transformers and Large Language Models, is written to help learners build a clear, practical understand- ing of the concepts, architectures, and techniques that drive these powerful systems. My journey into this field began over two decades ago with a degree in Applied Mathematics. I started my career as a statistical data analyst, even- tually moving into data mining and, later, data science. Along the way, I witnessed firsthand how crucial mathematical and computational foundations are—not only for understanding how these models work, but for applying them effectively to real-world problems.
References (30)
- • Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-1901.
- • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre- training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT (pp. 4171-4186).
- • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, 30.
- • Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
- References • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
- • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre- training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT (pp. 4171-4186).
- • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations.
- References • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre- training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT (pp. 4171-4186).
- • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
- • Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Im- proving language understanding by generative pre-training. OpenAI Tech- nical Report.
- • Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems, 33, 1877-1901.
- References • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre- training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT (pp. 4171-4186).
- • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
- • Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Technical Report.
- • Howard, J., & Ruder, S. (2018). Universal language model fine-tuning for text classification. In Proceedings of ACL (pp. 328-339).
- References • Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-1901.
- • Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., ... & Le, Q. V. (2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- • Reynolds, L., & McDonell, K. (2021). Prompt programming for large lan- guage models: Beyond the few-shot paradigm. arXiv preprint arXiv:2102.07350. References • Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-1901.
- References • Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-1901.
- • Roller, S., Dinan, E., Goyal, N., Ju, D., Williamson, M., Liu, Y., ... & Weston, J. (2021). Recipes for building an open-domain chatbot. Proceed- ings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, 300-325.
- • OpenAI. (2023). Best practices for prompt engineering with the OpenAI API. Retrieved from https://platform.openai.com/docs/guides/prompting References • Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ...
- & Riedel, S. (2020). Retrieval-augmented generation for knowledge- intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459-9474.
- • Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., ... & Yih, W. T. (2020). Dense passage retrieval for open-domain question
- References • Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., ... & Scao, T. L. (2023). LLaMA: Open and efficient foundation lan- guage models. arXiv preprint arXiv:2302.13971.
- • Mistral AI. (2023). Mistral models: Technical report and release notes. Retrieved from https://mistral.ai/
- • Biderman, S., Black, S., & Shoeybi, M. (2023). The rise of open-source LLMs. arXiv preprint arXiv:2307.09288.
- References • Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
- • Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E. H., Le, Q. V., ...
- & Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- • Bender, E. M., & Friedman, B. (2018). Data statements for natural lan- guage processing: Toward mitigating system bias and enabling better sci- ence. Transactions of the Association for Computational Linguistics, 6, 587-604.