At this week’s MCQLL meeting, Austin Kraft will be presenting [[Small language] corpus] problems, and Amirhossein Kazemnejad will be presenting Enhancing Length Generalization in Transformers: The Role of Positional Encoding.
This is the last lab meeting of the Fall semester. See you in Winter 2024!
All are welcome to attend.
- Austin Kraft
- [[Small language] corpus] problems
This presentation describes an in-progress collaboration to build a corpus of Semarangan Javanese (a.k.a. Peranakan Javanese; Cole et al. 2007) of Semarang, Central Java, Indonesia. In pursuing the empirical focus of the project—documenting anaphoric expressions like reflexive pronouns—the project has brought to the fore nontrivial design choices about how to represent a language that is underdocumented in corpora (Anand, Chung & Wagers 2020). As a case study from the Semarangan Javanese project, I discuss how existing part-of-speech tagsets might not neatly map to part-of-speech categories in the language, with consequences for how grammatical structure can be encoded in and retrieved from the corpus.
I situate these challenges within an early-stage typology of “small language corpus problems”: conceptual and technological issues that are made acutely relevant when creating a corpus for a small language. I compare the design needs of corpora for small languages with those of small corpora for extensively documented languages, such as a corpus of English dedicated solely to a niche discourse type or subject matter (Koester 2022). Alongside the Semarangan Javanese project, my ongoing literature review aims to identify and categorize common challenges in corpus-building for small languages.
- Amirhossein Kazemnejad
- Enhancing Length Generalization in Transformers: The Role of Positional Encoding
The advent of Long-Context Language Models (LLMs) has unlocked numerous advantages, including extended in-context learning, prolonged text generation capabilities, and an increased number of conversational turns. Such advancements have only recently become feasible, largely due to engineering breakthroughs like flash attention, which allow the processing of extensive sequences within memory constraints. Traditionally, Transformer models, however, have been limited in their ability to generalize beyond the context sizes encountered during training. In our NeurIPS 2023 paper, we delve into an empirical investigation of prevalent Positional Encodings to scrutinize their impact on length extrapolation. Furthermore, we propose and examine the possibility of removing positional encoding altogether from the standard decoder-only Transformer. This exploration encompasses both theoretical and practical analyses. Our findings not only provide critical insights but also lay the groundwork for the evolution of Transformer architectures in next-generation LLMs.