ACL Workshop on Computation and Written Language (CAWL)

To be held in conjunction with ACL 2023
Toronto, Canada, July 14, 2023
Pier 2, Westin Harbour Castle, Toronto

[Image credit for Proto-Sinaitic ‘alp 𐤀 used in logo: https://en.wikipedia.org/wiki/Proto-Sinaitic_script#/media/File:Proto-semiticA-01.svg (CC-BY-2.5, Author: Pmx)]

The CAWL proceedings are available on ACL Anthology.

Contact: cawl.workshop.2023@gmail.com

Most work on NLP focuses on language in its canonical written form. This has often led researchers to ignore the differences between written and spoken language or, worse, to conflate the two. Instances of conflation are statements like “Chinese is a logographic language" or “Persian is a right-to-left language", variants of which can be found frequently in the ACL anthology. These statements confuse properties of the language with properties of its writing system. Ignoring differences between written and spoken language leads, among other things, to conflating different words that are spelled the same (e.g., English bass), or treating as different, words that have multiple spellings (e.g., Japanese umai ‘tasty’, which can be written 旨い, うまい, ウマい, or 美味い).

Furthermore, methods for dealing with written language issues (e.g., various kinds of normalization or conversion) or for recognizing text input (e.g. OCR & handwriting recognition or text entry methods) are often regarded as precursors to NLP rather than as fundamental parts of the enterprise, despite the fact that most NLP methods rely centrally on representations derived from text rather than (spoken) language. This general lack of consideration of writing has led to much of the research on such topics to largely appear outside of ACL venues, in conferences or journals of neighboring fields such as speech technology (e.g., text normalization) or human-computer interaction (e.g., text entry).

We will invite submissions on the relationship between written and spoken language, the properties of written language, the ways in which writing systems encode language, and applications specifically focused on characteristics of writing systems.

Schedule

9:00-9:05 Organizers Open remarks
9:05-9:15 Position paper: Kyle Gorman and Richard Sproat Myths about writing systems in speech & language technology
9:15-10:15 Invited speaker: Mark Aronoff Paradise lost: how the alphabet fell from perfection
10:15-10:30 Manex Agirrezabal, Sidsel Boldsen, and Nora Hollenstein The hidden folk: linguistic properties encoded in multilingual contextual character representations
10:30-11:00 Coffee break
11:00-11:20 Christian Gold, Ronja Laarmann-Quante, and Torsten Zesch Preserving the authenticity of handwritten learner language: annotation guidelines for creating transcripts retaining orthographic features
11:20-11:40 Kurt Micallef, Fadhl Eryani, Nizar Habash, Houda Bouamor, and Claudia Borg Exploring the impact of transliteration on NLP performance: Treating Maltese as an Arabic dialect
11:40-12:00 Elizabeth Nielsen, Christo Kirov, and Brian Roark Distinguishing Romanized Hindi from Romanized Urdu
12:00-1:30 Lunch break
1:30-2:30 Invited speaker: Amalia Gnanadesikan How linguistic are writing systems?
2:30-2:45 Yuying Ren Back-transliteration of English loanwords in Japanese
2:45-3:05 Wen Zhang Pronunciation ambiguities in Japanese kanji
3:05-3:25 Shigeki Karita, Richard Sproat, and Haruko Ishikawa Lenient evaluation of Japanese speech recognition: modeling naturally occurring spelling inconsistency
3:25-4:00 Coffee break
4:00-4:20 Logan Born, M. Monroe, Kathryn Kelley, and Anoop Sarkar Disambiguating numeral sequences to decipher ancient accounting corpora
4:20-4:40 Fabio Tamburini Decipherment of lost ancient scripts as combinatorial optimisation using coupled simulated annealing
4:40-5:00 Logan Born, M. Monroe, Kathryn Kelley, and Anoop Sarkar Learning the character inventories of undeciphered scripts using unsupervised deep clustering
5:00-5:15 Noah Hermalin A mutual information-based approach to quantifying logography in Japanese and Sumerian
5:15-5:30 Organizers Closing remarks

Organization

Organizing Committee:
Program Committee:

Sponsorship

CAWL is supported by Google: