|
ACL Workshop on Computation and Written Language (CAWL)
To be held in conjunction with ACL 2023
Toronto, Canada, July 14, 2023
Pier 2, Westin Harbour Castle, Toronto
|
[Image credit for Proto-Sinaitic ‘alp 𐤀 used in logo: https://en.wikipedia.org/wiki/Proto-Sinaitic_script#/media/File:Proto-semiticA-01.svg (CC-BY-2.5, Author: Pmx)]
The CAWL proceedings are available on ACL Anthology.
Most work on NLP focuses on language in its canonical written form. This has often led researchers to ignore the differences between written and spoken language or, worse, to conflate the two. Instances of conflation are statements like “Chinese is a logographic language" or “Persian is a right-to-left language", variants of which can be found frequently in the ACL anthology. These statements confuse properties of the language with properties of its writing system. Ignoring differences between written and spoken language leads, among other things, to conflating different words that are spelled the same (e.g., English bass), or treating as different, words that have multiple spellings (e.g., Japanese umai ‘tasty’, which can be written 旨い, うまい, ウマい, or 美味い).
Furthermore, methods for dealing with written language issues (e.g., various kinds of normalization or conversion) or for recognizing text input (e.g. OCR & handwriting recognition or text entry methods) are often regarded as precursors to NLP rather than as fundamental parts of the enterprise, despite the fact that most NLP methods rely centrally on representations derived from text rather than (spoken) language. This general lack of consideration of writing has led to much of the research on such topics to largely appear outside of ACL venues, in conferences or journals of neighboring fields such as speech technology (e.g., text normalization) or human-computer interaction (e.g., text entry).
We will invite submissions on the relationship between written and spoken language, the properties of written language, the ways in which writing systems encode language, and applications specifically focused on characteristics of writing systems.
Schedule
9:00-9:05 |
Organizers |
Open remarks |
9:05-9:15 |
Position paper: Kyle Gorman and Richard Sproat |
Myths about writing systems in speech & language technology |
9:15-10:15 |
Invited speaker: Mark Aronoff |
Paradise lost: how the alphabet fell from perfection |
10:15-10:30 |
Manex Agirrezabal, Sidsel Boldsen, and Nora Hollenstein |
The hidden folk: linguistic properties encoded in multilingual contextual character representations |
10:30-11:00 |
Coffee break |
11:00-11:20 |
Christian Gold, Ronja Laarmann-Quante, and Torsten Zesch |
Preserving the authenticity of handwritten learner language: annotation guidelines for creating transcripts retaining orthographic features |
11:20-11:40 |
Kurt Micallef, Fadhl Eryani, Nizar Habash, Houda Bouamor, and Claudia Borg |
Exploring the impact of transliteration on NLP performance: Treating Maltese as an Arabic dialect |
11:40-12:00 |
Elizabeth Nielsen, Christo Kirov, and Brian Roark |
Distinguishing Romanized Hindi from Romanized Urdu |
12:00-1:30 |
Lunch break |
1:30-2:30 |
Invited speaker: Amalia Gnanadesikan |
How linguistic are writing systems? |
2:30-2:45 |
Yuying Ren |
Back-transliteration of English loanwords in Japanese |
2:45-3:05 |
Wen Zhang |
Pronunciation ambiguities in Japanese kanji |
3:05-3:25 |
Shigeki Karita, Richard Sproat, and Haruko Ishikawa |
Lenient evaluation of Japanese speech recognition: modeling naturally occurring spelling inconsistency |
3:25-4:00 |
Coffee break |
4:00-4:20 |
Logan Born, M. Monroe, Kathryn Kelley, and Anoop Sarkar |
Disambiguating numeral sequences to decipher ancient accounting corpora |
4:20-4:40 |
Fabio Tamburini |
Decipherment of lost ancient scripts as combinatorial optimisation using coupled simulated annealing |
4:40-5:00 |
Logan Born, M. Monroe, Kathryn Kelley, and Anoop Sarkar |
Learning the character inventories of undeciphered scripts using unsupervised deep clustering |
5:00-5:15 |
Noah Hermalin |
A mutual information-based approach to quantifying logography in Japanese and Sumerian |
5:15-5:30 |
Organizers |
Closing remarks |
Organization
Program Committee:
- Manex Agirrezabal, University of Copenhagen, Denmark
- Sina Ahmadi, George Mason University, USA
- Cecilia Alm, Rochester Institute of Technology, USA
- Steven Bedrick, Oregon Health & Science University, USA
- Taylor Berg-Kirkpatrick, UC San Diego, USA
- Dan Garrette, Google, USA
- Alexander Gutkin, Google, UK
- Nizar Habash, NYU Abu Dhabi, United Arab Emirates
- Yannis Haralambous, IMT Atlantique & CNRS Lab-STICC, France
- Cassandra Jacobs, University of Buffalo, USA
- George Kiraz, Princeton University, USA
- Christo Kirov, Google, USA
- Grzegorz Kondrak, University of Alberta, Canada
- Martin Jansche, Amazon, UK
- Yang Li, Northwestern Polytechnical University, China
- Constantine Lignos, Brandeis University, USA
- Zoey Liu, University of Florida, USA
- Gerald Penn, University of Toronto, Canada
- Yuval Pinter, Ben-Gurion University of the Negev, Israel
- William Poser, independent scholar, Canada
- Emily Prud’hommeaux, Boston College, USA
- Shruti Rijhwani, Carnegie Mellon University, USA
- Maria Ryskina, MIT, USA
- Lane Schwartz, University of Alaska, Fairbanks, USA
- Djamé Seddah, Sorbonne University & Inria, France
- Shuming Shi, Tencent, China
- David Smith, Northeastern University, USA
- Kumiko Tanaka-Ishii, University of Tokyo, Japan
- Annalu Waller, University of Dundee, UK
Sponsorship
CAWL is supported by Google: