Aktualisiert: 26. Mai
This post discusses the ACL 2022 paper Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling by Elena Álvarez-Mellado and Constantine Lignos. You can find the full text here.
Who should know about it
Lexical borrowing is the process of bringing words from one language into another.
This is something that happens in most languages, not only Spanish, and nowadays we mostly borrow English words (Maybe you have heard about the term anglicism.). If you are a non-English native speaker, think about how you say podcast, app, smartphone, or big data, fake news, etc. in your native language. Is there a dedicated translation in your language or do you simply use the English term?
The automatic detection of such words can be useful for a variety of tasks, e.g., text-to-speech (language-specific pronunciation rules) or machine translation (maybe you can just copy over the word if the borrowing also exists in the target language?). So if you think that your text data contains such words “borrowed” from another language (and if your texts are not in English, the chances are quite high), you might think about treating them differently. And if you then start implementing an automatic detection algorithm, you should definitely know about the new publicly available data and insights described in this paper.
Why we want to know about it
We at Celebrate company and especially our brands kartenmacherei and faireparterie accompany the celebration of a special moment with our print products at every step. Be it a colorfully announced save the date, the description of a birthday party dress code, or creatively named entries on a wedding menu, it has become quite impossible to imagine these without mentioning at least some lexical borrowings (especially from English). And that is true for both our French- and German-writing customers.
What the paper is about
The paper improves on previous research on borrowing detection in two ways:
Collection of a larger, more diverse data set (called COALAS) with emphasis on evaluating the generalization capabilities of machine learning models
Experiment with modern approaches based on transfer learning and the Transformer architecture
Aiming for the most challenging data set possible, the authors took special care to avoid any overlap between the training, development, and test splits of their data. They only consider news texts but the exact sources the texts are extracted from and also the time period of the texts vary between the three data splits. And to make it extra challenging, the authors aim for OOV-rich (Out Of Vocabulary) evaluation sets, i.e., already before manual annotation, they preferably select sentences for the dev and test sets that contain OOVs, i.e., words that do not occur in the train set.
And this process pays off: 92% of the borrowings in the test set do not occur in the train set. Best conditions for testing the generalization capabilities of our models!
If you are currently thinking about annotating your own data, I want to highlight two things here:
The authors recommend separating the borrowings into two classes: ENG for anglicisms and OTHER for the rest.
There are great annotation guidelines in the appendix of the paper. You can see that the authors put a lot of thoughts in them.
Modern modeling approaches
Experiments start with an existing baseline from previous work: A CRF with handcrafted binary features. Unfortunately, this approach does not perform too well (55% F1 on test).
Next comes a Spanish BERT model called BETO and multilingual BERT (mBERT). The vanilla approach of fine-tuning these two without any custom architecture already performs quite well with mBERT beating BETO slightly (82% F1 on test).
Next, the CRF makes a re-entrance, this time equipped with a bidirectional LSTM (BiLSTM) and several types of embeddings as input features instead of handcrafted ones. The authors find that using contextualized mBERT embeddings as input unsurprisingly beats word-type-based fastText embeddings as input. Using the BiLSTM+CRF on top of the mBERT embeddings does not, however, improve overall performance. It seems the LSTM and CRF layers just add unnecessary complexity. Then they try stacking English BERT and Spanish BETO embeddings together, feed it to the BiLSTM+CRF and they get better than mBERT. It seems that combining the information from the two specialized monolingual models (with English being by far the most common source for borrowings in COALAS) provides an advantage over the general multilinguality in mBERT.
They play around with a few other embedding combinations but keep the English BERT and Spanish BETO embeddings as the base setting for best performance. Interestingly, adding pretrained BPE embeddings for both English and Spanish from the Flair library consistently boosts performance further.
They present results with two other types of embeddings: Character embeddings (also from Flair) and mBERT embeddings fine-tuned on the task of language identification with code-switched¹ data. Both of these, however, do not always lead to improvements.
The paper also has a very informative error analysis section. One super important finding: Recall is a weak point for all models! When experimenting with our own data, I also found that memorizing some common expressions (save the date, dress code, …) can go a long way and will, of course, yield a very high precision (except maybe words that are spelled the same in the two languages but aren’t actually borrowings, see below). But for good recall, you will need to generalize well. You will need to really understand what an English word distinguishes generally from a Spanish word. And that is especially true for the COALAS dataset with its high rate of OOVs among the borrowings in the evaluation splits.
There are more results in the paper but here are a few highlights: Casing seems to be a problem as both unexpectedly upper-case borrowings (Big Data vs. big data) and borrowings that are upper-cased only because they are at the beginning of a sentence frequently pose problems. This could be due to the explicit exclusion of proper nouns during annotation. It makes the task harder to additionally require the model to distinguish between English-looking names and English-looking common nouns when it is already challenging to distinguish between Spanish- and English-looking words.
An interesting source of errors also stems from words in English and Spanish that accidentally share the same orthography. In English, you can have a primer on some subject, but in Spanish the word primer means first. So it is extra-challenging to distinguish between these two words only based on the context.
What we found inspiring
Both character-level information and the fine-tuning on code-switching does not help as much as we had expected and does not consistently lead to improvements either.
Interestingly, BPE information can be a strong signal for distinguishing words of different origins.
Proper nouns are a big confounding factor for the automatic detection of lexical borrowings. A joint approach with an NER component might benefit both tasks. Also, if your use case allows it, it can make sense to annotate proper noun borrowings in the same way as common noun borrowings and thus remove this source of error entirely.
More details on the paper
The full text of the paper
Video on YouTube
Background on modeling
Conditional Random Field (CRF)
Combining BiLSTM with CRF
Background on linguistics
: Code-switched data means text that mixes more than one language, e.g., you start a sentence in English and you finish in Spanish. The fine-tuning task then is about classifying each and every word according to the language it belongs to (e.g., English vs. Spanish).