Happy holidays! Here’s a long post on AI, small languages, and my recent trip to University of Tartu in Estonia.
When tech companies and academics talk about writing with large languages models (LLMs), it’s usually about how they write in English. And LLMs write pretty well in English.
It makes sense: the United States is the center of current AI research, English is the most spoken language in the world, and training data for LLMs is mostly in English because English absolutely dominates other languages online. For example, Spanish has just under half the speakers of English, but it has only 8% of the online representation of English. Common Crawl, which provides web data to many AI companies, has 43% of its pages in English and 4.5% in Spanish. Russian, at 6%, is a distant second to English. Not only are LLMs primarily trained on data in English, but they’re also generally finetuned in English: prompts and responses are rated by English speakers.
When LLMs write in languages other than English, it often sounds translated: the grammar or word choice is slightly off and the sensibilities are distinctly American. In a discussion among international writing researchers at WRAB 2023 in Norway (which I co-ran with Tim Laquintano), Israelis said they laughed when ChatGPT closed out a professional letter with the Hebrew equivalent of “Have a nice day!”
When I traveled to Estonia this October, hosted by University of Tartu faculty Djuddah Arthur Joost Leijen and Helen Hint and sponsored by a grant from the Baltic American Freedom Foundation, I got to step away from my assumptions about AI writing in English. Travel to Norway for WRAB and conducting workshops with faculty in China and who teach other languages, I’ve of course thought about how AI works outside of English. But a week in Estonia taught me a lot about AI for so-called small languages. (It also—delightfully—taught me to appreciate sauna culture and how salmon can be served at every meal. And if you make it to the end of this long post, sauna and salmon are the reward.)
Small languages and the threat of “digital death”
There are 1.1 million Estonian speakers in the world. It’s not an endangered language, and from the vantage point of the 7000 languages spoken around the world, it’s doing just fine. But from the perspective of AI, it’s a small language—a term I learned from the Estonians. Only 0.1% of websites are in Estonian. That’s partly because most Estonians speak another language—those born in the 1980s or later speak English and those born earlier are more likely to speak Russian as their second language. (Estonia left the USSR just ahead of its dissolution in 1991.) Estonians often use English or Russian online. Estonian is a difficult language to learn, with 14 noun cases and linguistic ties only to Finnish. It’s possible to get around the country without knowing any Estonian—as I found out—but Estonians value their language and teach it to their children. Estonians have an exceptionally high literacy rate (99.9% versus 79% in the US) and the highest rate of book ownership in the world.
Estonia is also on the cutting edge of digital development, thanks in part to a program in the 1990s called Tiger Leap. Rebuilding government from scratch after leaving the Soviet Union meant Estonia—which has a smaller population than the Pittsburgh metro area where I live—could think fresh about how to allocate resources. The Tiger Leap program began directing significant funds to information technology infrastructure and education. Subsequent investment in e-banking, e-governance, data and privacy, blockchain, and e-health has earned the country a nickname it embraces: E-Estonia. While I was in Tartu, Djuddah gave me a tour of the supercomputer and impressive robotic labs at the university. He also showed me his ID, which is used for pretty much everything in Estonia, and mentioned it takes him about 15 minutes(!!) to file his taxes. To an American, this sounds like sorcery.
Despite all this investment, however, it’s hard to contend with the very low representation of Estonian in the datasets training AI. Linda Heimisdóttir, an Icelandic speaker and CEO of an AI company in Reykjavik, worries about the “digital death” of small languages. Icelandic has only about a third of the speakers of Estonian but the country has invested a lot in language preservation (a supercool music scene helps, too). Heimisdóttir explains that Icelanders use English online because the technical infrastructure is better: Siri doesn’t understand Icelandic, autocomplete works better in English, and most of the search and social features of the web are tailored to English. So, the low data problem for Icelandic compounds. And the small market size of Icelandic means it’s hard to solve the problem commercially.
As AI is integrated into education, governance, and social applications, Heimisdóttir sees English encroaching in areas where Icelandic was previously safe. Kids in Iceland learn English alongside Icelandic, but if their educational chatbots only do English—or perhaps worse, do Icelandic poorly—then English will become the kids’ dominant language. The digital death of a language can lead to erosion of culture, history, and values alongside grammatical knowledge. Heimisdóttir hopes that “cross lingual transfer learning” will provide a solution for small languages such as Icelandic. It means models don’t have to be trained from scratch in every language, so smaller languages could have better models than their limited data would otherwise allow. Her company, Miðeind, has developed an AI leaderboard for Icelandic benchmarks alongside AI apps for municipal parking, grammar-checking, speech-to-text, and question-answering in Icelandic. The hope is that, if there are high-quality AI apps in Icelandic, Icelanders are less likely to use more dominant English options.
Annette visits Estonia
Djuddah and Helen had reached out to me in the spring when they were putting together the grant: was I interested in potentially traveling to Estonia for a series of workshops and talks on AI? Well, yes, of course! In October, they hosted an “Inspiration week on AI in higher education” in Tartu and Viljandi, Estonia, and I got to keynote. They were great hosts and I learned far more than I can talk about here—thank you, Djuddah and Helen! Everyone else: check out their recent article in Written Communication (more on that below).
I gave a keynote talk, “Can AI Writing be Good?” in their beautiful and iconic main university building (pictured in the University of Tartu holiday card). I began the talk noting many of the negative aspects of AI for writing, including its misalignment with educational values, and then shifted to consider how we might align it to our local educational contexts. Attendees asked great questions about data privacy, historical resistance to technologies, how we will need to change our teaching for students who grow up writing with AI. You can watch the whole keynote and Q&A here: https://uttv.ee/naita?id=35978.
Later the same day, I joined a panel of University of Tartu faculty from English, computer science, and law. Djuddah moderated. We talked about the importance of education about AI for future employment, whether AI will aid creativity, how AI might undermine learning, and the role of the university in preparing students to use AI critically. Aleksei Kelli introduced legal controversies about contemporary authorship and technologies. Raili Marling advised caution. She made a distinction between learning and using, noting we should be wise to surface-level performance of knowledge. Heiko Pääbo argued that AI is the future. He reminded us to keep focused on curriculum and why students might want to use AI. Marling asked, “Is it time to stop giving assignments that AI can do?” “Yes,” Pääbo answered. Yurii Kondratyk, a recent graduate, said that we must use AI but also be aware of the hype: there are always students who will cheat, but we should be thinking about what critical engagement actually means. I was grateful to be among so many diverse and considered viewpoints on AI and learning. You can watch the panel here: https://uttv.ee/naita?id=35985 .
Djuddah, Helen and I then traveled about an hour away to Viljandi, where I ran a workshop for students and another for faculty of the Viljandi Culture Academy. The participants there made these workshops so great! The Viljandi Culture Academy teaches applied research and traditional Estonian crafts, art, and music. Viljandi, a town of only about 17,000, hosts a popular folk music festival every summer.
AI tries to sing about Estonia
In the workshop, we had fun talking about songs I’d asked ChatGPT to write about Estonia, first in English and then in Estonian. I put the English lyrics into SunoAI and since it won’t let me ask for Taylor Swift—who I figured everyone would know—I asked for “melodic, strong female vocal, slight country, vulnerable, storytelling, catchy chorus.” “Land of a Northern Dream” sounds like Taylor Swift knockoff with a strong female vocal and a catchy chorus:
In the heart of the Baltic, where the sea meets the sky, There's a place where the winds whisper ancient lullabies. Tallinn’s towers shine like stars in the night, Cobblestone streets under soft, golden light. From the forests deep to the shorelines wide, In every corner, there’s a story that hides. With a flag of blue, black, and white waving free, Oh, Estonia, you're calling out to me.
We noticed that the song sounded like an outsider writing about Estonia: Oh, Estonia, you're calling out to me… What if we prompted ChatGPT to write in Estonian?
ChatGPT wrote “Meie Maa, Meie Süda,” and SunoAI did its best to make it sound like Baltic pop folk. I admit I kinda like the song, although it sounds nothing like Trad.Attack!, the Estonian group we tried to prompt the AI to emulate. Here’s the chorus in both Estonian and in an English translation:
Meie maa, meie süda, siin me elame, Iga hingetõmme tuules kajab vaikselt edasi. Meie veri, meie juured, tugevaks jääme, Sest Eesti maa on meie hinges igavesti. Our land, our heart, here we live, Every breath in the wind echoes softly. Our blood, our roots, we stay strong, Because the land of Estonia is in our soul forever.
When ChatGPT writes about Estonia in Estonian, it sounds like an insider: Our land, our heart, here we live… It uses tropes of song, nature, and story that the Estonians said felt closer to their history and values. Yet they said it still felt translated, and the pronunciation in SunoAI sounded awkward, with some vowels closer to Finnish.
So, can AI write in Estonian?
Sortof. It’s clearly not as good in Estonian as it is in English. The Estonian students and faculty I talked to used AI for English, but rarely Estonian. Helen, a native Estonian speaker who teaches and researches Estonian, noted it’s not good at all for academic Estonian.
Which makes sense: the datasets for academic Estonian would be very limited—many fields and journals require English for publication. For example: writing studies is dominated by English—in part because of its predominance in Anglophone universities, especially in the US. Also, English is a lingua franca among researchers whose native language is primarily defined by their own national borders. Studies focusing on lesser-used Baltic and Scandinavian languages often build on English academic writing approaches (Hint et al., Skar et al.). Norway has its own tradition of writing research, but scholars increasingly publish in English and make reference to other Scandinavian languages. The B-Write project combines research across Baltic languages, despite their dissimilarity. Academic Estonian is rarely studied, and Helen Hint, et al. point out that “smaller writing communities are at risk of losing their agency, voice, and identity amid the global English-centered academic discourse.” So, we might worry about the “academic death” of small languages alongside their “digital death.”
But I learned of some other uses for not-great Estonian AI in talking with some of the students after the workshop in Viljandi. One student was teaching herself Finnish, which is closely related to Estonian. Some language apps offer Finnish lessons—but they’re always based in English. Linguistically, it doesn’t make sense for an Estonian speaker to work through English to learn Finnish. So, together, we prompted ChatGPT to provide Finnish lessons with explanations in Estonian. The result was pretty good, she said—and at least better than having to think in English while she was learning Finnish. Perhaps the “cross lingual transfer learning” that Heimisdóttir advocates for will make applications like this more feasible.
Finally, saunas and salmon—and mushrooms!
Enough AI. Here’s the travelogue part of the post!
When I checked into the Lydia Hotel in Tartu, the host mentioned there was a spa in the basement. I asked if I had to pay for it. The response was a gentle no, with an implied, only an American would ask that question. Nearly every house in Estonia has a sauna, either inside or outside. That’s another thing Estonians share with Finns. When we passed apartment buildings, I asked Helen and Djuddah: are there saunas in each of those apartments, too?? Helen laughed: probably, yes, implying again, only an American would ask that question. I used the hotel sauna and pool every night I was there and reported back to my hosts in the morning. I think they got a kick out of my obsession with sauna.
The food in Estonia was amazing. I suppose this is another American contrast—here, the portions are huge but the quality is mediocre in midlevel restaurants. I prefer the European approach of deliciousness instead of leftovers. And they eat vegetables for breakfast! The hotel breakfast wasn’t the warmed eggs and waffles affair I’m used to (and which my kids love). Instead, it was roasted tomatoes and mushrooms, yogurt, fresh juice, brown bread, tiny sausages, and, of course, salmon. Over the course of a week, I had salmon on salad, salmon on pancakes, salmon on brown bread, roasted salmon, and—my favorite—salmon, potato and dill soup. Coupled with the sauna and endless walking, I was feeling quite virtuous and healthy by the end of my week in Estonia.
I walked miles every day to restaurants, museums and around neighborhoods. In Tallinn, I visited the Kiek in de Kök, the towers and walls surrounding the medieval city that is now a museum. The underground tunnels were used as a bomb shelter during World War II. The history of Russian and German occupation was palpable. In Tartu, I saw a Jewish cemetery next to a Russian Orthodox one. The dates on the Jewish headstones were all from the early 20th century. Estonia was one the first countries the Nazis declared to be “Judenfrei,” a dark time in Estonian history that was detailed in a somber museum exhibit on Jews in Estonia. Ukrainian flags flew on the House of Parliament in Tallinn and the central walking bridge in Tartu was lit up yellow and blue. Several history museums told of the censorship and deprivation associated with Soviet life. Old Lenin and Stalin statues were relegated to the backyard. Estonia was on the edge of the USSR, a stone’s throw—and more importantly, within radio distance—of independent Finland. Estonians were often more aware of life outside the USSR. Wandering through a large central park carved out by the Soviets in Tartu, I ran into an 800-year-old cathedral, part of it in well-maintained ruins and lit up artfully and the other part reserved as a museum. In another park, I saw the only statue I’ve ever seen of a literary critic.
The last day, after our workshops were over, Helen and Djuddah took me on a hike around a lake, where I photographed at least a dozen gorgeous mushrooms. Like many Estonians, Helen had grown up foraging mushrooms such as chanterelles in these woods. This is another Baltic tradition I could get behind. My favorite part of visiting Latvia 15 years ago was mushroom and blueberry foraging with Latvians who taught me which mushrooms were safe through hand signals. (I occasionally forage around Pittsburgh, bringing home what my husband calls “yard food.”) The landscape in Estonia reminded me of Minnesota—the lake, the trees, the level ground and the low sun. We had dinner at an out-of-season ski resort, heavy wood and sheepskins lending a cozy vibe to the modern high ceilings and spare décor.
Djuddah and Helen have said I’m welcome back to Estonia. It's been nearly impossible to find a sauna I can use here in Pittsburgh. The only salmon I eat is what I make at home (I’m a merely mediocre cook). And my daughter keeps stealing the Tartu Ülikool sweatshirt my hosts gave me, which is a perfect weight for winter. As ChatGPT said, Oh, Estonia, you're calling out to me. I’m hoping to return.