The Triple Helix

Examining Languages as a Data Science

2/15/2019

By Holly Zheng, '22

Do you tend to say “you’all” or “you guys”? A dialect quiz on New York Times from 2013 might be able to tell you how your linguistic idiosyncrasies match a specific geographic region. Dialectology is a branch of sociolinguistics with a focus on regional language variations. In 1939, Hans Kurath, who was a linguistics professor at Brown at the time, compiled with colleagues the “Linguistic Atlas of New England,” the first comprehensive English linguistic atlas of a large region. A result of interviews with more than 400 people, this collection of 734 maps described phonetic variations in American English dialects across the seven New England states. [1]

A sociolinguistics paper published in September, 2018, described lexical innovations and emergence of new words in American English over the recent few years. Interviews from Kurath’s period seem inefficient and almost useless now, compared to the database that Grieve and other linguists utilized to conduct their research: Twitter posts. The linguistic corpus of this research included 8.9 billion words from geo-coded and time-stamped Twitter posts between October, 2013, and November, 2014. From this database, the researchers identified emerging words in the English language and the correlation of their usage with age groups, regions, races, etc. [2]

Time it took the word “baeless” to reach certain popularity (Grieve, et al.)

For the most part of the 20th century, linguistics was mainly considered a discipline in social science that tried to understand languages as markers for cultural identity alongside their sound and word structure. With the rising of computing ability over the past few decades, however, linguistics is rapidly becoming more computational. “Linguistics is shifting from a social science to a data science, where linguists are increasingly analysing massive amounts of natural language harvested online,” commented Jack Grieve, one of the main authors of the Twitter lexicon research. [2]

The shift toward data mining has helped linguists study human speech patterns. One of the benefits of using a large linguistic database is a minimal “Observer's Paradox,” the phenomenon that describes interviewees’ unconscious decision of slightly changing their speech patterns due to the presence an observer.

Potential drawbacks of data mining in linguistic studies should also alert scientists. Privacy is one main concern. The data from the Twitter posts were real time posts that users made online without intending their words to be used in a scientific study. This unawareness does increase the utility of these lexical data, since these Twitter posts represent real lexicon, but user privacy emerges in the gray area -- whether researchers had the right to access these speech data is an issue to be addressed.

The field of computational linguistics has grown beyond utilizing data mining in studying language changes. The ongoing AI revolution has brought linguistics to a new level — the goal of computational linguistics has changed into modeling human language capacity in machines and computers.

Similar to the data-based approach in sociolinguistic research, Natural Language models also have grown tremendously for the past decades. In 1966, ELIZA, an artificial psychotherapist designed by computer scientists at MIT, passed the “Turing Test” as a program that behaved like a conversational partner. [3] Although this empathic listener returned sentences based on repetitive phrases and syntactic patterns in the user input, however, ELIZA did not comprehend the speech in the same way that human beings understand semantics.
This past October, Amazon successfully patented the technology that allows Alexa to detect emotional status and sickness in its user’s voice. [4] Relying on algorithms that detected the user’s pitch, intonation, and voice frequency, Alexa had become a “psychiatrist” far more accurate and precise than ELIZA.

Languages no longer represent what would be traditionally categorized as a distinctive human capacity. In addition to deconstructing sentences and word utterances, natural language models can now accomplish complicated and sometimes even tasks. Recently, researchers at Microsoft Research Asia added a poetic touch to their language models: they developed a training process for auto-generation of poetry based on images. [5]

Generated based on this image, one of the Microsoft team’s favorite poems:
The sun is shining
The tree moves
Naked trees
You dance

As sentimental as poetry seems as an artistic form, the science behind the AI the Poet is nevertheless meticulous. From describing linguistic innovations to processing meanings and providing feedback, the application of data-driven linguistic research has made many tasks easier and realistic. The role of computational power in the study of human languages, however, by no means replaces the linguistic potential and creativity innate to human beings. Training methods on language models still have a long way to go to improve in areas such as barriers across different languages.

“The point of this research is not to have AI replace poets. It’s about the myriad applications that can augment creative activity and achievement that the existence of even mildly creative AI could represent,” commented the scientists behind the poet AI model. [5]

Sources:
[1] Encyclopedia Britannica, “Hans Kurath” [cited: November 11, 2018] https://www.britannica.com/biography/Hans-Kurath#ref46917
[2] Grieve, J., Nini, A., Guo, D., “Mapping Lexical Innovation on American Social Media” Journal of English Linguistics, September 10, 2018 [cited: November 11, 2018] http://journals.sagepub.com/doi/10.1177/0075424218793191#_i2
[3] Shum, H., He, X., Li, D., “From Eliza to XiaoIce: Challenges and Opportunities with Social Chatbots” [cited: November 11, 2018] https://arxiv.org/pdf/1801.01957.pdf
[4] Brodkin, J., “Amazon patents Alexa tech to tell if you are sick, depressed or sell you meds” Artstechnica.com, October 11, 2018 [cited: November 11, 2018] https://arstechnica.com/gadgets/2018/10/amazon-patents-alexa-tech-to-tell-if-youre-sick-depressed-and-sell-you-meds/
[5] Microsoft Blog editors, “The poet in the machine: Auto-generation of poetry directly from images through multi-adversarial training – and a little inspiration” Microsoft Research Blog, October 18, 2018 [cited: November 11, 2018]
https://www.microsoft.com/en-us/research/blog/the-poet-in-the-machine-auto-generation-of-poetry-directly-from-images-through-multi-adversarial-training-and-a-little-inspiration/

0 Comments

Examining Languages as a Data Science

Leave a Reply.