Hello, my name is Kunal Katke, and I’d like to welcome you to Tech- blogging.com!
We spoke about computer vision in the previous episode, which is the capacity for computers to sense and analyze visual information.
We’re going to talk about how to teach computers to understand English today.
You may argue that they’ve always been able to do so.
We discussed machine language instructions and higher level programming languages in Episodes 9 and 12.
While they undoubtedly fit the criteria of a language, they also have tiny vocabulary and adhere to rigid standards.
Only code that is completely free of spelling and grammatical mistakes will compile and execute.
Naturally, this differs from human languages, which have enormous, diverse vocabulary, words with multiple meanings, speakers with various accents, and all sorts of wonderful word play.
People also make linguistic mistakes when writing and speaking, such as slurring words together, omitting essential facts, and mispronouncing terms.
But, for the most part, people are capable of overcoming these obstacles.
The ability to communicate effectively is an important aspect of what makes us human.
As a result, the desire for computers to comprehend and speak human language has existed since the invention of computers.
Natural Language Processing, or NLP, is an interdisciplinary discipline that combines computer science and linguistics as a result of this.
Words may be arranged in an almost endless number of ways in a phrase.
We can’t provide computers a lexicon of all potential phrases to assist them decipher what people are saying.
Deconstructing phrases into bite
which could be processed more quickly, was an early and basic NLP difficulty.
Nouns, pronouns, articles, verbs, adjectives, adverbs, prepositions, conjunctions, and interjections are the nine basic categories of English words you studied in school.
These are referred to as “parts of speech.”
There are subcategories as well, such as single vs. plural nouns and superlative vs. comparative adverbs, but we won’t delve into that right now.
Knowing the nature of a word is helpful, but unfortunately, many words have several meanings, such as “rose” and “leaf,” which may be employed as nouns or verbs.
Because a computerized dictionary is insufficient to address this issue, computers must also be grammatically aware.
To do so, phrase structure rules were created, which embody a language’s grammar.
In English, for example, a rule states that a sentence can have a noun phrase followed by a verb phrase.
An article, such as “the,” can be followed by a noun, or an adjective can be followed by a noun.
This type of rule may be applied to a whole language.
Then, using these principles, it’s rather simple to create a parse tree, which not only identifies each word with a likely part of speech but also indicates how the phrase is put together.
For example, we now know that the noun emphasis of this phrase is “the Mongols,” and that it’s about their “rising” from something, in this instance “leaves.”
These smaller data bits make it easier for computers to access, process, and respond to information.
Every time you make a voice search, such as “where is the nearest pizza,” similar procedures are taking place.
The machine understands that this is a “where” inquiry, that you’re looking for the word “pizza,” and that the dimension you’re interested in is “nearest.”
The same logic applies to questions like “who sang thriller?” and “what is the largest giraffe?”
Computers can be fairly proficient at natural language tasks if they treat language like Lego.
They can respond to queries and execute directions such as “set an alarm at 2:20″ or “play T-Swizzle on Spotify.”
However, as you’ve undoubtedly noticed, they break down when you get too sophisticated, and they can’t read the text correctly or understand your goal.
Hey Siri, do you think the mongols roam too much? What are your thoughts on this most pleasant mid-summer day?
Siri: I’m not sure I understand. I should also mention that computers can construct natural language text using phrase structure principles and other language codification approaches.
This works especially well when data is kept in a web of semantic information, where items are linked to one another in meaningful connections, giving you all the tools you need to create informative statements.
Michael Jackson sang the song Thriller, which was released in 1983.
The Knowledge Graph is Google’s version of this.
It has almost seventy billion facts regarding different entities and interactions between them by the end of 2016.
Natural language chatbots, which are computer programs that converse with you, rely on these two processes to parse and generate text.
Experts would encode hundreds of rules mapping what a user may say to how a software should respond in the early days of chatbots, which were mostly rule
was obviously inconvenient to maintain and limited the sophistication that could be achieved.
ELIZA, founded at MIT in the mid
1960s, was a well
known early example.
This was a chatbot that pretended to be a therapist and utilized fundamental syntactic rules to detect content in textual conversations, which it then asked the user about.
It resembled human
at times, but it also made basic, even funny errors at other times.
Chatbots and complex dialog systems have gone a long way in the previous fifty years, and they may now be pretty convincing!
Gigabytes of genuine human
human interactions are utilized to teach chatbots in modern ways based on machine learning.
Today, the technology is being used in customer service applications, where there are already a plethora of sample dialogues from which to learn.
People have also been using chatbots to converse with one another, with chatbots even evolving their own language in a Facebook experiment.
This experiment received a lot of scary headlines, although it was merely the computers working out a simple procedure for negotiating with each other.
It wasn’t malicious; instead, it was practical.
But what happens when anything is said — how can a computer decipher words from sound?
This is the field of voice recognition, which has been the subject of decades of research.
In 1952, Bell Labs unveiled Audrey, the automated digit recognizer, the world’s first voice recognition system.
If you pronounced them slowly enough, it could recognize all 10 numerical digits.
What are the numbers five, nine, and seven?
Because it was considerably faster to enter phone numbers with a finger, the project never got off the ground.
IBM showed a shoebox
sized system capable
of identifying sixteen words at the 1962 World’s Fair, ten years later.
In 1971, DARPA launched a five
year financing drive
to increase research in the field, which resulted in the creation of Harpy at Carnegie Mellon University.
Harpy was the first computer program to identify over 1,000 words.
However, transcription was frequently 10 times or more slower than natural speech on computers of the day.
time speech recognition became
possible because to substantial increases in CPU performance in the 1980s and 1990s.
Simultaneously, natural language processing algorithms evolved, moving away from hand
crafted rules and toward machine
learning approaches that could learn automatically from existing human language datasets.
Deep neural networks, which we discussed in Episode 34, are now used by the most accurate voice recognition systems on the market.
Let’s look at some speech, especially the acoustic signal, to get an idea of how these strategies function.
Let’s begin with vowel sounds such as aaaaa…and Eeee…
These are the waveforms obtained by a computer’s microphone for the two noises.
This signal represents the amplitude of diaphragm displacement within a microphone when sound waves force it to oscillate, as we described in Episode 21 Files and File Formats.
The horizontal axis represents time, while the vertical axis represents the degree of movement, or amplitude, in this representation of sound data.
Although we can discern changes in the waveforms, it’s not immediately apparent where you’d point to say, “oh ha! here is certainly an eee sound.”
We need to look at the data in a new way: a spectrogram, to truly make this stand out.
We still have time on the horizontal axis, but instead of amplitude on the vertical, we plot the magnitude of the many frequencies that make up each sound in this representation of the data.
The louder the frequency component, the brighter the color.
A really amazing method called the Fast Fourier Transform is used to convert waveforms to frequencies.
It’s pretty much the same as staring at the EQ visualizer on a stereo system.
A spectrogram is a time – based visualization of the data.
You may have noticed that the signals have a ribbed pattern to them
because that’s the signals contain all of my vocal tract’s resonances.
I squeeze my voice chords, lips, and tongue into different shapes to generate different sounds, which intensifies or dampens distinct resonances.
This may be seen in the signal, where there are brighter and darker patches.
We can observe that the two sounds have quite different arrangements if we work our way up from the bottom, labeling where we detect peaks in the spectrum — what are called formants.
This applies to all vowel sounds.
This is the kind of data that allows computers to detect spoken vowels, and even full words.
Let’s look at a more complex example, such as when I say, “she was happy.”
Here is where we can hear our “eee” and “aaa” sounds.
We may also hear a variety of other unique sounds, such as “shhh” in “she,” “wah” and “sss” in “was,” and so on.
Phonemes are the sound fragments that make up words.
All of these phonemes are recognized by speech recognition software.
There are around forty
four in English,
so it largely boils down to fancy pattern matching.
Then you have to separate words from one another, figure out where sentences begin and stop, and eventually you’ll have voice translated to text, which will allow you to use the techniques we covered at the start of the episode.
Because people utter words in slightly different ways owing to factors like accents and mispronunciations, using a language model, which incorporates data about word sequences, dramatically improves transcription accuracy.
“She was,” for example, is more likely to be followed by an adjective such as “glad.”
It’s unusual for “she was” to be followed by a word right away.
If the speech recognizer couldn’t decide between “happy” and “harpy,” it’d go with “happy” because the language model said it was the most likely option.
Finally, we must discuss voice synthesis, or the capacity of computers to produce speech.
This works in a similar way as voice recognition, but in reverse.
We can deconstruct a piece of text into its phonetic components and play those sounds back to back on a computer speaker.
Older speech synthesis devices, such as this hand
operated machine from Bell Labs in 1937,
can plainly hear this chaining of phonemes.
“She saw me,” say without emotion.
She had noticed me.
Now speak it in response to the following questions.
Who was it that saw you?
She had noticed me.
Who was it that she saw?
She had noticed me.
Was she able to see or hear you?
She had noticed me.
Although this had much improved by the 1980s, the discontinuous and difficult merging of phonemes continued to provide the trademark robotic sound.
Michael Jackson sang the song Thriller, which was released in 1983.
Today’s artificial voices, such as Siri, Cortana, and Alexa, have improved significantly, but they’re still not quite human.
But we’re so close, and it’ll almost certainly be a solved problem shortly.
Especially now that we’re witnessing an explosion of speech user interfaces on our phones, in our vehicles and houses, and, perhaps eventually, in our ears.
This ubiquity is creating a positive feedback loop in which people are using voice interaction more frequently, which provides more data for companies like Google, Amazon, and Microsoft to train their systems on… which leads to greater accuracy, which leads to increased use of voice, which leads to even greater accuracy… and so on.
Many people believe that in the future, voice technology will be as prevalent as screens, keyboards, trackpads, and other physical input
hat’s especially excellent news for robots who don’t want to have to carry around keyboards to speak with people.
However, we’ll go through them in greater detail next week.
Then I’ll see you Next Time.