(T)oday (I) (L)earn reading room: seven small microphones

── acoustic sensor system

── seven small microphones

── far─field speech recognition

── far─field acoustic sensor system

Anne M. Jacobsen, The pentagon's brain : an uncensored history of DARPA, America's top secret military research agency, 2015

p.383

Boomerang was DARPA's response to sniper threats,

It was an acoustic sensor system made up of seven small microphones that attached to a military vehicle, listened for shooter information, and notified soldiers precisely where the fire was coming from, all in less than a second. The Boomerang system was able to detect shock waves from a sniper's incoming bullets, as well as muzzle blast, then relay that information to soldiers.

([ far─field acoustic sensor system should get more accurate as you get more sample data; if you have the funding, you should keep working on the program; consider deploying them on freeway and highway to get as much data as you can; ... ])

p.383

a more advanced Boomerang-based technology called ...

... was a vehicle-mounted system that fused radar and signal-processing technologies to quickly detect much larger projectiles coming at coalition vehicles, including rocket-propelled grenades, antitank guided missiles, and even direct mortar fire. A sensor system inside the ... would be able to identify where the shot came from and relay that information to all other vehicles in the convoy.

383 “Shot. Two o'clock”: Raytheon news release, BBN Technologies, Products and Services, Boomerang III. [p.494]

383 CROSSHAIRS: DARPA, news release, “DARPA's CROSSHAIRS Counter Shooter System”, October 5, 2010. [p.494]

Anne M. Jacobsen, The pentagon's brain : an uncensored history of DARPA, America's top secret military research agency, 2015

____________________________________

[[ case study 3: natural speech interface [NSI]: far─field speech recognition, natural voice speaker, Skills Kit, which allowed other companies to build voice-enabled apps ]]

• case study: Amazon echo (4 years)

Brad Stone, Amazon unbound: Jeff Bezos and the invention of a global empire, 2021

• [ seven omnidirectional microphones ] at the top

a cylinder elongated to create separation between the array of seven omnidirectional microphones at the top and the speakers at the bottom, with some 14 hundred holes punctured in the metal tubing to push out air and sound.

• The math suggested they would need to roughly double the scale of their data collection efforts to achieve each successive 3 percent [ 3% ] increase in Alexa's accuracy., p.37, Brad Stone, Amazon unbound: Jeff Bezos and the invention of a global empire, 2021.

p.23

The initiative was originally designated inside Lab126 as Project D. It would come to be known as the Amazon Echo, and by the name of its virtual assistant, Alexa.

p.24, p.45

Project D, also known as ‘Amazon Alexa’, later named ‘Amazon Echo’

January 4, 2011, first email from Bezos on Project D, p.24

November 6, 2014, product launch, p.45

([

within a four year time horizon Amazon developed a voice-enable user interface, inside a real─world working product,

─ development far─field speech recognition

─ refine speech communication (speak and sound like natural voice)

─ backoffice technical development

─ developed the plan to gather enough data for the far─field speech recognition

─ the heavy lifting of the speech recognition and other sensory data processing happen at the data center

─ need internetwork [Internet or VPN] connection with the data center

─ (( I would be interested to know, if you were to connect an Amazon Echo inside a corporate network, configure the device with a proxy server to communicate to the Amazon server; who what else does the Echo need to connect to work properly; how would a corporate firewall react to this new traffic. ))

─ port number for Amazon Echo (Alexa)

─ for example, port number for e─mail is 25, or, is it 24

• The math suggested they would need to roughly double the scale of their data collection efforts to achieve each successive 3 percent [3%] increase in Alexa's accuracy., p.37, Brad Stone, Amazon unbound: Jeff Bezos and the invention of a global empire, 2021.

])

p.462 Index

Amazon Alexa, 26─38

AMPED and, 43─44

beta testers

Bezos's sketch for,

bug in,

as Doppler project, 26─38, 40, 42─47

Evi and, 34─36

Fire tablet and, 44

language─specific version of, 60

launch of, 44─46

name of, 32

Skills Kit, 44─46

social cue recognition in, 34─35

speech recognition in,

voice of, 27─30

voice service, 47

see also Amazon Echo

far─field speech recognition, 27─28

p.24

Greg Hart

([ in 2010, Greg Hart pointed out to Jeff Bezos that speech recognition technology was good at dictation and search; he did this by showing to Jeff, Google's voice search on an Android phone; ])

speech recognition 2010

Google's voice search, Android phone

technology was finally getting good at dictation and search

p.24

Hart remembered talking to Bezos about speech recognition one day in late 2010 at Seattle's Blue Moon Burgers. Over lunch, Hart demonstrated his enthusiasm for Google's voice search on his Android phone by saying, “pizza near me”, and then showing Bezos the list of links to nearby pizza joints that popped up on-screen. “Jeff was a little skeptical about the use of it on phones, because he thought it might be socially awkward”, Hart remembered. But they discussed how the technology was finally getting good at dictation and search.

p.24

January 4, 2011

Greg Hart,

Ian Freed, device vice president,

Steve Kessel

Amazon's HQ, Day 1 North building

p.25

voice-activated cloud computer

speaker, microphone, a mute button

Fiona, the Kindle building

p.26

One early recruit, Al Lindsay,

Al Lindsay, who in a previous job had written some of the original code for telco US West's voice-activated directory assistance. Lindsay spent his first three weeks on the project on vacation at his cottage in Canada, writing a six-page narrative that envisioned how outside developers might program their own voice-enabled apps that could run on the device.

p.26

internal recruit,

John Thimsen, director of engineering

p.26

To speed up development

Hart and his crew started looking for startups to acquire.

p.27

Yap, a twenty-person startup based in Charlotte, North Carolina, automatically translated human speech such as voicemails into text, without relying on a secret workforce of human transcribers

p.27

though much of Yap's technology would be discarded, its engineers would help develop the technology to convert what customers said into a computer-readable format.

p.27

industry conference in Florence, Italy

Amazon's newfound interest in speech technology

p.27

Jeff Adams, Yap's VP of research

two-decade veteran of the speech industry

pp.27-28

after the meeting, Adams delicately told Hart and Lindsay that their goals were unrealistic. Most experts believed that true “far-field speech recognition” ── comprehending speech from up to 32 feet away, often amid crosstalk and background noise ── was beyond the realm of established computer science, since sound bounces off surfaces like walls and ceilings, producing echoes that confuse computers.

“They basically told me, ‘We don't care. Hire more people. Take as long as it takes. Solve the problem,’” recalled Adams. “They were unflappable.”

p.28

Polish startup Ivona generated computer-synthesized speech that resembled a human voice.

Ivona was founded ìn 2001 by Lukasz Osowski, a computer science student at the Gdansk university of technology. Osowski had the notion that so-called “text-to-speech”, or TTS, could read digital texts aloud in natural voice and help the visually impaired in Poland appreciate the written word.

Michael Kaszczuk

he took recording of an actor's voice and selected fragments of words, called diphones, and then blended or “concatenated” them together in different combinations to approximate natural-sounding words and sentences that the actors might never have uttered.

p.28

While students, they paid a popular Polish actor named Jacek Labijak to record hours of speech to create a database of sounds. The result was their first product, Spiker, which quickly became the top-selling computer voice in Poland.

Over the next few years, it was used widely in subways, elevators, and for robocall campaigns.

p.29

annual Blizzard Challenge, a competition for the most natural computer voice, organized by Carnegie Mellon university.

p.29

Gdansk R&D center were put in charge of crafting Doppler's voice.

p.29

the team considered lists of characteristics they wanted in a single personality, such as trustworthiness, empathy, and warmth, and determined those traits were more commonly associated with a female voice.

pp.29-30

Atlanta area-based voice-over studio, GM Voices, the same outfit that had helped turn recording from a voice actress named Susan Bennett into Apple's agent, Siri.

p.30

To create synthetic personalities, GM Voices gave female voice actors hundreds of hours of text to read, from entire books to random articles, a mind-numbing process that could stretch on for months.

p.30

voice artist behind Alexa

professional voice-over community: Boulder-based singer and voice actress Nina Rolle.

warm timbre of Alexa's voice

Nina Rolle (Boulder-based singer and voice actress)

p.32

Bezos also suggested “Alexa”, an homage to the ancient library of Alexandria, regarded as the capital of knowledge.

p.32

[ seven omnidirectional microphones ] at the top

p.34

In 2012, inspired by Siri's debut, Tunstall-Pedoe pivoted and introduced the Evi app for the Apple and Android app stores. Users could ask it questions by typing or speaking. Instead of searching the web for answer like Siri, or returning a set of links, like Google's voice search, Evi evaluated the question and tried to offer an immediate answer. The app was downloaded over 250,000 times in its first week and almost crashed the company's servers.

p.34

Evi employed a programming technique called knowledge graphs, or large databases of ontologies, which connect concepts and categories in related domains. If, for example, a user asked Evi, “What is the population of Cleveland?” the software interpreted that question and knew to turn to an accompanying source of demographic data. Wired described the technique as a “giant treelike structure” of logical connections to useful facts.

Putting Evi's knowledge base inside Alexa helped with the kind of informal but culturally common chitchat called phatic speech.

p.35

Integrating Evi's technology helped Alexa respond to factual queries, such as requests to name the planets in the solar system, and it gave the impression that Alexa was smart. But was it? Proponents of another method of natural language understanding, called deep learning, believed that Evi's knowledge graphs wouldn't give Alexa the kind of authentic intelligence that would satisfy Bezos's dream of a versatile assistant that could talk to users and answer any question.

p.35

In the deep learning method, machines were fed large amounts of data about how people converse and what responses proved satisfying, and then were programmed to train themselves to predict the best answers.

p.35

The chief proponent of this approach was an Indian-born engineer named Rohit Prasad. “He was a critical hire”, said engineering director John Thimsen. “Much of the success of the project is due to the team he assembled and the research they did on far-field speech recognition.”

p.35

BBN Technologies (later acquired by Raytheon)

Cambridge, Massachusetts-based defense contractor

At BBN, he [Rohit Prasad] worked on one of the first in-car speech recognition systems and automated directory assistance services for telephone companies.

p.37

For years, Google also collected speech data from a toll-free directory assistance line, 800-GOOG-411.

p.37

Hart, Prasad, and their team created graphs that projected how Alexa would improve as data collection progressed. The math suggested they would need to roughly double the scale of their data collection efforts to achieve each successive 3 percent increase in Alexa's accuracy.

• The math suggested they would need to roughly double the scale of their data collection efforts to achieve each successive 3 percent increase in Alexa's accuracy., p.37, Brad Stone, Amazon unbound: Jeff Bezos and the invention of a global empire, 2021.

p.37

“How will we even know when this product is good?”

early 2013

p.38

“First tell me what would be a magical product, then tell me how to get there.”

p.38

Bezos's technical advisor at the time, Dilip Kumar,

p.38

they would need thousands of more hours of complex, far-field voice commands.

p.38

Bezos apparently factored in the request to increase the number of speech scientists and did the calculation in his head in a few seconds.

“Let me get this straight. You are telling me that for your big request to make this product successful, instead of it taking forty years, it will only take us twenty?”

p.42

the resulting program, conceived by Rohit Prasad and speech scientist Janet Slifka over a few days in the spring of 2013

p.42

Rohit Prasad and speech scientist Janet Slifka

spring of 2013

p.42

answer a question that later vexed speech experts ──

how did Amazon come out of nowhere to leapfrog Google and Apple in the race to build a speech-enabled virtual assistant?

pp.42-43

internally the program was called AMPED

Amazon contracted with an Australian data collection firm, Appen, and went on the road with Alexa, in disguise.

p.43

Appen rented homes and apartments, initially in Boston, and then Amazon littered several rooms with all kinds of “decoy” devices: pedestal microphones, Xbox gaming consoles, televisions, and tablets. There were also some twenty Alexa devices planted around the rooms at different heights, each shrouded in an acoustic fabric that hid them from view but allowed sound to pass through.

p.43

Appen then contracted with a temp agency, and a stream of contract workers filtered through the properties, eight hours a day, six days a week, reading scripts from an iPad with canned lines and open-ended request

p.43

The speakers were turned off, so that Alexa didn't make a peep, but the seven microphones on each device captured everything and streamed the audio to Amazon's servers. Then another army of workers manually reviewed the recordings and annotated the transcripts, classifying queries that might stump a machine,

p.43

so that next time, Alexa would know.

p.43

The Boston test showed promise, so Amazon expanded the program, renting more homes and apartments in Seattle and ten other cities over the next six months to capture the voices and speech patterns of thousands more paid volunteers. It was a mushroom-cloud explosion of data about device placement, acoustic environments, background noise, regional accents, and all the gloriously random ways a human being might phrase a simple request to hear the weather, for example, or play a Justin

p.44

by 2012

multimillion-dollar cost.

p.44

By 2014, it has increased its store of speech data by a factor of ten thousand and largely closed the gap with rivals like Apple and Google.

p.47

over the next few months, Amazon would roll out the Alexa Skills Kit, which allowed other companies to build voice-enabled apps for the Echo, and Alexa Voice Service, which let the makers of products like lightbulbs and alarm clocks integrate Alexa into their own devices.

p.47

a smaller, cheaper version of Echo, the hockey puck-sized Echo Dot,

a portable version with batteries, the Amazon Tap.

Echo

Echo dot

Amazon Tap (a portable batteries version of Echo)

─“”‘’•

p.24

January 4, 2011

p.45

November 6, 2014

Brad Stone, Amazon unbound: Jeff Bezos and the invention of a global empire, 2021

____________________________________

··<────────────────────────────────────────────────────────────────────────────>

(T)oday (I) (L)earn reading room

Sunday, October 30, 2022

seven small microphones

No comments:

Post a Comment

Chih-Tang Sah

Report Abuse