The second refers to the fast response speed of voice commands, which refers to the time from the end of the user's speech to the beginning of the execution of instructions by Xiao P. From the video comparison, it can be found that the current extreme version reduces the response delay of language control from the original 1.5s to about 0.9s. For car voice products, 0.9s is a better number. At present, car voice products are generally around 1.5s, and a better one can be1.2s. ..
In addition, every video emphasizes the ability to understand multi-intention instructions, but this is the existing function of P7. The better experience is that TTS's reply to multi-intention instructions is also a comprehensive reply, rather than broadcasting the implementation of each instruction one by one.
Full-time dialogue
After turning on the full-time call switch, Xiao P will continue to receive the broadcast, and can directly speak the instructions and execute them at any time without waking up (no need to shout hello Xiao P). At present, only some commands are supported, and the guess is mainly the command of vehicle control. In the process of full-time conversation, the car will not respond to the unsupported instructions, but the user can add a "small P" within 5 seconds, so that the small P can recognize and execute the instructions that were not supported just now. Through this product design, the problem of experience separation that full-time dialogue only supports the introduction of some fields is skillfully solved, and only "little P" is needed instead of "hello little P". Personally, I think this is the most brilliant functional update of G9. Just like if you ask someone to do something for you, and he doesn't move, you can call him by his name again. It is more natural to call "Hello Little P" simply "Little P".
In the video presentation, we can see that the interactive mode of oneshot combined with G9 reduces the four-word wake-up word "Hello Little P" to two-word "Little P", which makes great progress in doubling the number of wake-up words. At present, the technology of two-character wake-up words is very immature. When used alone, a large number of false positives will be introduced, and the two-word wake-up word will be introduced in the form of oneshot in connection with the instruction, which will alleviate this problem well. Compared with four words, two-word wake-up words are more natural and convenient to use, which can alleviate the embarrassment brought to users to some extent. This design has also been applied to Baidu's smart fitness mirror. It is said that Apple will also use this design to abbreviate "hi siri" to "siri".
When the full-time conversation switch is turned on, only the full-time conversation of the driver is supported by default. Here, Xiao P's eye animation has changed, and you can see the details of product design, and the user experience is better.
Multiperson dialogue
After the multi-person dialogue and full-time dialogue are started at the same time, the full-time dialogue function can be used in all four positions, and users in the four positions can speak alternately or at the same time without interfering with each other, thus meeting the needs of multi-person dialogue.
G9 realizes multi-round dialogue across sound zones, and different sound zones keep the same multi-round state. When the main driver says "open the seat for heating", the co-pilot only needs to say "me too" to open the co-pilot seat for heating. Mainly aimed at multi-round dialogue inheritance optimization of function points related to scope binding.
The asr results of the four positions are displayed in the four corners respectively, and the reply content will be displayed on the screen, and the sound zone reply will be locked (sometimes TTS reply will not be made). The design of some product details here is emphasized in the video.
Figure 2 Four-way full-time dialogue screen display
functional analysis
Jisu dialogue
Simply put, the eternal pursuit of voice interaction technology can be condensed into two words: fast and accurate. Fast and accurate voice interaction technology is a necessary condition to create voice interaction products that really satisfy users. The goal of speed conversation is to realize "fast" voice interaction.
Fig. 3 voice interaction data flow diagram
Fig. 3 shows a simplified flow from the user's voice to the reply given by the car machine. The yellow part of the recording module is responsible for data acquisition, the blue part is to process the collected voice data and understand the user's intention, the purple part is to answer the user according to the understood instructions, and the orange part is executed by the car. Generally speaking, users feel that voice speed is fast, that is, the time from recording to instruction execution, involving hardware, algorithms and other modules. In fact, the internal modules and interactive logic of a complete voice interaction product are far more complicated than those shown here. How to optimize the speed of voice interaction can be analyzed from the following three aspects: interaction link, algorithm, system and hardware.
1, interactive link
Interactive link optimization refers to shortening the transmission path of data or optimizing the transmission speed of data when designing interactive logic, so that the feedback results can flow to users faster. Possible schemes include:
Using offline scheme to optimize the logic of offline fusion.
Streaming processing is used to reduce the absolute waiting time of each algorithm module.
Algorithm modules are processed in parallel to find the shortest path to realize data transmission.
The combination of algorithm modules shortens the link of data transmission.
2. Algorithm
There are many modules in the chain of voice interaction technology. Imagine that if each algorithm module has a delay of tens of milliseconds, it may accumulate for hundreds of milliseconds. Therefore, in order to improve the speed of voice interaction, it is essential to optimize and polish each algorithm module. For algorithm engineer, who is a product, the ultimate problem everyone faces is: how to simplify the algorithm and increase the speed as much as possible without reducing the performance of the algorithm and increasing the occupation of computing power (CPU/NPU). Being a dancer with a chain rolling on the tip of a knife may be the highest requirement of algorithm engineer, who is a product maker. The optimization of algorithm module is not only closely related to the product experience, but also the simplified algorithm can directly reduce the hardware cost. In the voice technology chain, several modules that have an intuitive impact on the speed of voice interaction are:
Signal processing: including three core calculation modules: aec, separation and noise reduction, in addition, there will be sound zone localization and vocal isolation.
VAD: The delay of VAD algorithm itself is generally relatively small, and the core will cause relatively large delay in post-processing strategy, which is related to product design and needs to be weighed in terms of small delay and other experiences.
ASR: The part that introduces delay includes the accumulated data used for model scoring, the dependence on future information, the peak shift of CTC and other algorithms, and the pruning search strategy.
3. System and hardware
Hardware is the foundation and system is the support. A smooth underlying system is a necessary condition for excellent software products. Voice interaction system not only depends on hardware and system, but also controls the body hardware or system. If the in-vehicle system itself is easy to get stuck, it is useless to optimize the voice interaction algorithm. Hardware and systems that affect the voice interaction experience include:
Recording hardware and recording drivers
The priority of voice-related processes is based on the system resource allocation policy.
Control the reaction speed of body hardware.
Vehicle system response speed
The speed conversion function of G9 reduces the voice control delay from 1.5s to about 0.9s ... It can achieve such a great improvement, and two reasons are emphasized in each experience video:
Replacing the cloud voice solution with an offline integration solution eliminates the process of data uploading and downloading in the cloud solution and shortens the interaction time.
Support streaming understanding, ASR and NLU can be processed in parallel, shortening NLU waiting time.
But now is the 5G era. Is the network delay really so big? With a skeptical attitude, the author made a detailed analysis according to the experience video. From the data statistics of three key time periods, from the end of speech to the first word on the screen, from the end of speech to the screen of all recognition results, from the recognition results to the beginning of the car response, the following conclusions are drawn:
In high-speed conversation, the recognition result is advanced by 0. 15s, but the speed of the first word on the screen is slow. The high probability promoted here is related to the off-line asr algorithm scheme, and the network delay accounts for a small proportion.
The great promotion of extremely fast conversation comes from the improvement of vad post-processing strategy and the improvement of offline NLU algorithm for stream understanding.
Because the online experience video will be post-processed, it may be different from the real experience. Therefore, it will be analyzed and corrected again according to the real car experience. Students interested in speed optimization can jump to the appendix to view the analysis process.
Full-time dialogue
Full-time conversation is a subversive way of interaction, which breaks the tradition that voice interaction system must carry wake-up words since iphone 4s launched siri. According to the development of voice interaction logic, the evolution mode of full-time dialogue can be deduced from two directions, the essence of which is to improve interaction efficiency, make human-computer voice interaction more natural and convenient, and more in line with the dialogue logic between people.
Figure 4 Evolution Diagram of Full-time Dialogue
As we all know, wake-up words are equivalent to the switch of voice system. Start recording when opening, and stop recording when closing. If the wake-up word is deleted from the full-time conversation, the speech recognition system will continue to listen. After losing control of the switch, it means that the privacy and security of the voice interaction system will receive more attention. In order to do a full-time dialogue, we must do the following:
1, using offline voice scheme.
The offline voice scheme has the following advantages:
All data are processed locally to protect users' privacy. The data here is not only the voice data containing biological characteristics, but also the text content of voice recognition contains a lot of user privacy.
Data does not need to be uploaded to the cloud, saving traffic costs.
All the work is done locally, saving the cost of cloud services.
The off-line voice scheme carefully polished on G9 provides the feasibility for realizing full-time call function.
2. Do a good job of voice separation and isolation.
The goal of vocal separation is to separate the target person from other vocals, and the goal of vocal isolation is to exclude non-target vocals and only send the target vocals to the speech recognition engine for recognition. G9 adopts the hardware configuration of distributed four microphones, which reduces the difficulty of voice separation and isolation from the hardware. However, the algorithm still needs to be done well in these two aspects, especially the problem of missing sound when the target position does not speak and other positions speak.
3, do a good job of false alarm control.
False alarm control is the most difficult and critical part of full-time call, which directly determines the user experience of full-time call function. Students who do voice should know that voice wake-up also has false positives, and 80% of the badcase that every voice wake-up practitioner needs to solve may be the optimization of false positives. The false alarm of full-time call and false alarm of voice wake-up are essentially the voices that the train system should not respond to. But the false alarm of full-time dialogue is obviously different from the false alarm of awakening. First of all, false positives have different effects on users. The wake-up word is just a switch. When there is a false alarm, it is nothing more than a small P answer and turn to look at you. But every sentence in full-time conversation is a language control instruction with practical actions. Imagine that you are driving in a rainy day and talking to your wife on the phone, saying that you will come home late because of the traffic jam. At this time, the skylight inexplicably opened. Will you smell good at this time? If you know that it is a full-time conversation, you will definitely close it immediately and will not open it again. If you don't know it's a full-time conversation, it may be puzzling for the first time. It is estimated that you will drive to the 4S shop for maintenance for the second time. Secondly, the false alarm frequency and control difficulty are different. Wake-up words are four definite words with definite goals, but it is still difficult to control false positives. Only one attribute is so difficult to do, not to mention hundreds of function points and thousands of sentences in the whole conversation. In fact, this kind of false alarm will also exist in the current delay monitoring, but because the delay monitoring is generally only a few tens of seconds, the possibility of false alarm is greatly reduced in the time dimension. False positives in full-time conversations can be divided into two categories. The first category is the instruction misidentification caused by algorithm recognition error, such as asr recognizing irrelevant speech as valid instruction, or nlu parsing irrelevant text as valid instruction. The best way to solve this problem is to improve the performance of the algorithm infinitely, and to detect and shield these wrong instructions through some strategies. The second kind of problem is the distinction between man-machine dialogue and everyone's dialogue. For example, a sentence you mentioned in the process of chatting with friends is an instruction that can trigger the action of the car, but in fact you are chatting with friends instead of giving instructions to the car. This kind of problem may be the most difficult problem to solve in full-time conversation.
4. Avoid the fragmentation of the user experience.
From the perspective of security design and the maturity of current technology, the function points supported by full-time dialogue are only a subset of all voice function points for a long time, which will increase the learning cost of users, because users don't know which functions support which functions don't, which will cause a sense of fragmentation of user experience. The author thinks that Tucki G9 has handled this problem well, and Tucki's products and engineers have also solved this problem gracefully by waking up after use. Personally, I guess "Little P" should be realized by asr, not a special word wake-up system. At present, there are two other cars besides G9 that support full-time conversation. The first one is Geely's Xingyue L, which is set to geek mode in the system. After it is turned on, you can talk all the time. But the experience of this car is very bad, and it is basically unusable, because once it is turned on, just saying something will trigger the voice function. The second model is Chery Tiggo 8 pro. By default, the system has full-time dialogue function, which is called full-time wakeup-free function in car promotion. This scheme is provided by Horizon, which is the first full-time dialogue system based on all-offline scheme in the industry and the best experience on the market at present. I hope I can experience the full-time dialogue function of G9 as soon as possible, and I also hope that G9 can catch up and further promote the development of full-time dialogue function.
Multiperson dialogue
There are two main functions of multi-person dialogue in G9: one is that people in different positions can use voice at the same time, and they are independent and do not interfere with each other; Conversations between people in a second different position can be inherited from each other. Technically speaking, multi-person dialogue will be simpler than fast dialogue and full-time dialogue.
1, multi-person parallel use function
To realize the function of multi-person parallel use, two things need to be done well. The first point is the powerful signal processing function, especially the ability of voice separation and isolation. At present, the front-end signal scheme based on distributed four-wheat is mature and has a good solution, but there are also some difficult scenes to break through. The second point is that it has strong computing power and can support the concurrency of four voice interaction systems. The core is the concurrency of 4 ASRs and 4 nlu.
2, multi-person multi-round dialogue function
The core of this function is to inherit the multi-round state of multi-tone area, which belongs to the category of dialogue management, and there are also better solutions in the industry.
abstract
According to the experience video, the author summarizes two kinds of interactive logic on G9. (Just a personal guess)
Fig. 5 is a schematic diagram of the internal algorithm module of "Hello Little P" initiating voice interaction.
Fig. 6 Logic diagram of internal algorithm module of full-time dialogue voice interaction
The listing of Tucki P7 pushed the car voice assistant to a new height and became the target of many automobile manufacturers. I hope G9 can push car voice to a new height, bring more convenience to users and create more opportunities and development space for many voice practitioners. Finally, I hope to experience all the functions of G9 as soon as possible.
Appendix: Delay Analysis
In the experience video, the author chooses an example of "opening a window". By analyzing the video, the author compares the text on the screen with the voice and the instruction execution state in the video, and analyzes the time point of each key event.
Figure 2- 1 Turn off the speed conversation, the time point of each key time.
Figure 2-2: Time points of key events when the extremely fast dialogue is started.
According to the recognition results, the delay of voice interaction can be roughly divided into two parts: TD 1 and TD2. Please refer to the table for detailed definitions and explanations of each part. In addition, because the real-time display of voice results will also affect the user's feelings, the first word after the voice is displayed on the screen is recorded as TD3.
Name module description includes module analysis, closing speed dialog and opening speed dialog (increasing proportion).
TD 1 The screen delay of the recognition result is 1. The recording delay is 1. 2. The front-end signal processing delay; 3.vad algorithm delay; 4. Data network transmission delay (cloud solution); 5.asr algorithm delay. 0.608 seconds (9.732 seconds ~ 10.340 seconds) 0.467 seconds (23.2%)(2 1.0 seconds ~ 2 1.467 seconds).
TD2 Delay from text to instruction execution 1 Vad strategy delay; 2.nlu algorithm delay; 3. System delays, such as instruction interpretation and hardware startup. 0.947s( 10.340s ~ 1 1.287s)0.407s(57.0%)(2 1.467s ~ 2 1.874s)
The first word delay of TD3 recognition result is 1. Recording delay; 2. The front-end signal processing delay; 3.vad algorithm delay (data accumulation delay); 4. Data network transmission delay (cloud solution); 5.asr algorithm delay. 0.335 seconds (9.732 seconds ~ 10.067 seconds) 0.367 seconds (-9.5%)(2 1.0 seconds ~ 2 1.367 seconds).
Note: just using a voice has a general reference meaning and needs some data to prove its effectiveness. According to the statistical results, infer the reasons for the speed increase in the extremely fast conversation:
Will there be optimization instructions in the module speed dialogue?
Recording delay, recording at the bottom. There should be no change before and after opening the ultrafast conversation.
Signal processing delay signal processing itself is running at the end side, and the estimation has not changed.
Vad algorithm Delay VAD algorithm itself runs on the end side, and it is estimated that the accumulation of scoring data of VAD model and its dependence on future information have not changed.
The asr delay will change, and the promotion probability of TD 1 is related to the offline ASR algorithm scheme. On the one hand, it is the optimization of model level, on the other hand, the search space is small and the decoding speed is fast. Asr model scores data accumulation, dependence on future information, decoding delay and ctc peak shift.
According to the results of TD3, in the cloud solution, the network transmission delay has little influence on voice data uploading and recognition results publishing.
The delay of vad post-processing strategy has great influence. Generally, the post-processing of vad will extend backward for a certain time according to the output of the algorithm, so as to truncate the instruction in advance.
Nlu algorithm delays instruction "window opening". Theoretically, no matter whether the cloud or the end-side rule engine is realized with high probability, the speed difference between them should have little effect. Combined with streaming semantic understanding will be improved.
System delays such as instruction interpretation and hardware startup will not change, and there is no difference between hardware and system.
In the traditional speech interaction process, in order to ensure that speech recognition is not truncated in advance (for example, users stop talking, or vad algorithm is not robust, etc.). ), a post-processing strategy is added after the output of vad algorithm, which generally extends backward for a certain time on the basis of the output of the algorithm, which will introduce a lot of delay in many scenarios. As shown in the following figure, although a complete recognition result was obtained at t3, the vad segment will not be sent to nlu for text parsing until t4. After introducing streaming semantic understanding, asr recognition text is sent to nlu for analysis in real time, and the analysis result of nlu can be obtained at t7, which will greatly reduce the delay of confirming the result at t4 or only at t7. In fact, it is interesting that it takes 0.947s from t3 to t6 without turning on the sound of extreme speed. Assuming that the vad post-processing of the system is extended backward by 0.6s, the hardware execution consumes 0. 1s, and the nlu actually consumes 0.247s, which is incredible for such a simple instruction as "window opening". It can only be said that the great improvement still depends on the previous generation.