Current location - Health Preservation Learning Network - Health preserving recipes - Patent Document Retrieval Tool (system) is a translation system that can automatically translate patent documents.
Patent Document Retrieval Tool (system) is a translation system that can automatically translate patent documents.
This paper introduces a practical Chinese-English machine translation system for patent documents, including the overall design of the system and the main translation technologies used in the system. With the enhancement of China's awareness of intellectual property rights and the urgent need for international communication, the traditional manual translation of patent translators can no longer meet the rapidly growing demand for patent literature translation, which hinders the promotion and exchange of patent technology in China to some extent. Automatic machine translation and assisted translation are effective ways to solve this problem. In recent years, machine translation technology has made great breakthroughs, especially the development of statistical machine translation technology, which has greatly improved the translation quality and provided a new powerful means for patent document translation.

Characteristics of patent document translation

Compared with the translation of ordinary texts, the translation of patent documents has the following characteristics:

● There are many professional fields involved. Patent documents have strong domain characteristics, so it is difficult to get ideal translation results by directly using the existing general translation software. However, the field of patent documents can be divided according to the international patent classification number, which is relatively clear. At the same time, after years of accumulation, it is relatively easy to obtain bilingual parallel corpora in specific fields, which is convenient for corpus collection and domain division of machine translation.

● Many technical and legal terms are used. Patent literature contains a large number of technical terms and legal terms, which requires a high comprehensive quality of translators. Correspondingly, the remuneration of patent translation is also very high. For example, the translation fee for translating a mother tongue into a foreign language abroad is about $30 to $50 per 100 source words. For some rare languages, the price of translation services will be higher. Therefore, using automatic translation or assisted translation to solve the translation problems of technical terms and legal terms can greatly reduce the cost of patent translation.

● There are many languages for translation. Because patent documents have certain national characteristics, patent documents often need to be translated between different languages. If a translation system is established in each language translation direction, it will require a great development cost. Therefore, it is a reasonable choice to use language-independent translation technology.

● The document format is standardized and the language is rigorous. Patent documents have some characteristics of legal documents, so compared with news or oral translation, the format of words is relatively fixed and the language is relatively standardized. Patent documents often contain some fixed sentence patterns, commonly known as "sentence sets", such as "the purpose of the invention is X" and "X is characterized by Y in claim N", where X and Y can be any words or sentences, and N is any combination of numbers. These sentence pattern templates are suitable for automatic machine translation.

By analyzing the above characteristics of patent documents, we can see that it is possible to achieve better translation results by using machine translation method for patent translation with standardized form and clear field. In particular, the recent rapid development of statistical machine translation technology has the characteristics of good language independence, good domain portability, convenient knowledge acquisition and short development cycle, which is very suitable for building a patent document translation system.

The multilingual interactive technology laboratory of the Institute of Computing Technology of Chinese Academy of Sciences has many years of experience in machine translation research, and has made good achievements in statistical machine translation research in recent years. Beijing Oriental Lingdun Technology Co., Ltd. has a great demand for patent document translation, and hopes to further improve the translation quality and efficiency with the help of automatic translation software. Entrusted by Dongfang Lingdun Technology Co., Ltd., computing researchers designed and implemented a Chinese-English patent document translation system in a specific field by using statistical machine translation technology accumulated in multilingual interactive laboratory and combining the characteristics of patent document translation. At present, the translation field of the system is patent documents of traditional Chinese medicine. Due to the adoption of statistical machine translation technology, the system can be easily transplanted to patent translation in other technical fields.

System overall design

In order to meet the needs of large-scale, multi-user and concurrent tasks, this system adopts server/client network service mode and multi-thread scheduling. The physical structure and logical flow of the system are as follows:

1. physical structure

The physical structure of the Chinese-English patent document machine translation system consists of two parts, including:

● Translation engine server: responsible for providing translation services and managing translation resources.

● Client: responsible for presenting translation results to users, providing auxiliary translation tools and submitting user requests to the server.

Among them, the server mainly stores the translation core decoder and all kinds of resources it needs, such as phrase list, language model, template library, dictionary, memory library and so on. The server manages these resources in a unified way and schedules them reasonably. At the same time, the server is responsible for the scheduling and time slice allocation of each user thread, and coordinates the priority of each user's task submission.

Clients are divided into ordinary user clients and administrator user clients, and different users have different permissions. The client provides users with a convenient interface for editing and modifying, and at the same time provides users with the function of viewing task status and server status, and can access and modify some resources on the server in real time. Through the client, users can conveniently upload files in batches for translation, modify the returned results, resubmit the translation, and export the translation results in batches.

Both the server and the client can run independently, and they are connected with each other through the network.

2. Logical flow

The logical structure of the system is the overall business framework of the system, which describes the whole process from data input, expected results obtained through internal processing of the system, and final output (see figure 1 for the logical flow reference diagram of the system).

Specifically, the main flow of the system is described as follows:

● Translation service: responsible for translating sentences or text files submitted by users and outputting translation results. In the process of translation, memory management program, dictionary management program and template library management program will be called to access the statistical translation model library.

● Memory management: responsible for organizing and managing memory, and performing operations such as querying, adding, modifying, deleting and exporting translation samples. When a user or translator submits a memory operation request, the memory management module accesses the memory, performs the corresponding operation and feeds back the result.

● Dictionary management: responsible for organizing and managing all dictionaries in the system, and conducting dictionary query, addition, deletion, batch import and export, etc. When a user or translator submits a dictionary operation request, the dictionary management module accesses the system dictionary database, performs corresponding operations and feeds back the results.

● Template library management: responsible for organizing and managing the template library, and performing operations such as querying, adding, modifying, deleting, importing and exporting templates. When a user or translator submits a template operation request, the template management module accesses the template library, performs corresponding operations and feeds back the results.

● User management: responsible for receiving and executing operations such as adding, deleting and setting permissions of users.

Main translation techniques used in the system

The system is mainly based on statistical translation technology, which combines template-based and memory-based translation methods.

1. Statistics-based translation

Statistical machine translation technology is the leading machine translation technology in the world at present, which overcomes the main shortcomings of traditional rule-based translation methods. In the traditional rule-based machine translation method, translation knowledge is mainly embodied in dictionaries and rules, which are mainly written by human experts. The main problems of this method are: it takes a lot of manpower, material resources and time for human experts to write language knowledge; It is difficult to cover all kinds of problems in the real translation environment with written knowledge. Written language knowledge has no good solution in the face of conflict; Written language knowledge is not easy to transplant to different languages and fields. In statistical machine translation, all translation knowledge comes from real parallel corpora, and the translation knowledge in parallel corpora is automatically learned through statistical modeling, thus overcoming the main problems faced by human experts in compiling knowledge. To sum up, statistical machine translation has the following advantages:

(1) can be easily transplanted to different knowledge fields. As long as a bilingual parallel corpus in a new field is obtained, a translation system suitable for this field can be quickly constructed. Patents have a standardized domain division system, and it is easy to obtain patent translation texts in different fields, so this feature of statistical machine translation is particularly suitable for patent translation systems.

(2) It is easy to transplant to different languages. Statistical machine translation has the greatest language independence, and the translation system of new language pairs can be constructed with little language processing. This greatly reduces the system development cost of patents that need to be translated into multiple languages.

(3) There is no need to write rules manually. All translation knowledge is automatically obtained from bilingual parallel corpus, which greatly reduces the manpower, material resources and time required for system development. Statistical translation system is based on statistical model, and it also has reasonable solutions to overcome the conflict of knowledge.

(4) The translation quality of the system can be gradually improved with the increase of training data. With the use of patent translation system, more and more bilingual parallel corpora can be produced, which can further improve the translation performance and quality of the system.

In the system implementation, the researchers adopted a statistical machine translation model based on phrases. This model takes phrases as the basic translation unit, automatically obtains all the translations of phrases from the bilingual corpus, and at the same time obtains the translation probability between phrases, which is the translation model. In addition, we also obtained the target language model in the training stage. In the process of translation, the translation module selects the most possible candidate phrase translation combination as the translation result of the whole sentence according to the trained translation model and language model through a certain decoding algorithm.

2. Template-based translation

Template-based method is convenient for the system to translate sentences with similar patterns. Patent documents in specific fields often contain some fixed sentence patterns. For example, the following are the titles of several patents in the field of traditional Chinese medicine:

Traditional Chinese medicine for treating rheumatic heart disease

A medicated bag for treating hyperosteogeny

A sugar-free Chinese medicinal composition with tranquilizing effect and its preparation method

A pasty health food with weight reducing effect and its preparation method are provided.

It can be seen that these titles have great similarities in sentence patterns, which can be summarized by two templates: "A Y for treating X" and "A Y with X function and its preparation method". In the translation system, a complete translation template includes "the source language part of the template" and "the target language part of the template", and each part is divided into "the constant part of the template" and "the variable part of the template". For example, the above two templates are represented in the translation system as follows:

# # 2 {...} is used to treat # #1{...}

= => One ##2 is treatment ## 1

##2{…} with ## 1{…} function and preparation method thereof

= => #2 with # 1 effect and its preparation method

Where "##N" is the variable part of the template, and "n" is used to distinguish the corresponding relationships of different variables in the target language. In the "{…}" after the variable, it is allowed to add some constraints to limit the matching of the variable, such as the length of the matching string, the matching method (matching at the beginning of the clause or at the end of the clause), the words that must or must not be included in the variable, etc., so as to increase the expressive power of the template. The template here can match the whole sentence and clause.

After template matching, the above example is translated into the following form:

Traditional Chinese medicine for treating rheumatic heart disease

A medicated bag for treating hyperosteogeny

A sugar-free Chinese medicinal composition with tranquilizing effect and its preparation method

A pasty health food with weight reducing effect and its preparation method are provided.

It can be seen that through sentence pattern template matching, not only some fixed sentence patterns can be well translated, but also some long-distance sentence sorting can be realized, which makes up for the shortcomings of phrase-based statistical translation methods in long-distance sorting. Secondly, after template matching, some constants in the template have been translated correctly, and the statistical translation decoder only needs to translate the remaining phrase fragments, which can reduce the burden of the statistical decoder to some extent.

The sentence pattern template defined by the system is intuitive and easy for language workers to understand. Users can add translation templates according to the sentence pattern characteristics of the text to be translated, which greatly increases the flexibility of the system.

3. Memory-based translation

In the process of using the system, users can add correctly translated sentences to the memory in batches. During translation, if the same sentence exists in the memory, the system can quickly search for its correct translation. When the memory bank accumulates to a certain scale, it can be added to the training corpus to further improve the automatic translation quality of the system.

In addition, the translation system also allows users to add domain translation dictionaries and user translation dictionaries as needed, which enhances the user's ability to control the system.

Figure 2 takes the translation of Chinese text as an example, and gives the main translation process of the system. From this, readers can see the role and position of the above translation skills in the whole translation process. For an input Chinese text, first search the translation memory through the memory management module, and if the translation result already exists, return it directly; Otherwise, the system calls the word segmentation tool for Chinese word segmentation, and post-processes the word segmentation results, then calls the template matching module for template matching of the text, and finally carries out statistics-based translation. Statistical translation needs to call statistical translation model base, namely translation model and language model.

Main functions and performance of the system

Users can easily open the modified file through the user interface provided by the system, and dynamically add translation terms and translation templates to guide the translation results. At the same time, they can immediately look up the uncommon words being modified in the dictionary and add the modified correct results to the memory in batches. At the same time of modification, users can still submit translation tasks to the server in batches for queuing, and will be prompted to download translation result files after task translation is completed. The design of the system fully considers the concurrent execution of multi-users and multi-tasks, and batch translation tasks are processed in the background of the server, which does not affect the execution of other non-translation tasks on the client.

1. Translation quality

The system adopts 80,000 pairs of sentences (average sentence length is 3 1 word) provided by Dongfang Lingdun Technology Co., Ltd. for training. The evaluation of translation quality adopts the internationally accepted evaluation index Bleu and the universal evaluation tool Mt eval-V11B.PL. On the test set of 200 sentences outside the training corpus, when there is only one standard reference answer sentence, the Bleu value of automatic translation is 0.3020.

Here is a comparison with the latest level of machine translation in the world: in the large-scale data set evaluation of NIST machine translation Chinese-English translation in 2006, the best score of NIST subset (each sentence has four reference answers) is 0.3393, and the best score of Gale Gale subset (each sentence has 1 reference answers) is 0. 1470. The training data and test data used in NIST machine translation evaluation are all from the news field, and the scale of its training data is much larger than that used in this patent translation system. Although there is no direct comparability between the two, it can be seen that the translation level of the system in the patent field has reached or even surpassed the translation level of the best news field in the world with only a small amount of training corpus.

2. Translation speed

The speed of translation is measured by how many words are translated per hour. At present, the translation speed of this system is 6.5438+0.4 million words/hour. On average, there are 20 words for each patent title and 200 words for each patent abstract. The system can automatically translate 84,000 titles or 8400 abstracts after working for 12 hours. This translation speed can fully meet the needs of daily auxiliary translation work.

To sum up, the system adopts the international leading statistical translation technology, and combines template-based and memory-based translation methods to realize a practical Chinese-English patent document translation system. The system can not only realize automatic translation function, but also provide convenient auxiliary translation function. Users can modify the results of automatic translation, dynamically add dictionaries and templates to guide translation, and add corrected results to memory in batches. At present, the system has entered the trial stage, and the translation quality and speed have met the basic needs of users.

(Author Fu Lei, He, graduate student of Institute of Computing Technology, Chinese Academy of Sciences)