Practice and Thinking of Diversity Algorithm in 58 Tribes

Guide reading

After defining the theme of "individual diversity optimization of recommendation system", this paper clearly expounds the optimization details in recall layer, rule layer and diversity layer from the overall architecture. In MMR and DPP algorithms, there are both principles and practices. Finally, through the comparison of the chart display effect, the targeted distance design is carried out according to its own business characteristics.

background

In recommendation system, besides relevance, diversity is also one of the important indicators to measure the quality of the system. However, there are often some contradictions between diversity and relevance. From the perspective of business indicators, this paper probes into the thinking method of how to balance diversity and correlation, introduces the practical scheme of diversity algorithm, and finally achieves the purpose of improving business indicators through diversity means.

1. Significance and Difficulties of Diversity Algorithm

In recommendation system, the accuracy has always been the most important index to measure the quality of the system, and most of the work is studying how to improve the accuracy. However, the quality of recommendation results is measured by multiple dimensions. More and more research shows that recommendation based on accuracy can not improve users' experience and satisfaction, but can easily promote the emergence of information cocoon. Therefore, how to balance and optimize the accuracy and diversity, so as to improve the overall data index of the business, has become another optimization direction of the recommendation system.

In practice, we summarized several difficulties of diversity algorithm:

The optimization goal of 1) model is fuzzy.

As we all know, various user behaviors (click, transform, stay, share, etc. ) can be used as the target of optimizing accuracy. We can clearly collect user behavior as the target label of the model, so as to design and optimize the model. Because diversity itself is a set statistic, it is difficult to find direct user behavior as the goal of model optimization.

2) Conflict between business indicators and diversity indicators

Business concern indicators (conversion rate, residence time, etc.). ) and diversity index are not simply positive or negative. If we do diversity simply to improve the diversity index, it will lead to the deviation between the final result and the business goal and reduce the recommendation quality.

To sum up, in the diversity practice of the 58-tribe recommendation system, we ruled out simply using diversity index as an evaluation algorithm. Combined with the business characteristics of the tribe, our goal is to take diversity as a means to improve business indicators, and comprehensively evaluate the effect of the algorithm through multiple business indicators and diversity indicators, so as to finally achieve the goal of jointly improving the two indicators, thereby improving the user experience.

Practice of 58 Tribal Diversity Algorithm

In the research of diversity algorithm, diversity is usually divided into two types:

Based on the diversity of individual users, it aims to avoid recommending similar products to a single user, thus improving user experience and increasing user satisfaction;

Based on the diversity of all users, it aims to optimize the distribution effect of long-tail goods;

Because our main goal at this stage is personal user experience, we choose the diversity of individual users as the practical direction.

1.？ Recommended system architecture diagram

On-line hierarchy diagram of recommendation system

In order to ensure the diversity of recommended data, we optimized it in three places:

1) recall layer: provide diverse candidate sets from data sources, and ensure the diversity of data sources by improving the coverage of topics, categories and authors in the rough candidate pool;

2) Rule layer: in the case of little correlation loss, the diversity of data entering the fine sorting pool is ensured by improving the coverage of topics and categories in the fine sorting candidate pool;

3)? Diversity layer: unify and diversify the data to ensure the diversification of data output.

The data determines the upper limit of the effect, and the algorithm only approaches this upper limit. In the diversity layer, as far as possible, only the data diversity after fine sorting is guaranteed. Due to the influence of the data diversity of the candidate pool of the result fertilization line, it will be better if the diversity of the candidate set can be maintained in the data source of the recall layer.

2. Diversified data sources

2. 1 Realization of diversification of recall layer

Memory layer architecture diagram

In the recall stage, we adopted multi-channel recall. In order to ensure the diversity of data, the data sources are enriched by adding some different recall strategies, such as:

Through the diversity algorithm, the personalized and diverse topics, categories and authors of users are obtained for recall, ensuring that the recall takes into account diversity and relevance;

Increase the novelty and diversity of data through some long tails, time and surprise memories;

Through some collaborative recall branches based on relationships and attributes, the diversity of data is improved.

By increasing diversity and novelty recall, the coverage of data in the rough candidate pool is increased by about 120%, and the coverage based on categories is increased by about 100%.

2.2 to achieve diversification of the rules layer

Rule layer architecture diagram

The data we recommend are all kinds of heterogeneous entities, so before entering the fine line (rule layer), the types will be treated in buckets. Each type has to go through a stage of data interception from coarse to fine, and the main interception index is generally the related score of coarse. In order to prevent truncation from destroying the diversity optimization of data sources made by the recall layer, we first divide the results of this kind of rough arrangement into buckets according to categories, and then control the diversity. Finally, adjust the proportion of various types, supplement the data, and do some necessary screening. In the case of little correlation, improve the diversity of fine row data.

Because the rule layer diversity algorithm is divided into buckets by type, it is not affected by the mixed arrangement of various heterogeneous entities and is suitable for various diversity algorithms.

After the intervention of diversity algorithm in the rule layer, the coverage of data in the fine sorting candidate pool is increased by about 80% based on the topic and 70% based on the category.

The transformation effect of online ctr and uvctr after adding diversity to conventional layers is shown in the following figure:

Comparison of the effects of rule layer algorithms

As can be seen from the above figure, in terms of differences, each algorithm is not very big, and its performance is slightly better. Generally speaking, the algorithm performs a little better. The rule layer mainly uses diversity algorithm to replace the previous heuristic rules, thus automating the diversity of data sources.

3. Diversified layers based on reordering method

For algorithm parameter adjustment, we mainly refer to three business indicators:

Pvctr means pv click rate; Uvctr stands for uv click rate; Avgpv stands for the number of exhibitions per capita.

For the scenario where diversity algorithm is applied, common diversity algorithms are used in a single type of rule layer, such as binomal, EE, DPP, XQUAD, PM2, Bayes and MMR, but the common algorithms are not suitable for various heterogeneous entities in the diversity layer, because 58 tribes have diversity and heterogeneity, it is difficult to measure heterogeneous items in dense space with a single vector generation method, and the overlap of interest distribution of different types of entities is not very high.

3. 1 diversity architecture diagram

Diversity layer architecture diagram

Limited by space, the importance of business combination, timeliness and other reasons, the following focuses on the application of MMR and DPP in engineering practice.

3.2? MMR？ Principle of

The full name of MMR principle is MaximalMarginal Relevance, which is a greedy sorting algorithm to reduce redundancy and ensure correlation. In the recommendation scenario, we can recommend related products to users while ensuring the diversity of recommendation results. Formula:

S is an ordered collection of selected documents, which is generally the result of related sorting;

R represents the next candidate document;

Di stands for the next candidate document;

Dj stands for the document in S;

Sim function: it is a similarity measurement function, such as similarity;

Represents the correlation between candidate documents and query documents;

Represents the maximum similarity between a document and an existing document;

? Weight coefficient, adjust the relevance and diversity of recommendation results.

As a greedy algorithm, it also calculates the maximum value in the formula every time, puts it into the ordered result set, deletes the selected items from the candidate set at the same time, then updates the parameters, and then continues to select in the next round until the data in the result set meets the needs or there is no data to choose from.

Realization process

Mmmr algorithm flow chart

Experimental effect

Comparison chart of MMR parameter effect

In order to unify the algorithm flow, we transposed the sum in the original formula. In order to compare several indicators in the same picture, the sum in the picture is enlarged and reduced to some extent. As can be seen from the figure, the diversity parameters are gradually adjusted and gradually decreased, reaching the highest point at that time, because the smaller, the more relevant the results are. And the number of views per capita has increased steadily, indicating that with the increase of diversity, some people will be attracted to click, and they will browse more content, with the best effect at home and reaching the highest point at home.

? Engineering realization problem

When calculating the distance, we realize the average distance between the current document and the selected document, and avoid using the maximum distance to cause high similarity between subsequent documents in the recommendation results.

The definition of similarity function between articles can be based on the similarity mentioned in this paper. But this requires that there must be a unified item vector between the candidate lists to ensure that the vector is arbitrary. Because there are many types of recommendation results and they are heterogeneous, in this case, similarity is not well explained. Our approach is to combine business with a custom distance, which will be introduced in detail below.

3.3 the principle of DPP

The full name of determinant point process is the probability distribution defined on the power set of discrete point sets. Is a subset of random sampling, for any one, there are

The meaning of variables in the formula:

y？ Random subset obtained by random sampling of determinant point process;

Is any subset of y;

Represents the probability of hitting all elements in the sample;

? K is a real symmetric square matrix, also known as kernel matrix. Each element? It can be considered as the similarity between the first elements in a set, which is inversely proportional to the sampling probability. It can be seen as the correlation between items and users, which is directly proportional to the sampling probability.

Ka？ Is the main formula of k, and the order is the number of elements in it.

The following figure can vividly describe the decision point process:

Schematic diagram of decision point process

Represents the probability that items I and J are sampled at the same time. As can be seen from the formula, the more similar items, the smaller the probability of being sampled at the same time.

The idea of the algorithm is to treat the reordering problem as a subset selection problem, and the goal is to select high-quality but diverse subsets from the original data set, and maintain the balance between high quality and diversity through DPP. DPP is a high-performance probability model, which simplifies the complicated probability calculation by determinant. DPP can also be understood as a sampling process, and the probability that two elements are extracted into subsets is related to the probability that a single element is extracted and the similarity between the two elements.

DDP implementation scheme

? The first is a window-based reordering scheme proposed by Google. The kernel matrix construction scheme mentioned in this paper is:

Dij stands for the distance between items I and J, which is free variables. When a= 1, it is equivalent to the standard RBF kernel. while

The off-diagonal of the matrix is reduced, which basically corresponds to diversifying all items. When a> 1, the diagonal of the matrix is enlarged in proportion, which has the opposite effect of making all items more similar.

With the growth of sets, the probability of small sets increases and the probability of large sets decreases. Since a> is in 1, l may be non-semi-positive. In order to ensure that the kernel matrix is semi-positive definite every time, this paper maps the kernel matrix, calculates the characteristic decomposition of L, and replaces any negative eigenvalue with 0.

At the same time, an optimization method of deep learning model based on deep Gram kernel matrix is proposed to reduce the complexity of grid search parameters.

The second is the scheme provided by the University of Pennsylvania, which is a general solution of DDP maximum posterior inference, but the kernel matrix has to be reconstructed by calculation for each item, and the delay is not optimistic.

The third is the maximum posterior inference solution proposed by Tribal Tiger Video. Aiming at the problem of high time complexity of traditional maps, an improved greedy algorithm for fast solution is proposed:

By construction?

The submodule function of the mapping is transformed into the submodule function maximization problem. The greedy algorithm is used to solve the NP-hard problem caused by maximizing submodule function.

Select one product at a time and add it to the result set Yg. Yg is initialized to an empty set until the conditions are met. The products need to satisfy the following equation:

Because of the high complexity of calculating determinant, this paper decomposes it into Cholesky, initializes it to be empty, and obtains it through some column transformations:

The author also puts forward the incremental update, and obtains the final incremental update formula through derivation (refer to the original paper for the specific derivation process):

We have implemented three schemes, and the second scheme has a large delay and cannot be applied to online. The simple mode we realized in the first scheme is to calculate the determinant directly, and the delay is longer than that in the third scheme. Scheme 3 can be repaired without kernel matrix, and the sorting result will be less than the expected data volume. In practical application, combined with the requirements of data volume and efficiency, we finally chose the third implementation scheme proposed by hulu.

DPP implementation process

DPP algorithm flow chart

Experimental effect

Effect diagram corresponding to DPP parameter change

In order to ensure that several indexes can be compared with each other in the same picture, both pvctr and avgpv are enlarged and reduced to some extent. As can be seen from the above figure, gradually adjusting the diversity parameters, the pvctr becomes larger and larger, reaching the highest at about 0.7. Uvtr and the number of views per capita have increased steadily, indicating that with the increase of diversity, people who don't click will be attracted to click, and people will browse more content. Moreover, Uvtr reaches its peak at 0.7, and 0.7 has the best effect. The last parameter value of each curve of DDP is very low, because at 0.999, the value of A becomes very large when constructing the kernel matrix. After exponential change, many kernel matrices have maximum values, and the matrices do not satisfy the rank. Only a small amount of data is returned, which is an abnormal situation, and all indicators are declining, which is also something to be avoided in the process of algorithm debugging.

Engineering realization problem

In the implementation, we mainly use EJML, an efficient open source Java matrix operation library, which is a relatively efficient library in the current attempt.

In the engineering implementation, mainly refer to the use mentioned in the article.

exp(αr_u)

Instead of Ru, by adjusting diversity and correlation,

α=θ/((2( 1-θ)) )

The semi-positive qualitative guarantee mentioned in this paper can be modified.

S_ij=( 1+？ f_i，f_j？ )/2

In our application, the impact is not great, mainly because our similarity matrix is user-defined.

The main optimization point of DPP is efficiency. We adopt the optimization scheme of Tiger Tiger Video without traversing determinant, and the average delay of the reordering process of the whole 100 item is only 4 ms

The construction of core matrix, because we need a user-defined distance, which is more interpretable and can be closely combined with business, so the similarity matrix and core matrix are also user-defined.

Sort the entire list in batches by window. According to the properties of submodule function, small data sets are more exclusive than large data sets, and the window effect is better.

For the adjustment of DPP parameters, the user-defined distance parameters are fixed to find the appropriate one, then the distance parameters are fixed and adjusted circularly, and the parameters are optimized by grid search.

? Because the kernel matrix is not satisfied with the rank, the returned data may be less than the expected data, which can be used normally and has little impact on the business. If there are strict requirements on the amount of data returned, we need to consider modifying the kernel matrix. You can refer to the kernel matrix modification method proposed by google, and it will have a little impact on the delay, and the delay will probably double.

3.4 Difference Layer Algorithm Effect Comparison Diagram

Comparison of original algorithm, MMR and DPP

The diversity layer mainly considers stability and timeliness. Our main traffic is in two implicit diversity algorithms, MMR and DDP, while our original algorithm is controlled by heuristic rules. By comparing the algorithms under the optimal parameters, it is found that the overall effect of DDP is better than MMR, which is greatly improved compared with the original algorithm. The table below show that changes of the two algorithm compared with the original algorithm.

Indicator \ Algorithm Register

Digital data processor

pvctr+3.4%+5.8%

vvctr

+5.4%+7.9%

avgpv

+4.2%+6.0%

Change chart of business indicators of each algorithm?

Custom distance

In order to better explain and integrate the business more closely, we use user-defined distance. The benefits of user-defined distances are as follows:

? Strong interpretability, the self-defined distance is biased towards the purpose that people can understand, making the distance more interpretable;

And the business is more closely integrated, and the user-defined distance closely integrates the business by using the business-related information;

It is universal and can be used between different heterogeneous entities, which cannot be done at other distances.

The custom distance we practice is as follows:

1.？ Jakade distance/Hamming distance

This custom distance is realized by Hamming-Jakad distance tiling approximation.

Custom Distance-Jakade/Hamming Distance

It is to construct Hamming vector by combining services, first calculate Hamming distance, and then normalize it by Jacobian similarity.

2.？ Custom Distance-Tree Model Distance

The distance of user-defined tree model is realized by tree hierarchical attenuation and top-down business combination.

Custom distance tree model

The leaf nodes of the tree are the dispersion values of distance.

We have tried the above three methods to define the distance, and from the perspective of interpretability, business combination and online experiment effect, the tree model is the most suitable one at present.

Summary and prospect

There are many aspects in evaluating the diversity of recommendation systems. This paper not only uses the diversity index to measure the quality of the algorithm, but also comprehensively considers the diversity results based on the business indicators, weighs the diversity and business concern indicators, and finally achieves the purpose of improving the business indicators through the diversity algorithm. In the aspect of engineering implementation, this paper introduces that the algorithm tries to realize diversity from the aspects of recall, rules and reordering, especially in the most important reordering stage. Based on MMR and DDP algorithms, the calculation efficiency is optimized, and the distance is customized according to the characteristics of business data to meet the business needs of measuring project similarity from multiple dimensions.

At present, the learning diversity algorithm is more effective than the non-learning algorithm, and the dominance is more effective than the implicit one. However, the relationship between multi-category heterogeneous entities is not so intuitive, and we will try it in combination with business characteristics in the future. In addition, the diversity algorithm based on reinforcement learning is also a direction of diversity research, and we have tried it again at present.

References:

1.？ C.-N. Ziegler, S. M. McNee, J. A. Constance and G. Rosen. Improve the recommendation list through topic diversification. WWW2005

2.？ J. Ponnel and J. Goldstein. Use diversity-based reordering method to reorder documents and generate abstracts. SIGIR 1998

3.？ Jennifer gillenwater. Alex Kulessa? Ben Tasca. Approximate best mapping inference for determining point process. ? Tits? 20 12

4.？ Mark William, Ajith Raman Nathan, Alexander Bonnomeau, sagar Jain, Ed H Chee, Jennifer Gironwater. Practical diversity recommendation of decisive point process on YouTube. CIKM' 18 20 18

5.？ Chen Laming, Zhou Hanning. Fast greedy mapping reasoning for decision point process to improve recommendation diversity. NeurIPS 20 18

6.？ Caoys, blogger of CSDN. Introduction to vertex process of determinant. csdn，20 19。

About the author:

Liu Fashuai, 58 Fair Group, Senior algorithm engineer.

Zhou Jianbin, 58 Fair Group, algorithm architect, member of technical committee.