Quick vs. Optimal Algorithms in Data Fusion
Data fusion is the integration of two separate databases into a single database with joint information. This is achieved by the statistical matching of the records in the two databases based upon similarity on a list of variables. Within this framework, there exists a variety of algorithms to achieve this goal.
In the paper An Anatomy of Data Fusion (note: in Adobe pdf format) at the Worldwide Readership Research Symposium, we described how the steppingstone algorithm for the transportation problem in operations research can be used in the context of data fusion. This algorithm has three important properties. First of all, it preserves the distributions of the variables in each database. Secondly, it retains all the cases in each of the databases. Thirdly, within the constraints imposed by the first two properties, it achieves the highest success in finding matches among the cases. In this sense, the steppingstone algorithm is said to be an optimal algorithm since no other algorithm can achieve better success in matching rates subject to these two sets of constraints.
But, as we have shown in the aforementioned paper, the steppingstone algorithm is timeconsuming to implement and execute. This motivated us to explore the possibility of seeking some form of nearoptimal, but quick, algorithm for data fusion. In this paper, we will discuss one such algorithm, and compare its properties against those of the optimal algorithm.
To best illustrate the quick algorithm, we will construct a small artificial example. We assume that we have two databases. The first database consists of persons in a people meter television panel from whom we collect television viewing information. Following standard terminology, we call this the Television Audience Measurement (TAM) sample. The second database consists of persons in a consumer survey from which print readership and product usage information are collected. We call this the Target Group Index (TGI) sample. At this moment in time, this TAMTGI data fusion configuration exists or will come into existence very shortly for six countries in the Americas (Argentina, Brazil, Colombia, Mexico, Puerto Rico and USA).
For simplicity, we assume that we have 8 TAM persons and 10 TGI persons. Each TAM person and each TGI person are assigned a case weight in their respective studies. The case weight is the number of persons in the universe that the sample respondent is suppose to represent. The example is displayed below, where we note that the sum of the weights in each sample adds to 20,000 because they are supposed to represent the same universe.
TAM person  TAM weight  TGI person  TGI weight  
TAM 1  2000  TGI 1  1500  
TAM 2  2500  TGI 2  4000  
TAM 3  3000  TGI 3  3000  
TAM 4  1300  TGI 4  2000  
TAM 5  2200  TGI 5  2500  
TAM 6  3500  TGI 6  1800  
TAM 7  1400  TGI 7  2200  
TAM 8  2600  TGI 8  3000  
TAM 9  1000  TOTAL  20000  
TAM 10  500  
TOTAL  20000 
The steppingstone algorithm requires an arbitrary initial feasible solution, and then it begins through a series of iterations to determine if further improvements are possible with different matching solutions. If no improvement is possible, then the current solution can be proven to be the unique optimal solution. There are a number of different ways to construct this initial feasible solution. A very simple one is the socalled Northwest Corner Rule, which we will illustrate here.
To begin, we construct a matrix with the TAM persons in the rows and the TGI persons in the columns, with the case weights being written into the marginal totals.
TGI 1 (1500) 
TGI 2 (4000) 
TGI 3 (3000) 
TGI 4 (2000) 
TGI 5 (2500) 
TGI 6 (1800) 
TGI 7 (2200) 
TGI 8 (3000) 

TAM1 (2000)  
TAM 2 (2500)  
TAM 3 (3000)  
TAM 4 (1300)  
TAM 5 (2200)  
TAM 6 (3500)  
TAM 7 (1400)  
TAM 8 (2600)  
TAM 9 (1000)  
TAM 10 (500) 
The Northwest Corner Rule begins with the northwest corner of this matrix, which is at the confluence of TAM 1 and TGI 1. Into this cell, we write down the smaller of the case weight of TGI 1 and TAM 1. Since TAM 1 has a weight of 2000 and TGI 1 has a weight of 1500, the minimum is 1500.
TGI 1 (1500) 
TGI 2 (4000) 
TGI 3 (3000) 
TGI 4 (2000) 
TGI 5 (2500) 
TGI 6 (1800) 
TGI 7 (2200) 
TGI 8 (3000) 

TAM1 (2000)  1500  
TAM 2 (2500)  
TAM 3 (3000)  
TAM 4 (1300)  
TAM 5 (2200)  
TAM 6 (3500)  
TAM 7 (1400)  
TAM 8 (2600)  
TAM 9 (1000)  
TAM 10 (500) 
To complete this first step, we subtract the cell entry from the row and column totals. Thus, for TGI 1, everything has been transported into the cell and nothing is left. For TAM 1, 1500 has been transported into the cell and 500 remains.
TGI 1 (0) 
TGI 2 (4000) 
TGI 3 (3000) 
TGI 4 (2000) 
TGI 5 (2500) 
TGI 6 (1800) 
TGI 7 (2200) 
TGI 8 (3000) 

TAM1 (500)  1500  
TAM 2 (2500)  
TAM 3 (3000)  
TAM 4 (1300)  
TAM 5 (2200)  
TAM 6 (3500)  
TAM 7 (1400)  
TAM 8 (2600)  
TAM 9 (1000)  
TAM 10 (500) 
In the next step, we have to move either to the east (=right) or south (=downwards) of the current cell. In this case, since TGI 1 has been totally exhausted, we cannot move down. TAM 1 still has 500 left, so we can move right. Again, we enter the minimum of the current row and column totals. Since TAM 1 has 500 and TGI 2 has 4000, the minimum is 500. After we enter 500 in the TAM 1TGI 2 cell, TAM 1 is completely exhausted while TGI 2 is reduced from 4000 to 3500.
TGI 1 (0) 
TGI 2 (3500) 
TGI 3 (3000) 
TGI 4 (2500) 
TGI 5 (2000) 
TGI 6 (1800) 
TGI 7 (2200) 
TGI 8 (3000) 

TAM1 (0)  1500  500  
TAM 2 (2500)  
TAM 3 (3000)  
TAM 4 (1300)  
TAM 5 (2200)  
TAM 6 (3500)  
TAM 7 (1400)  
TAM 8 (2600)  
TAM 9 (1000)  
TAM 10 (500) 
The algorithm continues from the current working cell, processing either to the east (=right) or south (=down) based upon the available choice, and there should be one and only choice. In the rare case, we may encounter the degenerate case in which the working cell happens to meet the demands exactly on both the column and the row; the algorithm will continue simply by moving east (or, since this is arbitrary, moving south), with a zero weight assigned to the new working cell. Eventually, this eastwardsouthward serpentine path must end in the far southeast corner (bottomright) with everything being accounted for. In the next table, we show the final solution for our example.
TGI 1 (0) 
TGI 2 (0) 
TGI 3 (0) 
TGI 4 (0) 
TGI 5 (0) 
TGI 6 (0) 
TGI 7 (0) 
TGI 8 (0) 

TAM1 (0)  1500  500  
TAM 2 (0)  2500  
TAM 3 (0)  1000  2000  
TAM 4 (0)  1000  300  
TAM 5 (0)  1700  500  
TAM 6 (0)  2000  1500  
TAM 7 (0)  300  1100  
TAM 8 (0)  1100  1500  
TAM 9 (0)  1000  
TAM 10 (0)  500 
Physically, the fused database contains of records which correspond to the selected cells in this table. Thus the first record contains a case weight of 1500, the television information from person TAM 1 and the magazine/product usage information from person TGI 1; the second record contains a case weight of 500, the television information from person TAM 1 and the magazine/product usage information from person TGI 2; and so on. It is easy to verify that (1) all the original TAM and TGI persons are present in this fused database with their original case weights, although each person may have been fractionalized into more than one record; and (2) the frequency distributions of the original TAM and TGI variables are preserved, including all intraTAM and intraTGI relationships (such as reach, frequency, duplication, etc).
The Northwest Corner Rule has timecomplexity of linear order to the product of the sample sizes (that is, O(m+n) where m and n are the respective sample sizes) and spacecomplexity of linear order in the sum of the sample sizes (that is, O(m+n)). Therefore, it can be implemented and executed very rapidly. Unfortunately, as we have presented above, the Northwest Corner Rule makes no reference to any matching considerations, and may result in a matching solution that is far from optimal. After all, this could have been the Southwest Corner Rule working from the bottomleft corner to the upperright corner.
In the classification of algorithms, the Northwest Corner Rule is considered a greedy algorithm which accepts the next available candidate into the solution, without ever considering that switching some matches may improve the overall welfare. By contrast, the steppingstone reevaluates over and over again if any switching can bring about improvement. We quote the words of one of the original investigators of the transportation problem, the Russian Leonid V. Kantorovich, who drew an analogy between the capitalist and socialist systems:
I want to emphasize again, that the greater part of the problems of which I shall speak, relating to the organization and planning of production, are connected specifically with the Soviet system of economy and in the majority of cases do not arise in the economy of a capitalist. There the choice of output is determined not by the plan but by the interests and profits of individual capitalists. The owner of the enterprise chooses for production those goods which at a given moment have the highest price, can most easily be sold, and therefore give the largest profit. The raw material used is not that of which there are huge supplies in the country, but that which the entrepreneur can buy most cheaply. The question of the maximum utilization of equipment is not raised; in any case, the majority of enterprises work at half capacity.
In the USSR the situation is different. Everything is subordinated not to the interests and advantages of the individual enterprise, but to the task of fulfilling the state plan. The basic task of an enterprise is the fulfillment and overfulfillment of its plan, which is a part of the general state plan. Moreover, this not only means fulfillment of the plan in aggregate terms (i.e. total value of output, total tonnage, and so on), but the certain fulfillment of the plan for all kinds of output; that is, the fulfillment of the assortment plan (the fulfillment of the plan for each kind of output, the completeness of individual items of output, and so no).
With the Northwest Corner Rule, we can still exercise some control over the matching performance by arranging the TAM and TGI cases in a suitable manner. There are two principal approaches. The most effective way is to define mutually and exhaustive strata (e.g. ABC+ males, ABC+ females, etc) in which matching takes place, and this guarantees that those defining strata will be preserved completely. Unfortunately, the use of such a stratification plan is limited by sample size considerations.
The second approach to improving the Northwest Corner Rule is to sort each sample by a number of variables (for example, by age first, by head of household status within age, by housewife status within head of household within age, and so on). For example, if half of each sample were young people, then after sorting by age, the young people will appear in either the left hand columns or the top rows and the Northwest Corner Rule will bring them together. But while this may work for the variables that rank high in the sort priority, the performance will weaken significantly as we move down the sort priority of the variables.
We will now present some empirical comparisons from the TAMTGI Argentina fusion project. This project consists of fusing six months' worth of TAM data from 3,187 persons in the IBOPE Argentina people meter panel with 5,914 TGI respondents in the Gran Buenos Aires area. For the fusion, the samples were divided into ten mutually exclusive and exhaustive strata (sex by five socioeconomic levels). Within each stratum, the matching variables are (in order of priority): age, pay television, head of household status, housewife status, presence of infant, presence of children 25 and presence of children 611.
Here are the empirical performance statistics for the two algorithms.
COMPUTATIONAL TIME
MATCHING SUCCESS
After the fusion, the matched cases were compared with respect to the matching variables. The following table gives the percentage of times in which the value of the TAM variable was the same as that of the corresponding TGI variable on the matched records in the fused database.
Variable 
Northwest Corner Rule  Stepping Stone algorithm 
Sex  100.0%  100.0% 
Socioeconomic level  100.0%  100.0% 
Age  81.9%  87.7% 
Pay television status  68.2%  93.7% 
Head of Household status  72.4%  90.5% 
Housewife status  80.6%  91.8% 
Presence of infant (023 months old)  84.3%  91.1% 
Presence of children 25 years old  69.9%  87.2% 
Presence of children 611 years old  62.3%  83.9% 
The first two variables are guaranteed to have 100% successful match rates by definition. For all the other variables, the stepping stone algorithm was significantly more successful in producing matches. To the extent that these matching variables are important in the determination of the accuracy of fusion, failure to make matches results in poorer fusion.
DISCUSSION
The two pieces of empirical performance data confirm our sense of economic justice: the better quality product comes from investing greater resources and time. Since the TAMTGI Argentina fusion is meant to be conducted once every six months without the pressure of realtime or nearrealtime delivery, there is no compelling need to trade away accuracy for speed. What we have shown, though, is that the extra effort that we are taking should result in significant improvement in the accuracy of fusion.
Finally, for the sake of completeness, we note in passing that there exist other algorithms with complexity in between the two that are presented here (for example, Russell's approximation method and Vogel's approximation method), but there does not appear to be any algorithm that has the near accuracy of the optimal algorithm and the computational speed of the Northwest Corner Rule simultaneously.
(posted by Roland Soong, 1/5/2002)
(Return to Zona Latina's Home Page)