Quick vs. Optimal Algorithms in Data Fusion

Data fusion is the integration of two separate databases into a single database with joint information.  This is achieved by the statistical matching of the records in the two databases based upon similarity on a list of variables.  Within this framework, there exists a variety of algorithms to achieve this goal.  

In the paper An Anatomy of Data Fusion (note: in Adobe pdf format) at the Worldwide Readership Research Symposium, we described how the stepping-stone algorithm for the transportation problem in operations research can be used in the context of data fusion.  This algorithm has three important properties.  First of all, it preserves the distributions of the variables in each database.  Secondly, it retains all the cases in each of the databases.  Thirdly, within the constraints imposed by the first two properties, it achieves the highest success in finding matches among the cases.  In this sense, the stepping-stone algorithm is said to be an optimal algorithm since no other algorithm can achieve better success in matching rates subject to these two sets of constraints.

But, as we have shown in the aforementioned paper, the stepping-stone algorithm is time-consuming to implement and execute.  This motivated us to explore the possibility of seeking some form of near-optimal, but quick, algorithm for data fusion.  In this paper, we will discuss one such algorithm, and compare its properties against those of the optimal algorithm.

To best illustrate the quick algorithm, we will construct a small artificial example.  We assume that we have two databases.  The first database consists of persons in a people meter television panel from whom we collect television viewing information.  Following standard terminology, we call this the Television Audience Measurement (TAM) sample.  The second database consists of persons in a consumer survey from which print readership and product usage information are collected.  We call this the Target Group Index (TGI) sample.  At this moment in time, this TAM-TGI data fusion configuration exists or will come into existence very shortly for six countries in the Americas (Argentina, Brazil, Colombia, Mexico, Puerto Rico and USA).

For simplicity, we assume that we have 8 TAM persons and 10 TGI persons.  Each TAM person and each TGI person are assigned a case weight in their respective studies.  The case weight is the number of persons in the universe that the sample respondent is suppose to represent.   The example is displayed below, where we note that the sum of the weights in each sample adds to 20,000 because they are supposed to represent the same universe.

TAM person TAM weight TGI person TGI weight
TAM 1 2000 TGI 1 1500
TAM 2 2500 TGI 2 4000
TAM 3 3000 TGI 3 3000
TAM 4 1300 TGI 4 2000
TAM 5 2200 TGI 5 2500
TAM 6 3500 TGI 6 1800
TAM 7 1400 TGI 7 2200
TAM 8 2600 TGI 8 3000
TAM 9 1000 TOTAL 20000
TAM 10   500
TOTAL 20000

The stepping-stone algorithm requires an arbitrary initial feasible solution, and then it begins through a series of iterations to determine if further improvements are possible with different matching solutions.  If no improvement is possible, then the current solution can be proven to be the unique optimal solution.  There are a number of different ways to construct this initial feasible solution.  A very simple one is the so-called Northwest Corner Rule, which we will illustrate here.

To begin, we construct a matrix with the TAM persons in the rows and the TGI persons in the columns, with the case weights being written into the marginal totals.

  TGI 1
(1500)
TGI 2
(4000)
TGI 3
(3000)
TGI 4
(2000)
TGI 5
(2500)
TGI 6
(1800)
TGI 7
(2200)
TGI 8
(3000)
TAM1 (2000)                
TAM 2 (2500)                
TAM 3 (3000)                
TAM 4 (1300)                
TAM 5 (2200)                
TAM 6 (3500)                
TAM 7 (1400)                
TAM 8 (2600)                
TAM 9 (1000)                
TAM 10 (500)                

The Northwest Corner Rule begins with the northwest corner of this matrix, which is at the confluence of TAM 1 and TGI 1.  Into this cell, we write down the smaller of the case weight of TGI 1 and TAM 1.  Since TAM 1 has a weight of 2000 and TGI 1 has a weight of 1500, the minimum is 1500. 

  TGI 1
(1500)
TGI 2
(4000)
TGI 3
(3000)
TGI 4
(2000)
TGI 5
(2500)
TGI 6
(1800)
TGI 7
(2200)
TGI 8
(3000)
TAM1 (2000) 1500              
TAM 2 (2500)                
TAM 3 (3000)                
TAM 4 (1300)                
TAM 5 (2200)                
TAM 6 (3500)                
TAM 7 (1400)                
TAM 8 (2600)                
TAM 9 (1000)                
TAM 10 (500)                

To complete this first step, we subtract the cell entry from the row and column totals.  Thus, for TGI 1, everything has been transported into the cell and nothing is left.  For TAM 1, 1500 has been transported into the cell and 500 remains.

  TGI 1
(0)
TGI 2
(4000)
TGI 3
(3000)
TGI 4
(2000)
TGI 5
(2500)
TGI 6
(1800)
TGI 7
(2200)
TGI 8
(3000)
TAM1 (500) 1500              
TAM 2 (2500)                
TAM 3 (3000)                
TAM 4 (1300)                
TAM 5 (2200)                
TAM 6 (3500)                
TAM 7 (1400)                
TAM 8 (2600)                
TAM 9 (1000)                
TAM 10 (500)                

In the next step, we have to move either to the east (=right) or south (=downwards) of the current cell.  In this case, since TGI 1 has been totally exhausted, we cannot move down.  TAM 1 still has 500 left, so we can move right.  Again, we enter the minimum of the current row and column totals.  Since TAM 1 has 500 and TGI 2 has 4000, the minimum is 500.  After we enter 500 in the TAM 1-TGI 2 cell, TAM 1 is completely exhausted while TGI 2 is reduced from 4000 to 3500.

  TGI 1
(0)
TGI 2
(3500)
TGI 3
(3000)
TGI 4
(2500)
TGI 5
(2000)
TGI 6
(1800)
TGI 7
(2200)
TGI 8
(3000)
TAM1 (0) 1500 500            
TAM 2 (2500)                
TAM 3 (3000)                
TAM 4 (1300)                
TAM 5 (2200)                
TAM 6 (3500)                
TAM 7 (1400)                
TAM 8 (2600)                
TAM 9 (1000)                
TAM 10 (500)                

The algorithm continues from the current working cell, processing either to the east (=right) or south (=down) based upon the available choice, and there should be one and only choice.  In the rare case, we may encounter the degenerate case in which the working cell happens to meet the demands exactly on both the column and the row; the algorithm will continue simply by moving east (or, since this is arbitrary, moving south), with a zero weight assigned to the new working cell.  Eventually, this eastward-southward serpentine path must end in the far south-east corner (bottom-right) with everything being accounted for.  In the next table, we show the final solution for our example.

  TGI 1
(0)
TGI 2
(0)
TGI 3
(0)
TGI 4
(0)
TGI 5
(0)
TGI 6
(0)
TGI 7
(0)
TGI 8
(0)
TAM1 (0) 1500 500            
TAM 2 (0)   2500            
TAM 3 (0)   1000 2000          
TAM 4 (0)     1000 300        
TAM 5 (0)       1700 500      
TAM 6 (0)         2000 1500    
TAM 7 (0)           300 1100  
TAM 8 (0)             1100 1500
TAM 9 (0)               1000
TAM 10 (0)               500

Physically, the fused database contains of records which correspond to the selected cells in this table.  Thus the first record contains a case weight of 1500, the television information from person TAM 1 and the magazine/product usage information from person TGI 1; the second record contains a case weight of 500, the television information from person TAM 1 and the magazine/product usage information from person TGI 2; and so on.  It is easy to verify that (1) all the original TAM and TGI persons are present in this fused database with their original case weights, although each person may have been fractionalized into more than one record; and (2) the frequency distributions of the original TAM and TGI variables are preserved, including all intra-TAM and intra-TGI relationships (such as reach, frequency, duplication, etc).

The Northwest Corner Rule has time-complexity of linear order to the product of the sample sizes (that is, O(m+n) where m and n are the respective sample sizes) and space-complexity of linear order in the sum of the sample sizes (that is, O(m+n)).  Therefore, it can be implemented and executed very rapidly.   Unfortunately, as we have presented above, the Northwest Corner Rule makes no reference to any matching considerations, and may result in a matching solution that is far from optimal.  After all, this could have been the Southwest Corner Rule working from the bottom-left corner to the upper-right corner.

In the classification of algorithms, the Northwest Corner Rule is considered a greedy algorithm which accepts the next available candidate into the solution, without ever considering that switching some matches may improve the overall welfare.  By contrast, the stepping-stone re-evaluates over and over again if any switching can bring about improvement.  We quote the words of one of the original investigators of the transportation problem, the Russian Leonid V. Kantorovich, who drew an analogy between the capitalist and socialist systems:

I want to emphasize again, that the greater part of the problems of which I shall speak, relating to the organization and planning of production, are connected specifically with the Soviet system of economy and in the majority of cases do not arise in the economy of a capitalist.  There the choice of output is determined not by the plan but by the interests and profits of individual capitalists.  The owner of the enterprise chooses for production those goods which at a given moment have the highest price, can most easily be sold, and therefore give the largest profit.  The raw material used is not that of which there are huge supplies in the country, but that which the entrepreneur can buy most cheaply.  The question of the maximum utilization of equipment is not raised; in any case, the majority of enterprises work at half capacity.

In the USSR the situation is different.  Everything is subordinated not to the interests and advantages of the individual enterprise, but to the task of fulfilling the state plan.  The basic task of an enterprise is the fulfillment and overfulfillment of its plan, which is a part of the general state plan.  Moreover, this not only means fulfillment of the plan in aggregate terms (i.e. total value of output, total tonnage, and so on), but the certain fulfillment of the plan for all kinds of output;  that is, the fulfillment of the assortment plan (the fulfillment of the plan for each kind of output, the completeness of individual items of output, and so no).

With the Northwest Corner Rule, we can still exercise some control over the matching performance by arranging the TAM and TGI cases in a suitable manner.  There are two principal approaches.  The most effective way is to define mutually and exhaustive strata (e.g. ABC+ males, ABC+ females, etc) in which matching takes place, and this guarantees that those defining strata will be preserved completely.  Unfortunately, the use of such a stratification plan is limited by sample size considerations.  

The second approach to improving the Northwest Corner Rule is to sort each sample by a number of variables (for example, by age first, by head of household status within age, by housewife status within head of household within age, and so on).  For example, if half of each sample were young people, then after sorting by age, the young people will appear in either the left hand columns or the top rows and the Northwest Corner Rule will bring them together.  But while this may work for the variables that rank high in the sort priority, the performance will weaken significantly as we move down the sort priority of the variables. 

We will now present some empirical comparisons from the TAM-TGI Argentina fusion project.  This project consists of fusing six months' worth of TAM data from 3,187 persons in the IBOPE Argentina people meter panel with 5,914 TGI respondents in the Gran Buenos Aires area.  For the fusion, the samples were divided into ten mutually exclusive and exhaustive strata (sex by five socio-economic levels).  Within each stratum, the matching variables are (in order of priority): age, pay television, head of household status, housewife status, presence of infant, presence of children 2-5 and presence of children 6-11.

Here are the empirical performance statistics for the two algorithms.

COMPUTATIONAL TIME

MATCHING SUCCESS

After the fusion, the matched cases were compared with respect to the matching variables.  The following table gives the percentage of times in which the value of the TAM variable was the same as that of the corresponding TGI variable on the matched records in the fused database.  

Variable

Northwest Corner Rule Stepping Stone algorithm
Sex 100.0% 100.0%
Socio-economic level 100.0% 100.0%
Age 81.9% 87.7%
Pay television status 68.2% 93.7%
Head of Household status 72.4% 90.5%
Housewife status 80.6% 91.8%
Presence of infant (0-23 months old) 84.3% 91.1%
Presence of children 2-5 years old 69.9% 87.2%
Presence of children 6-11 years old 62.3% 83.9%

The first two variables are guaranteed to have 100% successful match rates by definition.  For all the other variables, the stepping stone algorithm was significantly more successful in producing matches.  To the extent that these matching variables are important in the determination of the accuracy of fusion, failure to make matches results in poorer fusion.

DISCUSSION

The two pieces of empirical performance data confirm our sense of economic justice: the better quality product comes from investing greater resources and time.  Since the TAM-TGI Argentina fusion is meant to be conducted once every six months without the pressure of real-time or near-real-time delivery, there is no compelling need to trade away accuracy for speed.  What we have shown, though, is that the extra effort that we are taking should result in significant improvement in the accuracy of fusion.

Finally, for the sake of completeness, we note in passing that there exist other algorithms with complexity in between the two that are presented here (for example, Russell's approximation method and Vogel's approximation method), but there does not appear to be any algorithm that has the near accuracy of the optimal algorithm and the computational speed of the Northwest Corner Rule simultaneously.

 (posted by Roland Soong, 1/5/2002)


(Return to Zona Latina's Home Page)