Varieties of Data Integration Experiences


This is a highly technical article that addresses methodological issues in the integration of information from different databases.  Ordinarily, such an article would have been submitted to a professional journal for publication.  We have opted instead to publish it on the World Wide Web.  The major reason is that data integration has assumed a central place in media research in the Americas in recent months, so that easy, rapid access to research articles will help disseminate information and knowledge about this subject.  By comparison, going through the usual  path of formal publication would have resulted in significant publication delays and the journals may be readily available to all those who are interested.


In the study of media behavior, we are fortunate to have research studies that collect detailed information in various areas.  Here are some examples:

Each database is presumed to have been fine-tuned to do the best job within its domain.  Very often, we are required to correlate information that exists in different databases.  For example, we may wish to know the duplication between sports television programs and sports magazines; or we may want to know what the best television programs are for potential purchasers of laptop personal computers.  Data integration is the process whereby information from two or more databases are integrated into the analytical process.  

There is in fact a variety of ways by which one might go about integrating data from multiple databases.  It is the purpose of this article to describe some of these processes, and to put them into a common perspective.


The simplest data integration method is to use a surrogate in one database to stand in for something that is absent in this database and present elsewhere.  In another article (Defining Target Groups in Media Planning), we described a detailed example using data from the TGI Colombia study.  We will recapitulate the situation here in brief.

Our stated plan was to advertise to the group of cinema goers via a suitable schedule of television programs.  Unfortunately, the television viewing database contains only demographic information (age, sex, socio-economic level, geography, etc) and television usage, and nothing about cinema attendance.  From other qualitative and qualitative sources, one knows that young people (persons 16-34) are more likely to attend cinema than other groups (32% versus 22% in the rest of the population).  Accordingly, one might opt to use persons 16-34 as the surrogate target group in the television database.

In our previous article, we documented the error rates associated with this decision.  While this surrogate group is significantly more likely to attend cinema, the facts are that most persons 16-34 are not regular cinema goers, and most cinema goers are not between 16 and 34 years old.  So while this had been a convenient choice, it has significant amounts of errors.  

We note that using surrogates has been historically regarded as unpleasant but also unavoidable, and is rarely subjected to the probing scrutiny of error rates that is accorded the more explicit methods of data integration.


The use of a single surrogate is an all-or-none proposition that results in the inclusion of all those qualified (e.g. all persons 16-34 years old) and the exclusion of all those did not qualify (e.g. all persons 35 years or older).  The harshness of the 'hard thresholding' can be softened by segmenting people into mutually exclusive and exhaustive segments, and then applying suitably derived adjustment factors.

To illustrate this, we continue with the example of cinema goers.  

The simple segmentation scheme consisted of two segments --- persons 16 to 34 years old, versus persons 35 years or older, with cinema attendance incidences of 32% and 22%.  

Suppose a television program has a total audience of 400,000 persons 12 years or older, with 200,000 persons 16-34 and 200,000 persons 35+; the estimated number of cinema goers among its audience is then given by (200,000 x 0.32) + (200,000 x 0.22) = 64,000 + 44,000 = 108,000.  

Suppose another television program has a total audience of 400,000 persons 12+, with 300,000 persons 16-34 and 100,000 persons 35+; the estimated number of cinema goers among its audience is then given by (300,000 x 0.32) + (100,000 x 0.22) = 96,000 + 22,000 = 118,000.  

The second program therefore delivers 10,000 more cinema goers.  This is how the data from one database can be linked to another via segmentation.

The segmentation can be constructed from anything that is available.  Certainly, we can expand our two segments above to involve finer age/sex groups in order to capture detailed behavior.  In the USA, it is worthwhile to mention that there are geodemographic segmentation systems that are based upon the demographic characteristics (e.g. median income, median age, residential density, etc) of the geographic areas such as postal zip codes and census blocks. 

The limitations to such a segmentation scheme is that the number of segments is small (or, at least, restricted by the number of cases present in each proposed segments).  Thinking about the case of the cinema goers in Colombia, we remind the reader that the reference study is the TGI Colombia survey of 7,035 respondents.  There would be no problem using 14 age/sex groups (male 12-19, male 20-24, male 25-34, male 35-44, male 45-54, male 55-64, female 12-19, female 20-24, female 25-34, female 35-44, female 45-54, female 55-64) with an average of 500 respondents per segment.  Upon further thought, given the cost of movie admission, we would be inclined to include socio-economic level (upper, middle, lower) in the segmentation.  And when we think about movie availability, we remember that large cities such as Bogotá have many cinema houses whereas rural areas are under-serviced, so we would want to include some type of geographical consideration (such as, urban versus rural).  Before we know it, we are looking at 14 x 3 x 2 = 84 segments with fewer than 90 respondents per segment.  Necessarily, we have to use a small number of segments for the sakes of statistical reliability and manageable complexity at the cost of discarding some potentially useful information.

Let us now consider the error rates involved in segmentation linkage.  If 32% of all persons 16-34 are cinema goers and a television program's viewers include 300,000 persons 16-34, then does that mean 32% of those 300,000 viewers are cinema goers?  This is true only if one assumes independence between cinema attendance and television viewing, and we would be inclined to believe this to be false (e.g. what if that program were an entertainment news program?) in many cases.


We use one database (such as a product usage database) to build a descriptive structural model between a list of predictor variables (such as age, sex, socio-economic level, geography) and one or more outcome variables (such as cinema attendance).  For the model, one can use any favorite statistical method --- discriminant analysis, multiple linear regression, probit regression, logistic regression, CHAID, classification and regression trees, multivariate adaptive regression splines, kernel methods, nearest neighbor matching, support vector machines, neural networks, etc.  As such, segmentation linkage may also be considered to be a simple form of predictive modeling.

The structural model can then be applied to the target database (such as a television people meter panel).  Each person is 'scored' on the basis of the predictor variables (age, sex, socio-economic level, geography, etc., which are assumed to be defined in the identical way as in the other database) to obtain a predicted value.  Very often, this predicted value takes the form of a probability (such as the likelihood of attending cinema).  This predicted probability can then be carried forward as a weighting factor.  Alternately, a person can be classified randomly as either a 'yes' or a 'no' as follows --- use a uniform pseudo-random number generator to obtain a real number between 0 and 1; if this is less than or equal to the predicted probability, then this person is a 'yes'; otherwise, this person is a 'no.'  

Predictive modeling has an advantage over segmentation linkage by not being limited by segmentation cell sizes, as long as the number of parameters (such as regression coefficients) to be estimated is relatively small compared to the total sample size.

In predictive modeling, there is chance of model misspecification, both in terms of the list of predictor variables and the structural nature of the model.  The myriad of applicable statistical methods reflect the fact that there are many different data assumptions.  For example, a logistic regression model is often specified in terms of main effects among the predictor variables out of parsimony or absence of knowledge, so that important interaction effects may have been omitted.

Let us now consider the error rates in predictive modeling.  The easiest way to measure this is through a cross-validation procedure --- the initial database from which the predictive model was built is split into two mutually exclusive and exhaustive portions; the predictive model is built on one portion (known as the training sample), and applied to the other portion (known as the validation sample).  The error rate will be based upon the comparison of the actual versus the predicted values in the validation sample. 


This method of data integraion is commonly referred to as data fusion.  Basically, the idea is to take two databases of individuals, all of whom have a common set of variables (e.g. age, sex, socio-eocnomic level, geography, etc).  Each individual in one database is matched with one (or more) individual(s) in the other database on the basis of the proximity of their common variables.  When a match is made, the information from the matching individuals are copied over to the fused database, which will now contain the common variables plus the information unique to the separate databases. 

Strictly speaking, statistical matching is a special form of predictive modeling.  Given an individual in one database, the predicted value is the value of the closest matching individual in the other database.  As such, it is a nonparametric technique that is less susceptible to model specification as some of the strong parametric methods (such as discriminant analysis), but at the potential cost of losing some statistical power.

We separate statistical matching from predictive modeling because it is presently being used to tackle a much larger problem than the latter has been used for.  Most typically, predictive modeling has been applied to a single outcome variable at a time (e.g. cinema attendance, leisure air travel, baby diaper purchases, etc).

The grand problem for data integration is that of a multimedia database that can be used for the optimization of advertising schedules that incorporates multimedia elements (e.g. television, print, radio, internet, outdoor, etc).  Predictive modeling would be hard pressed to simultaneously and consistently estimate the audiences of thousands of television programs, radio stations and press titles, while preserving the correlations between these entities.  

An equally important issue is the preservation of the currency values of the audience estimates in the different databases.  The term currency is used to denote the fact that these audience estimates are the basis by which buying and selling of advertisting space are made.  It would be unacceptable for the data integration process to significantly increase or decrease the audiences for entire media or specific titles in any systematic or even random fashion.  Most predictive modeling methods are hard pressed to preserve these currency values.

In a previous note (Data Fusion in Latin America), we presented a particular method of statistical matching that was able to fuse two or more databases in such a way that the currency values are preserved and such that the maximum success rates in matching on common variables under these constraints is achieved.  

The error rates associated with statistical method can be addressed with a split-sample cross-validation process.  The details of that error analysis is worthy of a separate lengthy article.  Since statistical matching has been only recently introduced in the Americas for data integration purposes, the error analysis has been subjected to intense scrutiny.  Here, we wish to point out that the alternative to statistical matching is some of the other varieties of data integration that we have discussed here, and their corresponding error rates have not been subjected to the same level of scrutiny.  The fact that the alternate procedures that are currently in use have significantly higher error rates than statistical matching has apparently not been clear.

(posted by Roland Soong, 5/28/2001)

(Return to Zona Latina's Home Page)