Data Fusion: Integrating
from Multiple Studies
(Based upon a speech delivered at the Sixth CASRO Annual Technology Conference, New York City, June 22nd, 2001)
CASRO is the Council of American Survey Research Organizations, which has more than 300 member survey research organizations. Although we do not know the exact number, it would be fair to say that these survey research organizations conduct thousands and thousand of surveys each year. Is this kind of expenditure and effort necessary? Not all of the time perhaps, but we have to acknowledge that different studies were designed and executed to serve different purposes.
These surveys can be classified into custom and syndicated research studies. A custom study is one that is designed for the specific needs of one (or more) user(s). A syndicated study is one which is designed to fit the needs of a community of users who may use the study for different purposes. This distinction is somewhat blurred as a custom study may become syndicated. Here are some of the major syndicated studies in the United States of America:
|Nielsen Media Research: NTI||National television viewing|
|Nielsen Media Research: NSI||Local television viewing|
|Nielsen Media Research: NHTI||National Hispanic television viewing|
|Nielsen Media Research: NHSI||Local Hispanic television viewing|
|Nielsen Media Research: NHI||New television technologies such as cable, pay cable, VCRs, DVD players, satellite dishes|
|Arbitron: Arbitron Radio||National and local market radio listening|
|Statistical Research: RADAR||National radio listening|
|Mediamark: National Study||Magazine readership, multimedia usage, product usage|
|Simmons: National Consumer Study||Magazine readership, multimedia usage, product usage|
|Mendelsohn Media: Affluent Study||Affluent market: Magazine readership, product usage|
|Millward-Brown IntelliQuest: Computer Industry Media Study||Computer industry media usage, product usage|
|J.D. Power & Associates: Car & Truck Media Studies||Media usage, automobile information|
|MARS: OTC/DTC Pharmaceutical Study||Magazine readership, multimedia usage, OTC/DTC pharmaceutical usage|
|Scarborough Research||Newspaper readership, multimedia usage, product usage, lifestyle|
|Jupiter Media Metrix||Internet usage|
|Net Ratings||Internet usage|
These large syndicated studies are financed by different industries (radio, television, advertising agencies, advertisers, etc). In light of the interests of the sponsors, these studies are designed and optimized to serve specific purposes. For example, the relative simplicity of local markets may mean that it is feasible to collect newspaper readership information over the telephone; the complexity of the television environment now requires the use of people meter technology where this is affordable; the even greater complexity of the internet means that software tracking is the only viable approach.
Nevertheless, there will come a time when data users find that the information that they need to make decisions exists across multiple studies. We will offer two simple examples:
These cross-study applications would be easy if we have a single-source study that collects all the required data from the same people. In practice, such a single-source study would require a massive number of respondents from whom a massive amount of data must be collected. Imagine that each respondent must:
The set of people who are willing and able to complete these tasks regularly must be considered NOT representative of the general population. Someday, perhaps, a single source study may be possible with a new technology such as the portable people meter watch. Until then, we are faced with the more immediate problem of how to integrate results from multiple separate studies.
Now there are many different ways to integrate information from multiple studies. We find it convenient to talk about two approaches (note: we do not imply that these two approaches exhaust all the possibilities).
As an example of the modeling approach, we will describe how one might tackle the problem of producing target group ratings.
On one hand, we have a television study (such as the Nielsen Television Index people meter panel) for which we have detailed television viewing information. On the other hand, we have a product usage study (such as MRI or NPD) for which we have detailed product brand preferences, volumetrics and attitudes. We are now interested in the television ratings for a target group defined in terms of product usage (such as frequent business travelers who make more than 12 trips per year). The two studies have a list of variables in common: age, sex, region, presence of children, pet ownership, household size, auto ownership, presence of pay television services, etc.
Step 1: Build a statistical model from the product usage study, relating the product usage to the set of common variables. Here, one can choose from among a number of statistical methods (e.g. simple cross-tabulations, linear regression, discriminant analysis, logistic regression, probit regression, classification trees, neural networks, k-nearest neighbor methods, kernel methods, support vector machines, etc.).
Step 2: Apply the statistical model to the television usage study to 'score' the respondents. This predicted 'score' is typically the probability of using the product; it can sometimes be a volumetric measure (such as the predicted number of business trips). This predicted score can be used to create target group ratings, either by using the predicted score as a weight adjustment or classifying people randomly according to the probability.
What are the limitation of the modeling approach? It goes without say that the modeling approach, as in fact with any form of data integration, is limited by the predictive power of the common variables. In our experience, the availability of the 'right' variables has a much greater impact on the outcome than the choice of a statistical technique.
We are actually more interested in the multimedia planning problem: On one hand, we have a television study for which we have detailed television viewing information that are analyzable by a limited set of demographics (such as age, sex, region, etc). On the other hand, we have a magazine readership survey for which we have detailed magazine reading information that are analyzable by a limited set of demographics (such as age, sex, region, etc). A media planner designs an advertising schedule consisting of television spots and magazine insertions, and wishes to quantify the characteristics of this schedule (e.g. reach and frequency).
Typically, the media planner would want to evaluate different combinations of target groups, television programs and magazines. Since we cannot print all conceivable combinations ahead of time, the information is best delivered in the form of an interactive database. In so doing, there are a number of properties that one might reasonably demand from this integrated database:
It would be very difficult for any modeling method to meet these requirements. Instead, the approach known as 'data fusion' is used for this type of problem. In a nutshell, data fusion is a form of statistical matching where respondents from one database is matched against respondents from the other database on the basis of similarity of common variables. The end result is a respondent-level database.
Data fusion is not brand new. For the multimedia planning application here, my company (Kantar Media Research) has been using a form of data fusion in the United Kingdom since the late 1980's. Currently, we are using data fusion to roll out integrated databases in Latin America (Mexico, Brazil and Argentina). There is currently no commercially fused databases in the USA as yet, but this will change very soon. Data fusion also has a long illustrious history in social policy analysis, where the subject area is known as 'microsimulation.'
We will now provide some details about statistical matching enables us to meet the four requirements stated above. For illustrative purposes, we will use the Mexican databases.
On one hand, we have the TGI (Target Group Index) study. This is two-phased study of a probability sample of Mexicans between the ages of 12 to 64 years old. The first phase of the study is a personal interview, in which media usage and demographic information is collected. At the end of the interview, a product booklet is left behind to be collected later. Ascription is used to fill up the product information for those respondents who did not return booklets. This sample is weighted to account for disproportionate sample allocations and differential response rates. This study is therefore very similar to the MRI national study in the USA.
On the other hand, we have the TAM (Television Audience Measurement) panel operated by IBOPE Mexico. This is a panel drawn by probability sampling of Mexicans age 4 or older, whose household television sets are equipped with people meters. Daily viewing information is collected electronically by telephone from the households. This panel is weighted daily to account for disproportionate sample allocations and differential intab rates. This database is therefore very similar to the NTI panel in the USA.
There are actually many variants of statistical matching. To meet the four requirements stated above, we used the form known as 'constrained statistical matching.' The basic technique was originally developed to solve a problem in operations research known as the transportation problem --- how to ship the supply from m warehouses to meet the demand from n retail stores with minimum total shipping cost. The problem was formulated by Hitchcock in the 1940's, and a feasible solution known as the stepping stone algorithm was published in the 1950's. The method has been used by the Internal Revenue Service in the 1970's to study the implications of tax policies on fused databases of tax returns and demographic databases.
In the context of data fusion here, the transportation problem can be recast as following: how to ship m TGI respondents to n TAM respondents, with their cases weights representing supply and demand respectively, and the shipping cost equals the badness of fit between individual persons. The next chart shows a simple example of how the method works.
For the TAM database, we assume that there are only two persons, with IDs TAM1 and TAM2. The first person has a case weight of 3,000 and the second person has a case of 3,000 as well. The total weight is 6,000. On the matching variable MV (e.g. head of household status), the first person is a YES and the second person is a NO. On the television variable TV1 (e.g. watch the Oscar show), the first person is a YES and the second is a NO.
For the TGI database, we assume that there are only three persons with IDs TGI1, TGI2 and TGI3. Each of these persons have case weights of 2,000. The total weight is 6,000, same as in the TAM database. On the matching variable MV, the first two persons are YES and the third person is NO. On the product usage variable P1 (e.g. own credit card), the first person is YES and the other two persons are NO.
EXAMPLE OF CONSTRAINED STATISTICAL MATCHING
For the constrained statistical matching, we can start with person TAM1. We look for a matching person on the TGI side and we find that TGI1 is a perfect match on MV (note: TGI2 is also a match and such ties can be broken arbitrarily). However, TAM1 has a case weight of 3,000 while TGI1 has a case weight of 2,000. So it is possible only to 'ship' 2000 for now. In the fused database, the first fused record came from TAM1 and TGI1 with case weight of 2,000 with their respective contributed data.
At this point, TGI1 is completely accounted for but there are still 1,000 units of TAM1 left. If we look for a match again, we find that TGI2 is a perfect match. In the fused database, the second fused record came from TAM1 and TGI2 with case weight of 1,000 with their respective contributed data.
At this point, TAM1 is completely accounted for. As for TAM2, the fusion is necessarily done with TGI3 and the remainder of TGI2. This completes the fused database.
The properties of this method are:
In terms of the validation of this method, it is relatively easy to check that the overall television, magazine and product usage information are indeed preserved. The larger issue is: Are the television data being fused correctly with magazine data?
Hypothetically, let us assume that we have a true single source database (i.e. television and magazine data from the same set of people). The most commonly used paradigm for validation is the split-sample fold-over design. We split the single source data into two portions, and we fuse the two portions together using the same statistical matching method. Due to the single source nature of the database, we can actually compare the original data against the fused data using suitable measures. Unfortunately, we do not have such a single source database. After all, if we had one, we wouldn't need to do data fusion, would we?
Instead, we note that the TGI database contains a number of television-related variables (such as television daypart viewing and program type preferences). These TGI television variables are not substitutes for TAM data, because they are based upon imperfect recall. Nevertheless, we can use the TGI television as surrogate variables. Under the split-sample fold-over design, we split the TGI sample into two portions and we fuse these two portions together. Afterwards, we can compare the original TGI television data against the fused data using suitable measures.
The limitations of the within-TGI split sample analysis is that surrogate variables are surrogates after all. Furthermore, this is not necessarily a realistic simulation of the TAM-TGI fusion with respect to sample design and sample size. Therefore, this is not a complete, unequivocal validation.
The need for integrating information from multiple studies is ubiquitous. Various forms of data integration is practiced all the time, formally or informally. Consider this example: you are interested in targeting cinema goers; your television database does not tell you how cinema-attendance habits; you look at a lifestyle database and you see that cinema goers are disproportionately young; therefore, you target young people in your television database. Well, you have just performed a crude form of data integration. If you had the means and the will to conduct an error analysis, you may be appalled at the inefficiency of such a decision --- while cinema goers are disproportionately young, most young people are not regular cinema goers and most regular cinema goers are not young!
By contrast, data fusion permits you to consider age as well as many other relevant variables (How much money do you have to spend --- it may cost $40 for a family of four to go to a movie these days? Where do you live --- I used to live in a place where I can't find a good movie to go to? What other forms of audio-visual entertainment you have access to --- you wouldn't pay $30 a month on premium cable channels just to not watch them? etc). The data fusion technology that I have described today is readily available for quite some time. The process can be extended to fuse three or more studies.
If you are thinking about data fusion yourself, I would urge you to look beyond data fusion algorithms. There are three parts to the data fusion process: (1) data preparation; (2) data algorithm; (3) data applications. When you check out technical papers on data fusion from conferences and symposia, virtually 100% of them are about data algorithms. As a data fusion practitioner, I spend 50% of time on data preparation, nearly 0% on data algorithms; and 50% on data applications. With this in mind, I offer these practical lessons:
Lesson #1: Pay attention to the details
When I begin a new data fusion project, I make sure that that I have a full understanding of the details. Any mistake on these details can have a devastating impact on your project. So I ask:
- What is the coverage of the original studies? What pieces of geography? Who is eligible to be surveyed?
- How is the sample allocated? Is it a proportionate allocation by design? What are the actual intab distributions?
- Are all persons in a household surveyed? If not, then are the survey respondents selected?
- What sort of weighting is applied to the intab samples? What are the weighted totals and distributions? Do they make sense?
- Are the so-called common variables really identical across the surveys? First, is the same question phrasing being used? (For example, just what is meant by the number of 'television sets' in the household? This is not a simple question because you have to consider ownership, working condition, active usage, portability, technically feasible to meter, etc.) Finally, do the outcomes appear to be similar? (For example, you may have asked the same question, but your results are radically different; such as 70% cable television incidence in one study and 55% in another study).
- Who are the sponsors and subscribers of the individual studies? Which interests must be protected? How will the fused data be used?
Lesson #2: You are limited by the quality of the original studies
Data fusion is downstream processing, and we know what flows downstream is not always the most healthy and appetizing product. Data fusion cannot improve upon the original studies with respect to qualities such as sample representativeness, sample size, response rates, response errors, etc. If you believe that these are important issues, you should try to get them fixed upstream within the original studies.
Lesson #3: You can improve the performance of data fusion
Data fusion is limited by the power of the available matching variables. Due to the 'flat maximum' effect, you are unlikely to achieve a great deal of improvement from better statistical modeling on the same set of matching variables. But you can gain significant improvements by adding more powerful matching variables. In the long run, this is a matter of determining what appropriate common questions can be inserted into the contributing studies. In the short run, you need to think 'out of the box' for creative solutions.
Let me give you an example. Suppose you wish to target people who need to make home improvements. The television database contains only television viewing and limited demographics (age, sex, income, geography, etc). The product usage database contains home improvement data and the same demographics. You are handicapped by the inability of your demographic variables to predict the need for home improvement. In this case, you can send out the names and addresses of your respondents to a database supplier (such as Acxiom or Experian), and you can obtain some very detailed data (single/multi-dwelling unit, lot size, the number of units in an apartment building, mortgage, length of residence, housing value, etc). Using this data overlay for matching is bound to improve your accuracy in the data fusion for this purpose.
Lesson #4: You must consider organizational factors
A data fusion project has multiple actors
- the research supplier for study #1
- the subscribers and sponsors for study #1
- the research supplier for study #2
- the subscribers and sponsors for study #2
- the data fusion supplier
- third-party software suppliers
These actors have different degrees of commitment, motivation, involvement, knowledge, expertise and resources. To pull this coalition of actors together, you have to address issues such as
(1) integrity of the original studies --- you need to assure and show the research suppliers, the sponsors and subscribers of the studies that their original data will not be distorted or devalued as a result of data fusion
(2) technical feasibility --- you need to show that it is technically feasible to conduct a valid data fusion
(3) commercial viability --- you need to convince the research suppliers that there are financial rewards
(4) understanding benefits/drawbacks --- you need to tell people what a fused database can and cannot do for them
(5) organization leadership --- you need someone to step forward to steward the data fusion project
(posted by Roland Soong, 6/22/2001)
(Return to Zona Latina's Home Page)