Until 2010: Continuous Exploitation of Census Data Throughout the Intercensal Period
by
Griffith Feeney and Sam Suharto
1998-08-26Abstract
Developments in information technology have rendered traditional approaches to the utilization of census results inadequate. To exploit the unique comparative advantages of the population census and to justify the tremendous expenses incurred we need to (i) produce the final edited set of census records and associated documentation as soon as possible after the enumeration and (ii) utilize these data continuously throughout the intercensal decade. This paper presents the case for continuous exploitation and addresses various practical questions and problems pertinent to its implementation.
Not so long ago, census “results” referred to published census volumes containing tabulations of individual and household records, pertinent documentation, and other contextual information. Developments in information technology are pointing toward a different conception, in which the primary product of the census is not tables, but the individual and household records records themselves. This paper elaborates this idea and draws out some of its more important implications.
Our essential message, for the impatient, is that the traditional census publications will constitute only a small part of the 2000 round census product, and by no means the most important part. We will increasingly come to regard the individual and household records as the fundamental product of the census, to be exploited again and again as new information needs arise, until the results of the next census become available.
While we urge continuous exploitation in this paper, we want to emphasize before beginning that it is critical to maintain the standard procedure of releasing basic census products as soon as possible after the census. Though it is to be expected that these will represent an increasingly small fraction of the value of the census as time goes on, they will remain important for the foreseeable future.
Before proceeding, a qualification is necessary. Some of the things we are suggesting are already implemented in some countries, and we do not pretend that they are wholly new or original. There is considerable variability in national practice, however, and there are many countries in the region that will benefit from considering this perspective on the utilization of census data.
Census Records: The Fundamental Census Product
The first requisite for continuous exploitation of the full information content over the intercensal period is to have the edited census records for individuals and households (“microdata” or “unit record data”) available for further utilization. While this might seem too obvious to bother pointing out, it is not uncommon for the census office not to have the census records in its possession, or not to have them in readily usable form.
In the future, geographically distributed processing of census data outside the central statistical office poses the risk that census records may not be forwarded to the national census office, but remain only in possession of the lower level offices that process them. Clearly this should not be allowed to happen.
Given the information technology of the past, this procedure was understandable. The media on which the individual and household records were stored were bulky, required special storage and handling, and could be processed only with expensive equipment that might not be available in the census office.
Given current technology, there is no technological justification for the census office not having the census records readily available to appropriate staff. All census records and documentation may be stored on CD-ROMs that cost very little, weigh very little, taken up very little space, and can be read on virtually any personal computer sold today.
The essential principle here is simple but profound. The primary product of the census is not tables, or publications, or public use samples, or any of the other end-user products, or all of these products taken together. The primary product of the census is the individual and household records themselves, together with pertinent documentation. No matter how many products are produced, the totality of their information content is a small fraction of the information content of the census records themselves.
Samples of census records, though valuable, are no substitute for the complete record set because they do not allow tabulation for very small groups or very small geographic areas. The defining characteristic of the census is that it is a complete enumeration of the population. This means that census data can be used to secure information about any identifiable sub-national aggregate. This is the unique comparative advantage of the census. It is not shared by samples of census records, even very large samples.
In particular, censuses generally allow tabulations for very small geographic units covering the entire national territory. These data is invaluable for many purposes, including, for example, market research by business and research on population impacts on the environment. Use of data for very large numbers (tens of thousands, hundreds of thousands, millions) of small geographic units has been made feasible by decreasing costs of data storage and computation. For most countries there is no alternative source for the “high spatial resolution” population data provided by the census, i.e. for the ability to provide numbers and characteristics of population for arbitrarily small geographic units. Moreover, the emergence of Geographic Information Systems (GIS) is creating a whole new class of uses for census results involving spatial analysis.
For all of these reasons, the census organization that does not have the primary census records readily available (and backed up in multiple locations to protect against physical destruction) has lost the “crown jewels” of the census. It is not in the interest of the census organization to allow this to happen.
This new scenario does raise an important problem, to be sure. The many disadvantages of the media and equipment of the past had as a side effect the benefit of helping to keep the census records secure against unauthorized use. The simplicity, low cost and wide availability of modern media and equipment means that the security of census records must be assured by appropriate administrative measures. The media containing these records must be physically secured and access to them limited to appropriate personnel.
Census Information on Demand
In the distant past census tables were produced by programmers who wrote programs to produce particular tables. In the perspective of this extremely slow and expensive process, the development of the first general purpose tabulation programs was a tremendous advance. The earliest general purpose programs still required considerable expertise to use, however, making the production of tabulations a relatively slow and costly process.
We are rapidly approaching, or may already have arrived at, a stage where users—not programmers—can produce any desired census tabulation with general purpose software requiring no more than a few hours of training.
A weakness of the traditional approach to producing census results was the heavy burden put on the persons who decide what tables will be produced. As the experienced census taker knows, the number of tabulations that may be produced is effectively infinite. However many tables are produced, time and experience will inevitably uncover other useful tables that weren't produced. It simply isn't possible for any human being, no matter how knowledgeable, intelligent and hardworking, to anticipate all possible uses for so rich an information source.
Given the development of information technology in past decades, there was no way around this weakness. Persons responsible for the tabulation plan did the best they could and hoped for the best.
The current level of information technology has made the tabulation plan far less critical because information can be produced from the census records throughout the intercensal decade as new uses are recognized.
Organizational Requirements
For continuous exploitation of census data throughout the intercensal decade to move from concept to reality requires either that this function be assigned to an existing unit or that a new unit be created, with appropriate staffing in either case.
It is not sufficient for the census office merely to respond to requests for extracting information from the census records. Many existing users of census data will not immediately realize the opportunity. More importantly, there are almost certainly a large number of potential users of census data who do not realize they are potential users. The unit responsible for continuous exploitation should actively seek out new users and uses of census data.
General Purpose Census Samples
Because there are many uses of sampling in census work it is necessary first to define terms. By a general purpose census sample we mean a representative sample of census records, for persons and/or households, drawn by computer and consisting of a certain fraction of the total census data set. Because of the importance of associating persons with the households in which they are enumerated, it is customary and appropriate to sample households and include all persons in the sampled households.
There are two main advantages to a general purpose census sample. The first is to reduce computing demands for applications in which the full census data set is not required. This is more important for larger countries, obviously, and for very small countries the value may be negligible, depending on the level of computing infrastructure.
For most investigations at the national level a modest fraction of census records will serve, and results may be produced more quickly and easily and with simpler equipment with a sample of one percent (say) of the census records than with the full data set.
Very small general purpose samples will be useful for exploratory work of various kinds, undertaken with the intention of replicating the results for a larger sample, or the whole census, when the results of the exploratory work are in.
Census samples are an essential element of any strategy of continuous utilization of census data, whether or not they are made publicly available. In case of public availability, measures must be taken respect the confidentiality of information about individual persons.
The second advantage of general purpose samples comes into play when these samples are made public. Confidentiality of the individual information may be very easily maintained simply by removing ow level geographic identifier information from the records. And it is, of course, essential to maintain the confidentiality of individual information.
New Tabulation Possibilities
Complete individual census records cannot be made publicly available for confidentiality reasons, but much of the value of them may be realized without making them public. What users want, after all, is not the individual information, but tables produced from this information. Computer networking will allow users to make tabulations from data sets securely held within the census office without having direct access to the records and without compromising confidentiality.
Requests for tabulations might be submitted by email, for example, or over the world wide web. A census office computer could receive the request, check it for syntactical correctness and expected processing time, produce the tabulation, vet the tabulation for confidentiality restrictions, and, if appropriate, email the result back to the user, all without intervention of census office staff.
Special Purpose Census Data Sets
Utilization of census data will be promoted also by the production of special purpose, deterministically defined subsets of census records. These are “samples” in the sense of being subsets of the complete set of census records, but not in the sense of involving random or pseudo-random selection of records. We therefore discuss them separately.
The strongest case for special purpose census data sets occurs for population subgroups that are important for one reason or another, but are so small that (i) making tabulations or otherwise processing them from the full set of records would be grossly inefficient and (ii) even a large general purpose sample would contain too few cases to provide reliable results.
Minority groups of various kinds are perhaps the most obvious example, but there are many possibilities. In a census with a disability question, for example, a special data set of disabled persons, or perhaps more appropriately of households including at least one disabled person, would be valuable.
Cartographic Information from the Census
Until very recently, cartographic operations carried out in connection with a census served primarily or exclusively as adjuncts to carrying out the enumeration. They were not likely to be regarded, even within the census organization itself, as products of the census deserving of broader dissemination. A constellation of circumstances—the volume of cartographic information collected in connection with the census enumeration, the emergence of GIS software, greatly increased computing power among users, the ability of census data to provide very high resolution spatially referenced data—is conspiring to change this. Cartographic information may in the future come to be regarded as one of the most important outputs of the census.
Cost and Pricing for Public Access
The information produced by the census is a public good, and it will usually be in the interest of the census organization to provide significant quantities of information to the public at minimal or no cost. There will always be a limit to what the census organization can provide without charge, however, for all services provided incur expenses, and some much more than others. Policies will inevitably vary from country to country, but there are a few general considerations that are broadly relevant.
The speed with with information technology is developing means that current practice is a poor guide to the future. Planning for the use of the 2000 round census data means looking ahead roughly five years, and that is a very, very long time in the rapidly changing world of information technology.
Users want “user friendly” data, but many of them also want inexpensive (or free) data. Since the development of user-friendly interfaces can be very expensive, it is reasonable to provide a rudimentary interface at no or low cost and to charge users willing to pay more for a more user-friendly interface.
The same principle may be applied to turn-around time. Real-time access to tabulations is relatively costly to provide, and users that want this may be charged accordingly. Users content to submit tabulation requests for processing during off-peak hours might be provided with the same data at low or no cost.
The Importance of Archiving
Continuous exploitation of census data is in part a matter of archiving. We have made much of the opportunities of new information technology, but it is equally appropriate to note some of the dangers. Very rapid technological change virtually insures that existing data formats will become obsolete in time. It is therefore critical for every national statistical office to have a systematic program for “refreshing” the format of census (and other) data files to insure that they continue to be accessible in the future as computer hardware and software changes.
There is also the issue of the stability of storage media. It is generally agreed that CD-ROMs are among the most stable media available today, but the medium is too new for us to know much about long term stability. Periodically refreshing critical data resources will serve not only to identify and correct the problem of outdated formats, but also to avoid losing data due to media deterioration.
Conclusion
New information technology has vastly increased the potential uses of census data to address public issues of all kinds. We believe that it is essential for census offices to cultivate a broader constituency of census data users to support the very heavy costs involved in conducting a census. It is essential to to educate existing users of census data about the new possibilities created by information technology, and to educate ourselves about how these possibilities can meet their needs. It is even more important to educate the many groups of non-users of census data who should be users—and to educate ourselves about how we can meet their needs.
Acknowledgments
We are grateful to Michael J. Levin for comments on an earlier version of this paper.
Griffith Feeney is a Senior Fellow at the East-West Center Program on Population in Honolulu. He may be contacted by email at gfeeney@hawaii.edu, by fax at 1-808-944-7456, or by mail at Program on Population, East-West Center, 1601 East-West Road, Honolulu, Hawaii 96848 USA.
Sam Suharto is Chief, Demographic and Social Statistics Branch, Statistics Division, United Nations, New York. He may be contacted by email at suharto@un.org, by fax at 1-212-963-1940, by phone at 1-212-963-8493, or by mail at Two UN Plaza, DC2-1520, United Nations, New York, NY 10017 USA.
![]()