Fundamentals of Statistical Analysis and Data Collection
Nature of Statistics
The fundamental question regarding Statistics is whether it is a science or an art. Professor Tippet rightly observed that “Statistics is both a science as well as an art.”
- As a science, Statistics studies numerical data in a scientific or systematic manner.
- As an art, Statistics relates quantitative data to real-life problems.
By using statistical data, we are able to analyse and understand real-life problems much better than otherwise. Thus, the problem of unemployment in India is more meaningfully analysed when the size of unemployment is supported with quantitative data.
Descriptive and Inferential Statistics
1) Descriptive Statistics
Descriptive Statistics refers to those methods used for the collection, presentation, and analysis of data. These methods relate to estimations such as:
- Measurement of central tendencies (average mean, median, mode)
- Measurement of dispersion (mean deviation, standard deviation, etc.)
- Measurement of correlation
Example: Descriptive statistics is used when you estimate the average height of the secondary students in your school, or when you find that marks in science and mathematics of the students in all classes are intimately related to each other.
(2) Inferential Statistics
Inferential Statistics refers to all methods by which conclusions are drawn relating to the universe or population on the basis of a given sample. (In Statistics, the term universe or population refers to the aggregate of all items or units relating to any subject.)
Example: If your class teacher estimates the average weight of the entire class (the population) based only on the average weight of a sample of students, he is using inferential statistics.
Limitations of Statistics
In modern times, Statistics has emerged to be of crucial significance in all walks of life. However, it has certain limitations. As Newshome writes, “Statistics must be regarded as an instrument of research of great value but barring severe limitations which are not possible to overcome.”
(1) Study of Numerical Facts Only
Statistics studies only such facts as can be expressed in numerical terms. It does not study qualitative phenomena like honesty, friendship, wisdom, health, patriotism, or justice.
(2) Study of Aggregates Only
Statistics studies only the aggregates of quantitative facts. It does not study statistical facts relating to any particular unit.
Example: It may be a statistical fact that your class teacher earns ₹50,000 per month. But, as this fact relates to an individual, it is not deemed a subject matter of Statistics. However, it becomes a subject matter of Statistics if we study the income of school teachers across all parts of the country to find regional differences in income.
(3) Homogeneity of Data, an Essential Requirement
To compare data, it is essential that statistics are uniform in quality. Data of diverse qualities and kinds cannot be compared. For example, the production of food grains (measured in tonnes) cannot be compared directly with the production of cloth (measured in metres). Nevertheless, it is possible to compare their value instead of the volume.
Data Sources and Types
Primary Source of Data
If you want to know about the quality of life of the people in your town, perhaps by ascertaining the per capita expenditure of different households, and you decide to collect the basic data yourself through a statistical survey (with the help of investigators or field workers), you are relying on a primary source of data.
Thus, a primary source implies collection of data from its source of origin, offering you first-hand quantitative information relating to your statistical study. You or your team are contacting the respondents (people offering basic information) to obtain the desired quantitative information.
Secondary Source of Data
A Secondary Source of data collection implies obtaining the relevant statistical information from an agency or institution already in possession of that information.
To continue the previous example, data relating to the quality of life or per capita expenditure in your town may have already been collected by the State Government. You can approach the concerned Government department and request the desired information. This will be a Secondary Source for you. Thus, a secondary source implies that the desired statistical information already exists, and you are simply collecting it from the concerned agency; you are not conducting the survey or contacting the respondents yourself. You are not getting first-hand information.
Primary Data
Data collected by the investigator for his own purpose, for the first time, from beginning to end, are called primary data. These are collected from the source of origin.
In the words of Wessel, “Data originally collected in the process of investigation are known as primary data.” Primary data are original. The concerned investigator is the first person to collect this information; therefore, primary data are first-hand information.
Illustration: If you are interested in studying the socio-economic state of students in your Class XI who secured first division in their matriculation examination, and you collect information regarding their pocket allowance, family income, and educational status, all this information would be termed primary data since you are the first person to collect it from the source of its origin.
Principal Differences between Primary and Secondary Data
The following are some principal differences between primary and secondary data:
- Difference in Originality: Primary data are original because they are collected by the investigator from the source of their origin. Against this, secondary data already exist and are therefore not original.
- Difference in Objective: Primary data are always related to a specific objective of the investigator and do not need adjustment for the concerned study. On the other hand, secondary data have already been collected for some other purpose; therefore, these data need to be adjusted to suit the objective of the study at hand.
- Difference in Cost of Collection: Primary data are costlier in terms of time, money, and effort involved than secondary data, as they are collected for the first time from their source of origin. Secondary data are simply collected from published or unpublished reports and are much less expensive.
It may be noted that there are no fundamental differences between primary data and secondary data; data are data, whether primary or secondary. They are classified based on collection: first-hand or second-hand. Thus, a particular set of data collected by an investigator for a specific purpose from the source of origin is primary data. The same set of data, when used by some other investigator for his own purpose, is known as secondary data.
Qualities of a Good Questionnaire
Following are some of the desired qualities of a good questionnaire:
- Limited Number of Questions: The number of questions should be as limited as possible, relating only to the purpose of the enquiry.
- Simplicity: The language of the questions should be simple, lucid, and clear. Questions should be short, not long or complex. Mathematical questions must be avoided.
- Proper Order of the Questions: Questions must be placed in a proper sequence.
- No Undesirable Questions: Undesirable or personal questions must be avoided. The questions should not offend the respondents.
- Non-Controversial: Questions should be answerable impartially; no controversial questions should be asked.
- Calculations: Questions involving calculations by the respondents must be avoided. The investigator should perform the calculation job.
- Pre-Testing (Pilot Survey): Some questions should be asked from respondents on a trial basis. If their answers involve difficulty, they can be reframed accordingly. Such testing is technically called a pilot survey.
- Instructions: A questionnaire must show clear instructions for filling in the form.
Census Method vs. Sample Method
Census Method
A method in which an investigator collects data related to the problem under investigation by covering every item of the population or universe is known as the Census Method of Collecting Data.
Example: If an investigator wants to investigate the colour composition of TATA cars in India, using the Census Method, he will have to collect data on the colour of each TATA car sold in India.
This method implies a complete enumeration of the population. The Census of Population is the most essential method of statistical enquiry. For the census of population, investigators must estimate the country’s population by conducting enquiries in every house, including people living on the roadside. In India, the Census of Population is conducted every ten years. The last census was held in February 2011; the next was due in 2021 but was postponed due to the pandemic.
Sample Method
The Sample Method is that method in which data is collected about a sample—a group of items taken from the population for examination—and conclusions are drawn based on that sample.
The sample method is widely used in our day-to-day life. A lady in the kitchen, for example, tests only a grain or two of rice to know whether the rice is boiled or not. Similarly, by examining only a few drops of blood, a doctor determines a person’s blood group.
Random Sampling
Random sampling is that method of sampling in which each and every item of the universe has an equal chance of being selected in the sample. In other words, there is an equal probability for every item of the universe being selected.
Which items get selected is beyond the control of the investigator; the selection is left entirely to chance factors. This method is used particularly when various items of the universe are homogeneous or identical to each other. This method is impartial and economical. Random Sampling may be done in the following ways:
1. Lottery Method
In this method, paper slips are made for each item of the universe. These slips are shuffled in a box, and then, impartially, some slips are drawn to form a sample of the universe.
2. Tables of Random Numbers
Some statisticians have prepared sets of tables called Tables of Random Numbers. A sample is framed with reference to these tables. Tippet’s Table is the most widely used. Using 41,600 figures, Tippet has involved 10,400 numbers comprising four units each.
For this method, all items of the universe are first arranged in an order. Then, using Tippet’s Table, the required number of items needed for a sample are selected.
Concept and Definition of Correlation
The statistical methods studied so far focus on the analysis of one variable or one statistical series only. In real life, however, two or more statistical series may be mutually related. For instance, a change in price leads to a change in quantity demanded; an increase in the supply of money causes an increase in the price level; an increase in the level of employment results in an increase in output.
Such situations necessitate the simultaneous study of two or more statistical series. The focus of study in such situations is on the degree of relationship between different statistical series. The statistical technique that studies the degree of such relationships is called the technique of correlation.
1. Simple Correlation
Simple correlation implies the study of the relationship between two variables only, such as the relationship between price and demand or the relationship between money supply and price level.
2. Multiple Correlation
When the relationship among three or more than three variables is studied simultaneously, it is called multiple correlation. In the case of such correlation, the entire set of independent and dependent variables is studied simultaneously. For instance, the effects of rainfall, manure, water, etc., on the per-hectare productivity of wheat are studied simultaneously.
Concept and Definition of Index Numbers
The concept of the index number can be best understood through an illustration. Consider a situation of rising prices during the year 2023. We face three basic questions:
- Compared to which year have the prices risen during 2023?
- How do we handle the situation when the prices of some goods rise more than others?
- Can prices of different goods be expressed in terms of any standard unit, or must different units be used (e.g., rupees per litre for milk, rupees per metre for cloth, rupees per kilogram for sweets)?
The study of Index Numbers answers all these questions. First, the rise in prices during 2023 would be studied only with reference to some previous year, like 2004 or 2011. Otherwise, the mere statement that prices during 2023 have tended to rise makes no sense. 2023 will be treated as the current year and 2004 or 2011 as the base year. Prices during the base year are taken as 100.
(1) Relative Changes
Index numbers measure relative or percentage changes in the variable(s) over time. An index number of prices, for example, is not simply a statement of prices at different dates; it presents estimates of percentage changes in prices over years with reference to some selected base year. If the index of prices stands at 200 in 2023 compared to 100 in 2011–12 (the base year), it suggests that compared to the base year, prices have risen by 100 per cent.
(2) Quantitative Expression
Index numbers offer a precise measurement of the quantitative change in the concerned variable(s) over time. The index of prices, for example, will tell us that between the years 2022 and 2023, prices have risen by 7 per cent, or that industrial production has declined.
Difficulties or Problems in the Construction of Index Numbers
(1) Purpose of Index Number
There are various types of index numbers constructed with different objectives. Before constructing an index number, one must define the objective, as construction is significantly influenced by it.
Thus, for example, if the objective is to study the impact of the change in the value of money on consumers, one should construct a consumers’ price index number. If the objective is to study the impact of the change in the purchasing power of money on producers, one shall construct an index number based on wholesale prices.
Haberler rightly pointed out that, “Different index numbers are constructed to fulfil different objectives, and before setting to construct a particular index number, one must clearly define one’s object of study because, it is on the objective of the study that the nature and format of the index number depends.”
(2) Selection of Base Year
Selection of the Base Year is another problem. The Base Year is the reference year with which prices of the current year are compared. As far as possible, the Base Year should be a normal year—one without much fluctuation—otherwise, the index values would fail to capture the real change in the variable.
(3) Selection of Goods and Services
Having defined the objective, the next problem is the selection of goods or services to be included in the index number. To construct the Consumers’ Price Index, for example, not all commodities are included.
