Social Research Principles: Methods, Ethics, and Big Data Analysis

Research Methods: Questions and Answers

Flaws in the Original Google Flu Trends Method

There are two flaws in the original Google Flu Trends method:

  1. They combined multiple queries into a single variable, ignoring variability in individual search query tendencies over time. This does not take into account that language in searches changes over time.
  2. The algorithm dynamics were unstable. Google Flu Trends was an unstable reflection of the prevalence of the flu.

Mitchell Duneier’s Use of an Audio Recorder

In the study he writes about in his book Sidewalk, why did Mitchell Duneier use an audio recorder to capture conversations as opposed to simply taking notes? When Mitchell Duneier wrote his first draft, he had collected all the data by writing it down. However, because the meanings of a culture are embodied, mainly through its language, Mitchell finally decided to use an audio recorder to capture conversations. This helped avoid the misunderstanding of some words, which could lead to a wrong and different meaning. Furthermore, he did not have to make any extra effort, just turn the recorder on.

Soda Consumption Point Estimate and Confidence Interval

You are interested in soda consumption in a particular city with a population of 2.1 million people. You take a random sample of 100 people from this population and ask each person in the sample how many liters of soda they drank during the previous day. The mean response is 0.6, with a standard deviation of 0.1. Provide a point estimate and a 95% confidence interval for the mean soda consumption in the population. For purposes of this question, assume a normal sampling distribution and use 2 as the critical value (z) defining the 95% confidence interval.

  • Point Estimate: The point estimate for the mean soda consumption is 0.6 liters.
  • 95% Confidence Interval Calculation:
    • n = 100, s = 0.1, µ = 0.6
    • Standard Error (se) = s / (√n) = 0.1 / (√100) = 0.01
    • Confidence Interval (Assuming z=2):
    • Lower Bound: 0.6 – (2 * 0.01) = 0.58
    • Upper Bound: 0.6 + (2 * 0.01) = 0.62

Thus, assuming a normal distribution sample and a critical value equal to 2, the 95% Confidence Interval is between 0.58 and 0.62.

Interpreting Multivariate Regression Results

To investigate the relationship between soda consumption and sleep problems, you ask respondents to record the amount of soda they drink and the number of minutes of sleep they get each day over the course of a week. You then use multivariate regression to model total weekly sleep time, in minutes, as a function of the total number of liters of soda consumed during the week, controlling for age and gender. The estimated coefficient on the soda consumption variable is -32 with a standard error of 2.5. Briefly interpret this result.

A negative estimated coefficient of -32 means that for each additional liter of soda a person consumes per week, their weekly sleep time is predicted to decrease by 32 minutes, holding age and gender constant. The standard error of 2.5 suggests that, based on the Central Limit Theorem, we can be 95% confident that the true effect of soda consumption leads to a decrease in sleeping minutes by somewhere between 27 and 37 minutes (calculated as -32 ± 2 * 2.5).

Ethics for Researchers (European Commission 2013)

Chapter I: Research Ethics

True research excellence and ethical research conduct imply the application of fundamental ethical principles to scientific research. All possible domains of scientific research can raise ethical issues. Ethics is everywhere. Ethics is often misunderstood by researchers as hindering the scientific process. It is true that it puts boundaries, but it does not intend to regulate research or go against research freedom.

History and Legal Basis of Research Ethics

Malpractices during the Nuremberg trials led to The Declaration of Helsinki (1964) for medical research on human subjects. This established that research ethics is of crucial importance for all scientific domains, leading to various Codes of Ethics. Research ethics and human rights influence each other, as seen in the Oviedo Convention. Compliance with human rights is pivotal for all European policy domains, as outlined in the European Charter of Fundamental Rights.

Chapter II: The 7th Framework Programme

The purpose of the Ethics Reviews is to ensure that all research activities carried out under the 7th Framework Programme are conducted in compliance with fundamental ethical principles. The 7th Framework Programme is a significant source of public funding dedicated to supporting a sound research community for a better future for Europe.

The Ethics Review Procedure

All research proposals submitted to the European Commission are evaluated on their scientific merit and their ethical and social impact. Proposals retained by experts for funding but identified as raising ethical issues will be submitted to the Ethics Review.

Chapter III: Ethical Issues

Although many of the rules and principles outlined seem evident, several recent examples indicate we need to remain alert and cautious when it comes to research ethics.

Data Protection and Privacy

Data protection is meant to guarantee our right to privacy. It includes both measures regarding access to data and the conservation of data. When preparing a proposal for the 7th Framework Programme, one should pay careful attention to privacy and data compliance.

Informed Consent

Informed consent guarantees the voluntary participation in research and is probably the most important procedure to address privacy issues. It consists of three components: adequate information, voluntariness, and competence. This implies that, prior to consenting, participants should be clearly informed of the research goals, possible adverse events, and their right to refuse participation or withdraw at any time without consequences. Special attention must be paid to children, vulnerable adults, and people with certain cultural backgrounds.

Research on Human Embryos and Fetuses

Research with human embryonic stem cells (hESC) offers the prospect of addressing medical needs but raises fundamental ethical questions because research implies the use and destruction of human embryos.

Dual Use

Dual use is understood as the potential misuse of research. This means that research activities might involve or generate materials, methods, or knowledge that could be misused. The possible dual use of new technologies and scientific results creates ethical problems for the scientist and the community regarding the responsibility to prevent such misuse.

Animal Research

This refers to the use of animals in experiments, governed by the principles of Reduction, Replacement, and Refinement. To comply with these principles, animal research must be systematically evaluated, assessing pain, distress, and lasting harm. The researcher must provide details of the species used, justify their use, explain why the anticipated benefits justify using animals, and why methods avoiding animal use cannot be employed.

Research Involving Developing Countries

A particular situation arises when research is conducted in or with non-EU countries. Special attention is required for developing countries and emerging economies. Collaboration can raise ethical concerns due to the country’s overall development level and the potential vulnerability of participants, demanding attention to the specific characteristics of the situation.

Three Tips to be More Ethically Prepared

  1. Try to integrate ethical and societal expertise into your research projects.
  2. Use existing codes of conduct for researchers.
  3. Do not hesitate to seek advice.

Social Research Methods by Alan Bryman

Chapter 1: The Nature of Social Research

A variety of considerations enter into the process of doing social research:

  1. Theory and Research Relationship: This concerns whether theory guides research (a deductive approach) or if theory is an outcome of research (an inductive approach).
  2. Epistemological Issues: These relate to what is regarded as appropriate knowledge about the social world. A crucial question is whether a natural science model is suitable for studying the social world.
    • Positivism: An epistemological position advocating the application of natural science methods to social reality. Its principles include:
      1. Knowledge is confirmed by the senses (phenomenalism).
      2. The purpose of theory is to generate testable hypotheses (deductivism).
      3. Knowledge is gathered from facts that form the basis for laws (inductivism).
      4. Science must be value-free (objective).
      5. A clear distinction exists between scientific and normative statements.
    • Realism: Shares two features with positivism: a belief that natural and social sciences should apply the same approaches and a commitment to an external reality separate from our descriptions of it.
      1. Empirical realism asserts that reality can be understood through appropriate methods.
      2. Critical realism aims to identify the underlying structures that generate social events and discourses to understand and change the social world.
    • Interpretivism: An alternative to positivism, it requires a strategy that respects the differences between people and the objects of natural sciences. The social scientist must grasp the subjective meaning of social action. Its heritage includes Weber’s notion of Verstehen, the hermeneutic-phenomenological tradition, and symbolic interactionism.
  3. Ontological Issues: These concern whether the social world is external to social actors or something people are in the process of fashioning.
    • Objectivism: An ontological position that social phenomena and their meanings exist independently of social actors.
    • Constructionism (or Constructivism): An ontological position that social phenomena and their meanings are continually being accomplished by social actors, implying they are produced through social interaction and are in a constant state of revision.

8jasHhBG_aOITeszjF6UxGeBBRth6jtQ7-rfWbX7

  1. Research Strategy: These issues relate to the distinction between quantitative and qualitative research. While they represent different approaches, we should be wary of driving a wedge between them.
  2. Values and Practical Issues: These also impinge on the social research process.

Key Definitions

  • Deductive: theory → observations/findings
  • Inductive: observations/findings → theory
  • Middle-range theories: Intermediate to general theories, which are too remote from particular social behaviors to account for observations, and to detailed descriptions that are not generalized.
  • Grand theories: Operate at a more abstract and general level.
  • Empiricism: (1) An approach suggesting only knowledge gained through experience and the senses is acceptable. (2) The belief that accumulating “facts” is a legitimate goal in itself (“naïve empiricism”).

Chapter 2: Research Designs and Methods

A research design provides a framework for the collection and analysis of data. A research method is simply a technique for collecting data, such as a questionnaire or participant observation.

Criteria in Social Research

  • Reliability: Concerned with whether the results of a study are repeatable. It is often connected with quantitative research.
  • Replication: Closely related to reliability. Researchers must spell out their procedures in detail so that others can replicate the findings to assess the study’s reliability.
  • Validity: Concerned with the integrity of the conclusions generated from research. Main types include:
    • Measurement validity: Does a measure truly reflect the concept it is supposed to denote? (Primarily for quantitative research).
    • Internal validity: Does a conclusion about a causal relationship between variables hold water? The causal factor is the independent variable, and the effect is the dependent variable.
    • External validity: Can the results of a study be generalized beyond the specific research context?
    • Ecological validity: Are the findings applicable to people’s everyday, natural social settings?

Five Major Research Designs

  1. Experimental: True experiments are unusual but strong in internal validity. They require manipulation of the independent variable.
    • Classical experimental design: Two groups (experimental and control) with random assignment. It is the foundation of the randomized controlled trial (RCT).
    • Laboratory experiment: High researcher control but may lack mundane realism, though it can have experimental realism.
    • Quasi-experiment: Has some characteristics of experimental designs but does not fulfill all internal validity requirements.
  2. Cross-Sectional: Often called a survey design. It involves collecting data on more than one case at a single point in time, usually quantitative data, to find patterns of association.
    • Reliability: Depends on the quality of measures.
    • Replicability: Likely to be present.
    • Internal validity: Typically weak, as it is difficult to establish causal direction.
    • External validity: Strong when the sample is randomly selected.
    • Ecological validity: Research instruments can disrupt the “natural habitat.”
  3. Longitudinal: A sample is surveyed on at least two occasions. It is associated with quantitative research.
    • Panel study: The same sample (e.g., households, organizations) is the focus of data collection on at least two occasions.
    • Cohort study: A sample of people who share a certain characteristic (e.g., born in the same week) is selected for data collection.
  4. Case Study: A detailed and intensive analysis of a single case, focusing on its complexity and particular nature. It often favors qualitative methods and is associated with a location. The approach can be inductive (qualitative) or deductive (quantitative).
    • Critical case: Chosen to understand when a well-developed theory will or will not hold.
    • Extreme or unique case: Chosen for its intrinsic and unique interest.
    • Representing, typical, or exemplifying case: Captures the circumstances of an everyday situation.
    • Revelatory case: An opportunity to analyze a phenomenon previously inaccessible to scientific investigation.
    • Longitudinal case: Investigated at two or more points in time.
  5. Comparative: Entails studying two contrasting cases using more or less identical methods. It embodies the logic of comparison to better understand social phenomena.

Key Points

  1. There is an important distinction between a research method and a research design.
  2. It is necessary to be familiar with the criteria for evaluating research: reliability, validity (measurement, internal, external, ecological), and replicability.
  3. It is necessary to be familiar with the five major research designs: experimental, cross-sectional, longitudinal, case study, and comparative.
  4. There are various potential threats to internal validity in non-experimental research.
  5. The case study has several forms, and it is important to be aware of issues concerning its external validity (generalizability).

Big Data for Development (UN Global Pulse 2012)

Google Flu Trends was created to predict reports from the Centers for Disease Control and Prevention (CDC) using big data. Two issues contributed to its mistakes:

  1. Big Data Hubris: The thought that big data is a substitute for traditional data. The odds of finding search terms that match the propensity of the flu but are structurally unrelated are high; therefore, they do not predict the future. The model was overfitting the small number of cases, a standard issue in data analysis. This method failed because it missed the nonseasonal 2009 Influenza pandemic, meaning the method was both a flu detector and a winter detector. These errors are not randomly distributed; their magnitude varies with seasonality, information that could be provided by traditional data.
  2. Algorithm Dynamics: All empirical research stands on a foundation of measurement. Algorithm dynamics—changes made by engineers to improve the commercial service and by consumers in using that service—affected the Google search algorithm. The most common explanation for GFT’s error was a media-stoked panic during the last flu season, but this cannot explain why GFT has been missing by wide margins. The Google search algorithm is not static; it is constantly being tested and improved.

Big Data Opportunities and Challenges

Big data offers enormous possibilities for understanding human interactions at a societal scale, with rich spatial and temporal dynamics, and for detecting complex interactions and nonlinearities among variables. However, traditional “small data” often offer information not contained in big data. The very factors that have enabled big data are also enabling more traditional data collection.

Transparency and Replicability

Replication is a growing concern across academia. Scientists need to replicate findings using these data sources across time and with other data sources to ensure they are observing robust patterns and not evanescent trends. Science is a cumulative endeavor; scientists must be able to continually assess the work on which they are building. A granular view can provide powerful input into generative models of flu propagation and more accurate predictions months ahead of time.

Big Data Definition and Strategy

A three-fold strategy consists of:

  1. Researching innovative methods for analyzing real-time digital data to detect emerging vulnerabilities.
  2. Assembling free and open-source technology tools for analyzing real-time data and sharing hypotheses.
  3. Establishing a global network of Pulse Labs.

“Big Data” describes a massive volume of both structured and unstructured data that is difficult to process with traditional databases. It is often defined by the “3 V’s”: more volume, variety, and velocity. It comes from everywhere and is huge in scope and power. When shock situations occur, such a systematic shock will prompt individuals to react in roughly similar ways, which can be calculated from different data sources.

Taxonomy of Digital Data Sources

Global Pulse has developed a loose taxonomy of new, digital data sources relevant to global development:

  1. Data Exhaust: Passively collected transactional data from people’s use of digital services. These services create networked sensors of human behavior.
  2. Online Information: Web content and usage considered as a sensor of human intent, sentiments, perceptions, and wants.
  3. Physical Sensors: Satellite data focusing on remote sensing of changes in human activity.
  4. Citizen Reporting or Crowd-sourced Data: While not passively produced, this is a key information source for verification and feedback.