Essential Statistical Concepts: Data Analysis and Modeling
Statistics: techniques (collecting,organizing,analysing,interpreting data)
Data may be:
quantitative (values expressed numerically) qualitative: (characteristics being tabulated). Descriptive statistics
: techniques summarize, describe numerical data= easier interpretation – can be graphical/involve computational analysis. Inferential statistics: techniques about decisions about statistical population/process are made based only on a sample being observed – use of probability concepts. VARIABLES:
Time Series Analysis and Regression Modeling in R
R Setup and Initial Data Handling
Setting the working directory:
setwd("/Users/hajdumarcell/Downloads/Öko. II. R Jegyzet")
Data inspection and preparation:
str(Titanic)
PS4$Date <- as.Date(PS4$Date)
Basic visualization using ggplot2
:
ggplot(PS4, aes(x=Date, y=Google_PS4)) + geom_line()
Regression Modeling with Dummy Variables
The general regression model structure, including trend ($t$) and quarterly dummy variables ($DQ$):
$$Y_t = \beta_0 + \beta_1 \times t + \beta_2 \times DQ_1 + \beta_3 \times DQ_
Read MoreResearch Sampling Methods: Probability vs. Non-Probability Techniques
Probability and Non-Probability Sampling Methods
Probability (Random) Sampling
Probability sampling, also known as random sampling, is a method where the probability of being selected is known, meaning every member of the wider population has an equal chance to be included. The primary aim is for generalizability and wide representation.
Purpose and Example
- Purpose: To select a group of subjects representative of the larger population from which they are selected.
- Example: A university randomly selects
Essential Statistical Concepts and Formulas Reference
Descriptive Measures: Center and Variability
Measures of Variation
- Standard Deviation (SD): The average measure of distance between data points and the mean (the square root of the variance). It indicates how far the data is, on average, from the mean.
- Calculation: Find the variance and take its square root.
- Coefficient of Variation (CV): Used to compare the standard deviation of two different data sets. Shown as a percentage, it measures variation relative to the mean.
- Formula: CV = (Standard Deviation
Key Concepts in Probability Distributions and Statistical Analysis
Continuous Probability Distributions
A continuous distribution is a type of probability distribution in which the random variable can take any value within a given range or interval. Unlike discrete distributions that deal with countable outcomes, continuous distributions describe data that can vary infinitely, such as height, weight, temperature, or time.
These distributions are represented using a Probability Density Function (PDF). Probabilities are calculated over intervals, since the probability
Read MoreEnsemble Methods Comparison: Bagging, Boosting, and Stacking Techniques
Bagging Classifier Implementation
Base Model Performance
base_model = DecisionTreeClassifier(random_state=42)
base_model.fit(X_train, y_train)
y_pred_base = base_model.predict(X_test)
base_recall = recall_score(y_test, y_pred_base)
print("Recall del modelo base: {:.4f}".format(base_recall))
Hyperparameter Tuning (Grid Search)
param_grid = {
"n_estimators": [10, 50, 100],
"max_samples": [0.5, 0.8, 1.0],
"max_features": [0.5, 0.8, 1.0],
"bootstrap": [True]
}
bagging = BaggingClassifier(
Read More