Understanding Ordered Factors in a Linear Model
Consider the following data from the text Design and Analysis of Experiments, 7th ed. (Montgomery, 2009, Table 3.1). It has two variables: power
and rate
. power
is a discrete setting on a tool used to etch circuits into a silicon wafer. There are four levels to choose from. rate
is the distance etched measured in Angstroms per minute. (An Angstrom is one ten-billionth of a meter.) Of interest is how (or if) the power setting affects the etch rate.
Ask Better Code Questions (and Get Better Answers) With Reprex
Note: This article was written about version 2.0.0 of the reprex package.
Getting Started with Generalized Estimating Equations
Generalized estimating equations, or GEE, is a method for modeling longitudinal or clustered data. It is usually used with non-normal data such as binary or count data. The name refers to a set of equations that are solved to obtain parameter estimates (i.e., model coefficients). If interested, see Agresti (2002) for the computational details. In this article we simply aim to get you started with implementing and interpreting GEE using the R statistical computing environment.
Getting Started with Binomial Generalized Linear Mixed Models
Binomial generalized linear mixed models, or binomial GLMMs, are useful for modeling binary outcomes for repeated or clustered measures. For example, let’s say we design a study that tracks what college students eat over the course of 2 weeks, and we’re interested in whether or not they eat vegetables each day. For each student, we’ll have 14 binary events: eat vegetables or not.
Getting Started with Web Scraping in Python
"Web scraping," or "data scraping," is simply the process of extracting data from a website. This can, of course, be done manually: You could go to a website, find the relevant data or information, and enter that information into some data file that you have stored locally. But imagine that you want to pull a very large dataset or data from hundreds or thousands of individual URLs. In this case, extracting the data manually sounds overwhelming and time-consuming.
A Brief on Brier Scores
Not all predictions are created equal, even if, in categorical terms, the predictions suggest the same outcome: “X will (or won’t) happen.” Say that I estimate that there’s a 60% chance that 100 million COVID-19 vaccines will be administered in the US during the first 100 days of Biden’s presidency, but my friend estimates that there’s a 90% chance of that outcome.
Getting Started with pandas in Python
The pandas package is an open-source software library written for data analysis in Python. Pandas allows users to import data from various file formats (comma-separated values, JSON, SQL, fits, etc.) and perform data manipulation operations, including cleaning and reshaping the data, summarizing observations, grouping data, and merging multiple datasets. In this article, we'll explore briefly some of the most commonly used functions and methods for understanding, formatting, and vizualizing data with the pandas package.
Understanding Multiple Comparisons and Simultaneous Inference
When it comes to confidence intervals and hypothesis testing there are two important limitations to keep in mind.
The significance level,1 \(\alpha\), or the confidence interval coverage, \(1 - \alpha\),
Data Scientist as Cartographer: An Introduction to Making Interactive Maps in R with Leaflet
Note: This version of the article contains static images of maps generated with Leaflet. You can view a version with interactive maps here.
Understanding Robust Standard Errors
What are robust standard errors? How do we calculate them? Why use them? Why not use them all the time if they’re so robust? Those are the kinds of questions this post intends to address.