StatLab Articles

How to Apply a Graduated Color Symbology to a Layer Using Python for QGIS 3

I was recently working on a project in QGIS 3 with a member of UVA Health's Oncology department. This person wanted to take a set of patient data (after identifying info had been removed) and after doing some other stuff, apply a graduated color scheme to the results, shading them from light to dark based on intensity.

You can find a sample dataset for this project here:

https://github.com/epurpur/PyQGIS-Scripts/blob/master/TestZipCodes.zip

Python, visualization, QGIS, Erich Purpur

How to Use the Field Calculator in Python for QGIS 3

Recently, I have taken the dive into python scripting in QGIS. QGIS is a really nice open source (and free!) alternative to ESRI's ArcGIS. While QGIS is a little quirky and generally not quite as user friendly as ArcGIS, it still provides nearly the same functionality. Personally, I've become a fan of it and now have even taught a short, 1 credit course in the University of Virginia's Batten School of Public Policy titled: GIS for Public Policy.

Python, data wrangling, QGIS, Erich Purpur

A Beginner's Guide to Text Analysis with quanteda

A lot of introductory tutorials to quanteda assume that the reader has some base of knowledge about the program's functionality or how it might be used. Other tutorials assume that the user is an expert in R and on what goes on under the hood when you're coding. This introductory guide will assume none of that. Instead, I'm presuming a very basic understanding of R (like how to assign variables) and that you've just heard of quanteda for the first time today.

quanteda, R, text analysis, text mining, Leah Malkovich

Assessing Type S and Type M Errors

The paper Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors by Andrew Gelman and John Carlin introduces the idea of performing design calculations to help prevent researchers from being misled by statistically significant results in studies with small samples and/or noisy measurements.

R, power analysis, statistical methods, type m error, type s error, Clay Ford

Interpreting Log Transformations in a Linear Model

Log transformations are often recommended for skewed data, such as monetary measures or certain biological and demographic measures. Log transforming data usually has the effect of spreading out clumps of data and bringing together spread-out data. For example, below is a histogram of the areas of all 50 US states. It is skewed to the right due to Alaska, California, Texas and a few others.

R, linear regression, statistical methods, log transformations, diagnostic plots, Clay Ford

Getting Started with Matching Methods

Note: This article demonstrates how to use propensity scores for matching data. However, propensity scores have come under fire in recent years. In their 2019 article, Why Propensity Scores Should Not Be Used for Matching, King and Nielsen argue that propensity scores increase imbalance, inefficiency, model dependence, and bias. Others argue that while propensity scores may be sub-optimal, they can be useful in certain situations.

R, statistical methods, matching, propensity scores, Clay Ford

Getting Started with Moderated Mediation

In a previous post we demonstrated how to perform a basic mediation analysis. In this post we look at performing a moderated mediation analysis. The basic idea is that a mediator may depend on another variable called a "moderator". For example, in our mediation analysis post we hypothesized that self-esteem was a mediator of student grades on the effect of student happiness. We illustrate this below with a path diagram.

R, statistical methods, mediation, Clay Ford

Getting started with Multivariate Multiple Regression

Multivariate Multiple Regression is a method of modeling multiple responses, or dependent variables, with a single set of predictor variables. For example, we might want to model both math and reading SAT scores as a function of gender, race, parent income, and so forth. This allows us to evaluate the relationship of, say, gender with each score. You may be thinking, "why not just run separate regressions for each dependent variable?" That's actually a good idea! And in fact that's pretty much what multivariate multiple regression does.

R, statistical methods, MANOVA, multivariate multiple regression, Clay Ford

Visualizing the Effects of Proportional-Odds Logistic Regression

Proportional-odds logistic regression is often used to model an ordered categorical response. By "ordered", we mean categories that have a natural ordering, such as "Disagree", "Neutral", "Agree", or "Everyday", "Some days", "Rarely", "Never". For a primer on proportional-odds logistic regression, see our post, Fitting and Interpreting a Proportional Odds Model.

R, effect plots, statistical methods, visualization, proportional odds logistic regression, Clay Ford

Getting Started with the purrr Package in R

If you're wondering what exactly the purrr package does, then this blog post is for you.

R, data wrangling, computational methods, purrr, Clay Ford