Event box

Working with Big Data in R In-Person

The workshop emphasizes best practices for data quality assurance and statistical considerations when working with large datasets, addressing common challenges such as computational efficiency, memory management, and maintaining data integrity across complex processing pipelines. This intermediate-level workshop provides social science researchers with essential skills for analyzing datasets that exceed typical computer memory limitations. Participants will learn to distinguish between datasets and databases, implement efficient data storage solutions using Apache Arrow and Parquet files, and build robust Extract-Transform-Load (ETL) pipelines for large-scale data processing. The workshop covers partitioning strategies for optimal performance, writing custom functions using both dplyr API for Acero and SQL syntax, and creating local analytical databases with DuckDB. Through hands-on exercises using real voter file data, researchers will develop practical skills in out-of-core processing, database management, and scalable data analysis workflows. 

Prerequisites: The frameworks and packages used in this workshop are designed to be written in tidy syntax. We will be using chained operations, high-level control structures and writing custom functions. Working proficiency in both R and tidy code is strongly recommended.

Please note that registrants for the "Working with Big Data" workshop will need access to the L2 Political dataset. To access this data, please complete the required form at least one week prior to the workshop date. 

Date:
Tuesday, February 24, 2026
Time:
9:00am - 11:30am
Time Zone:
Eastern Time - US & Canada (change)
Location:
RKZ Library Classroom 01
Campus:
Science Hill
Categories:
  Marx Science and Social Science Library     StatLab  

Registration is required. There are 3 seats available.

Event Organizer

Ted Ellsworth