GPU in EDU | Mid-Atlantic HPC User Group

GPU in EDU Track:

9:15 AM – 3:20 PM

[PRESENTATION] Ensuring Reproducibility in AI: Declarative Tools for Managing Software Stacks and Workflows

Presenter: Dr. Christoper S Simmons

As AI workloads increasingly move into academic research environments, the challenge of maintaining reproducible computational environments becomes critical for scientific rigor, collaboration, and scalability. This presentation addresses a fundamental problem in scientific computing: how to effectively manage the complex software dependencies that modern AI workflows require while ensuring portability across different systems. This talk will explore three complementary approaches to declarative software stack management: conda/mamba for cross-language package management, Spack for HPC-optimized builds, and containerization technologies including Docker, Singularity, and Apptainer.

The session will cover best practices for version controlling software environments, avoiding common pitfalls like Anaconda's impact on parallel filesystems, and leveraging modern tools like repo2docker for automated reproducible environments. By the end of this presentation, participants will be equipped to select the most appropriate tool for their specific research needs and implement reproducible software management practices that enhance collaboration and ensure long-term accessibility of their computational work.

Target Audience: Academic researchers currently managing software environments on HPC systems via command line who possess basic Linux proficiency and seek to implement more robust, reproducible software management practices for their AI and computational research workflows.

[TUTORIAL] The Reproducible Data Pipeline Playbook: Using GitOps and DataOps to Scale AI and Scientific Research

Presenter: Dr. Christoper S Simmons

Building on the declarative and stateless principles introduced in software stack management, this session tackles the equally critical challenge of data reproducibility in AI workflows. While version controlling code has become standard practice, most researchers still manage datasets through ad-hoc file copying and manual organization, an approach that breaks down quickly when dealing with the hundreds of gigabytes to tens of terabytes typical in modern AI and scientific computing. This presentation introduces the transformative concept of "treating data like code" through GitOps and DataOps methodologies, extending the same declarative principles that ensure software reproducibility to data management.

Attendees will explore a progressive toolkit for data versioning, starting with Git LFS for seamless integration with existing Git workflows, advancing to DVC (Data Version Control) for sophisticated management of ML pipelines, and culminating with LakeFS for enterprise-scale data lake versioning with branching and merging capabilities.

The session will demonstrate how S3-compatible object storage, including resources like the Open Storage Network (OSN) available to academic researchers, serves as the foundation for scalable, collaborative data management that transcends institutional boundaries. By the end of this talk, participants will be equipped to implement basic data versioning workflows immediately and understand the pathway to more sophisticated data management as their research scales.

Target Audience: Academic researchers working with substantial datasets in AI and scientific computing who seek to apply software development best practices to data management and establish reproducible, collaborative data workflows that complement their existing Git-based development processes.