EuroPython 2025

Tal Sofer

Tal Sofer is a product manager at Treeverse, the company behind lakeFS, an open-source platform that delivers a git-like experience to object-storage based data lakes. Tal is a former engineering manager who led engineering teams building scalable tools for developers and started her journey at Treeverse as an R&D team lead.

Tal holds a B.sc in Computer Science and Chinese studies from the Hebrew University of Jerusalem. In her free time you can find her running, cooking or brushing up on her Chinese.


Session

07-16
10:45
45min
Computer Vision Data Version Control and Reproducibility at Scale
Tal Sofer, Itai Gilo

Petabytes of unstructured data stand as the cornerstone upon which triumphant Machine Learning (ML) models are built. One common method for researchers to extract subsets of data to their local environments is by simply using the age-old copy-paste, for model training. This method allows for iterative experimentation, but it also introduces challenges with the efficiency of data management when developing machine learning models, including reproducibility constraints, inefficient data transfer, alongside limited compute power.

This is where data version control technologies can help overcome these challenges for computer vision researchers. In this workshop we'll cover:

  • How to use open source tooling to version control your data when working with data locally.
  • Best practices for working with data, preventing the need to copy data locally, while enabling the training of models at scale directly on the cloud. This will be demoed with an OSS stack:
  • Langchain
    -Tensorflow
  • PyTorch
  • Keras

You will come away with practical methods to improve your data management when developing and iterating upon Machine Learning models, built for modern computer vision research.

Machine Learning: Research & Applications
South Hall 2B