BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//programme.europython.eu//europython-2024//talk//7DF7VC
BEGIN:VTIMEZONE
TZID:CET
BEGIN:STANDARD
DTSTART:20001029T040000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10
TZNAME:CET
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20000326T030000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=3
TZNAME:CEST
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-europython-2024-7DF7VC@programme.europython.eu
DTSTART;TZID=CET:20240710T121000
DTEND;TZID=CET:20240710T125500
DESCRIPTION:Selecting the optimal text embedding model is often guided by b
 enchmarks such as the Massive Text Embedding Benchmark (MTEB). While choos
 ing the best model from the leaderboard is a common practice\, it may not 
 always align perfectly with the unique characteristics of your specific da
 taset. This approach overlooks a crucial yet frequently underestimated ele
 ment - the tokenizer.\n\nWe will delve deep into the tokenizer's fundament
 al role\, shedding light on its operations and introducing straightforward
  techniques to assess whether a particular model is suited to your data ba
 sed solely on its tokenizer. We will explore the significance of the token
 izer in the fine-tuning process of embedding models and discuss strategic 
 approaches to optimize its effectiveness.
DTSTAMP:20260418T103514Z
LOCATION:North Hall
SUMMARY:Deconstructing the text embedding models - Kacper Łukawski
URL:https://programme.europython.eu/europython-2024/talk/7DF7VC/
END:VEVENT
END:VCALENDAR
