Posted 5d ago

Internship : Testing Data Migrations with Synthetic Data: AI-powered approach

@ ELCA
Pully, Canton of Vaud, Switzerland
OnsiteFull Time
Responsibilities:Design strategy, Generate test datasets, Evaluate approaches
Requirements Summary:Strong Python, SQL, data modeling; experience with LLMs, data anonymization, CI/CD; problem-solving and documentation skills.
Technical Tools Mentioned:Python, SQL, CI/CD, data modeling, LLMs, agentic systems
Save
Mark Applied
Hide Job
Report & Hide
Job Description

Description

Data platform migrations are common in enterprise environments, moving from legacy systems to modern infrastructure while preserving business logic. The technical challenge isn't just syntax translation; it's validation. When developers migrate SQL scripts or data pipelines between platforms, they face different execution environments, modified data access permissions, and no safe way to test against production data.

This internship tackles synthetic data generation for migration script testing. You'll design and implement a system that generates realistic test datasets mirroring production structure and behavior without exposing sensitive information. There are different approaches, it could be a small dataset living in a git repository, or a fully-fledged synthetic data warehouse. Still, the data must be realistic enough to catch real bugs.

The challenge goes beyond simple data mocking. You'll need to decide whether to generate from real data (anonymization risks), from query analysis alone (requires good documentation), or hybrid approaches. Should categorical values match production exactly or can we substitute them and adapt the scripts? Can we extend unit-testing to end-to-end testing, and what would be the required dataset properties?

Part of the work involves establishing an evaluation methodology—potentially collecting a reference set of migration scripts and their expected behaviors to measure how well different synthetic data approaches catch real issues. There's potential to explore multi-agent architectures where specialized agents handle different aspects: schema analysis, constraint extraction, data generation, anonymization verification, and test validation. This is applied research with immediate production impact.

 

Objectives

  • Design a strategy for migration script testing that balances realism, anonymization, and practical constraints
  • Implement a proof-of-concept system that generates test datasets from schema documentation, existing queries, or (carefully) sampled production data
  • Define testing strategies: unit tests vs. end-to-end tests, minimum viable data sizes, etc.
  • Develop an evaluation methodology to measure the effectiveness of different synthetic data generation approaches
  • Explore multi-agent architectures for decomposing the generation pipeline into specialized components (schema analysis, constraint satisfaction, validation)

 

Our offer

  • A dynamic work and collaborative environment with a highly motivated multi-cultural and international sites team
  • The chance to make a difference in peoples’ life by building innovative solutions
  • Various internal coding events (Hackathon, Brownbags), see our technical blog
  • Monthly After-Works organized per locations

 

Skills required

  • Strong Python programming: data processing, testing patterns, CI/CD integration
  • Understanding of relational databases, SQL, and data modeling concepts
  • Experience with LLMs and agentic systems: prompting, tool use, multi-agent orchestration
  • Familiarity with data security and data anonymization concepts
  • Problem-solving mindset: comfort with ambiguous requirements and making justified technical trade-offs
  • Clear technical writing and documentation skills


     

Company

We are ELCA, one of the largest Swiss IT tribe with over 2,300 experts. We are multicultural with offices in Switzerland, Spain, France, Vietnam and Mauritius. Since 1968, our team of engineers, business analysts, software architects, designers and consultants provide tailor-made and standardized solutions to support the digital transformation of major public administrations and private companies in Switzerland. Our activity spans across multiples fields of leading-edge technologies such as AI, Machine & Deep learning, BI/BD, RPA, Blockchain, IoT and CyberSecurity.