Удаленная работа
AI Data Engineer
Удаленная работа
Полная занятость
The Role
We're looking for an AI Data Engineer to build and maintain the data infrastructure powering our AI-driven healthcare platform. This role focuses on implementing robust data pipelines, managing our data lakehouse architecture, and ensuring high-quality data processing for our AI systems.
Responsibilities:
- Design and implement scalable data pipelines for diverse healthcare data sources
- Build and maintain data lakehouse architecture on AWS for storing structured and unstructured medical data
- Create efficient ETL processes for handling medical transcriptions, clinical documentation, and practice data
- Implement data quality monitoring systems and validation frameworks
- Develop and maintain data crawlers for collecting domain-specific medical content
Support RAG system implementation with optimized data storage and retrieval mechanisms
Ideal Candidate:
- Strong experience with AWS data services (S3, RDS, Glue, EMR Serverless, Athena, DataZone, Lake Formation, DynamoDB)
- Expertise in data orchestration tools (Dagster, Apache Airflow, AWS MWAA, Step Functions)
- Proficiency in Python, SQL, and PySpark with experience in data processing frameworks
- Experience with data lakehouse architectures, ETL pipeline development, and SageMaker Feature Store
- Strong background with AWS analytics services (Glue Catalog, Glue ETL/EMR Serverless, Athena)
- Experience with Apache Iceberg table format for organizing data in data lakehouse architecture, including working with time travel, ACID transactions, and schema evolution
- Experience with PostgreSQL and vector databases (pgvector, OpenSearch, etc.)
- Proficiency in data transformation tools like dbt
- Experience implementing data quality frameworks (Great Expectations, Glue Data Quality, PyDeequ)
- Knowledge of healthcare data structures and medical terminology preferred
- Experience with data preprocessing for LLM applications strongly preferred (NLP libraries like spaCy, web scraping tools, text extraction, semantic chunking, etc.)
- Understanding of data security and HIPAA compliance requirements
- Collaborative mindset and ability to work in a fast-paced startup environment
- Bachelor's degree in Computer Science, Engineering, or related field
Maria Bilo