Skip to content

Latest commit

 

History

History
130 lines (109 loc) · 4 KB

data-engineer-certificate-exam-guide.md

File metadata and controls

130 lines (109 loc) · 4 KB

Certification Exam Guide

Section 1: Designing data processing systems

1.1 - Designing flexible data representations.

Considerations include:

  • future advances in data technology
  • changes to business requirements
  • awareness of current state and how to migrate the design to a future state
  • data modeling
  • tradeoffs
  • distributed systems
  • schema design

1.2 - Designing data pipelines.

Considerations include:

  • future advances in data technology
  • changes to business requirements
  • awareness of current state and how to migrate the design to a future state
  • data modeling
  • tradeoffs
  • system availability
  • distributed systems
  • schema design
  • common sources of error (eg. removing selection bias)

1.3 - Designing data processing infrastructure.

Considerations include:

  • future advances in data technology
  • changes to business requirements
  • awareness of current state, how to migrate the design to the future state
  • data modeling
  • tradeoffs
  • system availability
  • distributed systems
  • schema design
  • capacity planning
  • different types of architectures: message brokers, message queues, middleware, service-oriented

Section 2: Building and maintaining data structures and databases

2.1 - Building and maintaining flexible data representations

2.2 - Building and maintaining pipelines.

Considerations include:

  • data cleansing
  • batch and streaming
  • transformation
  • acquire and import data
  • testing and quality control
  • connecting to new data sources

2.3 - Building and maintaining processing infrastructure.

Considerations include:

  • provisioning resources
  • monitoring pipelines
  • adjusting pipelines
  • testing and quality control

Section 3: Analyzing data and enabling machine learning

3.1 - Analyzing data.

Considerations include:

  • data collection and labeling
  • data visualization
  • dimensionality reduction
  • data cleaning/normalization
  • defining success metrics

3.2 - Machine learning.

Considerations include:

  • feature selection/engineering
  • algorithm selection
  • debugging a model

3.3 Machine learning model deployment.

Considerations include:

  • performance/cost optimization
  • online/dynamic learning

Section 4: Modeling business processes for analysis and optimization

4.1 - Mapping business requirements to data representations.

Considerations include:

  • working with business users
  • gathering business requirements

4.2 - Optimizing data representations, data infrastructure performance and cost.

Considerations include:

  • resizing and scaling resources
  • data cleansing, distributed systems
  • high performance algorithms
  • common sources of error (eg. removing selection bias)

Section 5: Ensuring reliability

5.1 - Performing quality control.

Considerations include:

  • verification
  • building and running test suites
  • pipeline monitoring

5.2 - Assessing, troubleshooting, and improving data representations and data processing infrastructure.

5.3 - Recovering data. Considerations include:

  • planning (e.g. fault-tolerance)
  • executing (e.g., rerunning failed jobs, performing retrospective re-analysis)
  • stress testing data recovery plans and processes

Section 6: Visualizing data and advocating policy

6.1 - Building (or selecting) data visualization and reporting tools.

Considerations include:

  • automation
  • decision support
  • data summarization, (e.g, translation up the chain, fidelity, trackability, integrity)

6.2 - Advocating policies and publishing data and reports.

Section 7: Designing for security and compliance

7.1 - Designing secure data infrastructure and processes.

Considerations include:

  • Identify and Access Management (IAM)
  • data security
  • penetration testing
  • Separation of Duties (SoD)
  • security control

7.2 - Designing for legal compliance.

Considerations include:

  • legislation (e.g., Health Insurance Portability and Accountability Act (HIPAA), Children’s Online Privacy Protection Act (COPPA), etc.)
  • audits