AI prompts
base on This is a repo with links to everything you'd ever want to learn about data engineering # The Data Engineering Handbook
This repo has all the resources you need to become an amazing data engineer!
Make sure to check out the [projects](projects.md) section for more hands-on examples!
Make sure to check out the [interviews](interviews.md) section for more advice on how to pass data engineering interviews!
## Resources
Great books:
- [Fundamentals of Data Engineering](https://www.amazon.com/Fundamentals-Data-Engineering-Robust-Systems/dp/1098108302/)
- [Designing Data-Intensive Applications](https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321/)
- [Designing Machine Learning Systems](https://www.amazon.com/Designing-Machine-Learning-Systems-Production-Ready/dp/1098107969)
- [The Hundred Page Machine Learning Book](https://www.amazon.com/Hundred-Page-Machine-Learning-Book/dp/199957950X)
- [Kimball - The Data Warehouse Toolkit](https://ia801609.us.archive.org/14/items/the-data-warehouse-toolkit-kimball/The%20Data%20Warehouse%20Toolkit%20-%20Kimball.pdf)
- [Data Mesh](https://www.oreilly.com/library/view/data-mesh/9781492092384/)
- [Machine Learning System Design Interview](https://www.amazon.com/Machine-Learning-System-Design-Interview/dp/1736049127)
- [Streaming Systems](https://www.amazon.com/Streaming-Systems-Where-Large-Scale-Processing/dp/1491983876)
- [High Performance Spark](https://www.amazon.com/High-Performance-Spark-Practices-Optimizing/dp/1491943203)
- [Building Evolutionary Architectures, 2nd Edition](https://www.oreilly.com/library/view/building-evolutionary-architectures/9781492097532/)
- [Data Management at Scale, 2nd Edition](https://www.oreilly.com/library/view/data-management-at/9781098138851/)
- [Deciphering Data Architectures](https://www.oreilly.com/library/view/deciphering-data-architectures/9781098150754/)
- [97 Things Every Data Engineer Should Know: Collective Wisdom from the Experts](https://www.amazon.com/Things-Every-Data-Engineer-Should/dp/1492062413)
- [Data Governance: The Definitive Guide](https://www.oreilly.com/library/view/data-governance-the/9781492063483/)
- [Trino: The Definitive Guide](https://trino.io/trino-the-definitive-guide.html)
- [Delta Lake: The Definitive Guide](https://www.oreilly.com/library/view/delta-lake-the/9781098151935/)
- [Hadoop: The Definitive Guide](https://www.oreilly.com/library/view/hadoop-the-definitive/9781491901687/)
- [Modern Data Engineering with Apache Spark: A Hands-On Guide for Building Mission-Critical Streaming Applications](https://www.amazon.com/Modern-Engineering-Apache-Spark-Hands/dp/1484274512)
- [Data Engineering with dbt: A practical guide to building a dependable data platform with SQL](https://www.amazon.com/Data-Engineering-dbt-cloud-based-dependable-ebook/dp/B0C4LL19G7)
- [Data Engineering with AWS](https://www.oreilly.com/library/view/data-engineering-with/9781804614426/)
- [Practical DataOps: Delivering Agile Date Science at Scale](https://www.amazon.com/Practical-DataOps-Delivering-Agile-Science/dp/1484251032)
- [Data Engineering Design Patterns](https://www.dedp.online/)
- [Snowflake Data Engineering](https://www.manning.com/books/snowflake-data-engineering)
- [Unlocking dbt](https://www.amazon.com/Unlocking-dbt-Design-Transformations-Warehouse/dp/1484296990/)
- [Learning Spark, Second Edition](https://pages.databricks.com/rs/094-YMS-629/images/LearningSpark2.0.pdf)
Communities:
- [Seattle Data Guy Discord](https://discord.gg/ah95MZKkFF)
- [EcZachly Data Engineering Discord](https://discord.gg/JGumAXncAK)
- [AdalFlow Discrod (LLM Library)](https://discord.com/invite/ezzszrRZvT)
- [Chip Huyen MLOps Discord](https://discord.gg/dzh728c5t3)
- [Data Engineer Things Community](https://www.dataengineerthings.org/aboutus/)
- [DBT Community](https://www.getdbt.com/community/join-the-community/)
- [r/dataengineering](https://www.reddit.com/r/dataengineering)
- [Microsoft Fabric Community](https://community.fabric.microsoft.com/)
- [r/MicrosoftFabric](https://www.reddit.com/r/MicrosoftFabric/)
- [Data Talks Club Slack](https://datatalks.club/slack)
- [Data Engineering Wiki](https://dataengineering.wiki/)
Companies:
- Orchestration
- [Mage](https://www.mage.ai)
- [Astronomer](https://www.astronomer.io)
- [Prefect](https://www.prefect.io)
- [Dagster](https://www.dagster.io)
- [Airbyte](https://airbyte.com)
- [Kestra](https://kestra.io/)
- [Shipyard](https://www.shipyardapp.com/)
- [Hamilton](https://github.com/dagworks-inc/hamilton)
- Data Lake / Cloud
- [Tabular](https://www.tabular.io)
- [Microsoft](https://www.microsoft.com)
- [Databricks](https://www.databricks.com/company/about-us)
- [Onehouse](https://www.onehouse.ai)
- [Delta Lake](https://delta.io/)
- Data Warehouse
- [Snowflake](https://www.snowflake.com/en/)
- [Firebolt](https://www.firebolt.io/)
- Data Quality
- [dbt](https://www.getdbt.com/)
- [Gable](https://www.gable.ai)
- [Great Expectations](https://www.greatexpectations.io)
- [Streamdal](https://streamdal.com)
- [Coalesce](https://coalesce.io/)
- [Soda](https://www.soda.io/)
- [DQOps](https://dqops.com/)
- Education Companies
- [DataExpert.io](https://www.dataexpert.io)
- [LearnDataEngineering.com](https://www.learndataengineering.com)
- [AlgoExpert](https://www.algoexpert.io)
- [ByteByteGo](https://www.bytebytego.com)
- Analytics / Visualization
- [Preset](https://www.preset.io)
- [Starburst](https://www.starburst.io)
- [Metabase](https://www.metabase.com/)
- [Looker Studio](https://lookerstudio.google.com/overview)
- [Tableau](https://www.tableau.com/)
- [Power BI](https://powerbi.microsoft.com/)
- [Apache Superset](https://superset.apache.org/)
- Data Integration
- [Cube](https://cube.dev)
- [Fivetran](https://www.fivetran.com)
- [Airbyte](https://airbyte.io)
- [dlt](https://dlthub.com/)
- [Sling](https://slingdata.io/)
- [Meltano](https://meltano.com/)
- Modern OLAP
- [Apache Druid](https://druid.apache.org/)
- [ClickHouse](https://clickhouse.com/)
- [Apache Pinot](https://pinot.apache.org/)
- [Apache Kylin](https://kylin.apache.org/)
- [DuckDB](https://duckdb.org/)
- LLM application library
- [AdalFlow](https://github.com/SylphAI-Inc/AdalFlow)
Data Engineering blogs of companies:
- [Netflix](https://netflixtechblog.com/tagged/big-data)
- [Uber](https://www.uber.com/blog/houston/data/?uclick_id=b2f43229-f3f4-4bae-bd5d-10a05db2f70c)
- [Databricks](https://www.databricks.com/blog/category/engineering/data-engineering)
- [Airbnb](https://medium.com/airbnb-engineering/data/home)
- [Amazon AWS Blog](https://aws.amazon.com/blogs/big-data/)
- [Microsoft Data Architecture Blogs](https://techcommunity.microsoft.com/t5/data-architecture-blog/bg-p/DataArchitectureBlog)
- [Microsoft Fabric Blog](https://blog.fabric.microsoft.com/)
- [Oracle](https://blogs.oracle.com/datawarehousing/)
- [Meta](https://engineering.fb.com/category/data-infrastructure/)
- [Onehouse](https://www.onehouse.ai/blog)
Data Engineering Whitepapers:
- [A Five-Layered Business Intelligence Architecture](https://ibimapublishing.com/articles/CIBIMA/2011/695619/695619.pdf)
- [Lakehouse:A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf)
- [Big Data Quality: A Data Quality Profiling Model](https://link.springer.com/chapter/10.1007/978-3-030-23381-5_5)
- [The Data Lakehouse: Data Warehousing and More](https://arxiv.org/abs/2310.08697)
- [Spark: Cluster Computing with Working Sets](https://dl.acm.org/doi/10.5555/1863103.1863113)
- [The Google File System](https://research.google/pubs/the-google-file-system/)
- [Building a Universal Data Lakehouse](https://www.onehouse.ai/whitepaper/onehouse-universal-data-lakehouse-whitepaper)
- [XTable in Action: Seamless Interoperability in Data Lakes](https://arxiv.org/abs/2401.09621)
- [MapReduce: Simplified Data Processing on Large Clusters](https://research.google/pubs/mapreduce-simplified-data-processing-on-large-clusters/)
Great YouTube Channels:
- 100k+ subscribers
- [E-learning Bridge](https://www.youtube.com/@shashank_mishra)
- [TrendyTech](https://www.youtube.com/c/TrendytechInsights)
- [Darshil Parmar](https://www.youtube.com/@DarshilParmar)
- [Andreas Kretz](https://www.youtube.com/c/andreaskayy)
- [ByteByteGo](https://www.youtube.com/c/ByteByteGo)
- [The Ravit Show](https://youtube.com/@theravitshow)
- [Guy in a Cube](https://www.youtube.com/@GuyInACube)
- [Adam Marczak](https://www.youtube.com/@AdamMarczakYT)
- [nullQueries](https://www.youtube.com/@nullQueries)
- [TECHTFQ by Thoufiq](https://www.youtube.com/@techTFQ)
- 10k+ subscribers
- [Data with Zach](https://www.youtube.com/c/datawithzach)
- [Seattle Data Guy](https://www.youtube.com/c/SeattleDataGuy)
- [Azure Lib](https://www.youtube.com/@azurelib-academy)
- [Advancing Analytics](https://www.youtube.com/@AdvancingAnalytics)
- [Kahan Data Solutions](https://www.youtube.com/@KahanDataSolutions)
- [Ankit Bansal](https://youtube.com/@ankitbansal6)
- [Mr. K Talks Tech](https://www.youtube.com/channel/UCzdOan4AmF65PmLLks8Lmww)
- 1k+ subscribers
- [Eric Roby](https://www.youtube.com/@codingwithroby)
Great Podcasts
- [The Data Engineering Show](https://www.dataengineeringshow.com/)
- [Data Engineering Podcast](https://www.dataengineeringpodcast.com/)
- [DataTopics](https://www.datatopics.io/)
- [The Data Engineering Side Of Data](https://podcasts.apple.com/us/podcast/the-engineering-side-of-data/id1566999533)
- [DataWare](https://www.ascend.io/dataaware-podcast/)
- [The Data Coffee Break Podcast](https://www.deezer.com/us/show/5293247)
- [Thd datastack show](https://datastackshow.com/)
- [Intricity101 Data Sharks Podcast](https://www.intricity.com/learningcenter/podcast)
- [Drill to Detail with Mark Rittman](https://www.rittmananalytics.com/drilltodetail/)
- [Analytics Power Hour](https://analyticshour.io/)
- [Catalog & cocktails](https://listen.casted.us/public/127/Catalog-%26-Cocktails-2fcf8728)
- [Datatalks](https://datatalks.club/podcast.html)
- [Data Brew by Databricks](https://www.databricks.com/discover/data-brew)
- [The Data Cloud Podcast by Snowflake](https://rise-of-the-data-cloud.simplecast.com/)
- [What's New in data](https://www.striim.com/podcast/)
- [Open||Source||Data by Datastax](https://www.datastax.com/resources/podcast/open-source-data)
- [Streaming Audio by confluent](https://developer.confluent.io/podcast/)
- [The Data Scientist Show](https://podcasts.apple.com/us/podcast/the-data-scientist-show/id1584430381)
- [MLOps.community](https://podcast.mlops.community/)
- [Monday Morning Data Chat](https://open.spotify.com/show/3Km3lBNzJpc1nOTJUtbtMh)
- [The Data Chief](https://www.thoughtspot.com/data-chief/podcast)
Newsletters:
- [DataEngineer.io Newsletter](https://blog.dataengineer.io)
- [Seattle Data Guy](https://seattledataguy.substack.com)
- [Joe Reis](https://joereis.substack.com)
- [Data Engineering Weekly](https://www.dataengineeringweekly.com)
- [Data Engineering Central](https://dataengineeringcentral.substack.com)
- [Dutch Engineer](https://dutchengineer.substack.com)
- [ByteByteGo](https://blog.bytebytego.com)
- [Start Data Engineering](https://www.startdataengineering.com)
- [Developing Dev](https://www.developing.dev)
- [High Growth Engineer](https://careercutler.substack.com/)
- [Learn Analytics Engineering](https://learnanalyticsengineering.substack.com/)
- [Marvelous MLOps](https://marvelousmlops.substack.com/)
- [medium Data Engineering Newsletter](https://medium.com/data-engineering-weekly)
- [Benn Stancil](https://benn.substack.com/)
- [Metadata Weekly](https://metadataweekly.substack.com/)
- [Technically](https://technically.substack.com/)
- [Blef.fr Data News](https://www.blef.fr/blog/)
- [All Hands on Data](https://allhandsondata.substack.com/)
- [Modern Data 101](https://moderndata101.substack.com/)
- [SELECT Insights](https://newsletter.ssp.sh/)
- [Interesting Data Gigs](https://newsletter.interestinggigs.com)
- [Ju Data Engineering Weekly](https://juhache.substack.com/)
- [From An Engineer Sight](https://fromanengineersight.substack.com/)
Glossaries:
- [Data Engineering Vault](https://www.ssp.sh/brain/data-engineering/)
- [Airbyte Data Glossary](https://glossary.airbyte.com/)
- [Data Engineering Wiki by Reddit](https://dataengineering.wiki/Index)
- [Seconda Glossary](https://www.secoda.co/glossary/)
- [Glossary Databricks](https://www.databricks.com/glossary)
- [Airtable Glossary](https://airtable.com/shrGh8BqZbkfkbrfk/tbluZ3ayLHC3CKsDb)
- [Data Engineering Glossary by Dagster](https://dagster.io/glossary)
LinkedIn
- 100k+ Followers
- [Zach Wilson](https://www.linkedin.com/in/eczachly)
- [Ben Rogojan](https://www.linkedin.com/in/benjaminrogojan)
- [Sumit Mittal](https://www.linkedin.com/in/bigdatabysumit/)
- [Shashank Mishra](https://www.linkedin.com/in/shashank219/)
- [Chip Huyen](https://www.linkedin.com/in/chiphuyen/)
- [Alex Xu](https://www.linkedin.com/in/alexxubyte)
- [Deepak Goyal](https://www.linkedin.com/in/deepak-goyal-93805a17/)
- [Andreas Kretz](https://www.linkedin.com/in/andreas-kretz)
- 50k+ Followers
- [Joe Reis](https://www.linkedin.com/in/josephreis)
- [Darshil Parmar](https://www.linkedin.com/in/darshil-parmar/)
- [Ankit Bansal](https://www.linkedin.com/in/ankitbansal6/)
- [Marc Lamberti](https://www.linkedin.com/in/marclamberti)
- 10k+ Followers
- [Li Yin](https://www.linkedin.com/in/li-yin-ai/)
- [Joseph Machado](https://www.linkedin.com/in/josephmachado1991/)
- [Eric Roby](https://www.linkedin.com/in/codingwithroby/)
- [Simon Whiteley](https://www.linkedin.com/in/simon-whiteley-uk/)
- [Simon Späti](https://www.linkedin.com/in/sspaeti/)
- 5k+ Followers
- [Dipankar Mazumdar](https://www.linkedin.com/in/dipankar-mazumdar/)
- [Daniel Ciocirlan](https://www.linkedin.com/in/danielciocirlan)
- [Hugo Lu](https://www.linkedin.com/in/hugo-lu-confirmed/)
- [Tobias Macey](https://www.linkedin.com/in/tmacey)
- [Marcos Ortiz](https://www.linkedin.com/in/mlortiz)
- [Julien Hurault](https://www.linkedin.com/in/julienhuraultanalytics/)
- 1k+ Followers
- [Shruti Mantri](https://www.linkedin.com/in/shruti-mantri-88527a67/)
- [Volker Janz](https://www.linkedin.com/in/vjanz/)
- [Benoit Pimpaud)(https://www.linkedin.com/in/pimpaudben/)
Twitter / X
- [Zach Wilson](https://www.twitter.com/EcZachly)
- [Seattle Data Guy](https://www.twitter.com/SeattleDataGuy)
- [Sumit Mittal](https://www.twitter.com/bigdatasumit)
- [Joseph Machado](https://twitter.com/startdataeng)
- [Alex Xu](https://twitter.com/alexxubyte/)
- [Eric Roby](https://twitter.com/codingwithroby)
- [Andreas Kretz](https://twitter.com/andreaskayy)
- [Marc Lamberti](https://twitter.com/marclambertiml)
- [Dipankar Mazumdar](https://twitter.com/Dipankartnt)
- [Start Data Engineering](https://twitter.com/startdataeng)
- [Data Cyborg](https://twitter.com/data_cyborg)
- [Simon Späti](https://twitter.com/sspaeti)
- [Marcos Ortiz](https://twitter.com/marcosluis2186)
Instagram
- [Zach Wilson](https://www.instagram.com/eczachly)
- [Andreas Kretz](https://www.instagram.com/learndataengineering)
- [Seattle Data Guy](https://www.instagram.com/seattledataguy)
TikTok
- [Zach Wilson](https://www.tiktok.com/@eczachly)
- [Alex The Analyst](https://www.tiktok.com/@alex_the_analyst)
- [Marcos Ortiz](https://www.tiktok.com/@marcosluis2186)
Design Patterns
- [Cumulative Table Design](https://www.github.com/EcZachly/cumulative-table-design)
- [Microbatch Deduplication](https://www.github.com/EcZachly/microbatch-hourly-deduped-tutorial)
- [The Little Book of Pipelines](https://www.github.com/EcZachly/little-book-of-pipelines)
- [Data Developer Platform](https://datadeveloperplatform.org/architecture/)
Courses / Academies
- [DataExpert.io course](https://www.dataexpert.io) use code **HANDBOOK10** for a discount!
- [LearnDataEngineering.com](https://www.learndataengineering.com)
- [Technical Freelancer Academy](https://www.technicalfreelanceracademy.com/) Use code **zwtech** for a discount!
- [IBM Data Engineering for Everyone](https://www.edx.org/learn/data-engineering/ibm-data-engineering-basics-for-everyone)
- [Qwiklabs](https://www.qwiklabs.com/)
- [DataCamp](https://www.datacamp.com/)
- [Udemy Courses from Shruti Mantri](https://www.udemy.com/user/shruti-mantri-5/)
- [Rock the JVM](https://rockthejvm.com/) teaches Spark (in Scala), Flink and others
- [Data Engineering Zoomcamp by DataTalksClub](https://datatalks.club/)
- [Efficient Data Processing in Spark](https://josephmachado.podia.com/efficient-data-processing-in-spark)
- [Scaler](https://www.scaler.com/)
Certifications Courses
- [Google Cloud Certified - Professional Data Engineer](https://cloud.google.com/certification/data-engineer)
- [Databricks - Data Engineer Professional](https://www.databricks.com/learn/certification/data-engineer-professional)
- [Azure Data Engineer Associate](https://learn.microsoft.com/credentials/certifications/azure-data-engineer/)
- [Microsoft Fabric Analytics Engineer Associate](https://learn.microsoft.com/credentials/certifications/fabric-analytics-engineer-associate/)
- [Exam DP-203: Data Engineering on Microsoft Azure](https://learn.microsoft.com/en-us/credentials/certifications/exams/dp-203/?tab=tab-learning-paths)
- [AWS Certified Data Engineer - Associate](https://aws.amazon.com/certification/certified-data-engineer-associate/)
Conferences
- [Trino Summit - December 13-14, 2023 - Virtual](https://www.starburst.io/info/trinosummit2023/)
- [Data Universe - April 10-11, 2024 - New York City](https://www.datauniverseevent.com/)
- [Data Nova @ Data Universe - April 10-11, 2024 - New York City](https://www.starburst.io/datanova/)
- [DataTune Conference - March 8-9, 2024 - Nashville, TN](https://www.datatuneconf.com/)
", Assign "at most 3 tags" to the expected json: {"id":"8755","tags":[]} "only from the tags list I provide: [{"id":77,"name":"3d"},{"id":89,"name":"agent"},{"id":17,"name":"ai"},{"id":54,"name":"algorithm"},{"id":24,"name":"api"},{"id":44,"name":"authentication"},{"id":3,"name":"aws"},{"id":27,"name":"backend"},{"id":60,"name":"benchmark"},{"id":72,"name":"best-practices"},{"id":39,"name":"bitcoin"},{"id":37,"name":"blockchain"},{"id":1,"name":"blog"},{"id":45,"name":"bundler"},{"id":58,"name":"cache"},{"id":21,"name":"chat"},{"id":49,"name":"cicd"},{"id":4,"name":"cli"},{"id":64,"name":"cloud-native"},{"id":48,"name":"cms"},{"id":61,"name":"compiler"},{"id":68,"name":"containerization"},{"id":92,"name":"crm"},{"id":34,"name":"data"},{"id":47,"name":"database"},{"id":8,"name":"declarative-gui "},{"id":9,"name":"deploy-tool"},{"id":53,"name":"desktop-app"},{"id":6,"name":"dev-exp-lib"},{"id":59,"name":"dev-tool"},{"id":13,"name":"ecommerce"},{"id":26,"name":"editor"},{"id":66,"name":"emulator"},{"id":62,"name":"filesystem"},{"id":80,"name":"finance"},{"id":15,"name":"firmware"},{"id":73,"name":"for-fun"},{"id":2,"name":"framework"},{"id":11,"name":"frontend"},{"id":22,"name":"game"},{"id":81,"name":"game-engine "},{"id":23,"name":"graphql"},{"id":84,"name":"gui"},{"id":91,"name":"http"},{"id":5,"name":"http-client"},{"id":51,"name":"iac"},{"id":30,"name":"ide"},{"id":78,"name":"iot"},{"id":40,"name":"json"},{"id":83,"name":"julian"},{"id":38,"name":"k8s"},{"id":31,"name":"language"},{"id":10,"name":"learning-resource"},{"id":33,"name":"lib"},{"id":41,"name":"linter"},{"id":28,"name":"lms"},{"id":16,"name":"logging"},{"id":76,"name":"low-code"},{"id":90,"name":"message-queue"},{"id":42,"name":"mobile-app"},{"id":18,"name":"monitoring"},{"id":36,"name":"networking"},{"id":7,"name":"node-version"},{"id":55,"name":"nosql"},{"id":57,"name":"observability"},{"id":46,"name":"orm"},{"id":52,"name":"os"},{"id":14,"name":"parser"},{"id":74,"name":"react"},{"id":82,"name":"real-time"},{"id":56,"name":"robot"},{"id":65,"name":"runtime"},{"id":32,"name":"sdk"},{"id":71,"name":"search"},{"id":63,"name":"secrets"},{"id":25,"name":"security"},{"id":85,"name":"server"},{"id":86,"name":"serverless"},{"id":70,"name":"storage"},{"id":75,"name":"system-design"},{"id":79,"name":"terminal"},{"id":29,"name":"testing"},{"id":12,"name":"ui"},{"id":50,"name":"ux"},{"id":88,"name":"video"},{"id":20,"name":"web-app"},{"id":35,"name":"web-server"},{"id":43,"name":"webassembly"},{"id":69,"name":"workflow"},{"id":87,"name":"yaml"}]" returns me the "expected json"