# Mehdio's Tech Corner > Personal website of Mehdi Ouazza — data engineering, AI, and tech content. ## Blog Posts - [ctrl+r #11: IDEs are dead, Deep Fake is too easy](https://mehdio.com/blog/ctrlr-11-ides-are-dead-deep-fake): Blog post - [ctrl+r #10: AI & cognition, Obsidian+Claude Code ](https://mehdio.com/blog/ctrlr-10-ai-and-cognition-obsidianclaude): Training brains in the age of shortcuts - [ctrl+r #09: The generalist comeback, Cursor's hidden gem](https://mehdio.com/blog/ctrlr-09-the-generalist-comeback): Why niching down might be yesterday's advice, plus plan mode changed how I build - [ctrl+r #08: 2026's AI wake-up call, voice-first workflows, and the data engineer identity crisis](https://mehdio.com/blog/ctrlr-08-2026s-ai-wake-up-call-voice): From Karpathy's grief cycle to Hyprnote workflows, plus why companies are locking down their APIs - [ctlr+r #07: TypeScript’s AI advantage, junior skill debt, and trusting AI with real money](https://mehdio.com/blog/ctlrr-07-typescripts-ai-advantage): From GitHub Octoverse shifts to disk cleanup, finance hacks, and human debt - [ctlr+r #06: Auto-Dubbing the Web & "Serverless" RAG](https://mehdio.com/blog/ctlrr-06-auto-dubbing-the-web-and): Testing Gemini's File Search, the end of language barriers, and why RAM is getting expensive - [ctlr+r #05: Less UI, more chats, faster containers](https://mehdio.com/blog/ctlrr-05-less-ui-more-chats-faster): Blog post - [ctlr+r #04: AI’s Energy Bill , Free Localhost Tunnels](https://mehdio.com/blog/ctlrr-04-ais-energy-bill-free-localhost): Blog post - [ctlr+r #03: OSS fatigue & Orchestrating LLMs](https://mehdio.com/blog/ctlrr-03-oss-fatigue-and-orchestrating): Blog post - [ctlr+r #02: How to not get stupid & The Parquet Killer?](https://mehdio.com/blog/ctlrr-02-how-to-not-get-stupid-and): Blog post - [ctlr+r #01: Toon, LLM CLIs](https://mehdio.com/blog/ctlrr-01-toon-llm-clis): Blog post - [An actually useful MCP for web development](https://mehdio.com/blog/an-actually-useful-mcp-for-web-development): Eliminates the copy-paste hell with browser-tools - [Is Gemini CLI worth it for Cursors users ?](https://mehdio.com/blog/is-gemini-cli-worth-it-for-cursors): Yes. - [Apple’s new "Container" Engine (Bye Docker?)](https://mehdio.com/blog/apples-new-container-engine-bye-docker): Hands-on review of Apple's new container framework announced at WWDC 2 - [The Slow Death of Medium-Sized Software Companies](https://mehdio.com/blog/the-slow-death-of-medium-sized-software): What if scaling was no longer the goal? And what would that mean for software engineers? - [Making Cursor smarter (and up to date)](https://mehdio.com/blog/making-cursor-smarter-and-up-to-date): Context is king — and documentation context rules. - [macOS: Essential Productivity Hacks for Developers — No AI Needed](https://mehdio.com/blog/macos-essential-productivity-hacks): A fast, distraction-free workflow powered by open-source tools and keyboard-driven automation. - [Local LLMs, 0 cloud cost : is WebGPU key for next-gen browser AI app?](https://mehdio.com/blog/local-llms-0-cloud-cost-is-webgpu): Understand WebGPU through a real-world AI demo with code, and understand the technology powering browser compute - [How to use AI to create better technical diagrams](https://mehdio.com/blog/how-to-use-ai-to-create-better-technical): A practical look at how LLMs can help you generate architecture diagrams without the fluff - [DuckDB goes distributed? DeepSeek’s smallpond takes on Big Data](https://mehdio.com/blog/duckdb-goes-distributed-deepseeks): DeepSeek is pushing DuckDB beyond its single-node roots with smallpond, a new, simple approach to distributed compute. But does it solve the scalability challenge—or introduce new trade-offs? - [15 Python Libraries Every Data Engineer Needs](https://mehdio.com/blog/15-python-libraries-every-data-engineer): Reduce complexity and improve your data engineering work - [One year, One challenge: win money if I fail](https://mehdio.com/blog/one-year-one-challenge-win-money): 52 videos incoming. - [I deleted data in prod and received a T-shirt; what's next?](https://mehdio.com/blog/i-deleted-data-in-prod-and-received): Sharing how one critical mistake taught me key lessons - [LLMs For Builders : Jargons, Theory & History](https://mehdio.com/blog/llms-for-builders-jargons-theory): Equipping you with the knowledge to start building AI applications - [Dancing your way through the pathless data career](https://mehdio.com/blog/dancing-your-way-through-the-pathless): Exploring the how and why behind my non-linear journey in data and its resonance with many in the field. - [Revitalizing Your Tech Career: My 30-Day Marathon Through 20+ Interviews and 5 Job Offers](https://mehdio.com/blog/revitalizing-your-tech-career-my): Insights from another interview marathon for a full remote DevRel position - [The Most Painful And Repetitive Job Of A Data Engineer](https://mehdio.com/blog/the-most-painful-and-repetitive-job): Why we should do something about JDBC - [10 Lessons Learned In 10 Years Of Data [2/2] ](https://mehdio.com/blog/10-lessons-learned-in-10-years-of-c34): From 2020 to 2022 aka the explosion of tools era - [10 Lessons Learned In 10 Years Of Data [1/2]](https://mehdio.com/blog/10-lessons-learned-in-10-years-of): From 2012 to 2022, what went wrong in the data world ? - [You Don't Have Big Data; You Have Bad Data Lifecycle Management](https://mehdio.com/blog/you-dont-have-big-data-you-have-bad-data-lifecycle-management-e459b0e1e84f): Storage is not always cheap - [Data Contracts — From Zero To Hero](https://mehdio.com/blog/data-contracts-from-zero-to-hero-343717ac4d5e): A pragmatic approach to data contracts - [What Open Source Can Do For Your Data Career](https://mehdio.com/blog/what-open-source-can-do-for-your-data-career-53ecb747c111): You don’t need to code to get started. - [Meet Your Future Data Mentors](https://mehdio.com/blog/meet-your-future-data-mentors-6cb4066db83a): Story of datacreators.club, a hub to discover 100+ data content creators - [Testing Your Terraform Infrastructure Code With Python](https://mehdio.com/blog/testing-your-terraform-infrastructure-code-with-python-a3f913b528e3): Let’s cover an API use case with Terraform HCL & Python - [Job Hopping As A Software Engineer — Should You Do It?](https://mehdio.com/blog/job-hopping-as-a-software-engineer-should-you-do-it-c71a39390a29): Why Job Hopping Now Is Intentional, Not Impatient (And What You Need To Know) - [The Key Feature Behind Lakehouse Data Architecture](https://mehdio.com/blog/the-key-feature-behind-lakehouse-data-architecture-c70f93c6866f): Understanding the modern table formats and their current state - [The Battle for Data Engineer’s Favorite Programming Language Is Not Over Yet](https://mehdio.com/blog/the-battle-for-data-engineers-favorite-programming-language-is-not-over-yet-bb3cd07b14a0): Let's discuss the next contender for 2022 - [Your Next Container Strategy: From Development to Deployment](https://mehdio.com/blog/your-next-container-strategy-from-development-to-deployment-66167c0d028a): Learn how to manage dockerfiles and version through a working Python API project - [Stop Using The Term “Data Engineer”, There’s Something Better](https://mehdio.com/blog/five-overused-definitions-of-a-data-engineer-f0d9059a174): 5 overused definitions of the hottest job of the year - [7 Things You Need To Know If You Want to Become a Data Engineer ☄](https://mehdio.com/blog/7-hacks-to-get-your-first-data-engineer-job-4b3e44bb35fd): Strategies to help you to land your first Data Engineer job - [Why you should try something else than Airflow for data pipeline orchestration](https://mehdio.com/blog/why-you-should-try-something-else-than-airflow-for-data-pipeline-orchestration-7a0a2c91c341): Let’s evaluate AWS step functions, Google workflows, Prefect next to Airflow - [Highlights from DATA+AI Summit 2021 💥](https://mehdio.com/blog/highlights-from-data-ai-summit-2021-3abfd9aaccaa): Takeaways from one of the biggest DATA/AI conference - [Why & how to market yourself as a data engineer](https://mehdio.com/blog/why-how-to-market-yourself-as-a-data-engineer-98633371ea7b): Understand the values of marketing ourself and how to get started! - [I did 25+ interviews at 8 different tech companies for a data engineer position in 1 month.](https://mehdio.com/blog/i-did-25-interviews-at-8-different-tech-companies-for-a-data-engineer-position-in-1-month-feab3e465f13): Here is what I learned from this marathon and the current data market - [A day in the life of a data engineer](https://mehdio.com/blog/a-day-in-the-life-of-a-data-engineer-d65293272121): Breaking down the main activities of a data engineer in 2021 - [What are the most requested technical skills in the data job market?Insights from 35k+ datajobs ads](https://mehdio.com/blog/what-are-the-most-requested-technical-skills-in-the-data-job-market-insights-from-35k-datajobs-ads-d8642555f89e): Insights from the data skills radar, scanning daily data jobs ads - [Why and how you should dockerize your development environment (with VS Code 💙)](https://mehdio.com/blog/dockerize-your-development-environment-with-vs-code-cac9e7a60751): In this blog post, I will cover a few elements that should motivate you to dockerize your development environment and give you a repo… - [Highlights from Spark+AI Summit 2020 for Data engineers](https://mehdio.com/blog/highlights-from-spark-ai-summit-2020-for-data-engineers-359211b1eec2): In these takeaways focusing on the data engineering topics, I’ll provide as resources, the most interesting talks I've seen. ## Projects - [Projects](https://mehdio.com/projects): Open-source projects and side projects ## About - [About](https://mehdio.com/about): About Mehdi Ouazza --- # Full Blog Content ## ctrl+r #11: IDEs are dead, Deep Fake is too easy URL: https://mehdio.com/blog/ctrlr-11-ides-are-dead-deep-fake Date: 2026-02-04T21:00:21.001 ## 🧠 IDEs as we know them are dead I’ve been using Cursor as my IDE, and I’ve realized that I barely use most of its features anymore—simply because I’m writing less code manually. I think IDEs can afford to be much more lightweight when it comes to editing features. Instead, they should be designed around agents, workflows, and review. GitHub actually feels closer to this setup already. You have an Agent tab, you can orchestrate tasks through GitHub Actions, review through PRs, and spin up a Codespace that’s sandboxed to your repository. [![](https://substackcdn.com/image/fetch/$s_!1gbS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eb51afa-ef0a-410c-9dfa-7e2deb1ec6f7_1330x814.png)](https://substackcdn.com/image/fetch/$s%5F!1gbS!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7eb51afa-ef0a-410c-9dfa-7e2deb1ec6f7%5F1330x814.png) I can even imagine “projects” becoming a way to launch agent tasks and track their progress. So yeah, in the long run, I think **GitHub is well positioned here**. Editors built on VS Code are going to have a hard time—or will need some drastic changes. ## 🛠️ Deepfakes are too easy now Last week I had to get creative for some marketing content. If you haven’t seen it: Beyond being a fun experiment, it gave me a firsthand look at the challenges involved. How good and easy is it to create a deepfake? First, you have to realize that most platforms have security checks—so deepfaking a celebrity is harder if you’re not using open-source tools. At least, that’s what I thought. Turns out, many of these tools have barriers when you use the AI directly, but far fewer when you use their API. 😅 Once you have a voice clone and a video, there are tools like sync.so that sync a given voice audio with video footage. It does a decent job. But speaking of open source—cloning a voice is actually really easy now. Jeff Geerling [did a video](https://www.youtube.com/watch?v=dQ841Pd6YvQ) about how scary the new Qwen3-TTS (text-to-speech) model is. No long speech sample needed—just a small voice extract and you get really good results. Sure, intonation and emotion still need work. But if you want to change a few words in a given speech? Really easy. Try it yourself on Hugging Face: All in all, just chaining a few tools with some open-source models can get you surprisingly far in a matter of minutes—at okayish quality. It’s only a matter of time before the results become unrecognizable from real footage. 😬 ## 📚What I Read/Watched * [ElevenLabs just got nuked by open source](https://www.youtube.com/watch?v=dQ841Pd6YvQ) — Jeff Geerling shares his story about people faking his voice and the evolution of these open-source models * [Inside OpenAI’s in-house data agent](https://openai.com/index/inside-our-in-house-data-agent/) — this will be the playbook for companies building data agents to lower the technical barrier for internal analytics * [GitHub just announced Claude & Codex available as coding agents](https://github.blog/news-insights/company-news/pick-your-agent-use-claude-and-codex-on-agent-hq/?utm%5Fsource=mario-twitter-blog-3p-agents-amp&utm%5Fmedium=social&utm%5Fcampaign=agent-3p-platform-feb-2026) * Maxime Beauchemin (creator of Airflow) on [the semantic layer’s role in the context of AI](https://preset.io/blog/semantic-layer-is-back/?utm%5Fsource=substack&utm%5Fmedium=email) * [AI Expert: Here Is What The World Looks Like In 2 Years!](https://www.youtube.com/watch?v=BFU1OCkhBwo) — ex-Google insider Tristan Harris shares some nuggets about the AI race and its consequences for our economy and possibly democracy Mehdio's Tech (Data) Corner is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. --- _I was in Amsterdam last week and my hotel had an American waffle maker?! First time I’ve seen this in the EU!!_ [![](https://substackcdn.com/image/fetch/$s_!9j7v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffea2e069-1d81-4017-9dc8-2049ac43439a_1296x1728.jpeg)](https://substackcdn.com/image/fetch/$s%5F!9j7v!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffea2e069-1d81-4017-9dc8-2049ac43439a%5F1296x1728.jpeg) --- ## ctrl+r #10: AI & cognition, Obsidian+Claude Code URL: https://mehdio.com/blog/ctrlr-10-ai-and-cognition-obsidianclaude Date: 2026-01-29T11:18:24.327 ## 🧠 The paradox of getting “smarter with AI” With many people debating whether AGI is near—or even whether LLMs are truly intelligent—there’s an interesting paradox: will we stay intelligent while using them? A French documentary (don’t worry, there’s auto-dub) called [“La Fabrique à Idiots”](https://www.youtube.com/watch?v=4xq6bVbS-Pw) (The Idiot Factory) explores the impact of AI in schools—how students use it for cheating and why it’s becoming nearly impossible to assign homework anymore. My kid is 6, and it makes me think: how can I help him keep training his brain while still embracing the technology? Someone online told me, “Well, LLMs are just compressed knowledge, like a book.” That feels like an understatement, as LLMs can go much further than simply delivering information. And there’s science behind this. As mentioned in the documentary, learning requires three neural stages: * **Encoding**: the hippocampus creates new connections * **Retrieval practice**: the basal ganglia strengthen pathways through repeated recall * **Error correction**: dopaminergic neurons fire prediction error signals that tell the brain what to reinforce or prune When AI handles tasks for you, none of these mechanisms engage. We should watch out for ending up with an atrophied brain—and the choice is up to us. ## 🛠️ Obsidian and Claude Code I’ve been playing with the [Terminal community plugin for Obsidian](https://github.com/polyipseity/obsidian-terminal) to leverage more AI within Obsidian. Not necessarily for writing, but rather for linting and searching. Yes, Obsidian has good search mechanisms and Dataview (so the query I show in the screenshot is actually pretty basic). But I’m looking more for skills like “add a summary of this large note” or “fill in the missing metadata from the source URL.” [![](https://substackcdn.com/image/fetch/$s_!Zw1F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6e180b2-0154-4d42-8ee6-657e72dd3ffe_2912x1574.png)](https://substackcdn.com/image/fetch/$s%5F!Zw1F!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6e180b2-0154-4d42-8ee6-657e72dd3ffe%5F2912x1574.png) ## 📚 What I read/watched * [Why paying DevRel matters](https://www.dewanahmed.com/why-paying-devrel/): A reminder that DevRel is NOT an entry-level role. Yes, you can create marketing content as a junior, but true DevRel work requires deep technical knowledge and industry experience. * [Creator economy’s abundance crisis](https://open.spotify.com/episode/1dBVRilbq9TfXKhDLWXLS4?si=f9a6b95576ad4675): Some good observations about how content abundance has surpassed demand (this applies to tech too) and how you can stay “authentic” when AI can generate so much content so easily. * [OpenAI went ALL IN on this Rust framework](https://www.youtube.com/watch?v=LGrx9ueO3y0): Fun to see that TUI in Rust is so hot right now, given all the AI hype. * [Clawdbot is a security nightmare](https://www.youtube.com/watch?v=kSno1-xOjwI): Clawdbot (now renamed “[Moltbot](https://github.com/moltbot/moltbot)“), an open-source framework for running your personal AI assistant, gained more than 50k stars in just a few weeks. But as with many new open-source projects gaining quick traction, there are often security holes—so be careful! * [MCP vs Agent Skills](https://x.com/kaxil/status/2014433441497100346/?rw%5Ftt%5Fthread=True): Kaxil (from Astronomer.io) explains the main differences between both in this thread. Will steal and reuse! --- Mehdio's Tech (Data) Corner is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. --- _Most important AI note from one of my latest meetings. Yes, I’m lucky I can get decent sleep already._ [![](https://substackcdn.com/image/fetch/$s_!K2dq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb745b6a-d36d-46b6-95b1-78791ba7cd64_1600x117.jpeg)](https://substackcdn.com/image/fetch/$s%5F!K2dq!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb745b6a-d36d-46b6-95b1-78791ba7cd64%5F1600x117.jpeg) --- ## ctrl+r #09: The generalist comeback, Cursor's hidden gem URL: https://mehdio.com/blog/ctrlr-09-the-generalist-comeback Date: 2026-01-19T16:58:21.317 ## 🧠 Should I double down on being a generalist now that there’s AI? I recently received this question on socials: “Given what’s possible with AI, do you think it’s important for data engineers to branch into software engineering too?” Honestly, if you’d asked me this two years ago, I would’ve said no. A key thing in my career has been to niche down (data engineer → devrel in data engineering). But I’ve been building way more stuff outside of data engineering these past months. So what’s the model can’t replicate? I think deep niche understanding about the business and domain expertise still matters—but the execution layer is getting commoditized. The question isn’t “should I learn React?” anymore. It’s “do I understand the problem well enough to direct an AI to build the right thing?” ## 🛠️ Cursor’s plan mode I don’t know how I missed it, but back in October 2025, Cursor released [its plan mode](https://cursor.com/blog/plan-mode). It enables the model to research your codebase to find relevant files, review docs, and ask clarifying questions before writing any code. This is basically what I used to do manually (and it still works if you aren’t using Cursor’s plan mode). Whenever I’d start a new project where I wasn’t sure about the architecture, UX, or technical implementation, I’d use a prompt like this while pinning relevant docs I’d previously indexed within Cursor settings: ``` Don't start any coding implementation yet. I'm thinking about X—ask clarifying questions and use @docs to verify what's possible. ``` Now plan mode does roughly that, and the UX within Cursor is nice since I can easily edit the plan and give the green light to build. [![](https://substackcdn.com/image/fetch/$s_!jX_m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1c1be43-2a6b-445b-9678-f847b6638a91_1739x1124.png)](https://substackcdn.com/image/fetch/$s%5F!jX%5Fm!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1c1be43-2a6b-445b-9678-f847b6638a91%5F1739x1124.png) ## 📚 What I read/watched * [AI’s trillion-dollar opportunity: Context graphs](https://foundationcapital.com/context-graphs-ais-trillion-dollar-opportunity/): Foundation Capital (VC firm from the Bay Area) discusses how the advantage in industry at the macro level is shifting from “systems of record” to where decisions are actually being made. Agents won’t replace systems of record, but the key missing piece is decision traces—recorded exceptions, approvals, and precedents—so agents can show _why_ decisions were made. * [Can a VFX Artist Beat AI?](https://www.youtube.com/watch?v=HUX-LRNrsv4): Zach King, the “magician of the internet,” made a fun video that reminded me you can still be really creative _with_ AI rather than letting it beat you. * [The Thinking Game](https://youtu.be/d95J8yzvjbQ?si=KZX%5Fod19pvVuGfxK): A nice documentary about DeepMind (acquired by Google) and the breakthrough that would eventually win a Nobel Prize. * [The Big Constraint Flip](https://thiagoghisi.substack.com/p/the-big-constraint-flip): I know I talked about this in my last issue, but here’s another signal that resonates with me: we just passed a flipping point. --- _OK, I don’t know how I missed that UFO, but Raye’s live performance of “Where Is My Husband” is absolutely INSANE. EVERY live version. Such musicians, such a performance._ [![](https://substackcdn.com/image/fetch/$s_!3iCH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F831a8751-3bcf-4cea-a200-44b7eb5a103c_1214x1086.png)](https://substackcdn.com/image/fetch/$s%5F!3iCH!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F831a8751-3bcf-4cea-a200-44b7eb5a103c%5F1214x1086.png) Mehdio's Tech (Data) Corner is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. --- ## ctrl+r #08: 2026's AI wake-up call, voice-first workflows, and the data engineer identity crisis URL: https://mehdio.com/blog/ctrlr-08-2026s-ai-wake-up-call-voice Date: 2026-01-13T21:02:56.934 ## 🧠 2026 is when you can’t ignore AI in your work I’ve been playing with AI since 2024\. I remember trying out Cursor for the first time and being mind-blown by how good it was... but it was actually BAD in absolute terms. It could only edit one file, and code quality was meh. In 2026, we reached a tipping point where AI is so good—especially at coding—that you can’t ignore it in your work anymore. I’ve always experimented with AI and different workflows, mostly around prompts and some limited MCP services. In 2026, I need to master more of these workflows. Specifically, I’m interested in diving deeper into: * [GitHub spec kit](https://github.com/github/spec-kit) * [Claude skills](https://code.claude.com/docs/en/skills) * Building various MCPs for my daily work * Working with multiple sub-agents and pushing work async as a reviewer * How to manage memory for AI > _Some powerful alien tool was handed around except it comes with no manual and everyone has to figure out how to hold it and operate it, while the resulting magnitude 9 earthquake is rocking the profession. Roll up your sleeves to not fall behind. - Andrej Karpathy_ ## 🛠️ More voice, less typing I’ve been experimenting with voice recording tools, specifically for: * Meetings where I’m not in control of recording notes * Meetings where I don’t want to send the AI summary * For myself—spending more time speaking than writing * Not having a big red button “Recording in progress” I think the exercise of writing is still important, of course, but there’s something nice about speaking freely and getting a summary or reordering of your thoughts. I stumbled on two nice open-source products: [meetly](https://www.getmeetly.ai/) and [hyprnote](https://hyprnote.com/). I went with Hyprnote because it’s just markdown (yeah!) and I love their philosophy on not wanting to compete with tools like Obsidian, as described in their latest blog post [“Filesystem is the cortex”](https://hyprnote.com/blog/filesystem-is-coretex). I’m still figuring out my workflows, but all in all, it’s working pretty nicely! ## **📚 What I read/watched** * [Welcome to Gas Town](https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16dd04): A rather crazy take on how to run multiple agents with tmux. Note that I’ve seen more people managing agents through tmux lately. * [Full Course: The AI Stack We Actually Use for Prototyping, Strategy, and Personal OS (2026)](https://www.youtube.com/watch?v=VXdrfvqiH-0&feature=youtu.be): Another example that there’s no manual for the best workflows—you need to figure it out. Good examples here using text files and markdown. * [Defense comes to software](https://www.linkedin.com/pulse/defense-comes-software-tomasz-tunguz-quabc?utm%5Fsource=share&utm%5Fmedium=member%5Fandroid&utm%5Fcampaign=share%5Fvia): Tomasz mentions that a lot of companies are restricting their APIs to own the full stack due to the rise of AI. * [Data Engineers are going to be whatever they want to be](https://sungwc.substack.com/p/data-engineers-are-going-to-be-whatever) : Good take from on how the role of data engineer is shifting into multiple subroles given the rise of AI. --- _Last saturday was my last day of skiing with full powder before going back to work after my parental leave!_ [![](https://substackcdn.com/image/fetch/$s_!eRZP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0d40c7f-bde1-4c3d-9174-f91c3d54cdbe_2302x1726.jpeg)](https://substackcdn.com/image/fetch/$s%5F!eRZP!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0d40c7f-bde1-4c3d-9174-f91c3d54cdbe%5F2302x1726.jpeg) Mehdio's Tech (Data) Corner is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. --- ## ctlr+r #07: TypeScript’s AI advantage, junior skill debt, and trusting AI with real money URL: https://mehdio.com/blog/ctlrr-07-typescripts-ai-advantage Date: 2026-01-05T15:52:53.226 _This is a double edition, slightly longer than usual as I took a week off for the New Year._ ## 🧠 Will AI kill Python? The GitHub Octoverse results are out, and **TypeScript (TS) is now the most used language.** For context, on GitHub Octoverse 2024, Python dethroned JavaScript as the most used language across all work. There is no doubt that today, Python is still the undisputed king of AI and Data. But here’s the irony: I’m a Python guy, yet I’ve been writing more JS/TS than ever. I’m vibecoding apps with background jobs, and most of them rely on APIs, so TS fits perfectly this use case. Mehdio's Tech (Data) Corner is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. **Why is TS gaining traction?** TypeScript provides a much safer feedback loop for AI generation. Because LLMs can hallucinate APIs or import non-existent methods, TypeScript’s static typing acts as immediate “guardrails.” The AI (and the compiler) detects errors at the linting stage, not at runtime. It essentially allows the AI to self-correct before you even run the code. Tools like **Lovable** and **Replit** that enable you to build apps with AI are targeting the web, so naturally, it’s all JS/TS. You can see why usage is exploding. That being said, **[Wasm](https://webassembly.org/)** might eventually shift the balance, but today, if I want to write for the web, it’s TS/JS. ## 🛠️ Holiday laptop cleaning with `dua` Here we go again. My laptop drive is full, I have urgent things to do, and I need to quickly delete large files I don’t need (or worse, hidden cache). I’m a macOS user (though `dua` is multi-platform), and I used to rely on an OG app called Disk Space Utility. The downside? The indexing was slow. While it gave a good map, I missed the speed of the terminal. `dua` is an open-source fast terminal tool to view disk and manage disk space, written in Rust; you can navigate through big files in the terminal, mark them, and delete everything via keyboard commands. I usually just go through the root `/` folder of my mac and launch `dua i` **Quick** `dua` **Cheat Sheet:** * `dua i`: Launch interactive mode. * `j/k` : Up/down the current directory list * `h/l`: Exit/enter a folder * `Space`: Mark/Unmark the entry under the cursor. * `Ctrl + r`: Remove all marked entries (careful!). * `Ctrl + t`: Put the marked entries to the trash bin. * `?`: Show all keyboard shortcuts. [![](https://substackcdn.com/image/fetch/$s_!1nC2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea5a1af2-e7ff-44a1-b106-a3a1e4e79516_3006x1814.png)](https://substackcdn.com/image/fetch/$s%5F!1nC2!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea5a1af2-e7ff-44a1-b106-a3a1e4e79516%5F3006x1814.png) ## 🧠 Seniors are great with AI coding; Juniors are hurting themselves Because juniors can vibecode their way through problems with AI, they are skipping the basics. They skip the fundamentals. They get quick results now at the cost of long-term skills—specifically, the technical fundamentals required for critical thinking. I watched a great video from **Lee Robinson** (DevRel at Cursor) about the future of AI coding, and one comment struck me: > “To young engineers, don’t skip learning the basics.” raised a similar alarm in his recent blog: > “I fear we’re building a competency debt bubble. Just as we accrue technical debt by cutting corners in code and systems, we are now accruing human debt.” ## 🛠️ AI did my personal finance for 2026 OK, so I went wild. I know I could use local open models with Ollama for privacy reasons, but I was lazy and decided to give Gemini/Claude a go with some static CSV exports of my bank transactions. My biggest challenge is always figuring out categories and getting the “bigger picture” view. There are tons of tools to manage budgets, but I wondered how far I could get with just two CSVs and a good prompt. **The two main challenges I faced:** 1. **The devil is in the details:** I had to set specific conditions and ask the AI to flag “unknown transactions” for me to comment on. It was a “human-in-the-loop” workflow, but with no UI—just Markdown. It gave me a list with a column where I can add comment on larger miscellaneous transactions 2. **Weird numbers:** Of course, it initially hallucinated some recurring subscriptions, assuming I had them running for the full year when I didn’t. Also other stuff that didn’t add up. It looks at some transactions then quickly do some assumptions. I asked it to back up all data with actual sums using **[SQL and the DuckDB MCP server](https://github.com/motherduckdb/mcp-server-motherduck)**. It ran about \~30 queries locally to double-check its own math against the CSVs. Honestly, it’s crazy how far you can get with such a minimal local setup. ## **📚 What I read/watched** * **[The future of coding: Idan Gazit breaks down Octoverse 2025](https://youtube.com/watch?v=MQOaBXwRfYo&si=G0vc6uSWCsSSRseS)** : A good breakdown of which language is being used for what, all in the context of AI. Great takes for the future. * **[Dawn of the Brain-Rotted Zombies](https://joereis.substack.com/p/dawn-of-the-brain-rotted-zombies)** : Joe Reis raising the alarm on how we should teach (and learn) in the age of AI. * **[AI codes better than me. Now what?](https://www.youtube.com/watch?v=UrNLVip0hSA)** : shared his experience coding with AI over the past year. He pushed a lot of projects he never would have been able to finish otherwise. * **[Will AI replace human thinking?](https://www.ssp.sh/brain/will-ai-replace-humans/)** : Another blog from on the same theme as above, but with much more depth, evidence, and data. * **[Critical n8n Flaw (CVSS 9.9)](https://thehackernews.com/2025/12/critical-n8n-flaw-cvss-99-enables.html)** : Another week, another vulnerability exposed. This time, arbitrary code execution on n8n. * **[SSDs Have Become Ridiculously Fast, Except in the Cloud](https://databasearchitects.blogspot.com/2024/02/ssds-have-become-ridiculously-fast.html)** : An older blog (Feb 2024), but I think the take is still valid. Especially NVMe cards are crazy fast and are still lapping what is available in the cloud. It’s a good reminder to keep some hardware local. --- _My 6-year-old riding into 2026\. Earlier this week, we went through a scary ER visit (not ski-related). First time for me—terrifying. But as you can see, he’s all good now and ready to take on 2026\._ _Honestly, the only thing I wish for you this year is **health**._ [![](https://substackcdn.com/image/fetch/$s_!ihVu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F290d3f6c-61df-447a-9a7b-9d06f225a706_1489x1489.png)](https://substackcdn.com/image/fetch/$s%5F!ihVu!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F290d3f6c-61df-447a-9a7b-9d06f225a706%5F1489x1489.png) Mehdio's Tech (Data) Corner is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. --- ## ctlr+r #06: Auto-Dubbing the Web & "Serverless" RAG URL: https://mehdio.com/blog/ctlrr-06-auto-dubbing-the-web-and Date: 2025-12-22T17:16:31.346 Weekly updates from the command line of my brain: **🧠 Thoughts, 🛠 Tools, and 📕 Takes** on software engineering, data, AI, and tech. ## 🧠 Breaking down language barriers for education Translation features have been around for a while. It’s pretty standard to see them implemented across different social platforms, letting you translate a comment, a Reddit thread, or an entire web page. However, what has changed recently is that some of these websites are **now translating automatically**, making content **searchable** across languages. For instance, Reddit has been rolling out[ automatic translation features](https://redditinc.com/news/bringing-reddit-to-more-people-around-the-world-machine-learning-powered-localization-and-translation-launching-in-more-than-35-new-countries) causing quite a stir around mid-2024\. This leads to users replying in their native language without realizing the original post was auto-translated, resulting in awkward interactions where people have to clarify, “Hey, we’re speaking English here.” YouTube is also pushing heavily in this direction. They have been rolling out features to **automatically translate video titles and descriptions**, making them searchable in other languages. Furthermore, YouTube officially launched **multi-language audio tracks**, and in **September 2025** with an AI-powered dubbing service. This trend likely kicked off with MrBeast, who [originally created multiple localized channels and dubbed them manually](https://mrbeast.fandom.com/wiki/International%5Fchannels). He realized that if he wanted to be the biggest YouTuber in the world, he had to capture the giant markets that don’t speak English. While AI auto-dubbing can still sound a bit rough, I noticed a few people from Brazil following MotherDuck/DuckDB content using the Portuguese auto-dubbing. So, even for technical videos, it seems to be working! Alec from ElevenLabs recently shared that creators got [significantly more views](https://www.linkedin.com/posts/alecwilcock%5Fdubbing-is-one-of-the-highest-roi-strategies-activity-7407732797311991808-4l-M?utm%5Fsource=share&utm%5Fmedium=member%5Fdesktop&rcm=ACoAAA0tl2QBJUocRMpCGqvWI8N%5FYbcsbmkLctY) when using their service to upload higher-quality dubs. The beauty of this is that you can easily clone your own voice with ElevenLabs. So it could be Auto dubbed with YOUR voice. I think it’s great; we are tearing down barriers to amazing content that used to be language-specific. The only weird part is that we are evolving to view the world through an artificial layer that translates everything for us. ## 🛠️ “Serverless” RAG from Gemini A couple of weeks back, Google released their [“serverless RAG” through Gemini File Search](https://ai.google.dev/gemini-api/docs/file-search#chunking%5Fconfiguration). In short, instead of having to chunk and embed data yourself and manage a vector database, they do most of the heavy lifting for you. You just upload a file and start querying it. Nice, right? I’m a big fan of Gemini’s API (at the moment—meaning this past week, because god knows what AI model is going to be top-tier next week), so I gave it a go. There are basically 3 API calls: create a store, upload a file, and query it. In Python, [per their documentation,](https://ai.google.dev/gemini-api/docs/file-search) it looks like this: ``` from google import genai from google.genai import types import time # create client client = genai.Client() # File name will be visible in citations file_search_store = client.file_search_stores.create(config={'display_name': 'your-fileSearchStore-name'}) operation = client.file_search_stores.upload_to_file_search_store( file='sample.txt', file_search_store_name=file_search_store.name, config={ 'display_name' : 'display-file-name', } ) while not operation.done: time.sleep(5) operation = client.operations.get(operation) response = client.models.generate_content( model="gemini-2.5-flash", contents="""Can you tell me about [insert question]""", config=types.GenerateContentConfig( tools=[ types.Tool( file_search=types.FileSearch( file_search_store_names=[file_search_store.name] ) ) ] ) ) ``` I tried this on a couple of large PDFs, and one big issue is that you can’t easily update or delete existing documents/embeddings. If you have static information, that’s fine, but otherwise, it’s a major blocker. Another thought is that because the context windows of the latest models are so large, I feel that a lot of use cases actually don’t need RAG anymore. Sometimes you want the whole corpus of a document in context, and it might not be smart to chunk it (e.g., “Give me the source blog of...”). Other than that, it’s pretty cheap, so it’s worth trying! ## **📚 What I read/watched** * **[The RAM Shortage Comes for Us All](https://www.jeffgeerling.com/blog/2025/ram-shortage-comes-us-all):** RAM prices are going up. AI datacenters being built are the culprit, and it seems the consumer line is getting hit. Manufacturers have a choice, but isn’t this a consequence of so much hardware being bought but not yet deployed for the AI giga-centers? * **[Column Storage for the AI Era](https://sympathetic.ink/2025/12/11/Column-Storage-for-the-AI-era.html):** A really good overview on the state of Parquet (vs. other file formats) from the creator of Parquet. It discusses what’s missing in Parquet today for the AI era, noting that the hardest part isn’t the file format itself, but getting the community and ecosystem to agree on specifications. * **[Building an answering machine](https://motherduck.com/blog/analytics-agents/):** Of course I’m biased as I work at MotherDuck, but with LLMs getting better, it feels like we are finally getting somewhere with “chat with your database” for analytics. Check the blog above for some neat examples. * **[Project managers will be the new developers](https://www.youtube.com/watch?v=TyNtEOcvER8):** WebDev Cody demoed his new project, [Automaker](https://github.com/AutoMaker-Org/automaker). I should definitely experiment with tools where you just input a task, branch it out as an agent, and then review the PR. * **[GitHub Actions Pricing Update](https://www.reddit.com/r/devops/comments/1po8hj5/github%5Factions%5Fintroducing%5Fa%5Fperminute%5Ffee%5Ffor/):** GitHub Actions is probably in my top 5 favorite cloud tools, so I’m really sad to see this nonsensical pricing: paying **per minute** on self-hosted GHA!? I don’t mind paying for the service, but that pricing model seems unfair. Mehdio's Tech (Data) Corner is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. --- _I’m on parental leave. No, it’s not a playground. Yes, the MotherDuck AMS office has a swing AND a slide._ [![](https://substackcdn.com/image/fetch/$s_!l3lF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55905ff7-4b57-4e31-bd69-46dded98527b_1806x1296.jpeg)](https://substackcdn.com/image/fetch/$s%5F!l3lF!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55905ff7-4b57-4e31-bd69-46dded98527b%5F1806x1296.jpeg) --- ## ctlr+r #05: Less UI, more chats, faster containers URL: https://mehdio.com/blog/ctlrr-05-less-ui-more-chats-faster Date: 2025-12-14T15:04:58.077 Weekly updates from the command line of my brain: **🧠 Thoughts, 🛠 Tools, and 📕 Takes** on software engineering, data, AI, and tech. ## 🧠 Are chat interfaces replacing traditional UIs? What if we used less UI and more code going forward? What if, instead of clicking through interfaces, you had a simple chat that handled all the interactions for you? [Lee Robinson](https://x.com/leerob) (DevRel at Cursor, previously at Vercel) shared his experience of [removing the entire CMS from the Cursor website](https://leerob.com/agents). A CMS is another layer of interface and complexity. In the age of AI-assisted development, it’s much easier to work with raw code so AI can easily modify things without navigating through additional layers. So, chat everywhere? Not quite. Lee’s example shows we could certainly remove some layers, but he also mentioned still wanting a basic GUI to manage assets, for instance. For website content management, you could definitely survive with minimalist features. However, many workflows will still need specific, opinionated UIs. A chatbot won’t always cut it. A good example is video editing. Sure, you could use a chatbot to remove all the “silence” from a video. But what the editor actually wants is the ability to easily adjust the silence threshold and—most importantly—preview changes so they can always roll back (or cut more). This requires tight integration with a classic video timeline and other controls. A chatbot won’t provide an efficient workflow here. ## 🛠 OrbStack: A better container solution for macOS? It’s weird this project didn’t fall onto my radar earlier. [OrbStack](https://orbstack.dev/), per their definition, is “the fast, light, and simple way to run containers and Linux machines. OrbStack offers excellent performance and seamless integration with macOS.” There are lots of alternatives to Docker Desktop. In the past, I’ve tried (in order of age/maturity) [Rancher Desktop](https://rancherdesktop.io/), [Podman](https://podman.io/), and recently [Apple’s own container runtime](https://blog.mehdio.com/p/apples-new-container-engine-bye-docker), simply called “Container” (yes, they missed the “iContainer” branding opportunity). All of them were nice, but sometimes felt clunky and, most importantly, didn’t always have good support for [devcontainers](https://containers.dev/), which is how I use containers locally every day for all development purposes. Docker Desktop may feel heavier and slower, but its support is generally solid. I decided to give OrbStack a spin with a devcontainer (one container) running a Node.js app. [![](https://substackcdn.com/image/fetch/$s_!gT35!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f83a20d-d9cd-43c8-82a0-24d8d1a9723f_1308x206.png)](https://substackcdn.com/image/fetch/$s%5F!gT35!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4f83a20d-d9cd-43c8-82a0-24d8d1a9723f%5F1308x206.png) [![](https://substackcdn.com/image/fetch/$s_!gnvC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7405a065-95db-4357-9b5c-3a3381ade27a_1170x324.png)](https://substackcdn.com/image/fetch/$s%5F!gnvC!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7405a065-95db-4357-9b5c-3a3381ade27a%5F1170x324.png) So yes, OrbStack has a significantly smaller footprint and includes nice features around mounted volumes and SSH to containers that I haven’t explored much yet. That said, it’s closed source (like Docker Desktop) and backed by a much younger company. I’ll continue using it as my default for the coming months and will report back. ## 📚 What I read/watched * [Reducing BigQuery Costs: How We Fixed A $1 Million Query](https://shopify.engineering/reducing-bigquery-costs) \- Shopify’s experience with a single query that would have cost nearly $1M. Old blog post, but a good reminder that small optimizations at scale can yield massive savings. * [this is the worst case scenario](https://www.youtube.com/watch?v=s81dVUM-cQM) \- Low Level shows pragmatically how bad the critical vulnerability affecting both React and Next.js is ([CVE-2025-55182](https://cloud.google.com/blog/topics/threat-intelligence/threat-actors-exploit-react2shell-cve-2025-55182)). If you have any React or Next.js projects, patch them now! * [The End of Coding Tutorials for Tech Creators?](https://www.youtube.com/watch?v=c58bMmlelow) \- Francesco Ciulla and [Maximilian Schwarzmüller](https://www.youtube.com/channel/UCNxUdsuH8-kEGIwSD0r8RhQ), two creators with solid experience making successful coding tutorials, discuss the trend toward “tech entertainment” over deep-dive coding tutorials. IMO we’ll need both in the future and there are too much of “tech entertainment” videos at the moment. * [Donating the Model Context Protocol and establishing the Agentic AI Foundation](https://www.anthropic.com/news/donating-the-model-context-protocol-and-establishing-of-the-agentic-ai-foundation) \- Anthropic announced they’re donating MCP to the Linux Foundation. This could be a double-edged sword—only time will tell. But it’s probably a win for AI consumers if it leads to broader adoption beyond just Anthropic. * [Context Engineering the ](https://agor.live/blog/context-engineering)[@mistercrunch](https://github.com/mistercrunch) Way - Maxime Beauchemin (creator of Airflow) shares his insights on organizing rules for LLMs without bloating your `AGENTS.md` and `CLAUDE.md` files. Useful patterns similar to what I have right now. Mehdio's Tech (Data) Corner is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. --- As I moved back to my hometown after 6 years, I opened some very old boxes and found my old consoles. After 20+ years, my save file was still there! The game is at 95% completion, and I have a sudden urge to finish it at 100% but I can’t remember anything from back then lol. Can you guess which game it is, retro lovers? [![](https://substackcdn.com/image/fetch/$s_!Myio!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba66610a-ddb5-4a7d-be4c-b7df68fa8155_1356x996.jpeg)](https://substackcdn.com/image/fetch/$s%5F!Myio!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba66610a-ddb5-4a7d-be4c-b7df68fa8155%5F1356x996.jpeg) --- ## ctlr+r #04: AI’s Energy Bill , Free Localhost Tunnels URL: https://mehdio.com/blog/ctlrr-04-ais-energy-bill-free-localhost Date: 2025-12-06T16:09:51.666 A weekly recall from the terminal of my mind: **Thoughts 🧠, 🛠 Tools, and 📕 Takes.** ## 🧠 Is AI resource usage killing the planet? I recently stumbled upon a good essay arguing from that [ChatGPT is not bad for the environment](https://andymasley.substack.com/p/a-cheat-sheet-for-conversations-about). Then, I read a jarring counter-perspective from claiming that a single [Sora AI video burns 1 kWh](https://reclaimedsystems.substack.com/p/every-sora-ai-video-burns-1-kilowatt?utm%5Fsource=share&utm%5Fmedium=android&r=1g8h3p&triedRedirect=true). Meanwhile, we see massive investments in new infrastructure, like [Anthropic’s $50 billion project](https://www.anthropic.com/news/anthropic-invests-50-billion-in-american-ai-infrastructure). So, are we killing the planet or not? The truth lies in the details. First, we need to distinguish what “AI cost” actually means. While data on this has historically been imprecise, [Google’s August 2025 report](https://cloud.google.com/blog/products/infrastructure/measuring-the-environmental-impact-of-ai-inference) finally gives us solid numbers. * **Text is cheap:** A query costs about **0.24 Wh** : roughly equivalent to **9 seconds of watching TV**. * **Video is expensive:** Generating video is exponentially more energy-intensive. While tools like Kling 2.6 offer utility for artists, the energy cost of flooding the internet with “fun” Sora-style clips is massive compared to text. Second, we must differentiate **Training** vs. **Inference**. * **Training** is like building a stadium. Massive energy goes into the construction, but that cost is “fixed” regardless of whether 1 person or 10,000 people sit in it. This is the scary part: we are currently making a massive energy bet : burning resources now to build these “stadiums” and hoping that AI will improve life enough to justify the cost. * **Inference** is the daily usage (the ticket to the stadium). Finally, there is a surprising trade-off between **Cloud vs. Local** inference. Technically, the cloud is much better at “math-per-watt” due to specialized cooling and efficient hardware. Your laptop is less efficient at the math, _but_ it represents “sunk carbon”. The environmental cost to manufacture your phone or laptop has already been paid. Using the “idle compute” we already own locally might be the smarter move to avoid building endless new data centers. The future will likely need to be a hybrid of both. ## 🛠 Tunnels: exposing localhost to the world I recently needed a quick tunneling service to test a few things locally. Specifically, I had a webhook I needed to register to verify that my app could receive it and trigger the correct actions. To do this, you typically need to register a public URL. This is where tunneling comes in: it makes your local machine (aka `localhost:3000`) available to the internet via a secure public link. I used to use **ngrok**, but it has become increasingly restrictive for free usage (bandwidth limits, no static domains without paying). After checking [awesome-tunneling](https://github.com/anderspitman/awesome-tunneling), I found [Cloudflare Tunnel](https://github.com/cloudflare/cloudflared) (`cloudflared`). It is incredibly simple to start and arguably the best free alternative right now. In a setup with AI tools like **Lovable** or **Replit**, this is becoming less critical because they often provide their own cloud previews out of the box. However, if you want a true local development environment, a tunnel is still essential. **I** suspected Cursor might eventually host apps for development purpose too. ## 📚 What I read / watched * **[Driving Xiaomi’s Electric Car: Are we Cooked?](https://www.youtube.com/watch?v=Mb6H7trzMfI)**: It’s great to see more competition in the EV space. Xiaomi (yes, the one that makes phones and tablets) seems to have really nailed it with this model. * **[Anthropic acquires Bun](https://www.anthropic.com/news/anthropic-acquires-bun-as-claude-code-reaches-usd1b-milestone?s=09)**: Anthropic acquiring an OSS JS runtime is... interesting. I’m curious to see the future of this, but to me, it proves that we still need good engineers. Generating code might be becoming a commodity, but _engineering_? That isn’t going anywhere. Otherwise, they would have just forked the project. * **[Why Replicate is joining Cloudflare](https://www.google.com/search?q=https://blog.cloudflare.com/why-replicate-joining-cloudflare/%23:~:text%3DThis%2520is%2520why%2520we%27re,lives%2520entirely%2520on%2520the%2520network.)**: Replicate (think Vercel, but for AI model hosting) has been acquired. Cloudflare continues to build a really neat catalog of products, but I feel the one thing they are missing is strong developer branding. I see AWS, GCP, and Vercel champions everywhere. Cloudflare? Not so much. --- Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. Two months ago, I challenged myself to make the best Sunday morning pancakes for my family. Given that my cooking skills hover around zero, this was ambitious, but I’m finally getting somewhere! 🧑‍🍳 [![](https://substackcdn.com/image/fetch/$s_!L9Rl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa594d476-f5bb-4061-9073-5d3297aac893_1293x1341.jpeg)](https://substackcdn.com/image/fetch/$s%5F!L9Rl!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa594d476-f5bb-4061-9073-5d3297aac893%5F1293x1341.jpeg) --- ## ctlr+r #03: OSS fatigue & Orchestrating LLMs URL: https://mehdio.com/blog/ctlrr-03-oss-fatigue-and-orchestrating Date: 2025-11-30T14:12:16.38 A weekly recall from the terminal of my mind: **Thoughts 🧠, 🛠 Tools, and 📕 Takes.** ## 🧠 Open Source projects are becoming less convincing The business models and sustainability of open source have always been challenging. In the past, if you wanted to push something into the open for others to use and contribute to, the bar was incredibly high. You basically had to sacrifice your evenings and weekends just to write the code. That effort was a signal: it showed a genuine commitment to, at the very least, maintain and move the project forward. Now, with AI, anyone can spin up an open source project in a few minutes, throwing together code they’ve likely never reviewed. As a result, the commitment to maintainability has plummeted. So, I’m looking at new open source projects with much more skepticism now—unless they are actually backed by a company. There are still great new OSS projects, like [Ghostty](https://ghostty.org/) or [Omarchy](https://omarchy.org/), but these are exceptions led by people who [don’t need the money.](https://x.com/dhh/status/1964776333965427110) [![](https://substackcdn.com/image/fetch/$s_!QY-J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8913530-ba5d-444a-9cc6-83b9ec9b55bf_1070x656.png)](https://substackcdn.com/image/fetch/$s%5F!QY-J!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8913530-ba5d-444a-9cc6-83b9ec9b55bf%5F1070x656.png) ## 🛠 Modern orchestration for LLM workflows As a data engineer, I’m familiar with orchestrators like Airflow, Dagster, and [Kestra](https://kestra.io/). These are designed for data dependencies and heavy batch compute. They are primarily declarative DAGs: “here’s the whole pipeline, go run it.” Designing LLM workflows (or agents) requires a different approach. I need dynamic execution rather than raw compute power, as the system is mostly waiting for API responses. This is where event-driven, state-machine-style orchestrators shine. I’ve been testing [trigger.dev](https://trigger.dev/), a lightweight serverless tool where you define tasks directly within your code. You define task within your API, and the service will call these endpoint following a specific workflow and given you clear observability. It’s nice because your orchestrations is “built-in” within your app. There’s just another service calling endpoints. [![](https://substackcdn.com/image/fetch/$s_!fuXc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03295559-9422-48de-b6f0-2be86da83358_3420x2024.png)](https://substackcdn.com/image/fetch/$s%5F!fuXc!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F03295559-9422-48de-b6f0-2be86da83358%5F3420x2024.png) panel of trigger.dev for monitoring workflows An honorable mention goes to [Inngest](https://www.inngest.com/), while [Temporal](https://temporal.io/) offers a heavier enterprise alternative. I’ll share a deep dive soon on these once I’ve gathered more experience. ## 📚 What I read / watched * **[650GB of Data (Delta Lake on S3): Polars vs DuckDB vs Daft vs Spark](https://dataengineeringcentral.substack.com/p/650gb-of-data-delta-lake-on-s3-polars):** takes a “no tuning” approach, acknowledging that most devs don’t read the docs anyway. A pragmatic showdown of processing 650GB using different compute engines. * **[What if you don’t need MCP at all?](https://mariozechner.at/posts/2025-11-02-what-if-you-dont-need-mcp/):** An interesting take suggesting we don’t always need MCP. Often, simply letting LLMs run bash scripts is enough. * **[Knowledge Management in the Digital Age](https://youtu.be/BOJFHMtyqNs?si=OVl0B%5FTK19clsS6j):** A hands-on look at ‘s “second brain.” He’s used Obsidian for years, so despite the length, there are great nuggets here for your workflow. * **[My biggest programming regret](https://www.youtube.com/watch?v=XTzoBHECfXY):** A reminder to build what excites you, not just what pays. If you are passionate and get good at it, the career and money will follow naturally. --- We are constantly bombarded with online success stories and “the grind.” Here’s a photo of my new daughter to remind you that success is worthless if you can’t share it with real people. Take time with your family and friends. Life is short. [![](https://substackcdn.com/image/fetch/$s_!2UH2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9aa85c89-e33e-41c5-bf31-8db5a6361628_923x1050.jpeg)](https://substackcdn.com/image/fetch/$s%5F!2UH2!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9aa85c89-e33e-41c5-bf31-8db5a6361628%5F923x1050.jpeg) Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. --- ## ctlr+r #02: How to not get stupid & The Parquet Killer? URL: https://mehdio.com/blog/ctlrr-02-how-to-not-get-stupid-and Date: 2025-11-24T17:22:22.24 _A weekly recall from the terminal of my mind: **Thoughts 🧠, 🛠 Tools, and 📕 Takes.**_ ## 🧠 The LLM paradox: preserving critical thinking This has been on my mind for months. With LLMs at our fingertips, we risk becoming lazy thinkers. We ask a question, get the answer, and copy-paste...often without understanding the “how” or engaging our own brains first. It feels like our short and long-term memory is taking a hit. **Do you actually remember most of the answers you get from LLMs?** That said, AI is a powerful learning tool. [Harvard](https://www.youtube.com/watch?v=6rAWxGAG6EI) has an internal AI chat for their CS50 students. It used to just “quack” (seriously, based on [rubber duck debugging technique](https://en.wikipedia.org/wiki/Rubber%5Fduck%5Fdebugging)), but now it speaks English with what they call “pedagogical guardrails.” Instead of just giving answers, i**t nudges students to think critically, ask questions**, and work their way toward the solution. Anthropic also announced [Claude for Education](https://www.anthropic.com/news/introducing-claude-for-education) back in April 2025 (aka “learning mode”). It’s not publicly available yet, but I suspect it follows the same Socratic approach. I tried Gemini’s [“Guided learning”](https://blog.google/outreach-initiatives/education/guided-learning/) announced in August 2025, but I was a bit disappointed. [![](https://substackcdn.com/image/fetch/$s_!4D1_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a601a0e-8be2-425a-bfb0-c4bf3ca7200f_676x664.png)](https://substackcdn.com/image/fetch/$s%5F!4D1%5F!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a601a0e-8be2-425a-bfb0-c4bf3ca7200f%5F676x664.png) right in the chat under “tools” I asked, “Can you explain to me what Apache Kafka is and how it works?” It jumped straight into a technical answer without assessing my background or the level of depth I wanted. While it did eventually ask a follow-up , eg : “What part of the Kafka system would you like to explore next?”. I feel there is definitely space to improve how AI scaffolds learning rather than just retrieving facts. [Subscribe now](https://blog.mehdio.com/subscribe?) ## 🛠 Vortex: file format rethought for our era Parquet is legendary, but it was built for a different hardware era : spinning disks, HDFS, and large sequential scans. Today, we read data from fast NVMe and S3, mixing analytics with ML workloads that require quick, selective access. Vortex rethinks the layout for this modern environment: it aligns directly with Arrow’s in-memory format, ditches Parquet’s heavy row-group structure, and uses flexible encodings. The result is **10–20× faster table scans and up to 100× faster random access**, without sacrificing compression. For data engineers building retrieval-heavy systems, Vortex reduces I/O and CPU overhead while keeping the familiar columnar model. For any new standard, adoption is key. Vortex already supports: Arrow, DataFusion, DuckDB, Spark, Pandas, Polars, and Apache Iceberg is coming soon (!). It was recently added as a `core`[ extension](https://duckdb.org/docs/stable/core%5Fextensions/vortex) in DuckDB, so I gave it a spin against a \~1GB Parquet file. The results were mixed: Vortex was faster, but not _crazy_ faster, and the file size was only 3% smaller. I need to do more rigorous testing, but it looks promising! ## 📚 What I Read / Watched * **[Agent Design Is Still Hard](https://www.youtube.com/watch?v=UAK6dQbnknE):** Armin Ronacher (creator of [Flask](https://flask.palletsprojects.com/en/stable/)) shares gems on building agents with pragmatic takeaways. A must-read if you are building an agent! * **[Librepods](https://github.com/kavishdevar/librepods):** Someone reverse-engineered Apple’s protocol to unlock all AirPod features on Android. Note: you need a rooted phone unless you are on Oppo/OnePlus. * **[Post-mortem of Cloudflare’s outage](https://blog.cloudflare.com/18-november-2025-outage/):** The “Internet was down” again thanks to Cloudflare. Hats off to them for the detailed post-mortem so quickly. The root cause was a change to ClickHouse permissions that made a metadata query return duplicates, generating a bad config file. Insane how small issues can escalate to the world’s eye. * **[Fixing standup the only way I know how](https://www.youtube.com/@dreamsofcode):** Dreams of Code tries to automate stand-ups using n8n. TBH, might be over-engineered, but definitely an interesting attempt to solve the “problem” of the standup. Comments are interesting too. --- I was speaking at the [Forward Data conference](https://forward-data-conference.com/) today in Paris and the vibe was great! Here’s me spreading the Ducklife (made with [duckify.ai](https://duckify.ai/)) [![](https://substackcdn.com/image/fetch/$s_!xql0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb75930ce-a61f-4568-b6ba-f86481ee7579_1184x864.png)](https://substackcdn.com/image/fetch/$s%5F!xql0!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb75930ce-a61f-4568-b6ba-f86481ee7579%5F1184x864.png) Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. --- ## ctlr+r #01: Toon, LLM CLIs URL: https://mehdio.com/blog/ctlrr-01-toon-llm-clis Date: 2025-11-16T14:49:53.35 [![](https://substackcdn.com/image/fetch/$s_!j6fa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b69bc21-1e50-42ed-bddd-913199088558_410x283.png)](https://substackcdn.com/image/fetch/$s%5F!j6fa!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b69bc21-1e50-42ed-bddd-913199088558%5F410x283.png) Hey there, I’m starting a new habit that I hope to stick with and that will help you: **ctrl+r** : a weekly recall from the terminal of my mind. Each issue includes: * 🧠 something I’ve been thinking about * 🛠 something I’ve built, tried, or found useful * 📚 something I read or watched, with my take No fluff. Just highlights from my dev brain across data, AI, tools, and indie building. ## 🧠 Toon vs JSON Last week, the [TypeScript SDK for TOON hit 1.0](https://github.com/toon-format/toon) : a compact, token-efficient alternative to JSON designed for LLM prompts. In short, instead of passing data to your LLM API through JSON like this: ``` {   “users”: [     { “id”: 1, “name”: “Alice”, “role”: “admin” },     { “id”: 2, “name”: “Bob”, “role”: “user” },     { “id”: 3, “name”: “Charlie”, “role”: “admin” }   ] } ``` You would use Toon encoding like this: ``` users[3]{id,name,role}: 1,Alice,admin 2,Bob,user 3,Charlie,admin ``` As you can see: * JSON repeats keys (”id”, “name”, “role”) **every row** → more tokens, more cost. * TOON declares keys **once**, then streams values row-by-row → significantly fewer tokens, especially for large arrays (10k+ rows). It’s interesting to see that while Toon **has been around for a month,** the trigger for the hype was the 1.0 release of the TypeScript SDK. Indeed, most LLM API calls today are made through TypeScript. Meanwhile, the Python SDK is still very early and under heavy development, so Python-heavy data pipelines aren’t adopting it yet. ## 🛠 Multi-LLM CLI Like many of us, I’ve been juggling between LLM tools (be it UIs or CLIs like Claude Code or Gemini CLI), and it always presents the same challenges: finding past chats and having a common repository for prompts instead of framework-specific ones. I’ve looked at multiple projects, and [aichat](https://github.com/sigoden/aichat) comes closest to what I need. It’s missing a built-in TUI (I opened an issue [here](https://github.com/sigoden/aichat/issues/1438) and prototyped one) to easily search for past conversations, but it has a decent REPL, and I’m already using it for some workflows. ``` # proofreader is prompt+model definition, here fixing grammar and lightweight edit aichat --role %proofreader% “here’s my sentence that could use some grammar edits” ``` You can then, of course, wrap it up in a bash alias command or even a [Raycast](https://www.raycast.com/) keybinding so you just write and get the result back. Other projects worth mentioning I checked: * [Opencode](https://github.com/sst/opencode) * [llm](https://github.com/simonw/llm) by Simon Willison ## 📚 What I Read / Watched * **[TUIs Are Perfect for LLM Chat](https://www.youtube.com/watch?v=UAK6dQbnknE)** Code to the Moon built [shore](https://github.com/MoonKraken/shore) for the same problem I faced and explained above. Aside from the project itself, he made really good points, and his channel consistently produces high-quality software videos (especially if you like Rust!). * **[90-percent](https://lucumr.pocoo.org/2025/9/29/90-percent/?ref=dailydev)** Armin Ronacher (creator of [Flask](https://flask.palletsprojects.com/en/stable/) and popular developer in the OSS world) shares his experience with coding assistants, their strengths, and where they fail. Great insights from a _truly_ experienced developer! * [Python 3.14 will change the way you parallelise code](https://valatka.dev/2025/10/11/on-python-3-14-parallelization.html) There’s a lot of noise about 3.14 and free threading that could significantly speed up Python for some tasks. However, as the author said: _“I don’t think **data pipelines** will benefit from no-GIL though. Most don’t have latency requirements (rather throughput), and we already offload the bulk of CPU work.”_ * [How Not to Partition Data in S3 (And What to Do Instead)](https://luminousmen.substack.com/p/how-not-to-partition-data-in-s3-and) Interesting points highlighting that it’s generally more efficient to use `datetime` for partitioning large datasets on S3 (e.g., `my_path/dt=2026-11-01=events.parquet`) versus the common knowledge around Hive partitioning (`my_path/year=2026/month=11/day=01`). * [Gemini released their fully managed RAG service](https://blog.google/technology/developers/file-search-gemini-api/) also known as the “File Search Tool.” This represents a next level of abstraction, making it interesting not to have to manage vector databases and other components. It will probably be useful for many use cases, as it’s often overkill to set up such infrastructure for the value it provides. --- Hope you all had a great week. I’m still figuring out what to program for my new NuPhy Air75 V3 keyboard keycaps... any suggestions??? [![](https://substackcdn.com/image/fetch/$s_!QDdq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f987f52-aa08-4553-b8e1-3b81eb4cbef8_1520x994.png)](https://substackcdn.com/image/fetch/$s%5F!QDdq!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8f987f52-aa08-4553-b8e1-3b81eb4cbef8%5F1520x994.png) Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. --- ## An actually useful MCP for web development URL: https://mehdio.com/blog/an-actually-useful-mcp-for-web-development Date: 2025-07-19T14:46:05.661 [![](https://substackcdn.com/image/fetch/$s_!cwhY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48262b56-01b5-433d-b9a5-f9de3f16b5ee_1604x902.png)](https://substackcdn.com/image/fetch/$s%5F!cwhY!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48262b56-01b5-433d-b9a5-f9de3f16b5ee%5F1604x902.png) The MCP ecosystem is exploding —there are far more MCPs than actual users. We're drowning in noise. How do you figure out which MCP deserves your attention? [mcp.so](https://mcp.so/) lists 16,024 MCPs as of today 😱. Some are weekend hackathon projects, others are official implementations backed by companies pushing their product's MCP integration. There are also countless duplicates cluttering the space. Most of them are useless. Most of them will die. > As a reminder, [MCP (Model Context Protocol)](https://www.anthropic.com/news/model-context-protocol) is Anthropic's open standard that lets AI assistants securely connect to external data sources and tools. Think of it as a bridge between your AI chat and the real world—databases, APIs, file systems, and more. Instead of manually feeding information back and forth, MCPs let your AI assistant directly interact with these systems. _What makes an MCP worth your time?_ Here's a simple test: look at what you're copy-pasting into your AI chat. If the AI could make decisions automatically and eliminate that copy-paste step, you've probably found a use case worth building an MCP for. ## Web development's copy-paste problem Web development has two major pain points that create endless copy-paste cycles: **Runtime debugging**: Issues that only surface when you're running the local server. Your code compiles fine, but something breaks at runtime. You find yourself constantly copying error messages from the console, network request failures, and stack traces into your AI chat. **UI iteration hell**: Describing visual changes is inherently difficult. "Make this button there next to x" doesn't translate well to precise CSS modifications. You end up taking screenshots, uploading them to your AI assistant, describing what's wrong, getting code suggestions, implementing changes, taking another screenshot, and repeating the cycle. I sometimes create quick mockups in Figma as reference points, but I still end up in the screenshot → feedback → code → screenshot loop. It can be exhausting. In fact, the core problem is rather simple : it's the friction between your development environment and your AI assistant. They exist in separate worlds, connected only by your manual copy-paste bridge. ## Browser Tools MCP [browser-tools-mcp](https://github.com/AgentDeskAI/browser-tools-mcp) is one of the genuinely useful MCPs out there (5.8k stars and counting). It eliminates the copy-paste friction by giving your AI assistant direct access to your browser environment. Here's what I've been using it for : **Console and network debugging**: read console logs, inspect network requests, and analyze runtime errors. It sees the same error messages you do, but faster. **Visual debugging**: It can take screenshots of your application and analyze the UI directly. **SEO analysis**: flag on-page SEO issues, check meta tags, analyze page structure, and suggest improvements without you having to audit everything manually. ### Setup The browser tools MCP works by establishing a bridge between your browser and the MCP server. Here's how the architecture works: The browser extension acts as a client that exposes browser APIs to the MCP server. When your AI assistant needs to interact with your webpage, it sends commands through the MCP protocol to the server, which then communicates with the browser extension. This architecture is nice because it maintains security boundaries—the AI can't directly execute arbitrary JavaScript in your browser. Instead, it goes through the controlled MCP interface, which exposes only specific, safe operations. [![](https://substackcdn.com/image/fetch/$s_!bJzB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6976970f-4c65-4f21-bf83-58797a7e0a44_1407x1422.png)](https://substackcdn.com/image/fetch/$s%5F!bJzB!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6976970f-4c65-4f21-bf83-58797a7e0a44%5F1407x1422.png) Overview of the process between AI assistant and your browser Getting started requires a few components: 1. **Browser extension**: Install the browser tools extension and enable it. This creates the bridge between your browser and the MCP server. 2. **Active browser tab**: You need a running application to inspect. The extension works with any webpage, but it's most useful with your local development server. 3. **MCP server**: Install and run the server with `npx @agentdeskai/browser-tools-mcp@latest`. This handles the communication between Claude and your browser. 4. **Server service**: Run `npx @agentdeskai/browser-tools-server@latest` to start the service that manages browser interactions. This setup process reveals something important about MCP adoption: it's still early days. The fact that you need to run multiple services and install browser extensions shows that MCPs aren't quite plug-and-play yet. But for developers who deal with the copy-paste problem daily, the setup overhead is worth it. After that, you can just prompt it to check whatever logs (or screenshots) are needed. More data and context means better code. [![](https://substackcdn.com/image/fetch/$s_!G78e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F535ba4eb-3b90-4351-81d4-1a853972065e_468x596.png)](https://substackcdn.com/image/fetch/$s%5F!G78e!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F535ba4eb-3b90-4351-81d4-1a853972065e%5F468x596.png) example of interactions with the MCP in Cursor ## The bigger picture Browser tools MCP represents what good MCP development looks like: solving a real, specific problem that developers face daily. It's not trying to be everything to everyone—it focuses on browser interaction. The key is identifying those repetitive, mechanical tasks that create friction between you and your AI assistant. Every copy-paste operation is a potential MCP waiting to be built. Yes, setup for MCP tools like this one is not easy, and there is plenty of room for improvement. That being said, it saves me countless copy-paste cycles. Here's to the next MCP that makes me less of a monkey worker 🥂 Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. --- ## Is Gemini CLI worth it for Cursors users ? URL: https://mehdio.com/blog/is-gemini-cli-worth-it-for-cursors Date: 2025-07-08T20:54:18.197 --- ## Apple’s new "Container" Engine (Bye Docker?) URL: https://mehdio.com/blog/apples-new-container-engine-bye-docker Date: 2025-06-15T13:23:25.689 [![](https://substackcdn.com/image/fetch/$s_!JJSZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eaf7195-1d04-4f73-b435-5672232a025a_1604x902.png)](https://substackcdn.com/image/fetch/$s%5F!JJSZ!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6eaf7195-1d04-4f73-b435-5672232a025a%5F1604x902.png) Apple just dropped something no one expected at WWDC (their annual developer conference) with the announcement of their own containerization framework for macOS. And yes, everyone's asking the same question: **is this the end of [Docker Desktop](https://www.docker.com/products/docker-desktop/) and [Podman](https://podman-desktop.io/) on macOS?** These are two popular alternatives that developers use to run local linux containers today. Let's dive into what Apple has built, explore its features and current limitations, and get our hands dirty with some actual code examples. If you prefer watching over reading : ## Meet "Container" (yes, that's really the name) I have to say, Apple's developers aren't quite as creative as their marketing team when it comes to naming. The project is simply called **Container**, though if you check their [official announcement page](https://developer.apple.com/videos/play/wwdc2025/346/), it's referred to as the "Containerization Framework." The framework consists of two main repositories: * **Containerization**: The core virtualization engine * **Container**: The CLI tool that developers will use to create and manage lightweight VMs One fun fact that caught my attention: I know for certain that Apple actually hand-wrote this code because a [PR was submitted to fix typos ](https://github.com/apple/container/pull/122/files)in variable names and comments - something an AI agent will never do 😅 [![](https://substackcdn.com/image/fetch/$s_!eWsY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79750be3-fb9e-4b59-a8f8-85a271ca1b41_3326x1322.png)](https://substackcdn.com/image/fetch/$s%5F!eWsY!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79750be3-fb9e-4b59-a8f8-85a271ca1b41%5F3326x1322.png) Kudos to Apple for the handcrafted code 👏 ## The new architecture: one Virtual Machine (VM) per container The framework is built as an **open-source Swift** framework, released under the **Apache 2.0 license**. This is particularly interesting, considering Docker Desktop has not been free for commercial use for some time. If you’re using Docker Desktop at work, you probably owe Docker some money. > Docker Desktop is free for small businesses (fewer than 250 employees AND less than $10 million in annual revenue), personal use, education, and non-commercial open source projects. The new architecture: one Virtual Machine (VM) per container Here's where things get really interesting. A traditional container engine like Docker Desktop runs one large Linux virtual machine in the background, even when no containers are running. All your containers share this single VM, which can introduce security risks since files must pass through the shared virtual machine before reaching your container. [![](https://substackcdn.com/image/fetch/$s_!SB2r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b535d4c-725a-472a-83fa-abfd62a0219d_4466x2548.png)](https://substackcdn.com/image/fetch/$s%5F!SB2r!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b535d4c-725a-472a-83fa-abfd62a0219d%5F4466x2548.png) Traditional container engine like Docker Desktop Apple's approach is fundamentally different: **each container gets its own dedicated virtual machine**. This provides: * **Strong security isolation** * **Complete container separation** * **Individual IP addresses for each container** [![](https://substackcdn.com/image/fetch/$s_!W_W4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa564ee76-2e89-4a53-8269-7413afbcdb42_1628x879.png)](https://substackcdn.com/image/fetch/$s%5F!W%5FW4!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa564ee76-2e89-4a53-8269-7413afbcdb42%5F1628x879.png) Apple’s container framework That last point is nice - no more port forwarding headaches! You can access your local web services directly via the container's IP address. Apple claims startup times are within seconds, which is pretty impressive for full VM isolation. But let's confirm this and run the thing. ## Installing Container on macOS To install it, you can simply install the package installer file (`.pkg`) on their [release notes page](https://github.com/apple/container/releases) Once done, you will first have to launch the service by : ``` container system start ``` If you are familiar with Docker for Desktop or Podman, you'll feel right at home as commands are pretty similar. And yes, it does support building from `Dockerfile` ``` $ container --help OVERVIEW: A container platform for macOS USAGE: container [--debug] OPTIONS: --debug Enable debug output [environment: CONTAINER_DEBUG] --version Show the version. -h, --help Show help information. CONTAINER SUBCOMMANDS: create Create a new container delete, rm Delete one or more containers exec Run a new command in a running container inspect Display information about one or more containers kill Kill one or more running containers list, ls List containers logs Fetch container stdio or boot logs run Run a container start Start a container stop Stop one or more running containers IMAGE SUBCOMMANDS: build Build an image from a Dockerfile images, image, i Manage images registry, r Manage registry configurations SYSTEM SUBCOMMANDS: builder Manage an image builder instance system, s Manage system components ``` To pull an image, you would do : ``` container image pull python:3.12 ``` And to run for instance a `python` shell within the above image : ``` container run -it python:3.12 python ``` [Subscribe now](https://blog.mehdio.com/subscribe?) ## Current limitations Before you get too excited, there are some important caveats. The framework is designed primarily for **macOS 26** (the upcoming 2025 fall release, still in beta as the time of this writing), though it works on macOS 15 with limitations. > Yes, the version jump from **macOS 15 to 26** is definitely confusing. Apple decided to align macOS version numbers with the current year. The transition might sound even more confusing at first, but it will make more sense over time—you’ll be able to tell which year the OS is from just by its version name! Key limitations as of now on [macOS 15](https://github.com/apple/container/blob/main/docs/technical-overview.md#macos-15-limitations) : * **Container-to-container networking isn't fully supported yet**. This is a major limitation if you're running multi-container setups like a web server with a database. * **Container IP Address Management** : there are still some rough edges around how container IP addresses are handled and accessed. ## Performance comparison: Container vs Docker Desktop I ran some quick tests comparing Container with Docker Desktop by running a simple Python shell through a `python:3.12` image, and the results were interesting: Surprisingly, pulling images was noticeably faster with Docker for Desktop compared to Container. I'm not entirely sure why, but one reason was probably because I had some layers already cached on the Docker local registry. Looking at Activity Monitor: **Docker Desktop:** * Large shared VM consuming \~3.5GB RAM * Multiple background processes * Consistent CPU usage even when idle **Apple Container:** * Smaller individual VM footprints \~ 200MB RAM per container * Lower baseline resource usage * More efficient per-container resource allocation [![](https://substackcdn.com/image/fetch/$s_!PxJa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9647934-4893-422d-ae27-e3fd34567c47_4174x1376.png)](https://substackcdn.com/image/fetch/$s%5F!PxJa!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9647934-4893-422d-ae27-e3fd34567c47%5F4174x1376.png) However, keep in mind that Docker’s shared VM model means the overhead is amortized across all containers. With Apple’s approach, each container incurs its own overhead—but in practice, **this is usually still lower** than the Docker Desktop engine, since most developers typically run only 2–3 containers at a time. ## Exciting but early Apple's containerization framework represents a great approach to linux container isolation on macOS. The performance benefits and security improvements are nice, especially the individual VM architecture and direct IP addressing. However, it's still very early days. The missing multi-container networking, uncertain Docker Compose support, and [devcontainer](https://code.visualstudio.com/docs/devcontainers/containers) integration (for VSCode/Cursor) questions make it hard to recommend for use today. That being said, I expect these feature gaps to be filled by the time macOS 26 reaches general availability—so let’s take it for another test drive in a few months! Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. --- ## The Slow Death of Medium-Sized Software Companies URL: https://mehdio.com/blog/the-slow-death-of-medium-sized-software Date: 2025-06-01T16:00:26.57 [![](https://substackcdn.com/image/fetch/$s_!NCSI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51c63cf0-0f41-491e-b9e9-fb97c1f4e003_1280x720.png)](https://substackcdn.com/image/fetch/$s%5F!NCSI!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51c63cf0-0f41-491e-b9e9-fb97c1f4e003%5F1280x720.png) Here's my prediction for the future : **Medium-sized software companies are dying.** And it won’t be because of market crashes, VC winter, or the next AI wave killing jobs (well, maybe a little). They’ll die because, for the first time ever, _they may become unnecessary_. When I say _“software companies,”_ I mean exactly that: businesses built purely on software. No factories. No physical supply chain. No custom silicon or overseas production dependencies. You might still get an AWS bill that looks like a ransom note, but that’s about as “operational” as it gets. Software is light. And in 2025, with all things serverless, cloud-native, and generous free tiers, it’s lighter than ever. AI has dramatically reduced the cost of shipping features and building MVPs. But I think the really wild shift isn’t in cost — it’s in **team size**. Medium-sized companies are in the worst spot. Startups are nimble. Giants have capital and reach. But the ones in the middle? Too big to stay scrappy, too small to compete on scale. Let’s unpack that — and what it means for the future of software engineering if that prediction plays out. ## Small teams, big power Thanks to AI and improved automation, small teams are now playing in the big leagues. I actually know a few solo-preneurs and/or small teams landing enterprise clients. Small teams aren’t just fast at building products. They’re close to their customers. They have ultra-tight feedback loops. They can tailor solutions with speed and precision — while larger orgs are still scheduling the kickoff meeting for the kickoff meeting. Everyone who's worked at a fast-growing company knows that scaling is painful. Hiring takes time. Onboarding takes longer. Context sharing gets messy. And suddenly, the scrappy startup that shipped a major feature every week is now bogged down in process, sync meetings, and permissions management. If you've ever been frustrated at work just because of a missing permission or a ticket stuck in limbo… I feel you. With fewer people, you ship faster. You stay closer to your users. You kill features (or even whole products) quickly without causing internal revolts. And you let AI fill in the gaps that used to require entire departments. What if staying small is not only _possible_, but preferable? ## A Better world for customers — and Engineers? Now that we acknowledge the possibility of this shift, what happens when the market is full of these small-but-mighty teams? From a customer’s perspective, it might feel like a golden age. More options. More competition. Faster updates. Closer support. Yes, discovery could get overwhelming. But honestly, that's a solvable problem — and far better than the alternative of stale monopoly software. Think of how exciting the indie game scenes are compared to their corporate counterparts (if you are a gamer like me, you know). Software starts to feel alive again. From a software engineer's perspective, it’s a little more nuanced. If the trend holds, we’ll see **more jobs — but in smaller companies**. That means: * You’ll probably wear more hats. And that’s not a bad thing. * Job security in any one company could be lower — but you’ll gain experience faster, with more chances to work on different products throughout your career. * You'll be closer to the business, with a tighter feedback loop. You’ll learn about infrastructure, product decisions, customer feedback loops, and maybe even business models. You’ll feel the impact of what you build for real users. This feels like a great opportunity for continuous learning. Sure, it won't be your cozy corporate job, but the growth opportunity will be endless. ## Generalists first If small companies become the new default, **software engineers will need to think more like product builders** — or even founders. As in any small startup, you won’t just be writing code. You’ll be choosing tools, managing infrastructure, talking to users, prioritizing features and yes, sometimes answering support emails. That means **generalist skills are going to matter more**: full-stack engineering, rapid prototyping, even basic design and copywriting. Playing devil's advocate — there will still be a need for deep specialists. There are plenty of domains where depth still wins. If you’re working on database systems, fintech, or medical software, you’ll need real expertise. AI can help with boilerplate, but it can’t replace judgment or domain knowledge. For example, if you’re building a new payment platform, knowing how settlements work, how fraud detection operates, or how different regulatory environments interact is _not_ something ChatGPT can fully solve for you. Especially in R&D, fundamentals matter. The difference is that specialists may increasingly become **contracted or embedded**, rather than hired into an in-house team. Small companies will bring in that deep knowledge when they need it, not assume they need a full-time expert for every vertical. ## What should you do now? If you're a software engineer today, this shift opens up real opportunities — but also calls for a mindset change. First, embrace AI tools to increase your productivity. My key takeaway with AI is that you shouldn’t delegate the entire process. Even “vibe coding” requires domain knowledge. Keep learning the fundamentals and stay curious about how things work. The boring implementation code? Sure, let AI handle that. Second, don’t fear small companies. You’ll probably learn more in 6 months at a 5-person startup than you will in 2 years at a large corporation. Learn to wear multiple hats. Finally, the good news is : you can prototype all of this by yourself, as a solo-preneur. The best way to learn is to get your hands dirty. My own side project right now is [subtldr](https://subtldr.com/) (yes, a small plug — but hey, you're getting this content for free, so I figured it's fair). Maybe I’m overly optimistic. Or maybe we’re finally entering a more sustainable, more human phase of software? Who knows, but let’s talk again in 2030. Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. --- ## Making Cursor smarter (and up to date) URL: https://mehdio.com/blog/making-cursor-smarter-and-up-to-date Date: 2025-05-29T12:31:35.709 [![](https://substackcdn.com/image/fetch/$s_!VTPx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a59050-813e-45a8-b9e7-7a09bc83044d_1397x821.png)](https://substackcdn.com/image/fetch/$s%5F!VTPx!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a59050-813e-45a8-b9e7-7a09bc83044d%5F1397x821.png) A lot of software engineers use AI daily to write code, but how many are actually taking advantage of all the features available to them? Sure, you can craft solid prompts and rely on your technical knowledge to navigate the output. But in today’s world of rapid innovation, some teams — like Cursor — are shipping features at lightning speed. Should you care about all of them? Probably not. Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. Could some of them drastically improve your workflow with AI? Definitely. If your AI often writes outdated API code or you’re constantly living on the bleeding edge (because hey, things move fast), then this blog post is for you. > **Why [Cursor](http://cursor.com)?** > > I’ve been using Cursor for almost a year now, after switching back and forth with VSCode. While VSCode is starting to adopt features from Cursor, such as [agent mode](https://code.visualstudio.com/updates/v1%5F99), the implementation still lags. With recent layoffs, even if the VSCode team isn’t directly affected, I bet Cursor will continue to innovate at a faster pace, making it my preferred editor. ## Docs context is king Let’s look at what I believe are the two most underused Cursor features: documentation context and Cursor rules. In Cursor, under `Settings > Cursor Settings > Features > Docs`, you can add documentation sources to be used as context in your prompts. These sources are crawled and indexed. They can be: * Documentation websites * API docs * Raw GitHub code (if open-source) When you add a custom documentation URL, you give it a name (an alias for your prompts), and Cursor crawls and indexes it for you. [![](https://substackcdn.com/image/fetch/$s_!Xf4I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570a7080-e14f-41cb-befe-7936602424c0_1822x314.png)](https://substackcdn.com/image/fetch/$s%5F!Xf4I!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570a7080-e14f-41cb-befe-7936602424c0%5F1822x314.png) Once these are added, you can reference them in your prompt using `@docs `. [![](https://substackcdn.com/image/fetch/$s_!zm0j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ffe2aa5-6742-4b8b-89ae-cea358322a95_1084x306.png)](https://substackcdn.com/image/fetch/$s%5F!zm0j!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ffe2aa5-6742-4b8b-89ae-cea358322a95%5F1084x306.png) Now, there’s actually something faster than just adding the plain root URL of the documentation website. Something that will make Cursor easier to crawl and index the documentation. ## llms.txt While `robots.txt` and `sitemap.xml` are designed for search engines, [LLMstxt.org](https://llmstxt.org/) is a new standard optimized for LLMs. It provides site information in a format LLMs can easily parse. It's an evolving standard, and many developer tool docs have started adopting it this past year as LLMs usage grew drastically. It solves a real problem: when AI scrapes raw HTML, it gets a lot of noise — navigation bars, JavaScript, CSS 🤢. This is especially important now because we will soon (or already have) more LLMs reading the docs than humans. [Andrej Karpathy](https://x.com/karpathy) highlighted this trend shift in his [post](https://x.com/karpathy/status/1914494203696177444). [![](https://substackcdn.com/image/fetch/$s_!VLb1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa873d76d-e048-4694-ae76-ee0818d18856_1182x1056.png)](https://substackcdn.com/image/fetch/$s%5F!VLb1!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa873d76d-e048-4694-ae76-ee0818d18856%5F1182x1056.png) `llms.txt` offers clean, structured information optimized for AI, making updates much faster than re-crawling entire websites. The specification defines two files: 1. `/llms.txt`: A structured view of your documentation navigation (like a Markdown-based `sitemap.xml`) 2. `/llms-full.txt`: A single file containing all your documentation in one place **Quick example with Supabase documentation** A lot of documentation website may highlight the llms.txt link directly as a button, but in general, you can just try \`awesometool.com/docs/llms.txt\`, \`docs.awesometool.com/llms.txt\` or \`myawesometool.com/llms.txt\` Supabase, interestingly, has different `llms.txt` files at \`supabse.com/llms.txt\` depending on the client API. [![](https://substackcdn.com/image/fetch/$s_!WLbN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1fe6fa-9518-4bcb-9b00-79a2184426bd_1406x520.png)](https://substackcdn.com/image/fetch/$s%5F!WLbN!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c1fe6fa-9518-4bcb-9b00-79a2184426bd%5F1406x520.png) This shows that while there’s some standardization around `llms.txt`, the framework is still flexible and evolving. ## In action with rules Cursor rules live in .cursor/rules, and you can scope them using path patterns. Each rule is written in a .mdc file — a kind of supercharged Markdown designed for Cursor. Unfortunately, as of today, you can’t reference `@docs` in rules — only static files in your repo as per discussion in the forum [here](https://forum.cursor.com/t/can-we-reference-docs-files-in-the-rules/23300). You can find more Cursor rules at: https://cursor.directory/ I use Cursor rules to explain high-level setup and tech stack. Without them, the LLM might suggest alternatives instead of using what’s already in place. Cursor rules prevent that. Example Cursor rule for [my personal website](https://mehdio.com/): ``` You are an expert full-stack web developer focused on producing clear, readable Next.js code. You always use the latest stable versions of Next.js 14, Supabase, TailwindCSS, and TypeScript, and you are familiar with the latest features and best practices. Prompt Generation Rules: - Analyze the component requirements thoroughly - Include specific DaisyUI component suggestions - Specify desired Tailwind CSS classes for styling - Mention any required TypeScript types or interfaces - Include instructions for responsive design - Suggest appropriate Next.js features if applicable - Specify any necessary state management or hooks - Include accessibility considerations - Mention any required icons or assets - Suggest error handling and loading states - Include instructions for animations or transitions if needed - Specify any required API integrations or data fetching - Mention performance optimization techniques if applicable - Include instructions for testing the component - Suggest documentation requirements for the component General Component Creation Guidelines: - Prioritize reusability and modularity - Ensure consistent naming conventions - Follow React best practices and patterns - Implement proper prop validation - Consider internationalization requirements - Optimize for SEO when applicable - Ensure compatibility with different browsers and devices General Rules: - Enable strict TypeScript (strict: true in tsconfig.json) - Avoid 'any', prefer 'unknown' with runtime checks - Explicitly type function inputs and outputs - Use advanced TypeScript features (type guards, mapped types, conditional types) - Organize project structure: components, pages, hooks, utils, styles, contracts, services - Separate concerns: presentational components, business logic, side effects - Use Biome for code formatting and linting - Configure Biome as a pre-commit hook Next.js Rules: - Use dynamic routes with bracket notation ([id].tsx) - Validate and sanitize route parameters - Prefer flat, descriptive routes - Use getServerSideProps for dynamic data, getStaticProps/getStaticPaths for static - Implement Incremental Static Regeneration (ISR) where appropriate - Use next/image for optimized images - Configure image layout, priority, sizes, and srcSet attributes TypeScript Rules: - Enable all strict mode options in tsconfig.json - Explicitly type all variables, parameters, and return values - Use utility types, mapped types, and conditional types - Prefer 'interface' for extendable object shapes - Use 'type' for unions, intersections, and primitive compositions - Document complex types with JSDoc - Avoid ambiguous union types, use discriminated unions when necessary TailwindCSS and shadcn/ui Rules: - Use TailwindCSS utility classes for styling - Avoid custom CSS unless absolutely necessary - Maintain consistent order of utility classes - Use Tailwind’s responsive variants for adaptive designs - Leverage shadcn/ui components for rapid development - Customize shadcn/ui components only when necessary - Define and use design tokens in tailwind.config.js ``` You can go pretty far with multiple rules, but even a high-level overview of your tech stack will already help a lot. It stops the LLM from blindly guessing your stack or suggesting duplicate libraries. With a combination of rules and `@docs`, your answers will improve significantly. Again: **context is king**. LLMs are like pets — they need to be fed properly. Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. --- ## macOS: Essential Productivity Hacks for Developers — No AI Needed URL: https://mehdio.com/blog/macos-essential-productivity-hacks Date: 2025-05-04T17:45:40.56 [![](https://substackcdn.com/image/fetch/$s_!GtDY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4786b187-99e4-417c-b86e-15657f5ae31e_1600x896.png)](https://substackcdn.com/image/fetch/$s%5F!GtDY!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4786b187-99e4-417c-b86e-15657f5ae31e%5F1600x896.png) Yes - this is a macOs setup. I’ve never been this productive on my Mac in the past two years. Sure, AI tools have helped a lot — but one of the key changes I made had nothing to do with AI. It’s much simpler than that. I'm saving a ton of time just managing my laptop: opening new apps, switching tabs, or finding things is blazingly fast. It almost feels like mind control — but in reality, it's just a focused keyboard workflow. That’s the real key. To do this, I use three open-source tools. In this blog, I’ll break down my setup step by step, how I use it, and why: * [Aerospace](https://github.com/nikitabobko/AeroSpace) * [SketchyBar](https://github.com/FelixKratz/SketchyBar) * [Raycast](https://www.raycast.com/) (bonus) If you want a complete reference, you’ll find my full configuration in my [dotfiles repository](https://github.com/mehd-io/dotfiles). ## Level 0: Keyboard First Let me repeat: I don’t think you can be truly productive as a developer (or any app-intensive user) if you rely too much on your mouse. The keyboard unlocks so many underrated possibilities. We often underestimate how slow it is to click through UI elements. > Fun fact: Some banks and insurance companies still use old mainframe software where operators _only_ use keyboards. I’ve worked on projects revamping those systems, and surprisingly, many operators were frustrated at being forced to use a mouse — even if the UI looked more modern and intuitive. Here’s a simple test: if you’re right-handed, try using the mouse with your left hand for a few days. It’ll be so painful that you’ll naturally start figuring out faster ways to navigate — with the keyboard. Once you get past the initial friction of memorizing keyboard shortcuts, you’ll never go back. ## Level 1: Free Up Your Screen So you’ve just booted up your Mac, and you see that ugly dock taking up precious screen space. [![](https://substackcdn.com/image/fetch/$s_!AUN5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd11967ff-ad49-48fd-9adc-e8145e93d6b7_3018x1690.png)](https://substackcdn.com/image/fetch/$s%5F!AUN5!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd11967ff-ad49-48fd-9adc-e8145e93d6b7%5F3018x1690.png) A lot of vertical space lost As developers, we spend most of our time in editors and browsers — every extra vertical pixel counts. The first smart move? Shift the dock to the left and set it to auto-hide. Yes, it feels weird and empty at first. You might wonder: _How do I switch apps now?_ ## Level 2: App Shortcuts There are multiple ways to handle app shortcuts — even built-in ones on macOS. [Aerospace](https://github.com/nikitabobko/AeroSpace), a tiling window manager, also supports assigning shortcuts to apps, all from a single config file. To install Aerospace with [Homebrew](https://brew.sh/), you simply do : ``` brew install --cask nikitabobko/tap/aerospace ``` Once launched, you’ll find its icon in the menu bar. The configuration file is located at: `~/.config/aerospace/aerospace.toml` To create an app shortcut, use exec-and-forget and map it to a keybinding: ``` alt-b = 'exec-and-forget open -a /Applications/Brave\ Browser.app' ``` I typically use the first letter of the app or context : `alt+b` (browser), `alt+c` (code) and `alt+t` (terminal) Now you have a clean, keyboard-based app launcher — and more screen space. On to the next level: window management. ## Level 3: Window Management This is where a tiling window manager shines. It eliminates floating window chaos and allows for a structured, keyboard-driven layout. > What about [Yabai](https://github.com/koekeishiya/yabai) ? I used Yabai for more than a year, and honestly, it integrates more natively but since macOs 15.2, [SIP](https://github.com/koekeishiya/yabai/wiki/Disabling-System-Integrity-Protection) need to be disabled for some key features and it was a sad goodbye. If disabling SIP is not a blocker for you, I would recommend watching [my video on Yabai setup](https://www.youtube.com/watch?v=J4SXh8UhiCQ). That being said, Aerospace now covers most of the same features. To improve visual focus, I also recommend [JankyBorders](https://github.com/FelixKratz/JankyBorders) — a lightweight tool that highlights the currently focused window. [![](https://substackcdn.com/image/fetch/$s_!S3VE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb4de41e-d2f3-4004-805a-c58f1e200f5b_3000x1618.png)](https://substackcdn.com/image/fetch/$s%5F!S3VE!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb4de41e-d2f3-4004-805a-c58f1e200f5b%5F3000x1618.png) Border cue with JankyBorders You can install it through Homebrew too, and the default configuration is already pretty fine. ``` brew tap FelixKratz/formulae brew install borders ``` Now back to `aerospace.toml`. Here are some useful shortcuts for managing windows: ``` # toggle window zoom alt-f = 'fullscreen' # toggle window split type alt-e = 'layout tiles horizontal vertical' # focus window alt-h = 'focus left' alt-j = 'focus down' alt-k = 'focus up' alt-l = 'focus right' ``` For instance, whenever I open a new window on my browser - you'll see it going next nicely. I can now easily switch focus on another window still using the keyboard. That's where having a visual cue, with janky boarders, of which window is being focused right now, helps. You can also change the split or zoom temporarily on one window. It starts to feel… blazingly fast. 🧢 Subscribe for free to receive new posts and support my work. ## Level 4: Workspaces With smart workspace management, I barely need a second monitor. I use one 90% of the time. macOS has a built-in _Spaces_ feature, but there’s no clean API to control it. Yabai can hack into it, but again — SIP needs to be disabled. Instead, Aerospace introduces its own _virtual_ workspaces. Why use workspaces? 1. Dedicated apps on dedicated workspaces 2. Minimize clutter — aim for 1–2 apps per workspace You can automate app-to-workspace assignment like so: ``` on-window-detected if.app-id ='com.brave.Browser' run = ['move-node-to-workspace 1'] ``` Navigate or move windows between workspaces with: ``` alt-1 = 'workspace 1' alt-2 = 'workspace 2' alt-3 = 'workspace 3' alt-4 = 'workspace 4' alt-5 = 'workspace 5' alt-6 = 'workspace 6' alt-7 = 'workspace 7' # See: https://nikitabobko.github.io/AeroSpace/commands#move-node-to-workspace alt-shift-1 = 'move-node-to-workspace 1' alt-shift-2 = 'move-node-to-workspace 2' alt-shift-3 = 'move-node-to-workspace 3' alt-shift-4 = 'move-node-to-workspace 4' alt-shift-5 = 'move-node-to-workspace 5' alt-shift-6 = 'move-node-to-workspace 6' alt-shift-7 = 'move-node-to-workspace 7' ``` I usually assign 5 workspaces to my main screen and 2 to a secondary: ``` [workspace-to-monitor-force-assignment] 1 = 'main' 2 = 'main' 3 = 'main' 4 = 'main' 5 = 'main' 6 = 'secondary' 7 = 'secondary' ``` Now I can either go to a given workspace using `alt+` or the app shortcut I put on initially if I want to join a specific app, so I don't have even to remember the space. If I want to move an app to another screen, it’s as simple as moving it to a workspace assigned to that screen. For example, to move an app from the main screen to the secondary screen: 1. Focus the app using its shortcut — for instance, `alt+t` for the terminal. 2. Move it to workspace 6 or 7 (which are mapped to the secondary screen) using `alt+shift+6`. Now you might be wondering: _“But Mehdi, how do you keep track of which apps are running in which workspace — especially if you’re constantly moving them around? Do you just memorize everything?”_ That’s exactly where a custom status bar comes in. > If you want a deeper dive into Aerospace, check out my walkthrough video ## Level 5: Custom Status Bar [Sketchybar](https://github.com/FelixKratz/SketchyBar) replaces the default macOS menu bar with a customizable one — and frees up space. Be honest: how often do you use the default menu bar? Once a week? That’s wasted space. First, we'll autohide the macOs menu bar in `System Settings > Control Center` : [![](https://substackcdn.com/image/fetch/$s_!s50f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe73f3b11-6b50-4d73-aef5-318a91956293_930x86.png)](https://substackcdn.com/image/fetch/$s%5F!s50f!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe73f3b11-6b50-4d73-aef5-318a91956293%5F930x86.png) System Settings for autohide menu bar That means this one is still available if needed. We'll use Sketchybar to display : * Workspaces and applications running * Focused windows * System info (battery, Wi-Fi, sound) Install SketchyBar: ``` brew tap FelixKratz/formulae brew install sketchybar ``` Start with someone else’s config — [there are plenty of examples](https://github.com/FelixKratz/SketchyBar/discussions/47?sort=top). My config is also available in my [dotfiles](https://github.com/mehd-io/dotfiles). The setup consists of: * `sketchybarrc` (main config file) * Plugins (Bash/Lua scripts) Thanks to Sketchybar I know now which workspace my apps are, plus what is the current active app/workspace selected. ## Level S: Special God Mode That covers the full window management workflow: app shortcuts, workspace assignments, tiling layout, and a smart status bar. One last tool I rely on a lot is [Raycast](https://www.raycast.com/). Think of it as a command palette on steroids. With Raycast, I: * Search files * Insert emojis * Launch specific git projects * Pick colors from my palette * And much more Honestly, Raycast deserves its own blog post. If you’re a productivity nerd, let me know — I’d love to dive deeper. In the meantime, take care of yourself — and your keyboard. 🧢 Subscribe for free to receive new posts and support my work. --- ## Local LLMs, 0 cloud cost : is WebGPU key for next-gen browser AI app? URL: https://mehdio.com/blog/local-llms-0-cloud-cost-is-webgpu Date: 2025-04-14T11:29:16.576 [![](https://substackcdn.com/image/fetch/$s_!ivzf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64f89b67-c03b-47ca-b371-cdbd019695ce_1024x1024.png)](https://substackcdn.com/image/fetch/$s%5F!ivzf!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64f89b67-c03b-47ca-b371-cdbd019695ce%5F1024x1024.png) For April Fools’ Day, I built an AI app [QuackToSQL](https://motherduck.com/quacktosql) — just quack into your mic, and it instantly transcribe the "quack" and generates SQL. Who needs to type prompts anymore, right? The beauty of this app? It starts by downloading the model directly into your browser, and after that, _everything_ happens locally. Real-time speech-to-text powered by your browser, leveraging your local GPU. No server-side processing needed. This black magic is possible thanks to WebGPU. WebGPU also enables impressive graphical demos like this [Ocean simulation](https://webgpu-ocean.netlify.app/) to run entirely processed in your browser: [![](https://substackcdn.com/image/fetch/$s_!agPb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feea85a24-acff-4a0f-813d-663044953b94_600x375.gif)](https://substackcdn.com/image/fetch/$s%5F!agPb!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feea85a24-acff-4a0f-813d-663044953b94%5F600x375.gif) In this blog post, we'll explore where WebGPU came from, dive into the technical aspects of the 'QuackToSQL' project (source code available [here](https://github.com/motherduckdb/quacktosql) ), and revisit what WebGPU is, and the significant opportunities it presents, especially for LLMs and leveraging local compute power. By the end of this blog, you’ll get a feel for the power of WebGPU—and how libraries like [transformers.js](https://github.com/huggingface/transformers.js) let you run powerful AI models efficiently, right in your users’ browsers. Yes, you might save some money among the way, with less cloud computing and more local muscle. ## Bringing graphics to the web To understand the story behind WebGPU, we need to go down memory lane to the world of game development in the 90s. The gaming industry was gaining momentum, and developers wanted to tap into the full potential of graphics hardware. Before widespread GPU acceleration APIs, graphics capabilities were limited, primarily because developers had to write very specific code for different graphics cards, and CPU processing was often a major bottleneck for complex scenes. Two major developments occurred: * In **1992**, Silicon Graphics introduced **OpenGL (Open Graphics Library)**, providing a standardized way to harness the power of GPUs across different hardware. * **Microsoft** followed suit with **[DirectX](https://en.wikipedia.org/wiki/DirectX)** for Windows, and the two became dominant graphics APIs, primarily for PC and console game development. As the web grew, the need for a web-based graphics solution emerged. In **2011**, the Khronos Group—the consortium then responsible for OpenGL—released **WebGL (Web Graphics Library)**. WebGL allowed JavaScript to communicate _with_ the computer's GPU directly from the browser, enabling 3D graphics on web pages without plugins. However, WebGL's architecture, based on the older OpenGL ES 2.0 standard, faced **limitations when trying to fully leverage modern graphics hardware**. This was primarily because its design wasn't optimized for modern multi-core CPUs and advanced GPU features like parallel command submission. ## WebGPU : graphics + compute WebGL also lacked dedicated support for general-purpose computations on the GPU, which limited its use primarily to graphics rendering. So yes, we could get amazing games and visualizations running directly in our browsers by leveraging our GPUs—but that was about it. WebGPU is much more than that. It supports both graphics rendering _and_ **general-purpose compute workloads (GPGPU)**. And with the current AI boom, I probably don't need to emphasize the importance of GPUs for general-purpose computation. WebGPU officially reached a stable release point around April 2023 after collaboration between major players like Google, Mozilla, Apple, Intel, and Microsoft. WebGPU aims to bring modern, low-level GPU access to web applications. It's designed based on the concepts of newer native APIs like Vulkan, Metal, and DirectX 12. But why does this matter in the end in the context of LLMs? Anyone can run a Python script locally or use tools like [Ollama](https://ollama.com/) to run AI models leveraging their GPU, right? Well, not quite so easily across the board. ## Making GPU computing accessible to everyone GPUs can be tricky beasts. Native GPU programming often involves dealing with platform-specific APIs and drivers. If you look at NVIDIA's ecosystem, many powerful tools and libraries rely specifically on their hardware architecture and **CUDA (Compute Unified Device Architecture)**. For example, code written using NVIDIA's CUDA framework will only run efficiently (or at all) on NVIDIA GPUs. If you want your application to support AMD, Intel, or Apple Silicon GPUs, you often need to write separate code paths using different technologies like ROCm (AMD), oneAPI/OpenCL (Intel), or Metal (Apple). This adds significant development complexity and maintenance overhead. [![](https://substackcdn.com/image/fetch/$s_!V5G7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6fae5f3-e06e-442a-bc41-b467fbdeac77_3437x2334.png)](https://substackcdn.com/image/fetch/$s%5F!V5G7!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6fae5f3-e06e-442a-bc41-b467fbdeac77%5F3437x2334.png) The challenge when building native GPU app So while you can develop native AI applications targeting specific operating systems, **you often end up tied to particular GPU vendor integrations** or managing multiple complex build targets. On macOS, Apple Silicon provides a somewhat unified target, but across the wider PC ecosystem (Windows, Linux), the hardware landscape is diverse. Web browser applications, using JavaScript and now WebGPU, offer a unique abstraction layer. They make these GPU-accelerated applications potentially runnable on _any_ device with a compatible browser, accessible via a simple URL. The browser, through its WebGPU implementation, handles the communication with the underlying native graphics drivers (Vulkan, Metal, DirectX 12). [![](https://substackcdn.com/image/fetch/$s_!3gCu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e7aa7d-1392-46f5-8d4b-9d6e9f442417_3437x2334.png)](https://substackcdn.com/image/fetch/$s%5F!3gCu!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90e7aa7d-1392-46f5-8d4b-9d6e9f442417%5F3437x2334.png) How webdev and WebGPU simplifies development But how well is WebGPU supported? As of early 2025, WebGPU enjoys solid support in the latest versions of Chromium-based browsers like Google Chrome and Microsoft Edge on desktop platforms (Windows, macOS, ChromeOS, Linux). Firefox support is progressing well and available, while Safari has support in Technology Previews and is expected in stable releases soon. Since many browsers build upon the Chromium project, adoption tends to propagate relatively quickly once features land there. In short, if you're using an up-to-date version of Chrome, you're likely good to go for exploring WebGPU features. Adding to this, we now have AI frameworks specifically designed for the web that integrate WebGPU support, all within the JavaScript ecosystem (who said Python was our only savior?!). A prime example is `transformers.js`. The [core transformers library](https://github.com/huggingface/transformers) is written in Python and powers most model training and inference workflows. transformers.js builds on top of it by enabling selected models to run in JavaScript—often after converting the original Python models to formats like ONNX. ## Building your first AI app using WebGPU with transformers.js For the 'QuackToSQL' April Fools' project, I needed a speech-to-text system. The goal was simple: recognize the word 'quack' in real-time and generate a random SQL query in response. While cloud-based streaming services exist from providers like [OpenAI](https://platform.openai.com/docs/guides/realtime) and [Google](https://cloud.google.com/speech-to-text/docs/transcribe-streaming-audio), I was skeptical about the network latency for the feedback loop I wanted. Even a one-second delay would feel sluggish for the demo, especially since I wanted a visual gauge reacting instantly to the 'quack'. I tried a couple briefly, but the latency wasn't ideal. These cloud services are often optimized for conversational turn-taking, where studies suggest humans tolerate latencies up to a few hundred milliseconds or even close to a second before the interaction feels unnatural. However, for my use case requiring immediate visual feedback, even sub-second delays felt too long. I needed something faster, something local. The solution? Cut out the network round-trip entirely and process everything directly in the browser! Enter `transformers.js`. As per their definition : > _transformers. js is a JavaScript library for running 🤗 Transformers directly in your browser, with no need for a server! It is designed to be functionally equivalent to the original Python library, meaning you can run the same pretrained models using a very similar API_ `transformers.js` has been around for a while. Originally developed by [Joshua Lochner](https://www.linkedin.com/in/xenova/) ([Xenova](https://x.com/xenovacom)), it's now officially maintained under the Hugging Face umbrella. While the library existed before, robust **WebGPU support was added on 3.x in late 2024**, enabling significant performance improvements for model inference compared to the previous CPU/WASM backend. Of course, performance depends heavily on your model and device but on capable GPUs, certain models and tasks have shown significant speedups—often exceeding 10×—compared to earlier WASM-based execution! You can run the benchmark directly yourself with the Hugging Face space [here](https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark) [![](https://substackcdn.com/image/fetch/$s_!HT76!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F264065f9-27e5-48b5-bec0-e94f1d67c8b4_1706x2118.png)](https://substackcdn.com/image/fetch/$s%5F!HT76!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F264065f9-27e5-48b5-bec0-e94f1d67c8b4%5F1706x2118.png) [Subscribe now](https://blog.mehdio.com/subscribe?) ### Using models in the browser So, how do you get an AI model into the browser using this library? Let's step back quickly about the process When a new model architecture is released, for it to be usable in `transformers.js`, it typically needs to be converted and potentially added to the library's supported model list. At a high level, this often involves **converting the model weights to the [ONNX](https://onnx.ai/)** (Open Neural Network Exchange) format, an open standard for representing machine learning models. Depending on the complexity of the model's architecture, this conversion process might require contributions to the underlying Python Transformers library first before it can be easily exported to ONNX. Luckily, many popular models are already converted and available directly through the Hugging Face Hub integration in `transformers.js`. For my project, the [openai/whisper-base ](https://github.com/openai/whisper)model was readily available and perfectly suited my need for real-time voice transcription. ### A bit of Javascript on how it works The core logic for the speech recognition in QuackToSQL resides in `app/worker.ts` (running in a Web Worker to avoid blocking the main UI thread). It uses the `@huggingface/transformers` package. Here's a simplified overview of the pipeline initialization: ``` class AutomaticSpeechRecognitionPipeline { static model_id = "onnx-community/whisper-base"; static tokenizer: any = null; static processor: any = null; static model: any = null; static async getInstnce(progress_callback?: (progress: any) => void) { this.tokenizer ??= AutoTokenizer.from_pretrained(this.model_id, { progress_callback, }); this.processor ??= AutoProcessor.from_pretrained(this.model_id, { progress_callback, }); this.model ??= WhisperForConditionalGeneration.from_pretrained( this.model_id, { dtype: { encoder_model: "fp32", decoder_model_merged: "q4", }, device: "webgpu", progress_callback, }, ); ``` This code sets up the pipeline using the `openai/whisper-base` model. Key points: * `AutoTokenizer`**,** `AutoProcessor`**,** `WhisperForConditionalGeneration`**:** These classes handle loading the necessary components (text tokenizer, audio preprocessor, and the actual Whisper model). * `from_pretrained(model_id, ...)`**:** This is the core function that downloads (if needed) and loads the specified model components. * `dtype: { ... "q4" }`**:** We specify using 4-bit quantization (`q4`) for the decoder part of the model. This significantly reduces the model size and memory usage, often with minimal impact on accuracy for this task, making it more suitable for browser environments. * `device: "webgpu"`**:** This crucial line tells `transformers.js` to attempt running the model inference using the WebGPU backend. * `progress_callback`**:** Allows updating the UI during the potentially long model download/initialization phase. * `await this.model.forward(this.model.dummy_inputs);`: This step uses dummy data to proactively compile WebGPU shaders and allocate resources; while not strictly necessary, it confirms the model setup and minimizes latency during the user's first interaction. The actual transcription generation looks something like this: ``` async function generate({ audio, language }: GenerateParams) { // ... const [tokenizer, processor, model] = await AutomaticSpeechRecognitionPipeline.getInstance(); const streamer = new TextStreamer(tokenizer, { skip_prompt: true, skip_special_tokens: true, callback_function, token_callback_function, }); const inputs = await processor(audio); const outputs = await model.generate({ ...inputs, max_new_tokens: MAX_NEW_TOKENS, language, streamer, }); ``` * The audio data (likely a Float32Array) is processed. * `model.generate()` performs the inference using WebGPU. * The `TextStreamer` allows receiving transcribed text incrementally, enabling the real-time effect. The rest of the frontend code handles audio capture from the microphone using the Web Audio API and updates the UI based on the streamer callbacks. You can explore the complete implementation in the project repository [here](https://www.google.com/search?q=link%5Fto%5Frepo). ## WebGPU: a WIP standard While building this, I encountered two main practical challenges with this setup: 1. **Model download size:** The `whisper-base` model is roughly 200MB (even more before quantization). While cached by the browser after the first download, this initial load can be significant, especially on slower or mobile connections. Quantization helps, but larger models remain a challenge for web delivery. 2. **Browser configuration & compatibility:** While WebGPU adoption is growing rapidly, it's still relatively new tech. This means some users might encounter issues or need to enable specific browser flags, especially on older browser versions or less common operating systems/driver combinations. This compatibility aspect was the most common friction point users hit when trying the app. I quickly added instructions to help them: ``` This model requires WebGPU support. For the best experience: - Use **Chrome** browser (recommended) - Enable **"WebGPU"** flag in `chrome://flags` - For **Linux** users: Ensure **Vulkan support** is installed ``` # The Promise of local processing with WebGPU I've been following WebGPU for a while, but this hands-on project clearly demonstrates that exciting architecture is getting closer to reality. We can leverage the hardware already present in our laptops and desktops – including those expensive MacBooks and gaming PCs with capable GPUs – directly from the web. The key benefits are clear: * **Easy distribution:** accessible via a URL, no complex installation. * **Zero installation:** runs directly in the user's browser. * **Enhanced privacy/security:** sensitive data (like raw audio in this case) can be processed locally without ever leaving the user's machine. * **Reduced server costs:** offloads computation to the client-side. * **Potential for offline functionality:** once models are cached, apps can work without constant connectivity. Regardless of the evolution of specific libraries like `transformers.js`, WebGPU itself is a foundational technology enabling this shift towards more powerful local computation within the browser. Imagine a future where **commonly used AI models might even be bundled with browsers** or efficiently cached across websites, eliminating download times entirely for many applications. This could revolutionize what's possible on the web with AI. Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. --- ### 📓Resources * WebGPU Specification: * Transformers.js Documentation: * Transformers.js Repository: [https://github.com/huggingface/transformers.js](https://www.google.com/search?q=https://github.com/huggingface/transformers.js) * Xenova (Joshua Lochner) talk about transformer.js * WebGPU Samples: * Can I use WebGPU?: * QuackToSQL Project Repository: * Core transformer library : --- ## How to use AI to create better technical diagrams URL: https://mehdio.com/blog/how-to-use-ai-to-create-better-technical Date: 2025-03-29T09:55:04.029 [![](https://substackcdn.com/image/fetch/$s_!oAog!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59c4a70a-4b9a-4a9c-928f-ab84436a9f3a_1920x1080.png)](https://substackcdn.com/image/fetch/$s%5F!oAog!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59c4a70a-4b9a-4a9c-928f-ab84436a9f3a%5F1920x1080.png) As AI tools continue to grow around software engineering—whether it’s for debugging or writing code—one area that still feels a bit overlooked is **creating technical diagrams**. “A picture is worth a thousand words,” and in engineering, being able to communicate clearly—whether for future projects (RFPs) or existing architecture—is key. The challenge is that it’s hard to create technical diagrams that are both clear and sober, without being overly flashy or confusing. Fortunately, AI tools can help you make great diagrams, if you know how. ## AI is bad at generating image diagrams (on its own) The common trap when creating technical diagram is relying on an LLM with a vague prompt like this: ``` design a technical diagram to helps visualize how microservices or components communicate through events and message queues ``` If we run this with ChatGPT from OpenAI : [![](https://substackcdn.com/image/fetch/$s_!i37O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ab72549-2b40-44e0-8953-aaeac6814cf0_1024x1024.png)](https://substackcdn.com/image/fetch/$s%5F!i37O!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ab72549-2b40-44e0-8953-aaeac6814cf0%5F1024x1024.png) What you’ll get is usually a _generated image_—and often, it’s… well, **ugly**. To be fair, the image above was generated **before March 25, 2025**, which was prior to [OpenAI’s 4o image generation release](https://openai.com/index/introducing-4o-image-generation/). Here’s the _same prompt_ using the updated model: [![](https://substackcdn.com/image/fetch/$s_!tLM3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea6f83e8-8934-49ca-97d7-ba76832e1733_1024x1024.png)](https://substackcdn.com/image/fetch/$s%5F!tLM3!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea6f83e8-8934-49ca-97d7-ba76832e1733%5F1024x1024.png) Already a big improvement—but honestly, we can still do **much** better! These AI models aren’t great at design. But what they _are_ great at is generating **code** and understanding the relationships between components. That’s what makes them useful for generating diagrams. We need an **intermediate layer** between our prompt and the final image: we need **code that generates diagrams**. ## Diagrams as code There are several frameworks that let you generate diagrams using code. One popular open-source project is [Diagrams](http://diagrams.mingrammer.com/), a Python framework that lets you create beautiful diagrams using official cloud icons from AWS, Azure, GCP, and more. [![](https://substackcdn.com/image/fetch/$s_!0oVv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F553ec0e2-387c-49df-aed1-77b8543692d3_2172x1312.png)](https://substackcdn.com/image/fetch/$s%5F!0oVv!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F553ec0e2-387c-49df-aed1-77b8543692d3%5F2172x1312.png) The downside is that it’s **Python-based**, which can feel a bit heavy depending on your workflow. In this case, you could prompt an LLM to generate Diagrams code for your desired architecture, then run the Python script to export it as a .png. While it’s possible to streamline the process with CI/CD, there are simpler workflows that offers **better developer feedback loop.** > 💡 **You don’t always need AI to generate diagrams.** > If you’re already using Terraform, aside from the built-in [terraform graph](https://developer.hashicorp.com/terraform/cli/commands/graph), check out these tools—they generate diagrams based on your Terraform graph: > • [Terravision](https://github.com/patrickchugh/terravision) – uses the Diagrams framework > • [Terramaid](https://github.com/RoseSecurity/Terramaid) – uses Mermaid for lighter diagrams ## Faster iteration with Mermaid If you’re looking for something more lightweight, [Mermaid](https://mermaid.js.org/) is a great alternative. It’s another popular open-source framework that lets you write diagrams in a simple syntax. Because it’s **JavaScript-based**, it’s easier to integrate into markdown renderers and live editors. There are extensions for [Obsidian](https://obsidian.md/), [VS Code/Cursor](https://marketplace.visualstudio.com/items?itemName=MermaidChart.vscode-mermaid-chart), and more. Let’s try a prompt: ``` Create an Mermaid schema that explain the high-level architecture and design of Apache Spark. Include the key components such as the driver, executors, cluster manager, and DAG scheduler. Describe how Spark handles job execution, fault tolerance, and memory management. Provide a conceptual overview suitable for a software engineer familiar with distributed systems but new to Spark. ``` You can inspect the result using any Mermaid previewer (here in Cursor): [![](https://substackcdn.com/image/fetch/$s_!wly8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc55ffef-3b85-4917-88fa-92f678b40692_3010x1892.png)](https://substackcdn.com/image/fetch/$s%5F!wly8!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc55ffef-3b85-4917-88fa-92f678b40692%5F3010x1892.png) We still need to update through code for any additional edits. If you want a live editor on the schema itself, you can use [mermaid.live](app://obsidian.md/mermaid.live) or a [Cursor/VSCode extension](https://marketplace.visualstudio.com/items?itemName=corschenzi.mermaid-graphical-editor). But let’s be honest—the editing experience isn’t amazing when it comes to fine-tuning diagrams manually. You’re still dealing mostly with code. ## Best human feedback loop with Excalidraw & Cursor Mermaid is great for simple, lightweight diagrams—but the **editor experience** is limited because it was built for code → diagram. On the other side, [Excalidraw](https://excalidraw.com/): an open-source tool built the other way around—**from sketching → code**. It’s great for quick, hand-drawn-style technical diagrams with just enough structure. I personally use the [Obsidian plugin](https://github.com/zsviczian/obsidian-excalidraw-plugin), and there’s also a great [VS Code/Cursor extension](https://marketplace.visualstudio.com/items?itemName=pomdtr.excalidraw-editor). The cool part? Excalidraw diagrams are just **JSON**: ``` { "id": "VNytcZzbA1rTn2Q5GfXmM", "type": "text", "x": 622, "y": 476, "text": "Task dispatch", "fontSize": 20, ... } ``` This structure makes it **very AI-friendly**. Here’s the same above prompt you can try: ``` Create an Excalidraw schema that explains the high-level architecture and design of Apache Spark. Include the key components such as the driver, executors, cluster manager, and DAG scheduler. Describe how Spark handles job execution, fault tolerance, and memory management. Provide a conceptual overview suitable for a software engineer familiar with distributed systems but new to Spark. ``` Once the .excalidraw JSON is generated, you can open it in the editor and **interact with it manually**. This is where AI really shines: it gives you a **solid first draft**, and you provide the **human feedback loop** to adjust layout, naming, and connections. I think this is where creating technical diagram with AI shines : having simple and clear human feedback loop to quickly manually operate. Sometimes, drawing is faster than getting the perfect prompt. Until the next experiment 🫡 Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. --- ## DuckDB goes distributed? DeepSeek’s smallpond takes on Big Data URL: https://mehdio.com/blog/duckdb-goes-distributed-deepseeks Date: 2025-02-28T15:01:27.125 [![](https://substackcdn.com/image/fetch/$s_!89CK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09a8f44b-d040-454e-8c01-999fe6121507_1600x900.png)](https://substackcdn.com/image/fetch/$s%5F!89CK!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09a8f44b-d040-454e-8c01-999fe6121507%5F1600x900.png) DeepSeek has made a lot of noise lately. Their R1 model, released in January 2025, outperformed competitors like OpenAI’s O1 at launch. But what truly set it apart was its highly efficient infrastructure—dramatically reducing costs while maintaining top-tier performance. Now, they're coming for data engineers. [DeepSeek](https://github.com/deepseek-ai) released a bunch of small repositories as independent code modules. [Thomas Wolf](https://www.linkedin.com/in/thom-wolf/), Co-founder and Chief of Product at HuggingFace [shared some of his highlights](https://www.linkedin.com/posts/thom-wolf%5Fi-want-to-share-bit-of-context-on-todays-activity-7300872211794440192-QJIb?utm%5Fsource=share&utm%5Fmedium=member%5Fdesktop&rcm=ACoAAA0tl2QBJUocRMpCGqvWI8N%5FYbcsbmkLctY), but we're going to focus on one particularly important project went that unmentioned—**[smallpond](https://github.com/deepseek-ai/smallpond)**, a distributed compute framework built on **[DuckDB](https://duckdb.org/).** DeepSeek is pushing DuckDB beyond its single-node roots with smallpond, a new, simple approach to distributed computing. First, having DeepSeek, a hot AI company, using DuckDB is a significant statement, and we'll understand why. Second, we'll dive into the repository itself, exploring their smart approach to enabling DuckDB as a distributed system, along with its limitations and open questions. I assume you're familiar with DuckDB. I've created [tons of content around it](https://www.youtube.com/playlist?list=PLIYcNkSjh-0wlrFUE2VvQilLU2aBPns0K). But just in case, here's a high-level recap. > For transparency, at the time of writing this blog, I’m a data engineer and DevRel at [MotherDuck](https://motherduck.com/). MotherDuck provides a cloud-based version of DuckDB with enhanced features. Its approach differs from what we’ll discuss here, and while I’ll do my best to remain objective, just a heads-up! 🙂 ## DuckDB Reminder DuckDB is an in-process analytical database, meaning it runs within your application without requiring a separate server. You can install it easily in multiple programming languages by adding a library—think of it as the SQLite of analytics, but built for high-performance querying on large datasets. It's built in C++ and contains all the integrations you might need for your data pipelines (AWS S3/Google Cloud Storage, Parquet, Iceberg, spatial data, etc.), and it's damn fast. Besides working with common file formats, it has its own efficient storage format—a single ACID-compliant file containing all tables and metadata, with strong compression. In Python, getting started is as simple as: ``` pip install duckdb ``` Then, load and query a Parquet file in just a few lines: ``` import duckdb conn = duckdb.connect() conn.sql("SELECT * FROM '/path/to/file.parquet'") ``` It also supports reading and writing to [Pandas](https://pandas.pydata.org/docs/index.html) and [Polars](https://pola.rs/) DataFrames with zero copy, thanks to Arrow. ``` import duckdb import pandas # Create a Pandas dataframe my_df = pandas.DataFrame.from_dict({'a': [42]}) # query the Pandas DataFrame "my_df" # Note: duckdb.sql connects to the default in-memory database connection results = duckdb.sql("SELECT * FROM my_df").df() ``` ## DuckDB is coming into AI companies? We talk a lot about LLM frameworks, models, and agents, but we often forget that the first step in ANY AI project comes down to data. Whether it's for training, RAG, or other applications, it all comes down to feeding systems with good, clean data. But how do we even accomplish that step? Through data engineering. Data engineering is a crucial step in AI workflows but is less discussed because it's less "sexy" and less "new." Regarding DuckDB, we've already seen other AI companies like [HuggingFace](https://huggingface.co/) using it behind the scenes to quickly serve and explore their datasets library through their [dataset viewer.](https://huggingface.co/docs/hub/en/datasets-viewer) Now, DeepSeek is introducing _smallpond_, a lightweight open-source framework, leveraging DuckDB to process terabyte-scale datasets in a distributed manner. Their benchmark states: \_“Sorted 110.5TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66TiB/min.” [![](https://substackcdn.com/image/fetch/$s_!rz7r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75b3f4a5-7e53-40da-9044-86c44af20dcd_2628x546.png)](https://substackcdn.com/image/fetch/$s%5F!rz7r!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75b3f4a5-7e53-40da-9044-86c44af20dcd%5F2628x546.png) Source : https://github.com/deepseek-ai/smallpond While we've seen DuckDB crushing 500GB on a single node easily ([clickbench](https://benchmark.clickhouse.com/)), this enters another realm of data size. [![](https://substackcdn.com/image/fetch/$s_!KfV4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F823ded79-b00e-46ee-a9d0-6739ae6beb56_4092x724.png)](https://substackcdn.com/image/fetch/$s%5F!KfV4!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F823ded79-b00e-46ee-a9d0-6739ae6beb56%5F4092x724.png) Clickbench benchmark But wait, isn't DuckDB single-node focused? What's the catch here? Let's dive in. ## smallpond's internals ### DAG-Based Execution Model smallpond follows a **lazy evaluation** approach when performing operations on DataFrames (`map()`, `filter()`, `partial_sql()`, etc.), meaning it doesn’t execute them immediately. Instead, it constructs a **logical plan** represented as a **directed acyclic graph (DAG)**, where each operation corresponds to a node in the graph (e.g., `SqlEngineNode`, `HashPartitionNode`, `DataSourceNode`). Execution is only triggered when an action is called, such as: * `write_parquet()` – Write data to disk * `to_pandas()` – Convert to a pandas DataFrame * `compute()` – Explicitly request computation * `count()` – Count rows * `take()` – Retrieve rows This approach optimizes performance by deferring computation until necessary, reducing redundant operations and improving efficiency. When execution is triggered, the logical plan is converted into an execution plan. The execution plan consists of tasks (e.g., `SqlEngineTask`, `HashPartitionTask`) that correspond to the nodes in the logical plan. These tasks are the actual units of work that will be distributed and executed through [Ray](https://ray.io/). ### Ray Core and Distribution Mechanism The important thing to understand is that the distribution mechanism in smallpond operates at the Python level with help from [Ray](https://www.ray.io/), specifically [Ray Core](https://docs.ray.io/en/latest/ray-core/walkthrough.html), through partitions. A given operation is distributed based on manual partitioning provided by the user. Smallpond supports multiple partitioning strategies: * Hash partitioning (by column values) * Even partitioning (by files or rows) * Random shuffle partitioning For each partition, a separate DuckDB instance is created within a Ray [task](https://github.com/deepseek-ai/smallpond/blob/ed112db42af4d006a80861d1305a1c22cabdd359/smallpond/execution/task.py#L4). Each task processes its assigned partition independently using SQL queries through DuckDB. Given this architecture, you might notice that the framework is tightly integrated with Ray, which comes with a trade-off: it prioritizes **scaling out** (adding more nodes with standard hardware) over **scaling up** (improving the performance of a single node). Therefore, you would need to have a Ray cluster. Multiple options exist, but you would have more options today to manage your own cluster through AWS/GCP compute or a Kubernetes cluster. Only [Anyscale](https://www.anyscale.com/), the company founded and led by the creators of Ray, offers a fully managed Ray service. Even then, you have the overhead of monitoring a cluster. The great thing here is that the developer experience is nice because you get a local single node when working and only scale when you need to. But the question is: do you actually need to scale out and add the cluster overhead given that the largest machine on [AWS today provides 24TB of memory](https://aws.amazon.com/ec2/instance-types/high-memory/)? ### Storage Options Ray Core is just for compute - where does the storage live? While smallpond supports local filesystems for development and smaller workloads, the benchmark on 100TB mentioned is actually using the custom DeepSeek [3FS framework](https://github.com/deepseek-ai/3FS): Fire-Flyer File System is a high-performance distributed file system designed to address the challenges of AI training and inference workloads. To put it simply, compared to AWS S3, **3FS is built for speed, not just storage**. While S3 is a reliable and scalable object store, it comes with higher latency and eventual consistency, making it less ideal for AI training workloads that require fast, real-time data access. 3FS, on the other hand, is a high-performance distributed file system that leverages **SSDs and RDMA** networks to deliver low-latency, high-throughput storage**.** It supports **random access to training data**, **efficient checkpointing**, and **strong consistency**, eliminating the need for extra caching layers or workarounds. For AI-heavy workloads that demand rapid iteration and distributed compute, 3FS offers a more optimized, AI-native storage layer—**trading off some cost and operational complexity for raw speed and performance**. Because this is a specific framework from DeepSeek, you would have [to deploy your own the 3FS cluster](https://github.com/deepseek-ai/3FS/blob/main/deploy/README.md) if you want to reach the same performance. There's no fully managed option there... or maybe this is an idea for a spinoff startup from DeepSeek? 😉 One interesting experiment would be to test performance at the same scale using AWS S3\. However, this implementation is currently missing in smallpond. This approach would be much more practical for an average company needing 100TB of processing capability. Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. ### Key differences from other frameworks like Spark/Daft Unlike systems like [Spark](https://spark.apache.org/) or [Daft](https://www.getdaft.io/) that can distribute work at the query execution level (breaking down individual operations like joins or aggregations), smallpond operates at a higher level. It distributes entire partitions to workers, and each worker processes its entire partition using DuckDB. This makes the architecture simpler but potentially less optimized for complex queries that would benefit from operation-level distribution. [![](https://substackcdn.com/image/fetch/$s_!49iq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1ca0e10-89c4-40a1-81e9-f1bd84cda1be_1092x808.png)](https://substackcdn.com/image/fetch/$s%5F!49iq!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1ca0e10-89c4-40a1-81e9-f1bd84cda1be%5F1092x808.png) Distributed compute levels - Image by the Author ## Summary of smallpond's Architecture Let's recap the features of smallpond : * **Lazy evaluation with DAG-based execution** – Operations are deferred until explicitly triggered. * **Flexible partitioning strategies** – Supports hash, column-based, and row-based partitioning. * **Ray-powered distribution** – Each task runs in its own DuckDB instance for parallel execution. * **Multiple storage layer options** – Benchmarks have primarily been conducted using 3FS. * **Cluster management trade-off** – Requires maintaining a compute cluster, though fully managed services like Anyscale can mitigate this. * **Potential 3FS overhead** – Self-managing a 3FS cluster introduce significant additional complexity. [![](https://substackcdn.com/image/fetch/$s_!6QKk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69678362-0091-48ca-8308-a19dc40d3352_840x683.png)](https://substackcdn.com/image/fetch/$s%5F!6QKk!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69678362-0091-48ca-8308-a19dc40d3352%5F840x683.png) High-level-design of smallpond Image by the Author ## Other ways of distributed compute with DuckDB Another approach to distributed computing with DuckDB is through serverless functions like AWS Lambda. Here, the logic is often even simpler than partitioning, typically processing by file. Or you could decide to process per partition with some wrapper, but you won't be able to go much further than file-by-file processing. Okta implemented this approach, and you can read more on blog: [Okta's Multi-Engine Data Stack](https://juhache.substack.com/p/oktas-multi-engine-data-stack) [![](https://substackcdn.com/image/fetch/$s_!2Yy8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f51204-1040-41d0-b26a-8f5fbb2ac930_1622x618.jpeg)](https://substackcdn.com/image/fetch/$s%5F!2Yy8!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f51204-1040-41d0-b26a-8f5fbb2ac930%5F1622x618.jpeg) from Julien Hurault’s blog Okta’s Multi-Engine Data Stack Finally, MotherDuck, [is working on dual execution](https://motherduck.com/docs/concepts/architecture-and-capabilities/#dual-execution), balancing between local and remote compute for optimize resources usage. ## Scaling DuckDB All in all, it's exciting to see that DuckDB is being used in AI-heavy workloads and that people are getting creative on how to split the compute when needed. smallpond, while being restricted to a specific tech stack for distributing compute, aims to be simple, which aligns with the philosophy of DuckDB 👏 It's also a good reminder that there are multiple ways to scale DuckDB. Scaling up is always the simpler approach, but with smallpond and other examples mentioned here, we have plenty of options. This approach makes sense nowadays rather than having to rely on complex and heavy distributed frameworks by default "just in case." These not only hurt your cloud costs when starting with small/medium data but also has a tax on developer experience (still love you, Apache Spark ❤️). While we have powerful single-node solutions that would be enough for most use cases, [especially if you're in the 94% of use cases under 10TB according to Redshift](https://www.linkedin.com/posts/mehd-io%5Fdataengineering-activity-7298333190694293504-%5FB%5Ff?utm%5Fsource=share&utm%5Fmedium=member%5Fdesktop&rcm=ACoAAA0tl2QBJUocRMpCGqvWI8N%5FYbcsbmkLctY), we now have even more options to make the Duck fly. Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. --- ## 15 Python Libraries Every Data Engineer Needs URL: https://mehdio.com/blog/15-python-libraries-every-data-engineer Date: 2024-09-25T11:01:56.339 [![](https://substackcdn.com/image/fetch/$s_!5UMZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce405d6c-27a5-4ee0-84e3-37771809a228_1600x907.png)](https://substackcdn.com/image/fetch/$s%5F!5UMZ!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce405d6c-27a5-4ee0-84e3-37771809a228%5F1600x907.png) Python's ecosystem is still growing strong and the explosions of libraries can make one getting into data engineer a bit scared. So I sat down and thought, "If I could keep only 15 Python libraries for most of my data engineering work, which ones would I choose?" To make this more digestible, I sorted these into four categories: data ingestion, data transformation, developer tools, and data validation. If you prefer watching over reading, I've got you covered. ### 🌊 Data ingestion #### 1\. [Requests](https://requests.readthedocs.io/en/latest/) This is the basic HTTP library in Python. It is essential for querying APIs and fetching data from the web, including web scraping tasks. Mastering more than just a GET request is essential. Understanding how to handle status codes and manage retries will help you build robust pipelines. #### 2\. [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) Used alongside Requests, BeautifulSoup is the Python standard for parsing HTML content. It's a must-have if you do web scraping. To illustrate how the above two libraries work together, here is a short snippet: ``` import requests from bs4 import BeautifulSoup # URL to fetch url = 'https://example.com' # Send a GET request response = requests.get(url) # Check if the request was successful if response.status_code 200: # Parse the HTML content soup = BeautifulSoup(response.text, 'html.parser') # Find all 'a' tags links = soup.find_all('a') # Print each link's URL and text for link in links: print(f"Text: {link.text.strip()}, URL: {link.get('href')}") else: print(f"Failed to retrieve the webpage. Status code: {response.status_code}") ``` #### 3\. [Dlt](https://dlthub.com/) Dlt from Dlthub is a bit more than just a library. It's a framework that follows best practices for creating data pipelines. It supports various sources and destinations, including REST APIs and databases, making it a versatile choice for data ingestion. Check Dlt's documentation around [their core concepts](https://dlthub.com/docs/reference/explainers/how-dlt-works) to understand how things work. ### 🛠️ Data Transformation #### 4\. [DuckDB](https://duckdb.org/) DuckDB is an in-process OLAP database written in C++ that acts like a Swiss army knife for data engineering. It supports various data formats like CSV, JSON, and Parquet and table formats like Iceberg and Delta Lake. It works well with dataframe libraries like Polars and Pandas thanks to Arrow : you can query/process directly your Pandas/Polars dataframe using DuckDB. DuckDB focuses on SQL, offering [many functions to simplify data manipulation tasks](https://duckdb.org/docs/sql/dialect/friendly%5Fsql.html). _Note : For full disclosure, at this point of this blog I'm working for MotherDuck (DuckDB in the Cloud) - so yes I'm kind of bias here. If you want to learn. more about DuckDB, you can check my work on [MotherDuck YouTube channel](https://youtube.com/@motherduckdb)._ #### 5\. [Polars](https://docs.pola.rs/) At the opposite of DuckDB's friendly SQL, Polars has a dataframe approach. Polars is a high-performance library written in Rust with Python bindings. Like DuckDB, it's especially good for single-node computing environments on local machines or in the cloud. It handles different data file types and transformations efficiently, making it ideal for fast data processing tasks. #### 6\. [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) PySpark (Apache Spark's Python API) has been the gold standard for the past decade for handling large datasets across distributed systems. Note that _large datasets_ start to be actually a tiny percentage of use cases, given the power single node machine we can have today. Most of your use cases can be handled with other straightforward frameworks like DuckDB/Polars. PySpark can't run on any Python runtime (or it will be overkill as a standalone); you will still need Spark Cluster, hence the complexity. Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. ### 🧰 Developer Tools #### 7\. [Loguru](https://github.com/Delgan/loguru) I never liked the default logging features of Python and this library provides a simpler approach. It's a one-line setup, and it's designed to replace traditional logging and print statements. Good logging means easier debugging. Easier debugging means more robust pipelines. #### 8\. [Typer](https://typer.tiangolo.com/) Typer is an intuitive tool for building command-line interfaces (CLIs). It is based on the principles of FastAPI (by the same author) and simplifies the creation of CLIs. CLI for data pipelines are also crucial. Running a pipeline with specific parameters (e.g., specific dates) or backfilling are all being enabled by powerful CLI. You don't want to have to modify the code for any custom runs, use your CLI parameters! #### 9\. [Fire](https://github.com/google/python-fire) For simpler projects that still require a CLI, Fire offers a less powerful but easier-to-bootstrap option in my opinion compared to Typer. It automatically detects function parameters and includes them in the CLI, streamlining the process of setting up and running scripts. ``` import fire def hello(name="World"): return "Hello %s!" % name if __name__ '__main__': fire.Fire(hello) ``` Then you can run : ``` python hello.py # Hello World! python hello.py --name=David # Hello David! python hello.py --help # Shows usage information. ``` #### 10\. [Ruff](https://github.com/astral-sh/ruff) Ruff is a tool that helps clean up and organize your code. It's a linter and code formatter built using Rust, which means it runs blazingly fast compared to its competitors. As it combines multiple tools in one (linter, formater), it can replace your pylint/black tool kit. Fewer dependencies, fewer struggles. #### 11\. [Pytest](https://docs.pytest.org/en/stable/) Pytest is a common standard for testing. It's a very popular because it gives clear information on what parts of your test are failing and why. Pytest is also powerful because it [has many plugins](https://docs.pytest.org/en/stable/reference/plugin%5Flist.html) you can add to help test specific parts of your code more effectively. For instance, the `pytest-compare` plugin helps check if the parts of your code that interact with each other are doing so correctly. #### 12\. [python-dotenv](https://pypi.org/project/python-dotenv/) When working on projects, especially locally, you often need to handle sensitive information like passwords or API keys. Python-dotenv helps manage these secrets safely by letting you store them in a `.env` file, which you keep out of your main project files. This means you can use all your secrets in your code without risking them being exposed. When you're ready to move your project to the cloud, transition is easier as not a single sensible information is left and you just have to provision your cloud runtime with the appropriate environment to make it work. ### 🚦Data Validation #### 13\. [Pydantic](https://docs.pydantic.dev/latest/) This tool is great for making sure the data you receives is exactly what it expects. Think of it as a supercharged version of Python's own data classes. Pydantic lets you define exactly how your data should look and behave, which is especially useful for ensuring that data from the internet or other sources meets your standards. For example, you can set up Pydantic to check that a URL or a user's password meets specific criteria, which helps prevent errors in your data processing. #### 14\. [Pandera](https://pandera.readthedocs.io/) This tool checks that data organized in tables (dataframe) fits a specific format. This is very useful when you pass data around different parts of your program to make sure nothing breaks. Pandera allows you to define a schema, or a blueprint, for your data tables, and it checks incoming data against these schemas. You can use Pandera with Pandas dataframes, Polars, or even Pydantic models to ensure your data is correct before you proceed with processing. If you're unsure how Pydantic and Pandera relate, think of Pydantic as handling data validation at the Python object level (typically for `dict` structures), while Pandera focuses on validating data at the dataframe level. #### 15\. [PyArrow](https://arrow.apache.org/docs/python/index.html) PyArrow is a bit like the hidden machinery that helps various data tools work together seamlessly by standardizing how they describe and store data in memory. This compatibility is crucial for tools like DuckDB, which can work with data from Pandas or Polars without converting the data formats. While PyArrow isn’t typically used by developers directly for everyday tasks, it plays a vital role in the background. I definitely use it sometimes to type data to avoid inference through PyArrow's magic. --- With these 15 libraries, we cover the essentials. However data engineering is such a wide domain than even with scoping on these categories, you sometimes need a little extra. What did I miss? Which library do you think should be on this list ? Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. --- ## One year, One challenge: win money if I fail URL: https://mehdio.com/blog/one-year-one-challenge-win-money Date: 2024-08-19T14:17:05.138 [![](https://substackcdn.com/image/fetch/$s_!IoO9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F385f7f34-7e27-45f6-87e8-b45fcf69725f_1024x1024.webp)](https://substackcdn.com/image/fetch/$s%5F!IoO9!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F385f7f34-7e27-45f6-87e8-b45fcf69725f%5F1024x1024.webp) Dear people on the internet, It’s been a while, at least in this newsletter! I want to start by saying thank you. Whether you've been here from the beginning or just joined, thank you for reading, watching, and engaging with my content. You are the reason I keep doing this! This is a brief, unusual post to quickly reflect on my past content and announce my biggest challenge. I'll also explain how you could help on this journey or win some free money... I know that sounded like a scam, but believe me, it's not 😅 I've been creating tech content for the past four years. I started blogging but I found quickly that making videos was even more fun for me and now it's also part of my work. However, turning a hobby into a job has its downsides, and I haven't been able to put out as much on my personal channel as I wanted to. I'm going through some big changes in my life right now, some good and others... challenging. I realized once again that life is short. I have so much to tell, and I want to express my creativity without boundaries. So here's the deal: starting today, I'll release one video a week for the next year on my [YouTube channel,](https://www.youtube.com/@mehdio) which is more than I have made. I'll keep the rules pretty simple. In a given month, videos need to be : 1) In a mixed format (long/short). 2) Both educational (in tech) and entertaining. These rules will force me to avoid sticking to one style just for the sake of views and ensure that I'm stepping out of my comfort zone while still bringing value to you. What about the money? I need you to help keep me accountable! If I miss a week, I’ll give $100 to one of my newsletter subscribers. And I’ll add another $100 to the pot for every additional week missed. So, in short, you get free content from me and a lottery ticket. This will be hard but I'm excited to explore things on a new level. [The first video](https://www.youtube.com/watch?v=aeFIj%5FSiDCU), which is at index `0` since I'm counting as a data engineer, is out now. The challenge will end on August 25, 2025. See you next week! (and don't forget to subscribe to get your ticket 😉) Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. --- ## I deleted data in prod and received a T-shirt; what's next? URL: https://mehdio.com/blog/i-deleted-data-in-prod-and-received Date: 2024-05-07T09:37:08.307 [![](https://substackcdn.com/image/fetch/$s_!RWUz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F564ab01f-d0e4-4e97-b7e4-25dcb11cd2a5_2981x2981.jpeg)](https://substackcdn.com/image/fetch/$s%5F!RWUz!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F564ab01f-d0e4-4e97-b7e4-25dcb11cd2a5%5F2981x2981.jpeg) the “souvenir” I've been a data engineer for almost 10 years now, but one mistake still haunts me: I accidentally deleted data in production. It was a tough lesson, but I learned a lot from it. As I share my story, I'll explain what happened and the important lessons I learned. My goal is to help you avoid making the same mistake and handle it better if it does happen. Believe me, it can happen to anyone. Even big companies like [AWS have accidentally deleted data in production](https://aws.amazon.com/message/680587/), and that's just one example among many. So, what should you do if it happens to you? ## The call to delete Let's roll back a few years to when on-premise big data clusters were the norm. There is no AWS S3 version history and no way to roll back data from the built-in cloud services. It's late Friday afternoon, and I'm receiving a ping from a business stakeholder: "Data is not refreshed. Could you help us out? We need to decide for next week's marketing campaign". That's weird; I didn't get any failure notifications. The root cause is a silent failure because we ran out of disk space on our cluster. As it's an on-premise cluster, I can't just extend the cluster with new nodes/disks; I have to free up some space to fix the pipelines immediately. As many companies back then (and still to this day), production cluster was also used for a dump of data that we _may_ use in the future, but they were not activated (=used by the business), and there wasn’t any business case prioritized yet. That's the tricky part when working with data. In software engineering, it's common best practice to separate environments. Data teams would do the same. However, because the data is so connected, testing things in the development or staging areas can be challenging without using accurate data from production. In addition, it's a no-go for some organizations to have these production datasets available in a development environment for security/[PII](https://en.wikipedia.org/wiki/Personal%5Fdata) reasons. But I'm getting off track. I wanted to point out that it's pretty standard to have unused production data. The above means we could clean that up quickly to unblock active use cases and data pipelines. So here we go, deleting data in production. Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. ## The surgical operation I'm an engineer who relies on trusted command lines, so let's dive into the terminal to check what's taking a lot of space and delete. 💡Lesson 1: Using the UI for some critical commands is okay. It may be slower, but UI usually has some safety net to confirm around critical operations like deletion. It’s harder to make a quick mistake when using it. I'm starting to explore, and I suddenly find something : ``` ls /data/marketing/project_1 is taking up much space. ``` This one is not used at all, and no projects are planned for shortly. I’m doing a quick `cd ..` to explore something else and then performing a delete operation `rm -rf` on the _whole_ `data/marketing` folder 🤦‍♂️. This happened in roughly one or 2s. I went too fast, getting in and out of some folder, and didn’t realize my current path was wrong. 💡Lesson 2: when you want to delete files in a critical environment, always use first an `ls` command against that path and just replace this one with the delete command (`rm`); that would help you to double-check that you aren’t doing anything wrong. In hindsight, I felt under pressure. It was the end of the day just before the weekend, and I didn’t want to stay too long at work. 💡Lesson 3: Don't make business stakeholders or upper management push you under pressure; approaching a problem in such a way won't be beneficial. Especially when there’s an incident, you may need some help handling communication so that you can focus on the actual problem. But common Mehdi, you probably have a backup? Well, we had the trash disabled for another maintenance operation. So, the files that have been deleted were not recoverable. The worst is that it was almost impossible to recover from the source for other technical reasons. Yup, I pushed my luck really hard on this one. 💡Lesson 4: While you can't control external circumstances, it's wise to plan for the worst-case scenario—just to be prepared. ## Sharing the bad news I had a good relationship with my manager, so I didn’t expect something terrible to happen. It was hard to understand how much “money” I made the company lost with that action as this data was not yet used anyway. That being said, I directly booked a 1:1 with him right after I assessed all the consequences of my action. 💡Lesson 5: This was good; I didn't rush to conclusions based on partial information. Take the time to assess the consequences and then spend even more time writing a post-mortem to discuss with the team how we could have avoided this. Together, you can plan for mid- to long-term solutions to prevent it from happening again. My manager also didn't hide the information and quickly shared it with the relevant upper management/stakeholders. ## Avoiding Data Deletion in Production In summary, here are the main takeaways: * Restrict deletion permissions to a small, relevant group. * Don't hesitate to use the UI for critical tasks; it may limit damage if things go wrong. * If an incident occurs, stay calm and ask for help managing external communication so you can focus on solving the problem. * Don't be too hard on yourself; some things are beyond your control. * Take time to understand the situation before sharing it with the right people. * Write post-mortems and discuss mid- and long-term solutions with the team. Most importantly, remember that your career is like a roller coaster. Since that incident, I've heard many stories like mine, some with even worse outcomes. Luckily, my situation wasn't too bad. I didn't lose my job, maybe because I'd been there for over a year and had done many good things. Sure, one mistake can overshadow all the good, but it wasn't intentional. The T-shirt gift I got afterward is a reminder that, with the right workplace culture, you shouldn't be afraid to make mistakes—that's how you grow. Your career will have ups and downs. Some things are out of your control, and even the things you can control may not always go as planned. But that's okay. Learn from your mistakes and strive to avoid them in the future. Keep learning. Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. --- ## LLMs For Builders : Jargons, Theory & History URL: https://mehdio.com/blog/llms-for-builders-jargons-theory Date: 2023-12-19T14:25:29.502 [![](https://substackcdn.com/image/fetch/$s_!Z2sX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26dc91c6-4f60-40b3-b33e-bd670fd7ccf3_510x510.png)](https://substackcdn.com/image/fetch/$s%5F!Z2sX!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26dc91c6-4f60-40b3-b33e-bd670fd7ccf3%5F510x510.png) Image by the author (generated with DALL-E) AI is a hot topic, with abundant consumer-focused content and continuous research breakthroughs. For a typical software engineer or data engineer starting his journey into the world of Large Language Models (LLMs) to build, it can be overwhelming. At least, I know I felt that way. What level of understanding is truly essential? This post aims to demystify just enough theory, terminology, and history so you can grasp how these elements interconnect. My goal is to provide a comprehensive yet accessible overview, equipping you with the knowledge **to start building**. In this blog, we'll explore the jargon and history around LLMs and cover the key features that define them. To keep things practical and true to our mission of building, we'll conclude by running an LLM on a local machine. Note: this article is the 1st part of a series. [![](https://substackcdn.com/image/fetch/$s_!kFpl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8118bbf-e16c-4f74-99bc-a2caac5cfaf8_2184x1398.png)](https://substackcdn.com/image/fetch/$s%5F!kFpl!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8118bbf-e16c-4f74-99bc-a2caac5cfaf8%5F2184x1398.png) [Subscribe now](https://blog.mehdio.com/subscribe?) Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. ## Where do LLMs come from anyway? LLMs are **machine learning models** that are really good at understanding and generating human language. They are specifically a subset of machine learning known as **deep learning**, which deals with mostly but not only algorithms inspired by the structure and function of the brain called artificial **neural networks**. And... what's a Neural Network? Neural networks consist of layers of neurons, each processing part of a problem. Neurons compute using inputs and weights which are key parameters adjusted during training to improve accuracy. As the network processes data, it fine-tunes these weights to reduce errors in its output. This learning method enables the network to perform tasks such as image recognition or language understanding by efficiently handling complex data. [![](https://substackcdn.com/image/fetch/$s_!8d3N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0241b5d8-4cd0-4e7c-ad2b-30051b05c473_2604x1442.png)](https://substackcdn.com/image/fetch/$s%5F!8d3N!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0241b5d8-4cd0-4e7c-ad2b-30051b05c473%5F2604x1442.png) Image by the Author ## A brief look back: from complexity to accessibility To appreciate where we stand today with technologies like ChatGPT, it's helpful to rewind and see the journey that led to these advancements. #### Up to 2017 - RNNs & LSTMs Initially, deep neural network models such as Recurrent Neural Networks (RNNs) and their advanced variant, Long Short-Term Memory Networks (LSTMs), were predominant. They are sequentially processing text, but they face two challenges : * Handling long sequences and fully understanding broader contexts was difficult. * Due to their sequential nature, RNNs and LSTMs are limited in their ability to be processed in parallel, affecting how much data you can effectively feed the model. #### 2017 - A Paradigm shift with transformers The transformer model, introduced in the "[Attention Is All You Need](https://arxiv.org/abs/1706.03762)" paper, changed the landscape. Unlike RNNs and LSTMs, transformers used parallel processing and an attention mechanism, handling context and long-range dependencies more effectively. In brief, the transformer's attention mechanism allows the model to "focus" on different parts of the input data at once, much like how you focus on different speakers at a noisy cocktail party. It can weigh the importance of each part of the input data, no matter how far apart they are in the sequence. #### 2018 - Post-Transformer Transformers enabled us to move away from linear processing to a more dynamic, context-aware approach. Two major milestones : * BERT: This model, focusing only on encoding, was great at getting the context right. It changed the game in areas like figuring out what text means and spotting emotions in words. * GPT: The GPT series, like GPT-3, which focused just on decoding, became famous for creating text that feels like a human wrote it. They're really good at many tasks involving coming up with new text. But hold on, what exactly do we mean by 'encoding' and 'decoding'? Encoder-decoder models combine input understanding and output generation, ideal for machine translation. For instance, in Google Translate, the encoder comprehends an English sentence, and the decoder then produces its French equivalent. Encoder-only models like BERT are geared towards understanding inputs, excelling in tasks like sentiment analysis where deep text comprehension is essential. Decoder-only models, such as GPT, specialize in generating text. They're less focused on input interpretation but excel in creating coherent outputs, perfect for text generation and chatbots. Decoder-only models have become quite the trend because they're versatile and simpler to use. This makes them a favorite for all sorts of tasks, and they keep getting better thanks to improvements in how they're trained and the hardware they run on. [![](https://substackcdn.com/image/fetch/$s_!9C6L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc143791-2216-412a-9b4e-8a2d33399699_795x620.jpeg)](https://substackcdn.com/image/fetch/$s%5F!9C6L!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc143791-2216-412a-9b4e-8a2d33399699%5F795x620.jpeg) source: #### 2021 - Multimodal era In 2021, with DALL-E's release, we saw the expansion of LLM capabilities, similar to those in GPT, into the realm of multimodal applications. 'Multimodal' means these models handle more than just text - they understand images too! DALL-E, built on the foundations of GPT, used its language understanding skills to interpret text and then creatively generate corresponding images. This was a big deal because it showed that the techniques used in text-based models like GPT could also revolutionize how AI interacts with visual content. For reference, the below-left image was pre-DALL-E from a [2020 paper](https://arxiv.org/pdf/2009.11278.pdf), and the one to the right was taken from today’s Midjourney. Things are moving fast. [![](https://substackcdn.com/image/fetch/$s_!cEKY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75251d35-7c7b-445a-a557-484becc3d808_466x263.png)](https://substackcdn.com/image/fetch/$s%5F!cEKY!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75251d35-7c7b-445a-a557-484becc3d808%5F466x263.png) #### 2022 Release of ChatGPT and mass adoption ChatGPT has become the user interface of AI, democratizing access for anyone who can type on a laptop - and it's free. It's also the fastest-growing application in history, reaching 100 million users in just two months. [![](https://substackcdn.com/image/fetch/$s_!W3k7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21cc520d-adc2-4024-980a-399d81b3d690_750x650.png)](https://substackcdn.com/image/fetch/$s%5F!W3k7!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21cc520d-adc2-4024-980a-399d81b3d690%5F750x650.png) [Source](https://x.com/petergyang/status/1644392103668224001?s=20) : Peter Yang Since then, there's been an explosion of new models, both open-sourced ([Llama2 ](https://ai.meta.com/llama/), [Mistral](https://mistral.ai/), etc) and proprietary ([Claude](https://www.anthropic.com/index/introducing-claude), [Cohere](https://cohere.com/), etc), and a whole bunch of startups have sprung up. Not only have images become more impressive, but we've also started seeing things like text-to-video or text-to-audio. AI is branching out in so many directions, and this is just the beginning. Big players like Adobe are even [integrating AI into their products](https://blog.adobe.com/en/publish/2023/10/10/next-gen-of-creativity-powered-by-ai#:~:text=Generative%20AI%20tools%20in%20Adobe,by%20Adobe%20Firefly%20generative%20AI.), showing just how mainstream this technology is becoming. Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. ## The engineering behind ChatGPT Let's now understand the key features of an LLM. ## The prompt So far, we know that LLMs are NN (Neural Network) that use transformers with specific strategies like attention mechanisms. LLMs are like super-smart guessers. Imagine you're typing a text message, and your phone suggests the next word – that's a basic form of auto-complete. LLMs do this but on a much more complex scale. Here's a simple example: If you type "The cat sat on the...", an LLM can predict the next word might be "mat" because it has learned from a huge amount of text that "mat" often follows this phrase. It's not just guessing randomly; it's using patterns it has learned from all the data it's been trained on. So, a good text input strategy helps LLMs make even better guesses, making conversations or writing more fluid and natural. The text we put into LLMs is called a prompt. Crafting effective prompts can be challenging because there's no one-size-fits-all method; it varies based on the model used. Even the order in which you write your prompt can significantly impact the output, making the process of writing good prompts somewhat unpredictable. However, several techniques can help, such as: * **Zero-Shot Learning**: Giving the model a task or question without any previous examples. For instance, asking "What is the capital of Germany?" without providing any prior context or examples. * **Few-Shot Learning**: Providing the model with a few examples before asking your main question. For example, showing a couple of examples of animals and their habitats, then asking, "What is the habitat of a polar bear?" * **Chain of Thought**: Writing out a step-by-step reasoning process to guide the model. If you ask, “How many hours are in three days?” you might start with "One day has 24 hours, so three days would have..." Next to these techniques, there are many frameworks for creating good prompts. OpenAI just released their [prompt engineering guide](https://platform.openai.com/docs/guides/prompt-engineering). But the best way to master prompt writing is through experimentation! ### The lifecycle of a model There are roughly 4 steps in the lifecycle of a model * Data collections and preparation : Gathering and processing relevant data to ensure it is clean, representative, and in a format suitable for training the model. * Training : Feeding vast and diverse text datasets into the LLM to learn complex language structures, nuances, and contextual relationships, enabling it to understand and generate human-like text. * Fine-tuning : After initial training, the model is fine-tuned with more specific datasets or for particular tasks. This could involve training on specialized topics or styles to enhance performance in certain areas. * Inference : applying the trained LLM to interpret, generate, or respond to new text inputs. Essentially, ChatGPT is a fine-tuned model! We differentiate between a _base model_ and an _assistant model_, with the latter being a more evolved form. This assistant model represents a refined version of the base model, specifically enhanced or adjusted to excel in certain tasks or contexts. Roughly speaking, the assistant model is a more user-centric and task-specific iteration of the base model, designed to efficiently handle particular interactions or functions. ## The components of an LLM #### Context window The context window refers to the number of tokens (the smallest units of text, like words or parts of words) the model can consider at once, both in its input and output. This window defines the limit on how much prior text (input tokens) the model uses for understanding and how much it can generate (output tokens) in response. The size of this context window is crucial as it defines the model’s ability to comprehend longer contexts and maintain coherence in both its processing and generation of language. For instance, ChatGPT 4 can handle 32k tokens. This means the model can consider approximately 32k tokens at a time, combining both the input and output. You can easily roughly estimate that 1 word =\~ 1.5 token. #### Neural Network Taking Llama2 as an example, the complexity of coding needed for its neural network depends on several factors like implementation details, optimization, and specific features used. Large language models like Llama2 usually use high-level languages and frameworks (like Python with TensorFlow or PyTorch), which simplify much of the complex stuff. Running a Neural Network is similar to executing a `run.c` file, regardless of the programming language. Imagine it as a script around 500 lines long. Once you understand how it works, you'll find operating the neural network quite manageable. ### Parameters The real complexity of an LLM lies in its parameters, which are key to defining the model. These parameters are developed during training. Essentially, training involves feeding the model a vast amount of data, allowing it to learn and adjust its parameters for better performance. These parameters include weights and biases that the model uses to make predictions or generate text. The Llama2 series, like many AI models, comes in different sizes of parameters. The current norm in the world of LLMs is to have models with billions of parameters. When we say a model has billions of parameters, it's like saying it has billions of bits of information or rules to help it understand and predict language. For instance, when we talk about Llama-2-70, the "70" means it has 70 billion parameters. Training this 70 billion parameter model of Llama2 requires over [1 million GPU hours](https://llama-2.ai/llama-2-model-details/). If you were to use 1,000 GPUs in parallel, it would take about 70 weeks to train this model. The cost? A hefty $8 million. Often, the weights or parameters of models like Llama2 are typically saved in specific file formats for sharing. A common format is `.pth`, which is used with PyTorch. There's also a lot going on about this file formats like [GGUL and GGML](https://medium.com/@phillipgimmi/what-is-gguf-and-ggml-e364834d241c). [![](https://substackcdn.com/image/fetch/$s_!3Q4k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c90b142-b2b1-41c2-ad3a-fad63e6625c8_3026x1550.png)](https://substackcdn.com/image/fetch/$s%5F!3Q4k!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c90b142-b2b1-41c2-ad3a-fad63e6625c8%5F3026x1550.png) Image by the Author, inspired by the great [Intro to LLM](https://www.youtube.com/watch?v=zjkBMFhNj%5Fg&t) from Andrej Karpathy ## Running a model on your laptop Now that we've got a handle on the basics, it's time to run our first model. Thanks to the surge in open-source projects like [Llama2](https://ai.meta.com/llama/) and [Mistral](https://mistral.ai/), we've got many tools to help us run these models right on our laptops. While big cloud platforms and services like [HuggingFace](https://huggingface.co/) are often the go-to for hosting LLMs, there's a lot of progress in making it possible to run them efficiently on your own computer, using the full potential of your CPU and/or GPU. [Ollama](https://ollama.ai/) is a great example that lets you run, create, and share large language models with a command-line interface. You can think of it as "Docker for LLMs". #### Setup If you are on MacOs, you can use `brew` package manager to install ``` brew install ollama ``` Or visit their [download page](https://ollama.ai/download) for other distribution. #### Download and running a model Let's say we want to try the famous latest [Mixtral 8x7B](https://mistral.ai/news/mixtral-of-experts/) from Mistral, we simply have to do : ``` ollama run mixtral ``` Time to get your coffee ready! The first time you run the command, it's going to download the model, which is a 26GB file size. Once that's done, you can simply type in your prompt and press enter! Of course, you can run many other supported models. Have a look at [their model library](https://ollama.ai/library). Two great resources to help you choose the right model are: * [LLMs leaderboard by HuggingFace](https://huggingface.co/spaces/HuggingFaceH4/open%5Fllm%5Fleaderboard) * [lmsys.org](https://chat.lmsys.org/) : which offers a user-friendly interface to test different LLMs and also features their own leaderboard. ## Onward and Upward Well done! You've navigated through the complex jargon and now have a solid grasp of the key elements in the world of LLMs. Plus, you've even managed to run a model on your own computer! What's next? Only the exciting stuff. In the upcoming blog, I'll explore how to craft effective prompts and explore various techniques to make the most of LLMs. Stay tuned for level 2 🧗‍♂️ ! --- ## Dancing your way through the pathless data career URL: https://mehdio.com/blog/dancing-your-way-through-the-pathless Date: 2023-10-30T14:10:46.84 [![](https://substackcdn.com/image/fetch/$s_!V6lX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7780955f-9be6-4439-a625-46fc83412164_1024x1024.png)](https://substackcdn.com/image/fetch/$s%5F!V6lX!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7780955f-9be6-4439-a625-46fc83412164%5F1024x1024.png) Image by the Author \[Dall-e\] Data roles definition has ben hard to understand since the beginning. There’s no traditional path, people are coming from different background and each companies have their own nuanced definition. Add on top of the that the rapid evolution of our industry and these roles changing and the confusion even get bigger. In this post, we're going to look at how we got to this point. I'll try my best to offer you a guide to get through this confusing situation and create your own successful career in data without following a set path. [Subscribe now](https://blog.mehdio.com/subscribe?) ## My pathless career in data Below is my career path over the last decade. As you can see, I have held multiple different job roles. While it may not be as unconventional as some paths I have seen, where people transition from industries such as HR, filming, or music, it serves as a good example to show that you should not be afraid of taking such a path. In this blog, I will share a couple of personal experiences and tips that will hopefully help you in your data journey. [![](https://substackcdn.com/image/fetch/$s_!jDOV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48603ae9-88cc-4804-90b0-a65d30b8a34b_1584x610.png)](https://substackcdn.com/image/fetch/$s%5F!jDOV!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48603ae9-88cc-4804-90b0-a65d30b8a34b%5F1584x610.png) Image by the Author ## The flaws in the classic career scheme A classic career path seems simple in some industries. Want to be a doctor? You go to med school. Want to be a lawyer? You head to law school. In the world of data, things are a bit more complicated. Through my mentoring experience, I've noticed that young graduates or those shifting careers often express a desire to _work in data_, AI, or ML. However, they usually don't have a clear idea of what they want to do. They might have some guesses about their interests, but mostly, they just want to break into the industry. That career path is not straightforward at all. Another scenario is when you're already employed in the data field and aim to shift to a different role, for example, transitioning from Data Analyst (DA) to Data Engineer (DE). This transition, too, is far from straightforward. [![](https://substackcdn.com/image/fetch/$s_!RMcm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ae5232f-14c6-4ef2-8345-b46c49a597e0_752x336.png)](https://substackcdn.com/image/fetch/$s%5F!RMcm!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ae5232f-14c6-4ef2-8345-b46c49a597e0%5F752x336.png) Image by the Author If you check out my career journey, you'll see that my first data job was an internship. I left a steady job and took a 20% pay cut to grab a special chance to learn a lot and set myself up for later success. It's alright to start from scratch. Once you're in the industry, moving up becomes a lot easier. ## Let’s draw our plan based on what the market needs When you look at data engineering job posts on LinkedIn, you can see this: [![](https://substackcdn.com/image/fetch/$s_!BaYy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8527e1c9-5707-484f-9520-08771e478956_1747x801.png)](https://substackcdn.com/image/fetch/$s%5F!BaYy!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8527e1c9-5707-484f-9520-08771e478956%5F1747x801.png) Even for jobs with the same title, they ask for totally different basic skills. So what’s going on here ? ### Understanding everyone’s data maturity Data maturity can be defined in many ways. I like to think it as your ability to use data to make business decisions that boost profit or productivity. At the very end, it’s not just about the tech tool stack or your code, but it’s for business. One issue is that big tech companies like Netflix and Airbnb are openly sharing their tech know-how, which is generally a good thing. However, this makes other companies think that achieving similar tech levels is easier than it actually is. They often don't realize how wide the tech gap really is. About a decade ago, in 2012, there was a rush to embrace Machine Learning, leading to a surge in hiring Data Scientists. However, many companies soon realized they couldn't effectively utilize their data. [![](https://substackcdn.com/image/fetch/$s_!hWQ1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa8bd60c-5614-4f61-b143-2b44e2c9f77a_2644x966.png)](https://substackcdn.com/image/fetch/$s%5F!hWQ1!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa8bd60c-5614-4f61-b143-2b44e2c9f77a%5F2644x966.png) Hidden Technical Debt in Machine Learning Systems, Google. The image above, from a [2015 Google paper](https://proceedings.neurips.cc/paper%5Ffiles/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf), illustrates the average time a Data Scientist spends on an ML model. Interestingly, despite being a global trend, many companies still find themselves in this scenario today, each experiencing their own "aha" moment when they realize its relevance to them. So, should we simply label ourselves as data fellows and dive into data tasks? That approach doesn't really assist anyone in charting their career path. To some extent, role titles are important. Reflecting on my career journey, I made the leap into the tech scene about four years ago. Prior to that, I was employed in more traditional companies. While securing a job in such places might be easier and you might not get to use all the latest tech tools, they offer a solid foundation. They provide a great starting point for building skills and breaking into the data field. ## Dissecting roles Everyone has probably come across a version of the well-known data roles Venn diagram. [![](https://substackcdn.com/image/fetch/$s_!Ed3e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff57924f9-3342-416e-b4cf-eb0223933831_650x473.png)](https://substackcdn.com/image/fetch/$s%5F!Ed3e!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff57924f9-3342-416e-b4cf-eb0223933831%5F650x473.png) source : Different versions of this Venn diagram exist. The issue is that they imply skills are strictly linked to certain jobs. For example, they make it seem like a Data Scientist never does ETL (Extract, Transform, Load) work. But in reality, Data Scientists often do engage in ETL tasks. The reality is, skills often overlap a lot between different roles, but how much they overlap can change from one company to another. This diagram doesn't show how deeply you need to understand each skill for each job. This can be overwhelming, given the huge range of skills involved. A more accurate way to represent this might be through a web chart (or radar chart), as shown below. [![](https://substackcdn.com/image/fetch/$s_!D9Pf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F617fb004-25ff-4f70-9af7-aeaf6469018a_941x523.png)](https://substackcdn.com/image/fetch/$s%5F!D9Pf!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F617fb004-25ff-4f70-9af7-aeaf6469018a%5F941x523.png) source : The further away you go from the center, the higher the value is. This graph is not to be taken for gold as a one source of truth but as a good baseline. This graph enables us to comprehend the variations in job offers for identical roles. So you could potentially have [![](https://substackcdn.com/image/fetch/$s_!O2KW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08181bcc-8965-4769-8ce4-420b0b9b7d61_888x500.png)](https://substackcdn.com/image/fetch/$s%5F!O2KW!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08181bcc-8965-4769-8ce4-420b0b9b7d61%5F888x500.png) And even this : [![](https://substackcdn.com/image/fetch/$s_!rpko!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcef2b3e-8951-4dd1-a686-c51762ae6320_890x494.png)](https://substackcdn.com/image/fetch/$s%5F!rpko!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcef2b3e-8951-4dd1-a686-c51762ae6320%5F890x494.png) Because sometimes, even the company is not sure about which role they want to hire and they put too many requirements. They don’t realize they are not looking at one person role, but a whole data department. Back in 2015, when I began working in the data realm, the term 'data engineer' wasn't really popular. I initially rode the wave of excitement into a Data Scientist role, but soon realized it wasn't quite the right fit for me. Don’t hesitate to try different things, as shifting between roles is often smoother than making your initial entry into the field. ## **What You Need to Learn to Break Into Data** "The art of knowing is knowing what to ignore." – Rumi To thrive in the data field, it's essential to focus on a mix of both hard and soft skills. Here's a **non-exhaustive list** that balances technical abilities with personal competencies, offering long-term value and market relevance: * **Data Analysis**: Interpreting and understanding data. * **Programming (SQL/Python/etc)**: Critical for data manipulation. * **Data Visualization**: How to tell a story and present data insights visually. * **Software engineering Basics**: Understanding the software development lifecycle, from building code to deploying it into production. * **Data Warehousing Concepts**: Knowing about data storage and management. * **Critical Thinking**: For solving problems and making decisions. * **Communication**: Data roles are in the center of many stakeholders with different background, communication is challenging. * **Adaptability**: Staying flexible in a rapidly evolving tech landscape. When building your skill set: 1. **Focus on Long-Term Value**: Opt for skills that will stay in demand. Be cautious about becoming too specialized in certain tools or technologies that might not stand the test of time. 2. **Market Relevance**: Align your skills with the current needs of the job market. Regularly check job listings to stay informed about desired skills. These skills can be applied across various data roles. Customize your learning based on these essentials, keeping in mind your career objectives and the ever-changing job market. When I transitioned to a data engineer role, I immediately recognized that my SQL skills were transferable. However, it was also clear that I lacked foundational knowledge in software engineering. I was eager to delve into this area, and now, with a better understanding of how software functions beyond just the data aspect, I can see the immense value of these skills. They're not just useful; they're transferable across various domains. Another piece of enlightening advice in the same line comes from Steph Ango's recent blog, "[Don't Specialize, Hybridize.](https://stephango.com/hybridize)" As illustrated in the image below, he demonstrates that there are multiple pathways to gaining expertise, going beyond the traditional two-path approach. [![](https://substackcdn.com/image/fetch/$s_!gnwE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15962a02-3387-4b5f-b109-6403019a93d9_1445x1435.png)](https://substackcdn.com/image/fetch/$s%5F!gnwE!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15962a02-3387-4b5f-b109-6403019a93d9%5F1445x1435.png) Source : Steph Ango, https://stephango.com/hybridize ## **Beyond just getting a job** We've mostly talked about landing a job, but it's also about finding joy in your work. How do you stand out? How do you find something you truly enjoy? Enter Ikigai, a Japanese concept that's all about finding your purpose. [![](https://substackcdn.com/image/fetch/$s_!lTM7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F204934e4-9403-46ad-ab1f-1f6de39317a5_640x640.webp)](https://substackcdn.com/image/fetch/$s%5F!lTM7!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F204934e4-9403-46ad-ab1f-1f6de39317a5%5F640x640.webp) We've focused a lot on the 'what you can be paid for' part, but it's important to discover what you love and what you're good at. So, how can you make your own moves? * Explore industries you're passionate about. * Think about your soft skills. What are your strengths? What do others appreciate in your personality? * Find activities that feel like fun to you but work to others. Blending these with your professional life can really add some excitement. Reflecting on my own career, I noticed I spent a lot of time blogging, making YouTube videos, and teaching. These were not just hobbies; they helped me stand out and led me to a new role in developer relations (devrel). A great way to experiment with this is through side projects, and they don’t have to be tech-related. Anything that lets you play with your interests and figure out what you're good at can be beneficial. The key challenge lies in understanding how these pursuits can enrich your professional life. ## Creating your own map Navigating a career in data can feel like a labyrinth. Unlike traditional paths, it demands a nuanced understanding of both technical and business aspects. The market needs are constantly evolving, and so are the definitions and expectations of various roles within the industry. Moreover, the pace of change varies across organizations. Craft your unique, pathless career and avoid being too fixated on specific job roles: * Focus on transferable skills. * Align these skills with the demands of the market. * Reflect on what activities you genuinely enjoy and mix them with your work. Flexibility, continuous learning, and self-awareness are your keys to unlocking extraordinary opportunities. Good luck! 🍀 Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. [Subscribe now](https://blog.mehdio.com/subscribe?) Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. --- ## Revitalizing Your Tech Career: My 30-Day Marathon Through 20+ Interviews and 5 Job Offers URL: https://mehdio.com/blog/revitalizing-your-tech-career-my Date: 2023-04-20T08:56:34.82 [![](https://substackcdn.com/image/fetch/$s_!2XEj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e4720f-f632-48a1-b034-a9b87bc8e3bd_907x914.png)](https://substackcdn.com/image/fetch/$s%5F!2XEj!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5e4720f-f632-48a1-b034-a9b87bc8e3bd%5F907x914.png) The journey through a marathon of data interviews \[Image by the Author\] In January, I began an exciting journey of back-to-back interviews. Over just one month, I applied to 8 companies, and after more than 20 interviews, I received five job offers. All of these positions were fully remote, as I am located in Berlin, Germany. This was my second time taking on this challenge, and it's been **a major boost to my career.** In this post, we'll talk about what I learned during this interview marathon and share some updates on the remote data market, focusing on devrel roles since that was my target. You'll discover the reasons why embracing such a marathon could be a game-changer for your own software engineer/tech career and learn how to navigate this thrilling endeavor! Let's start with the definition. Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. ## What is a Marathon? As this is my second marathon, I think it would be nice to come up with some rules or definitions of what I consider a marathon of interviews. Like any marathon, you prepare seriously for a race of endurance, mental strength, and physical fitness. The latter may not sound adequate in our case, but believe me, in this Zoom meetings era, sitting is the new smoking! Note that for this marathon, my adventure was a little different because I was changing role, moving from a staff engineer to developer relations (devrel). The story behind this switch will be shared in another blog. ### Preparation #### Yourself During this time, you review the skillset you lack for the target role so that you can either fill the gap during preparation or just be honest about your limit if you are hitting some wall during the interviews. It’s worth it to think about your answer if you are hitting such a wall. Is it because you haven’t practiced through a project for a while? Is it because you are lagging behind the new features? Are you willing to learn to fill the gap in that tech, or is this just not your target? Think about which skillset you are strong for the role so that you can highlight them with concrete past examples. To do so, you can write your [brag document](https://jvns.ca/blog/brag-documents/): which projects and impact you are the proudest of your career. #### The company You should spend some time researching which companies are interesting for you. Look online to understand their business and how healthy they are. Websites like Glassdoor and Crunchbase can give you some rough numbers. Then you can write down any questions you may have to ask the future interviewer. Don't forget that an interview involves both parties; it's also your responsibility to gather all the necessary information about the company, work culture, and technical setup. List the top 10 companies you would like to work for. Tip: it’s nice to ask the same questions to multiple interviewers to collect different data points. Finally, update your LinkedIn profile, and define what projects and what impact you’ve done on your previous employer. You can also update a PDF-style CV as it is sometimes required. Having your website portfolio also helps to find everything, but again, you may need to update it. Last, but not least, think about the realistic salary range to ask. ### Applying Don’t start applying directly to your top 3 favorite companies, do a mix to warm up these interview muscles! Look at your LinkedIn network to see if anyone works at the company you are willing to apply. Tip: On Linkedin, you can search based on company name and connection level I would avoid cold reach to random people for a referral. But if you have an engineering manager in your 2nd connection hiring, it doesn't hurt to contact them directly AFTER you have applied to inform them that you are interested in a job role they published. ### Interviews Just follow the flow! Some processes are more painful than others, and you may see red flags that would discourage you from moving further. But as part of a marathon, that’s a whole goal to have time to try the good, the bad, and the ugly. There is always something to learn from a process, company, or interviewer. One important thing to mention to the interviewer is your timeline and context. Giving information that you are on break to do a marathon of interviews and looking to start at a specific date would usually speed up the process. It means that your agenda is flexible, and you’ll be hired whatever happens in a few weeks, hopefully. There’s time pressure from both sides. ### Weighing Options & Sealing the Deal Congratulations! You’ve been through all the talks and tests, and now have some offers. It's time to negotiate based on your input. A tip is that when starting the interview, just mention an expected salary range based on your research and the offer you received. There isn't any problem if you come back with a negotiation with a higher range than expected if you feel you were a bit under the market value. The most important is to justify why you are asking now for a higher salary. Stay humble, and polite and provide good arguments. For this marathon, my requested salary range was lower than what I could get. After the first offers came out, I texted the companies that, based on these new data points, I’d like to update that salary range. It was pretty hard for me to evaluate what I'm worth as I was switching roles and mostly targeting US companies that have a different benchmark than the local Berlin tech scene. Reaching out to people in your network who currently work at or have left the company is also a good idea. Don't be afraid to seek feedback. I've observed from experience that once a company has around 150 or more employees, numerous "micro-cultures" emerge, which means that one team may operate entirely differently from another. Finally, I usually do a [reverse interview](https://blog.pragmaticengineer.com/reverse-interviewing/) with the future hiring manager once I have the offer to ask some additional questions. ### Timeline [![](https://substackcdn.com/image/fetch/$s_!J2zV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7d7b5df-7776-46e8-97a4-a21e7e0aaebd_2359x932.png)](https://substackcdn.com/image/fetch/$s%5F!J2zV!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7d7b5df-7776-46e8-97a4-a21e7e0aaebd%5F2359x932.png) Timeline overview \[image by the Author\] A couple of caveats : * Of course, the more senior you are, the easier it will be to get applications. * You may rush it into 1 month, but be aware that some processes can take longer for reasons that are out of your control (holidays, parental leave, etc.) * The 50% ratio highlights that I don’t believe you can work full-time while doing this marathon and not becoming insane. But you can do some work or relax with some other activities. Later one is my recommendation, as discussing with many people drains much energy. ## Targeting companies out of the FAANG hype Disclaimer: I've never worked at FAANG (or MANGA), but I did attend the final rounds at Meta a few years ago. FAANG companies were on a hiring spree during the past decade of the flourishing tech industry. Now that times are getting harder and not going away soon, I believe any small company that didn't hire aggressively is in a great position to beat the competition. The reason is that they don't have to deal with a heavy layoff and all the consequences of it, which impact culture and motivation, to name a few. That's the reason why I mostly applied to small companies that had decent runaway money in the bank. Though I couldn't predict some of them had their money at SVB that would crash… 🤦‍♂️ Anyway. Smaller companies offer more flexibility and often have a smaller applicant pool, making them an ideal target. Unlike larger corporations with established remote entities, startups rely on services like [Deel](https://www.deel.com/) and [Remote](https://remote.com/) to hire remote people, providing you with the freedom to work from (almost) anywhere. If you are not familiar with these services, they are EOR (Employer of Record), and they basically act as an intermediate for contracts, payments, and everything else related to HR admin work. They have offices worldwide, and they will give you a local contract based on your location. No worries about taxes; it’s like you would have another job in your country. There are many other reasons why you would or not consider startups over FAANG, but to recap, if you are looking for a flexible remote job with a decent salary, open your mind to other opportunities. [![](https://substackcdn.com/image/fetch/$s_!gmY8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe27e6fe1-22bb-4dbe-ac59-6c26a78ce1f1_1024x1024.jpeg)](https://substackcdn.com/image/fetch/$s%5F!gmY8!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe27e6fe1-22bb-4dbe-ac59-6c26a78ce1f1%5F1024x1024.jpeg) Power of remote working, working from anywhere \[image by the Author\] ## Refine Your Niche Skills As you progress to senior-level engineering roles, it's important to sharpen your skills and knowledge in a particular field. This could be technical (e.g., internal of databases) or domain-driven (e.g., data/analytics), helping you stand out. In my case, I specialized in DevRel within data infrastructure, a niche within a niche. As I'm switching roles, I'll adapt my niche skills. As a data engineer, I was comfortable on the platform side. Now that I'm moving from devrel, I intend to boost my video and teaching skills as key differentiator skills. Yes, you will see more crazy video edits on my YouTube channel soonish. ## US vs. EU Salary So, what about the money? Well, as I mentioned above, this was a big guess when I asked for a salary range because I was applying for US companies, and there are a couple of things you need to consider. As most of you may know, the EU/US systems are completely different. To summarize quickly, we, as Europeans, pay more taxes and have a smaller salary, but we have a lot of things covered with little to no extra cost. This includes health insurance, generous unemployment benefits & PTO, etc. Just to give you an example, in Germany, anyone that has a job can have health insurance for his/her whole family "for free,” as it's included in their salary taxes. It's also common to take 1 year of parental leave for the mother, as the government covers it. It's not your salary at 100%, but it's decent enough to have great family time. I’m not getting here into a debate about what’s the best. You can definitely make more money in US, but that depend on your family situation and a bunch of other stuff. Do your own research. A fun fact is that my first viral LinkedIn post was about this very subject of US vs. EU salary. Have a [look at the comments](https://www.linkedin.com/feed/update/urn%3Ali%3Ashare%3A6931952695750619137/); they are… interesting. [![](https://substackcdn.com/image/fetch/$s_!cfRI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2f85a97-539b-496a-9507-41005100a080_609x430.png)](https://substackcdn.com/image/fetch/$s%5F!cfRI!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa2f85a97-539b-496a-9507-41005100a080%5F609x430.png) Me writing a 15min post without thinking that it could lead to 500k views 😅 Next to that, it usually costs a bit more for US companies to hire in the EU (tax reasons), even if European are cheaper. To sum up, if you apply to an US company, you can’t just ask for a 1:1 salary equivalent from the US. You need to take into account the above. You can assume that you will be on a “discount” compared to US level salary but above the national or even EU market (\*). If you want to know more about how different companies benchmark their salary, there’s the must-read classic from about this : [The Trimodal Nature of Software Engineering Salaries in the Netherlands and Europe](https://blog.pragmaticengineer.com/software-engineering-salaries-in-the-netherlands-and-europe/). _(\*) excuding FAANG salaries_ ## About DevRel $$ My strategy (and after hearing from other devrel experiences) was to benchmark myself against my software engineer experience because, after all, we are all software engineers. I would add that you can probably add a multiplier depending on the variety of skills you have. For instance, can you do nice technical writing? Blog tutorial? Video hands-on? Can you edit videos? Do engaging thumbnails? In-person talks? You may not need all of these as companies have different definitions and requirements on their devrel depending on the journey, but the fact that you may have some of these skills (to a certain level) in one person is more valuable than you think. I defined myself as somehow a bit above seniority. But again, this doesn’t mean much. In practice in have 8 years of experience in data, and I’ve been doing content for the past two years in different forms. So, I'm not a complete newbie. The offers I got were in the range of **160k-195k$ yearly base salary** with equities between 20k-60k. TC (total comp) was, on average **\~210k$ annually**. PTO was always at least 20 days, the minimum legal in Germany. If you look at US Silicon Valley salaries, that’s relatively “cheap” for a Senior Software Engineer. But if you look at Berlin, that’s above the tech market. As a staff engineer, you can expect from \~120-150k$ yearly base salary from the Berlin tech scene. ## Take a Break! I can't stress enough to advise people to take a break whenever they want to change jobs or go through interviews. If you can't afford a month, take at least a week. The power of running multiple interviews simultaneously and landing multiple offers will definitely be worth it, as **you will have way more leverage for negotiation**. Plus, having a **free mind to measure the pros and cons** of opportunities is key. After all, we spend a crazy amount of time working, so taking a week or a month is nothing compared to the time invested. It's also a good time to just **reflect on what you would like to do and do more interviews than needed** just to test out what you like and what you don't like. Again, this is something you can't afford to do when working full-time. Plus, as time goes on, your motivation will go down. Having a timeboxed time to do your interviews will also push your effort to another level. If you want to catch up on my first experience of such a marathon, applying as a data engineer, you can read about it here! And may the interview force be with you. Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. --- ## The Most Painful And Repetitive Job Of A Data Engineer URL: https://mehdio.com/blog/the-most-painful-and-repetitive-job Date: 2023-03-15T10:50:56.693 [![](https://substackcdn.com/image/fetch/$s_!x0L-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d32393-fc8c-4f9e-bee5-118c90aaf021_800x800.png)](https://substackcdn.com/image/fetch/$s%5F!x0L-!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d32393-fc8c-4f9e-bee5-118c90aaf021%5F800x800.png) Image by the Author, generated by MidJourney I remember my first job as data analyst was to use Microsoft SSIS, a GUI ETL tool. We had only one data source back then, which was plenty enough for the business use cases. In today’s modern world, that often looks like a joke or a utopia. Even if you have a small business, you quickly end up with many different services and other places to consume, process, and analyze data. You end up doing the most boring data engineering job: moving data. How did we get into this situation? Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. ### Why can’t we have data in a single place? There are multiple reasons for data team to move data from system A to a system B with little to no transformation. Let’s cover a few common tasks : * Getting data from a source is often steps from 0 to 1 for any analytics project * Pulling data from an API and putting it in an Object Storage or Data warehouse * Moving data from an operational OLTP database to an Object Storage/Data Warehouse * Moving data from the Data warehouse to Object storage * Moving data from Object storage to an operational OLTP database * Migration project from data service A to data service B In these use cases, there are transformations but few “business” transformations. They often involve mostly changing the form and type of data rather than effective business logic. It’s mostly about data compatibility: **structure & format**. Some examples : * Flattening some JSON data (coming from an API) into a columnar format for loading into a data warehouse * Transforming into a new file format, typed, efficient for query and consuming less storage (e.g CSV to Parquet) At the very end, we put data into different systems for **performance** reasons (cost, query latency, usability with BI tools). We need to do this because while there is some standard in file format, it’s not widely adopted. ### The emergence of standards in file format File formats like Parquet and supercharged file format supporting ACID like [Delta lake](https://delta.io/), [Apache Iceberg](https://iceberg.apache.org/) or [Apache Hudi](https://hudi.apache.org/) help to move data easier. Many cloud data warehouses have been putting effort into making these file formats work with their ecosystem. However, remember that all Data warehouse like Redshift, Snowflake or Big Query for instance have their internal file format, and using these open-source standards comes today at a tradeoff in terms of performance. ### The true bottleneck So why is moving data still so hard ? Aren’t these file formats enough to conquer the problem space? It’s all come down to **data compatibility** and **how data transit** between place. File format is one piece of the puzzle, but what protocol do we use to move these? Let’s look at how most databases communicate with the outside world to transfer data, our old friend JDBC. JDBC has been around since the **mid-1990s** (!), providing a vendor-neutral interface for accessing databases from Java applications. But as technology has evolved, JDBC has started to show its age. Of course, as you may know, it can also be used by other programming languages through JDBC drivers. These drivers bridge the JDBC API and the programming language, allowing other languages to use JDBC to interact with relational databases. This makes it a versatile technology that can be integrated into different systems and used by multiple programming languages, not just Java. It’s a common standard today in the data ecosystem, especially for many BI tools. ### Enter the saver from the shadow: Apache Arrow Created in 2016, [Apache Arrow](https://arrow.apache.org/) is an open-source, in-memory **data format** and **transport mechanism** that provides a standardized, efficient, and interoperable solution for data processing and exchange across different platforms and programming languages. Arrow aims to improve data transfer and processing performance by providing a columnar data format that enables better compression, vectorization, and SIMD parallelism. Apache Arrow has gained significant popularity in the data processing and analytics communities for these reasons. Some primary reasons why it beat JDBC : * JDBC protocol uses a row-based data transfer approach, which can result in high network overhead and increased latency for large datasets. * JDBC may require multiple round trips between the client and server for data transfer, making it inefficient and slow compared to Apache Arrow’s optimized data transfer mechanisms. * Apache Arrow’s columnar data format allows for better compression, vectorization, and SIMD parallelism, resulting in significant performance improvements over JDBC’s row-based [![](https://substackcdn.com/image/fetch/$s_!qd-g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee313e63-4581-427c-afb3-098f09c234df_800x559.png)](https://substackcdn.com/image/fetch/$s%5F!qd-g!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee313e63-4581-427c-afb3-098f09c234df%5F800x559.png) Image by the Author Column-based databases are the most common places where we consume analytics today, aka Cloud Data Warehouse (BigQuery, Snowflake, and Cie). And so are the file formats used today, like Apache Parquet, Delta Lake, and Apache Iceberg. Arrow is, therefore, in the right place to be used around these tools. ### Practical Tips to get the most of Arrow 🏹 Consider implementing the following strategies: 1. Evaluate your existing data stack: Assess your current data stack to identify areas where Apache Arrow can be integrated to optimize data movement and processing. Determine which systems and tools are compatible with Arrow and can benefit from its columnar data format. 2. Embrace open-source columnar file formats: Use formats like Parquet, Delta Lake, Apache Hudi or Apache Iceberg to enable better data compatibility and interoperability. 3. Leverage modern data tools: Choose modern data tools that support Apache Arrow, such as [Polars](https://www.pola.rs/), [DuckDB](https://duckdb.org/), Apache Flink or Apache Spark to take advantage of its performance benefits. 4. Stay informed about new developments: Keep an eye on Apache Arrow’s ongoing developments and improvements and its growing adoption in the data community. ### What the future looks like? The future of database protocol is looking brighter than ever! While using a standard file format does have some performance tradeoffs, Arrow’s role in properly interfacing data has huge potential. With its growing adoption, Arrow is expected to simplify moving data between different systems, minimizing the need for extra serialization and deserialization. Its columnar format makes data transfer efficient, and its support for multiple programming languages and platforms makes it incredibly versatile. Soon, you’ll be able to spend less time on the mundane task of moving data and more time generating valuable insights for your business. To quote Tristan, CEO at dbt labs [during an interview](https://youtu.be/o1pCuTYa%5Fr0?t=595) I did last October, “I want Apache Arrow to take over the world.” In the meantime, may the data be with you. Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. --- ## 10 Lessons Learned In 10 Years Of Data [2/2] URL: https://mehdio.com/blog/10-lessons-learned-in-10-years-of-c34 Date: 2023-01-13T12:43:39.075 [![](https://substackcdn.com/image/fetch/$s_!pSOM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43fd1136-0940-42a9-af42-143662ae7560_1024x1024.jpeg)](https://substackcdn.com/image/fetch/$s%5F!pSOM!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43fd1136-0940-42a9-af42-143662ae7560%5F1024x1024.jpeg) Generated by MidJourney This is part 2; check part one if you don’t want to get spoiled! [![](https://substackcdn.com/image/fetch/$s_!IKMe!,w_56,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6204d08-39c3-45d5-8b65-5ca8e807c29e_169x169.png)Mehdio's Tech (Data) Corner10 Lessons Learned In 10 Years Of Data \[1/2\]It’s the end of 2022, and a common tradition in the data community is to predict trends in 2023\. But what do you need for predictions? Data. And looking solely at 2022 will not help us too much to give accurate predictions. So let’s go back to 2012…Read more3 years ago · 4 likes · 2 comments · mehdio](https://mehdio.substack.com/p/10-lessons-learned-in-10-years-of?utm%5Fsource=substack&utm%5Fcampaign=post%5Fembed&utm%5Fmedium=web) Let’s tackle five lessons learned from the past three years. We have seen an explosion of frameworks, tooling, and SaaS startups. Probably because bootstrapping SaaS products has never been easier. ## 2020 ⏱️ Remember bob, the data engineer? [![](https://substackcdn.com/image/fetch/$s_!0nR_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc655daf0-ecba-42a1-a420-081ae2c2ab38_385x497.png)](https://substackcdn.com/image/fetch/$s%5F!0nR%5F!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc655daf0-ecba-42a1-a420-081ae2c2ab38%5F385x497.png) I was a Data Scientist, but the Data Engineering hype was stronger. Bob now has plenty of options in open-source. Plus, they aren’t ridiculously expensive to put in production. After all, many companies put Kafka into production before it was even 1.0! #### ✔️**Lesson #6: Open-source is the new norm** [![](https://substackcdn.com/image/fetch/$s_!3mcg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414cdb49-4ba1-452f-b59a-4d0719898bef_459x461.png)](https://substackcdn.com/image/fetch/$s%5F!3mcg!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F414cdb49-4ba1-452f-b59a-4d0719898bef%5F459x461.png) Having a product (or some part of it) open-sourced enables tech people to try new technology at minimum risk without any commitment. Tech folks don’t like to talk to salespeople. I prefer to try the product first and then return with any questions. And I’m happy to go one step further regarding sales if things get interesting. On a side note, open-source is not needed in this case. Having an online demo without a credit card could solve this. Yes, but it’s still a black box. How mature is the project? How big is the number of contributors? What’s the community traction? All these things can be evaluated easier when a project is open-source. But here’s the trap: _maintenance_ is not free. While an open-source tool can be easy to try, there is sometimes a huge gap between a local playground and something put in production. We sometimes get fooled compared to expensive proprietary vendors, but we should never forget that a huge part of the cost, in the end, is our salary. It’s ends-up with the same classic question of build vs. buy, as there’s always something to build when using an open-source product. ## 2021 ⏱️ The modern data ~~stack~~ mess [![](https://substackcdn.com/image/fetch/$s_!7qSq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb0f6e8f-e215-43f2-a8a3-50ae58904071_1349x691.png)](https://substackcdn.com/image/fetch/$s%5F!7qSq!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb0f6e8f-e215-43f2-a8a3-50ae58904071%5F1349x691.png) We are in an overdose of tooling era. If you are sometimes losing track of what’s happening, don’t worry; you are not alone. Even when doing my daily technology watch, I’m still amazed by all products I’ve never heard of in the above picture. #### ✔️**Lesson #7:** The best cloud provider is AWS GCP Azure AWS had a headstart in the Cloud war, and it’s still a safe bet. But we have seen that each cloud provider somehow plays a different strategy. Azure is great for existing Microsoft customers. There are a lot of migration paths, and contract-wise, well, it’s just an amendment to an existing contract. For old big corporate that required a ton of security and procurement processes to consider moving to the cloud, it’s a big win. Google’s GCP has been focusing a lot on Machine Learning products which makes sense as it’s part of their core and initial product. That being said, a lot of companies today have at least two cloud providers just for the sake of being able to negotiate better pricing. Plus, with Kubernetes emerging as standard and getting easier to manage, I’ve often seen companies focusing primarily on these services to avoid too much vendor locking. Aside from that, we have SaaS startups that are taking some part of the cake on niche services. They partner with cloud providers to offer the best in class experience while relying on the big cloud provider for server management. Databricks on Azure and Confluent’s Kafka on GCP are some examples. #### ✔️**Lesson #8:** The data stack needs consolidation [![](https://substackcdn.com/image/fetch/$s_!UOXe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F401b61a5-6e2b-4b50-8850-f79bcca450b2_641x478.png)](https://substackcdn.com/image/fetch/$s%5F!UOXe!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F401b61a5-6e2b-4b50-8850-f79bcca450b2%5F641x478.png) Having plenty of options is nice, but if we still need to figure out how the blocks are talking to each other, it may not be worth it. One big plus we’ve seen these past years in the data world is the adoption of standard file formats. It mainly started with Parquet and now we have ACID file formats like Delta lake, Iceberg, and Hudi. A lot of cloud data warehouses have been pushing support on these. Moving data just because we can’t use the data as it is was the most painful job to do, especially at scale. Glad we are finally getting away from this with more standards. But no matter how many integrations we put in place, the hard truth is: we have too many tools. Some data vendors will just die. Or get acquired. Two weeks ago, Confluent (Kafka) announced the acquisition of Immerock (managed Flink). I’m sure most of us never actually heard of this startup. ## 2022 ⏱️ Python and SQL. Everywhere. Some personal confession: I got a Python fatigue. #### ✔️**Lesson #9: Data engineering/analytics is software engineering** [![](https://substackcdn.com/image/fetch/$s_!yxxJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b1cef10-93fd-4b7c-bf32-6659f27217e3_1002x588.png)](https://substackcdn.com/image/fetch/$s%5F!yxxJ!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3b1cef10-93fd-4b7c-bf32-6659f27217e3%5F1002x588.png) As my career was mostly in data (BI, etc.) and not started as a classic software engineer, I always felt I had to catch up on some foundations like CI/CD, unit testing, and observability, just to name a few. Many of these concepts have landed in data by now, but many people were (and some still are) not considering data as a software engineering asset. Your metrics in your beautiful Tableau UI are software engineering assets that should be versioned, tested (in different environments), and included in a CI/CD pipeline. #### ✔️**Lesson #10: Python & SQL is nice. Rust is efficient** Nope, it’s not a war of Python vs. Rust. It’s Python WITH Rust. Python is here to stay, but Rust will impact (and it’s already happening) the data ecosystem at its core. It won’t probably impact the classic data users as they will still use Python binding for their data tasks. This topic can be an entire blog, but if you want to know more, I’m putting one of my latest YouTube video below. I’ll also do an online talk about this very topic on the 18th of January at the event organized by , register [here](https://www.eventbrite.com/e/state-of-data-2023-tickets-468776622497). It's free! ## ⚔️ Brace yourself for 2023. Here is the recap for part 2 : ✔️Lesson #6: Open-source is the new norm ✔️Lesson #7: The best cloud provider is AWS GCP Azure ✔️Lesson #8: The data stack needs consolidation ✔️Lesson #9: Data engineering/analytics is software engineering ✔️Lesson #10: Python & SQL is nice. Rust is efficient The macroeconomic situation is playing in our favor. Only sustainable products will survive. That will clean up the ecosystem a bit. Make sure you evaluate new shiny tools only when needed and double-check the maturity and viability of the product because who knows if it’s still going to be there in a year… May the data be with you. Thanks for reading Mehdio's Tech (Data) Corner! Subscribe for free to receive new posts and support my work. --- ## 10 Lessons Learned In 10 Years Of Data [1/2] URL: https://mehdio.com/blog/10-lessons-learned-in-10-years-of Date: 2022-12-30T12:50:19.092 [![](https://substackcdn.com/image/fetch/$s_!0Rg6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F154a7e7f-19b5-4c8c-9eb4-88f66215c28b_1024x1024.webp)](https://substackcdn.com/image/fetch/$s%5F!0Rg6!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F154a7e7f-19b5-4c8c-9eb4-88f66215c28b%5F1024x1024.webp) Generated by Midjourney It’s the end of 2022, and a common tradition in the data community is to predict trends in 2023\. But what do you need for predictions? Data. And looking solely at 2022 will not help us too much to give accurate predictions. So let’s go back to 2012. I’ll highlight my lessons learned, and you draw your own prediction for 2023\. Don’t worry; some will be obvious. And, of course, there will be memes. ## 2012 ⏱️ Meet Bob, the Big data engineer [![](https://substackcdn.com/image/fetch/$s_!iS8S!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F57c0c6b4-83d8-48fc-941a-c83e6a00d753_186x526.png)](https://substackcdn.com/image/fetch/$s%5F!iS8S!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F57c0c6b4-83d8-48fc-941a-c83e6a00d753%5F186x526.png) Let the data hype start Bob is happy. His company just invested in an on-premise Hadoop cluster. No more proprietary BI tools. They will be dead in a few years, anyway (right!?). Bob is happy to care about distributed systems rather than business value. A few months, system engineers, and thousand of $$$ later, the cluster is finally ready. Bob is thinking: _“Oh, it would be nice to have a service that does that for us, but what will we do then? It will steal our jobs!”_ _**✔️ Lesson #1: Cloud didn’t take our job**_ [![](https://substackcdn.com/image/fetch/$s_!gk4i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F220eaa8c-0b62-4bf7-88ff-ad3670fd1b16_680x383.jpeg)](https://substackcdn.com/image/fetch/$s%5F!gk4i!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F220eaa8c-0b62-4bf7-88ff-ad3670fd1b16%5F680x383.jpeg) Technology doesn’t replace people; it rather changes the way we work. So if you are scared about all these ChatGPT highlights, look at the past and think twice. You definitely will need to adapt as many companies did for the cloud, but you will still get a job to do. ## 2013 ⏱️ Another day in Bob’s Big Data Engineer life [![](https://substackcdn.com/image/fetch/$s_!709I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72f253d-9037-4c46-aa64-1bf39395623b_1422x902.png)](https://substackcdn.com/image/fetch/$s%5F!709I!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fc72f253d-9037-4c46-aa64-1bf39395623b%5F1422x902.png) Today Bob has a big batch job to run that will probably take all resources from the cluster for a while. He kindly warns his teammates. They are ready for a long coffee break. _**✔️ Lesson #2: Unlimited cloud resources can be painful**_ [![](https://substackcdn.com/image/fetch/$s_!WLLj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fde0809c3-b1f4-4248-899b-d8aa74757617_800x805.png)](https://substackcdn.com/image/fetch/$s%5F!WLLj!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fde0809c3-b1f4-4248-899b-d8aa74757617%5F800x805.png) Nowadays, you are not anymore bothering your colleague about on-premise resource limits; you are just burning your credit cards. This is a blessing and a curse. Without any limit on your resources, we tend to avoid any data pipeline/SQL query optimization. Until there’s no more money and your CTO is looking back to the data team where they can save money. Feel familiar? Given the tough economic times, I believe it’s here to stay. ## 2016 ⏱️ Bob, ~~the Big Data Engineer~~ Data Scientist [![](https://substackcdn.com/image/fetch/$s_!9Has!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F1c56308e-c73a-4bc5-bbb1-5fa30b072a4f_1098x700.png)](https://substackcdn.com/image/fetch/$s%5F!9Has!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F1c56308e-c73a-4bc5-bbb1-5fa30b072a4f%5F1098x700.png) Machine learning, Machine learning Data is liquid gold, and all we need is a bunch of Ph.D. Data Scientists to make this happen. At least, that’s what we thought. _**✔️ Lesson #3: Data Science was a dream**_ [![](https://substackcdn.com/image/fetch/$s_!Va__!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fddcffa61-0722-4a27-b078-0d539789cdc6_300x168.jpeg)](https://substackcdn.com/image/fetch/$s%5F!Va%5F%5F!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fddcffa61-0722-4a27-b078-0d539789cdc6%5F300x168.jpeg) I like this meme above because I feel many companies were blinded by the data maturity of big tech companies. They got fooled, thinking they could easily do the same. Note that we should probably talk more about failures rather than successes at conferences and meetups. Today the hard truth is that everybody knows you need a strong data foundation before doing anything fancier than basic analytics. We understood that we need to be humble with our data maturity, and that’s okay. ## 2018 ⏱️ Bob likes Notebooks [![](https://substackcdn.com/image/fetch/$s_!4efF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7735a1-0781-49f3-bf53-c932c8b75472_530x772.png)](https://substackcdn.com/image/fetch/$s%5F!4efF!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4a7735a1-0781-49f3-bf53-c932c8b75472%5F530x772.png) Bob is happy with Jupyter notebooks. No need to know software engineering, just a few lines of python, and it’s working. _**✔️ Lesson #4: Meet the users where they are but not too much**_ [![](https://substackcdn.com/image/fetch/$s_!JoyG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F8122b070-cadf-4835-a687-2fd8f81e9dbb_565x321.jpeg)](https://substackcdn.com/image/fetch/$s%5F!JoyG!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F8122b070-cadf-4835-a687-2fd8f81e9dbb%5F565x321.jpeg) The “it works on my machine” black hole. Notebooks are great as they lower the technical barrier to entry to data. But when a tool is easy to use, it often hides complexity elsewhere. Jupyter notebook, in this case, bypasses most of the software engineering best practices like versioning, testing and code reusability. Yes, there are workarounds. Yes, tons of Saas companies are working on this. But in 2018, we just thought it was the holy grail until we tried to go into production. So the bottom line is: yes, we need more tools that are easy to use, but users need to upskill themself at a minimum to understand that handling data needs software engineering foundations. ## 2019 ⏱️ Bob, the ~~data scientist~~ data engineer. [![df](https://substackcdn.com/image/fetch/$s_!nJJv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F13f8d750-3698-4af6-9300-b48f7fb5de3c_856x512.png "df")](https://substackcdn.com/image/fetch/$s%5F!nJJv!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F13f8d750-3698-4af6-9300-b48f7fb5de3c%5F856x512.png) Just a role name change for a salary upgrade At this point, Bob is just following the market buzzword, which is fair. Most of the data engineers in 2022 that started earlier than in 2019 did the same. At least I did. _**✔️ Lesson #5: Data engineer role is too wide**_ Why? Probably because a lot of data engineers that started before 2019 as data scientists ended up taking that part of responsibilities while still keeping the old ones. Add to that the explosion of tooling and frameworks, and data engineer was the default role where we would put all new responsibilities for data needs. Infrastructure? Data engineer. Data pipelines? Data engineer. Analytics? Data engineer. MLops? Data engineer. Data Observability? Data engineer. And the list goes on. If you look at job offers today, you will get a lot of different definitions. I touch down on this topic while explaining which role name we can use to navigate through this mess in the video above. ## 🔄 Recap ✔️ Lesson #1: Cloud didn’t take our job ✔️ Lesson #2: Unlimited cloud resources can be painful ✔️ Lesson #3: Data Science was a dream ✔️ Lesson #4: Meet the users where they are but not too much ✔️ Lesson #5: Data engineer role is too wide Alright, that’s all for Part I, folks. There are already way too many memes in this blog post. Part II will cover from 2019 to 2022, which literally feels like a decade in data as so many things happened… and many lessons learned too. May the data be with you. --- Thanks for reading Mehdio's Tech (data) Corner! Subscribe for free to receive new posts and support my work. --- ## You Don't Have Big Data; You Have Bad Data Lifecycle Management URL: https://mehdio.com/blog/you-dont-have-big-data-you-have-bad-data-lifecycle-management-e459b0e1e84f Date: 2022-11-28T01:08:46.609 #### Storage is not always cheap [![](https://substackcdn.com/image/fetch/$s_!v7xu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c25431-c6d1-4dbc-b145-197374f87052_800x533.jpeg)](https://substackcdn.com/image/fetch/$s%5F!v7xu!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c25431-c6d1-4dbc-b145-197374f87052%5F800x533.jpeg) \[Digital image\] by NeONBRAND Digital Marketing, [https://unsplash.com/photos/KZs5Bt5VDng](https://unsplash.com/photos/dDvrIJbSCkg) We are in the data gold era. [The global data sphere](https://www.statista.com/statistics/871513/worldwide-data-created/) produced around \~2 zettabytes in 2010 and \~64 in 2020 (x32!). The growth of data volume is exponential. As individuals, we produce and consume an insane amount of data. As a company, we want to leverage these to make proper decisions and smarter products. Given this, it’s easy to fall into the trap of storing data just for the sake of storing. With today’s market conditions, data teams will be challenged more and more on their Total Cost of Ownership. “Storage is cheap” can be a myth. Let’s find out why. ### 🤔Where all the fuzz started Let’s get a small reminder about where the “big data” is coming from. It refers to 3 V: Velocity (the speed at which you need the data), Volume(size), and Variety (variety of the sources of your data). In this article, we will focus only on one most underrated dimensions: volume. ### 💨Data collection & downfall of ELT The defacto strategy nowadays is to capture almost `any` data because we have “potential” future use cases but no active users/use cases. And if we don’t capture it, it’s lost. That’s where [ELT](https://glossary.airbyte.com/term/elt#:~:text=elt/) (extract-load-transform) shines and took over his old brother [ETL](https://glossary.airbyte.com/term/etl-vs-elt/) (extract-transform-load) these past years because we needed a more flexible way to ingest the data without needing to transform and model it. But abusing this pattern ends up with a lot of data collected, unused. ### 🎌Common traps Here’s the most obvious one I’ve seen repeatedly : * Too much data copy for development/experimentation purpose without expiration date: data stay there as a ghost. * Not doing any file conversion on some raw `csv` / `json` raw files: we have many file formats more appropriate for storage and query (e.g : parquet, ORC, Avro…). * Too many useless snapshots. * Too much staging pipeline data not being pruned. * Too much re-load in the Data warehouse without leveraging lakehouse/direct query on the data lake. * Too much data is captured, which is not used at all. The last one is hard to tackle, as we could have a use case in the future. However ‘when’ this future is going to happen? Challenge yourself with the following question : * Is this use case really valuable vs the operation overhead? * Is there any way to backfill this data if we don’t capture it now? Having an approach of “knowledge” first would help to decide if you need to capture this data in the first place. > But wait, isn’t storage cheap? Yes, it is. But because it’s so easy to store, we tend to abuse it, and bad patterns are also easy to use. Storing a huge amount of data still adds up. ### 🫰Embracing FinOps As discussed above, there are different ways to solve the issue, but the mindset is to adopt FinOps from day one. That means, for instance : * Creating Cloud budget alerts. * Creating Dashboard of consumptions. For this latter one, ask yourself the following questions? * What are the biggest object storage buckets? * What are the biggest tables? * What was the last time this big bucket/table was accessed? Clear monitoring and alert will help you detect when things go sideways rather than react when the bill is already there. Plus, having a clear view of your cost is always something your leadership will like, as it’s easier to plan a budget (for both headcounts and resources). Bonus: if you use AWS S3, you should check [intelligent tiering](https://aws.amazon.com/s3/storage-classes/intelligent-tiering/). It will automatically move your data to cheaper storage based on your configuration. ### 🩹ETL to the rescue? Aside from FinOps, we saw also how ELT paradigm could be misused. ETL can also help because you could do _some_ of the transformations rather than storing the raw data as it is. For example, if you join some sources and just transform them to a more appropriate file format (e.g, parquet), there is little to no value to keeping the raw once you’ve transformed it. In any way, it’s a good thing to keep in mind the tradeoff with ELT and rethink things if your storage costs are exploding. ### 💸Everything is cheap until it’s a priority. As you can see, the biggest danger with storage is that we have a common knowledge that storage is cheap, and therefore, all bad patterns that we listed aren’t a priority to focus on as they are until your credit card budget are burned. However, today, you can start budgeting and monitoring from the very first start of your data journey. No need to go crazy on this, but this should evolve as your use cases, and data maturity evolves so that you keep the costs tight and the added value high. Want to connect? F**ollow me on** 🎥 **[Youtube](https://www.youtube.com/channel/UCiZxJB0xWfPBE2omVZeWPpQ)**,🔗**[LinkedIn](https://linkedin.com/in/mehd-io/)** for more data/code content! --- ## Data Contracts — From Zero To Hero URL: https://mehdio.com/blog/data-contracts-from-zero-to-hero-343717ac4d5e Date: 2022-09-09T07:05:48.316 #### A pragmatic approach to data contracts [![](https://substackcdn.com/image/fetch/$s_!DLNf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9507f970-3475-43df-b341-ed085b198802_512x375.png)](https://substackcdn.com/image/fetch/$s%5F!DLNf!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9507f970-3475-43df-b341-ed085b198802%5F512x375.png) Writing data contracts — Image by the Author, generated with Stable Diffusion. Recently, there has been a lot of noise around data contracts on social media. Some data practitioners [shared opinions](https://www.youtube.com/watch?v=4BEpYAp3Qu4) about pros and cons but mostly about what it is and its definition. While I think data contracts are a wild topic, I wanted to share my experience with pragmatic tips on how to get started. Data contracts are something real and valuable that you can start leveraging today with less effort than you think. But why do we need them in the first place? ### 🔥What’s the fuzz about data contracts? #### Being proactive instead of reactive If you work in data, chances are high you faced multiple times this problem: data is wrong, and you have no idea why. There seems to be a problem upstream in the data, but none of your internal colleagues knows why so what do we do? Who should we contact? > How did we end up there? With data not being the first class citizen, data teams mostly start getting analytics on an existing infrastructure that serves other initial goals. They will “plug” their pipelines against an existing operational database, off-load data to a warehouse and handle the rest. [![](https://substackcdn.com/image/fetch/$s_!j94O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526a5a39-f55d-425f-a4f2-2cd9f6273320_800x525.png)](https://substackcdn.com/image/fetch/$s%5F!j94O!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F526a5a39-f55d-425f-a4f2-2cd9f6273320%5F800x525.png) Image by the Author Data teams are stuck between the hammer (the operational databases they have no control) and the business screaming their needs. They can do some magic to some extent but [garbage-in garbage-out](https://en.wikipedia.org/wiki/Garbage%5Fin,%5Fgarbage%5Fout). The more problem you have upstream, the more challenging it will be for data teams. That’s where data contracts can help. Data teams have an explicit way to ask what they need and put a more strict process to handle change management. ### 📎 How do we implement such contracts? #### What if we could re-do everything from scratch? It seems unrealistic at first as you rarely have the opportunity to start with a greenfield infrastructure. However, with today’s cloud technology, it’s not so far-fetched. An event-driven architecture can help support data contracts for multiple reasons : * Events can be strongly typed, and each event can be associated with a schema version. * It’s cheap if you use a serverless event stream and the infrastructure is self-contain (per topic). * Events platforms (aka pub/sub) offer built-in connectors for classic downstream data consumption (object storage, data warehouse). Technology like AWS Kinesis or Kafka (with managed Kafka like AWS MSK or Confluent), Cloud Pub/Sub are good options to get you started. The idea is to create a brand new contract with the backend and agree on the best need for (data) consumers. Backend folks often have use cases for event-driven patterns outside analytics. For instance, communicating between micro-services. Two options here : 1. Make compromise about the schema so that it fits both data analytics and their use case 2. Create an event that’s dedicated to data analytics use case Going for 1\. avoid having an explosion of events type created at the source but may be a bit hard to discuss change as more stakeholders will be involved. [![](https://substackcdn.com/image/fetch/$s_!2SQV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd7b2cd-92d1-4dcf-b0de-078a9618cb1c_800x525.png)](https://substackcdn.com/image/fetch/$s%5F!2SQV!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fd7b2cd-92d1-4dcf-b0de-078a9618cb1c%5F800x525.png) Image by the Author #### Defining a process for creating/modifying a contract Most event platforms like Kafka or AWS MSK have their schema registry (AWS Glue registry in the case of AWS). For each topic created, you will need to register a schema. An easy way to implement such a process between data producer and data consumer is reusing a **git process**. [![](https://substackcdn.com/image/fetch/$s_!AUgP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3a951d-f6c5-48f1-b22b-94fe1a55ddbd_800x306.png)](https://substackcdn.com/image/fetch/$s%5F!AUgP!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f3a951d-f6c5-48f1-b22b-94fe1a55ddbd%5F800x306.png) Image by the Author All schema creation/change/deletion can go through a git pull request. With clear ownership and consumer to a topic, you can quickly know who can approve changes to the schema. The CI/CD pipeline picks up and deploys the change with the corresponding schema on merge. The beauty of such a process is that **it forces the discussion to happen** before making any change. ### 🚀 Production checklist Here are a few things to recommend when implementing data contracts with an event bus. #### Tip 1: Please, use typed schema It’s a pain to maintain JSON schema. Too much freedom. A common standard for typed events is to use [Avro](https://avro.apache.org/). It’s supported by all schema registries and has a lot of interoperability with other processing engines (Flink, Spark, etc.) for further transformation. #### Tip 2: Don’t go crazy on nesting fields As we usually analyze data in a columnar format, having too many nested complex fields can be challenging for schema evolution and expensive to process. If you have a lot of nesting fields, think about splitting the event into multiples one with specific schemas. #### Tip 3: BUT you can make some compromise(s) on nesting fields If the producer is unsure about the definition of all the schema (e.g., they depend on 3rd party APIs), you can go as far as you can in the definition and leave the rest of the unknown as a JSON string. It will cost you more on the compute to explode/access such field, but it leaves more flexibility on the data producer side. #### Tip 4: Set up extra metadata fields in the event. Things like `owner`, `domain`, `team_channel`, or identifying PII columns with specific fields will be helpful later for clear ownership, lineage, and access management. [Schemata](https://github.com/ananthdurai/schemata) is a good resource to use or get inspiration about schema event modeling. #### Tip 5: Don’t change a data type on a given field. It’s better to rename a field with a new type. While we can have a mechanism downstream to detect schema version, allowing a type change on a field without renaming it will always cause a headache. If you accept one case, you will have to handle all others. So if changing an `int`to a `string`is not hurtful; what happens when you change a `int`to `float`or a `float`to `int`? #### Tip 6: You can still implement data contracts without an event bus If you have a place in `git` where you keep all DDL statements for your operational database, you can still implement most of the above. For instance, on any change done on the database, there’s a git process that alerts the consumer who will need to approve. However, it’s a bit hard as you put a contract on something already existing where the data team didn’t have the opportunity to speak up when the schema was created. ### 🪃 Give back ownership to the data producer Data contracts are just a trend to **give back ownership to data producers** rather than having data teams suffer from whatever data we throw at them. And this is great; it makes life easier for everything downstream and **avoids silos between products and data**. The biggest challenge is organizational. Data teams must **cross the barrier and talk with the backend about new processes**, which can be scary. Highlighting the current pain points and bringing visibility into how the data is consumed helps to drive the discussion. For the tooling itself, things can be set up progressively using a event platform **pub/sub service**, **a schema registry,** and **git** for the data contracts process. Find a suitable project sponsor within your company and implement the pipeline from end to end. There’s no need for big bang migration; start with a small event and extend the pattern next! #### 📚Further reading [Implementing Data Contracts: 7 Key Learnings](https://barrmoses.medium.com/implementing-data-contracts-7-key-learnings-d214a5947d5e) by [Barr Moses](https://www.linkedin.com/in/barrmoses/) [The Rise Of Data Contracts ](https://dataproducts.substack.com/p/the-rise-of-data-contracts)by [Chad Sanderson](https://www.linkedin.com/in/chad-sanderson/) ### Mehdi OUAZZA aka mehdio 🧢 Thanks for reading! 🤗 🙌 If you enjoyed this, **follow me on** 🎥 **[Youtube](https://www.youtube.com/channel/UCiZxJB0xWfPBE2omVZeWPpQ)**,🔗**[LinkedIn](https://linkedin.com/in/mehd-io/)** for more data/code content! **Support my writing** ✍️ by joining Medium through this **[link](https://mehdio.medium.com/membership)** --- ## What Open Source Can Do For Your Data Career URL: https://mehdio.com/blog/what-open-source-can-do-for-your-data-career-53ecb747c111 Date: 2022-08-04T08:35:44.376 #### And you don't need to code to get started. [![](https://substackcdn.com/image/fetch/$s_!R-tX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbafc4d6e-8ac6-48b5-b3f6-dbbf45f5b5d2_800x450.jpeg)](https://substackcdn.com/image/fetch/$s%5F!R-tX!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbafc4d6e-8ac6-48b5-b3f6-dbbf45f5b5d2%5F800x450.jpeg) \[Digital image\] by Kushagra Kevat, Today's data world uses many open source tools, from data engineering to data science and deep learning. This is a unique opportunity to grow your career in multiple ways. In a recent podcast of [DataTalks.Club](https://datatalks.club/), [Merve Noyan](https://www.linkedin.com/in/merve-noyan-28b1a113a/), shared how she went from baby steps on GitHub to Developer Advocate Engineer at HuggingFace. Inspired by the talk, I wanted to give my reflection on how open source has helped me throughout my career so far, up to Staff Data Engineer. ### 🍼 Baby steps in Open Source #### Reproducible issue Contributing to open source can be scary. Where do you start with an unknown codebase, a different way of working, and a lot of automation during the PR? Well, you don't have to code to do your first steps! The first time I faced open source was because I had an issue with a python package. After no luck on StackOverflow, I decided to look at the GitHub issues. A similar issue was already there but with poor context, so I commented with an extensive how-to reproduce this one. One day later, a fix came out from the maintainer 🎉 Have a problem with a library? Here's what I usually do : 1. Go directly to the GitHub repo and check the documentation. 2\. If there's nothing helpful in the documentation, I'm searching for any issues. **Don't forget to remove the default filter** `open` and search through all issues. Most of the time, you must dig into the closed issues to find relevant information. 3\. If there's no existing/related issue, I'll open one. I'll spend enough time providing all information needed to reproduce it. This process is so underrated but so valuable for the maintainer. Having multiple data points of a problem with clear steps on how to reproduce is 50% of the work towards a solution. #### Promotion and documentation support There are other ways to get involved without coding : * Update documentation * Helping on StackOverflow * Share it on social media (Twitter/Linkedin) These will also allow you to exchange knowledge and meet incredible people online. ### 🦸 Next Level #### Your first coding tutorial Before committing to someone's code, why not share your knowledge along a “hello world” project? This is less scary because you are in control of everything, and it doesn't need to be crazy in terms of features. The main goal is to teach something. It can be a blog or a video, but it's always better to have the code repository pushed somewhere. Here are some personal examples of [written coding tutorials](https://betterprogramming.pub/your-next-container-strategy-from-development-to-deployment-66167c0d028a) and [videos](https://www.youtube.com/watch?v=DxTEzywnBOc&t) I did. Bonus: reach out to the creator of whatever you are covering; sometimes, they will be super happy to re-share and highlight your work! #### Your first library It doesn't need to be a great library that millions of users will download. It can be something you created to **solve a specific problem you encounter**. If you face a challenge, chances are high that someone will have the same. It doesn't even need to be a library. It can be a framework, code snippet, or a boilerplate. That's what I did with a [pyspark boilerplate](https://github.com/mehd-io/pyspark-boilerplate-mehdio). I wanted a simple boilerplate I could reuse over different projects. Nothing perfect and fancy, but it's solving a problem I have. #### Your first Pull Request (PR) Now that you've been solo contributing, you are ready to look at someone's project. It's worth looking at the `first good issue`label on GitHub and **start the discussion before implanting anything.** It can be frustrating to have your PR rejected because it's not in line with the design decision. Merve Noyan highlighted that maintainers will always be happy to discuss with you, as they respect your time and commitment to the project. Multiple seasoned events will promote open-source contributions. Here are a few of them: * Contribution sprint: Many opensource projects have dedicated contribution sprints where maintainers will focus their time onboarding and helping new contributors. * Hacktoberfest * Google Summer of Code [![](https://substackcdn.com/image/fetch/$s_!GG3c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57cdc6f8-47f5-4f08-b4ad-3cb4823026b2_643x737.png)](https://substackcdn.com/image/fetch/$s%5F!GG3c!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57cdc6f8-47f5-4f08-b4ad-3cb4823026b2%5F643x737.png) Image by the Author ### 📣 Promote your project Nobody wants to git clone and read your README to set up your project. The last mile would be to deploy your project so users can easily use it. If your project is a library, push it to appropriate places (e.g.: PyPI for python) For other kinds of projects, there are a couple of platforms that can help you : * [Kaggle](https://www.kaggle.com/) provides a notebook runtime to show off your projects * [HuggingFace Space](https://huggingface.co/docs/hub/spaces) offers a simple way to host ML demo apps * [Streamlit](https://streamlit.io/) turns data scripts into web apps in a few minutes ### 🌟 From contributing to Open Source to landing your dream job There's a great secret about doing work in public: it's public. Anyone can look it up. It could also speed up technical interviews as you may have already proven your abilities through some PR's. Some companies also offer you the opportunity to do a public PR on an open source project they own. It's great because your coding tests are visible for your other interviews. ### 🚀 Go contribute! There has never been a better opportunity to contribute to Open Source. There are tons of projects. Many platforms to lower the technical barrier to deploying and showcasing your work. And everything that you will do will be public, which is gold for future reference. So don't hesitate, and make the leap! ### Mehdi OUAZZA aka mehdio 🧢 Thanks for reading! 🤗 🙌 If you enjoyed this, **follow me on** 🎥 **[Youtube](https://www.youtube.com/channel/UCiZxJB0xWfPBE2omVZeWPpQ)**,✍️ **[Medium](https://medium.com/@mehdio)**, or 🔗**[LinkedIn](https://linkedin.com/in/mehd-io/)** for more data/code content! **Support my writing** ✍️ by joining Medium through this **[link](https://mehdio.medium.com/membership)** --- ## Meet Your Future Data Mentors URL: https://mehdio.com/blog/meet-your-future-data-mentors-6cb4066db83a Date: 2022-06-09T16:05:53.757 #### Story of datacreators.club, a hub to discover 100+ data content creators [![](https://substackcdn.com/image/fetch/$s_!2BgW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25aff1ec-2bd2-4acc-b3c1-1a8e13266239_800x564.png)](https://substackcdn.com/image/fetch/$s%5F!2BgW!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25aff1ec-2bd2-4acc-b3c1-1a8e13266239%5F800x564.png) screenshot of [datacreators.club ](https://datacreators.club/)website — Image by the Author Technology is moving blazing fast. The open-source era increased the speed of new tools adoption, and when we seek knowledge, it mostly comes from tech individuals online. YouTube has seen tremendous growth, claiming [500+ hours of content uploaded every minute.](https://blog.youtube/press/) Medium reported an increase of 106% in writers in [2020](https://medium.com/creators-hub/2020-by-the-numbers-473c8bf52207). And that’s just a few numbers and platforms among the ocean of information out there. So how can you find relevant data content creators without being drowned? That’s the story of the data creators club. ### 🧙‍♂️Find your mentor I’m an eager learner, and I do daily technology watch. Medium was one of my first sources of “creators learning” outside of online course providers. As I’m working as a data engineer, getting fresh information about the technology is tough as everything is (always?) new. It was difficult for me to find a proper technical mentor where I was working for that specific reason. Therefore, I decided to find my own online. ### [📊](https://emojipedia.org/bar-chart/) More consumption, more platforms Over the past years, I started to spend more time-consuming tech content and tutorials on Medium, YouTube, and, more recently Linkedin. I learned so much from the data folks sharing content, that I decided to start sharing my knowledge too! Almost every week, some data gurus highlight new data content creators and their fantastic work. And there are so many small creators providing high-value content from the start! So why don’t we have a single, sustainable place to find them, no matter their channels? ### 🕸️ Datacreators.club [A simple website](https://datacreators.club/) to search among a list of more than 100+ data content creators online. You can filter by channel (YouTube, Medium, Twitter,…) and topic (e.g : data engineering, data science, machine learning). There’s a form on the website to suggest a data creator if it’s not already there! [![](https://substackcdn.com/image/fetch/$s_!HLPz!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9ae697e-424c-4a43-ba7e-fa75184f3356_1370x1192.gif)](https://substackcdn.com/image/fetch/$s%5F!HLPz!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9ae697e-424c-4a43-ba7e-fa75184f3356%5F1370x1192.gif) Demo of [datacreators.club](https://datacreators.club/) — Image by the Author ### 💙 Positive response from the data community Since its initial launched, datacreators.club had 1000+ views in less than a week and 20+ data creators submissions! I was thrilled that I also discovered new content that I wasn’t aware of. ### 🏗️ What’s next Hopefully, we will get more suggestions on data content creators to be able to have the best resources online to find your data creators. I also received some requests to put out some content in a language other than English. So this feature will soon be enabled on the website! I’m still listening to feedback for the rest, and if you have any ideas, don’t hesitate to DM me! On my side, I’ll continue to do my best to produce quality data content. Happy learning [👨‍🎓](https://emojipedia.org/man-student/) ``` Follow me on 🎥 YouTube,🔗 LinkedIn ``` Subscribe to DDIntel [Here](https://ddintel.datadriveninvestor.com/). Join our network here: --- ## Testing Your Terraform Infrastructure Code With Python URL: https://mehdio.com/blog/testing-your-terraform-infrastructure-code-with-python-a3f913b528e3 Date: 2022-05-25T15:34:46.25 #### Let’s cover an API use case with Terraform HCL & Python [![](https://substackcdn.com/image/fetch/$s_!vEp6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa575cc0c-f50d-4d5d-965f-918aa60f8cb2_800x450.png)](https://substackcdn.com/image/fetch/$s%5F!vEp6!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa575cc0c-f50d-4d5d-965f-918aa60f8cb2%5F800x450.png) Image by the author Today, most of the infrastructure code is done through [Terraform](https://www.terraform.io/). It’s been there for quite a while, has a strong community, and it’s multicloud. However, things start to be tricky when it comes to testing Terraform code. While Terraform uses its own language (HCL), its backend is written in Golang. A good pattern for Terraform modules tests is [terratest](https://terratest.gruntwork.io/), but as you may have guessed, you will need to write these in Golang. Here we’re going to showcase how we can use plain Python with a powerful yet simple library [tftest](https://github.com/GoogleCloudPlatform/terraform-python-testing-helper/) on existing Terraform HCL code. ### Tftest [Tftest](https://github.com/GoogleCloudPlatform/terraform-python-testing-helper/) is a small Python library from Google. It enables you to do Terraform actions (plan|deploy|destroy) programmatically and retrieve the execution plan, output variables, etc. The power of tftest lies with the potential combo with `pytest`. Besides, Python has really good SDK support on the different cloud providers, which makes it nice for testing cloud infrastructure. ### Case study Our setup will involve a simple [Cloud Run API](https://cloud.google.com/run) (serverless container runtime), but you can apply the above method for any infrastructure you deploy! #### Simple infrastructure test In the first example, we will do a simple test to get our hands around `tftest`. Let’s try the following * Create a `plan`fixture * Assert that the output name of our container image is what we expected * Check the output variables are expected The plan fixture simply looks at which terraform module/directory we need to look up (and perform the `apply` command) and make the output variable available as an object. As you can see, you can easily retrieve any outputs and/or variables to perform any tests that you would like! #### Advanced e2e infrastructure test We are going now to do an e2e test that would do the following: * Deploy the API on Cloud Run * Get the URL of the deployed service * Generate an Auth session ready for request * Perform the request and assert the response * Destroy the API The first three points can be again put in a fixture to keep our testing function at the minimum and to be able to reuse this one for other requests. This time, our fixture will do an `apply` and a `destroy` at the end of the test. It will also generate an auth session based on the deployed URL to be allowed to perform a request. We will wrap up this auth session through a `request_wrapper` function. Let’s put this fixture in `conftest.py`: And here’s our test file that would perform the request and assert the response: This file is kept at a minimum, and the fixture can be easily reused for other endpoints testing. ### Caveat This setup works great for any serverless components (like Cloud Run) that don’t have a long cold start. Some Cloud services can sometimes take up to 15–20 minutes to be ready, making it tedious to include them as part of a CI pipeline. ### Isn’t there a better solution? The setup presented here works great on an existing terraform codebase. However, if you are starting a new project, there are more appropriate solutions. Terraform already has in beta [Terraform CDK](https://www.terraform.io/cdktf), which allows you to directly use Python (or any other programming languages supported) to declare your infrastructure, which makes testing much easier. [Pulumi ](https://www.pulumi.com/)is also a great candidate, and it’s more mature on the CDK side. If you want to work with AWS only, you can also use AWS CDK, but you lose the benefits of investing knowledge in an IAC framework that isn’t vendor-locked. In any way, it’s nice to finally be able to use a standard programming language to also manage infrastructure as you can directly leverage all the testing toolkits included! Happy testing. _Link to the full demo GitHub repository [here](https://github.com/mehd-io/cloudrun-terraform-tftest-demo)._ ``` Want to Connect? Follow me on YouTube or LinkedIn ``` --- ## Job Hopping As A Software Engineer — Should You Do It? URL: https://mehdio.com/blog/job-hopping-as-a-software-engineer-should-you-do-it-c71a39390a29 Date: 2022-04-08T12:52:22.853 #### Why Job Hopping Now Is Intentional, Not Impatient (And What You Need To Know) [![](https://substackcdn.com/image/fetch/$s_!c6G8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff25346ad-a550-4d9f-8db0-16718c357d85_800x660.png)](https://substackcdn.com/image/fetch/$s%5F!c6G8!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff25346ad-a550-4d9f-8db0-16718c357d85%5F800x660.png) Even Super Mario sometimes takes shortcuts — Image by the Author. “We are in the most heated tech hiring market of all time.” [commented the Pragmatic Engineer](https://blog.pragmaticengineer.com/off-cycle-compensation-adjustments-for-software-engineers-in-2021/) based on incontestable fresh data points. We, Software Engineers, are in an exceptional situation that may be comparable to the dot com bubble. As things are moving fast, so are our career and job tenure. I had a couple of employers these past years and wanted to reflect on my personal experience and discuss the elephant in the room: is job-hopping a good or bad thing? ### We don’t marry our employer anymore 💒 Let’s step back for a minute and consider the work culture in general. Most of us will have more than one employer during our career, but it’s somewhat unclear if people change jobs more often. It isn’t straightforward to get some decent numbers for the following reasons : * We don’t have enough data (yet): job-hopping is still a relatively new trends * Covid: an unsuspected event that’s changing the market heavily and how we work (remotely) However, according to a [2016 Linkedin survey](https://blog.linkedin.com/2016/04/12/will-this-year%5Fs-college-grads-job-hop-more-than-previous-grads), job-hopping is the new norm for millennials (at least). Full disclosure: I’m one of them. Past concepts of job permanence are no longer. We do start to acknowledge the merits of job-hopping. Based on all interviews I have been doing these past years, I can confirm that I never had anyone pointing at the number of jobs I had on my resume. I also had some big tech companies asking me: “What do we need to do for you to be happy here for two years.” It was a revelation to me as they did their homework. Some companies know the tech debt that an employee living too fast creates and when the return on investment is worth it. ### Covid accelerates the ease of change 🏃 Interviews can be done within the day without literally moving from your computer! I remember going on-site and signing the contract with a pen and paper 😱. There were a lot of physical actions and processes that would require you more effort and commitment than it’s today. Because remote work is starting to be more popular, there are even more opportunities than it used to be. Though more competitions. Job hunting tools started to be more standard and easier to use: Linkedin, Glassdoor, Indeed. They also give you good insights about the company you would apply to. ### Career advancement ⏭ Changing jobs allows you to potentially pursue a higher-level career at another company. It can also grant you opportunities to learn new skills, gain practical experience, or be given more responsibilities. Changing jobs can help you advance your career without spending years/months waiting for a promotion. > Apply for jobs that you aren’t 100% qualified for. That’s how you grow!! \[[Zach Wilson](https://www.linkedin.com/in/eczachly)\] The best way to evaluate yourself is to get out there and try to reach a specific position through the interview process. You get a better feeling of what to improve. Your employer probably has quarterly/bi-annual reviews, and that’s valuable feedback. However, the best is to compare that to the actual market as it gives you even more valuable data points. You know what you are capable of and the state of the current market. We also value more experience than years of experience. While we still see _+years_ of experience on some job ads(which still make me laugh), I believe things are changing slowly. Especially in our field where things are changing so fast in terms of hard skills. For the same position in a given year you could : * Use the same programming language and framework, and reach a level that is just good enough for what you need to do (so tricky to actually get better) * Learn to code from scratch with free resources and get a job * Deliver impactful projects that could reach millions of users So yes, I believe years of experience in technical skills are not really relevant. It’s what you do with them that matters. ### Higher salary 💸 When you apply for a job, you are not respecting a precise promotion cycle, and if you pass the interview for a higher position, congrats, you just got promoted. I would however point out that this is not the main advantage in the short term. I believe taking an opportunity that would give you a significant career opportunity (but maybe a lower salary) will anyway ends-up with a higher salary in the long run if you make good decisions. I, myself in the past, jumped on the big data wave as the hype was starting. I was pretty junior in that field and I didn’t mind losing 20% of my salary as long as I could get a big data project. My salary grew even more rapidly a few years later. ### Adaptability 🦎 You build new relationships with a new team every time you change and learn a whole new way of doing things. You improve your communication and adaptability skills, both are considered valuable [soft skills](https://www.indeed.com/career-advice/resumes-cover-letters/soft-skills) for any software engineering job. You are also resilient to failure. It’s never fun to receive rejection after spending a lot of time already in an interview process. However, that’s part of the process, and if you get the right mindset, you will rise again, stronger, like a phoenix from the ashes. It increases your security in terms of work in general because if something happens (get laid off, the world is crumbling apart,…) well you will be keener to jump back on your feet to find something else. ### The cons ❎ #### Too many different roles Job hopping on wide different roles may not be really good as you do need to have consistent experience during a certain amount of time to grasp the fundamental of the job role. #### Jumping too fast In my opinion, a good rule of thumb (in the tech Software industry) is that leaving before 1 year may not be good. Unless you’re ending up in a toxic environment, you should leave for the sake of your mental health. Or you have a golden opportunity that only shows up once in a lifetime. #### Loss of benefits * In tech companies, equities are basically a good way to do retention, so depending on the company's current value and vesting schedule, you may lose a lot of money on the table. * Loss of opportunity: sometimes there’s a good opportunity in terms of career or projects but you have to be patient, and leaving too soon may actually be something not worth it in the mid/long run. ### Conclusion For me, it’s more about _**how you do job hopping**_ rather than _**not doing it.**_ Here are some of the mental models I used when considering a new job : * Is it giving me a mid/long term edge on learning and/or salary and/or position? * Did I learn enough from my current employment and not give up after the 1st problem? * What’s my career projection at my current position in the following 6/12 months? Is it worth the jump? Answering these questions often helps me to decide whether to jump or not. But aside from that, I highly recommend doing job interviews, no matter if you are looking for something else or not. Yes, it can be exhausting, but it will create golden opportunities over time. ``` Follow me on 🎥 YouTube,🔗 LinkedIn ``` --- ## The Key Feature Behind Lakehouse Data Architecture URL: https://mehdio.com/blog/the-key-feature-behind-lakehouse-data-architecture-c70f93c6866f Date: 2022-02-21T11:56:22.715 #### Understanding the modern table formats and their current state [![](https://substackcdn.com/image/fetch/$s_!K70o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad7376a-84cf-4fee-a012-5feffc1dcf28_800x378.png)](https://substackcdn.com/image/fetch/$s%5F!K70o!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ad7376a-84cf-4fee-a012-5feffc1dcf28%5F800x378.png) The Usual Table Format Suspects — '_Hoodie' (Hudi)_, Iceberg, Delta \[Image by the Author\] Data Lakehouse is the next-gen architecture presented by [Databricks paper](https://databricks.com/wp-content/uploads/2020/12/cidr%5Flakehouse.pdf) in December 2020\. Data Lake can be run with open formats like Parquet or ORC and leverage Cloud object storage but lacks rich management features from data warehouses, such as ACID transactions, data versioning, and schema enforcement. Today, we have more options than ever in terms of modern table formats. They all aim to solve these issues and power the Lakehouse architecture. Let's understand what these table formats bring to the table… 😉 ### Why? Is Parquet not enough? 🤔 Data Lake was offering good flexibility at the greatest cost. You can load (almost) whatever you want in your lake (video, images, JSON, CSV, etc), but the governance is lost. The most wanted feature we missed in Data Lake is **ACID** transaction. Let's understand this with a few examples : * **A**tomic: either the transaction succeeds or fails. It means any reader or writer will not see any potential partial successful transaction that would lead to data in a corrupted state. * **C**onsistency: from a reader's point of view, if a column has unique values, this is preserved no matter which operation is done on the data source (a constraint on value). If a set of transactions has been committed: 2 readers will see the same data. * **I**solation: if two concurrent transactions are updating the same source — it will be done like one after the other. * **D**urability: Once the transaction is committed it will remain in the system even if there's a crash right after. This is something we used to have on a single database, but on a distributed system (in a Data Lake setup) where everything is on object storage, there's no isolation between reader and writer, they work directly on data files, and we have almost **no metadata** that makes sense to help us to achieve ACID transaction. Lakehouse architecture embraces this ACID paradigm and requires a modern table format. ### The top 3 modern table format 📑 You probably already guessed it, the modern table formats are heavily making use of `metadata` files to achieve ACID. With that in place, they enable different features like : * Time travel * Concurrency read/write * Schema evolution and schema enforcement And of course, storage is always independent of the compute engine so that you can plug any storage model on your favorite Cloud provider. For instance: AWS S3, Azure Blog Storage, GCP Cloud Storage. #### Apache Hudi Created at Uber in 2016,[ Apache Hudi](https://hudi.apache.org/) focuses more on the streaming process. It has built-in data streamers, and the transaction model is based on a timeline. This one contains all actions on the table at a different time instance. The timeline can provide time-travel through hoodie commit time. * ➕ Different data ingestion engine supported: Spark, Flink, Hive * ➕ Well suited for streaming process * ➕ A lot of reading engines supported: AWS Athena, AWS Redshift, … #### Apache Iceberg [Apache Iceberg](https://iceberg.apache.org/) started in 2017 at Netflix. The transaction model is snapshot-based. A snapshot is a complete list of files and metadata files. It also provides optimistic concurrency control. Time travel is based on snapshot id and timestamp. * ➕ It has great design and abstraction that enables more potential: no dependency on Spark, multiple file formats support. * ➕ It performs well at managing metadata on huge tables (e.g.: changing partition names on +10k partitions) * ➕ A lot of reading engines supported: AWS Athena, AWS Redshift, Snowflake… * ➖ Deletions & data mutation is still preliminary #### Delta Lake [Delta Lake](https://delta.io/), open-sourced in 2019, was created by Databricks (creators of Apache Spark). It is no surprise that it's deeply integrated with Spark for reading and writing. It's now a major product from Databricks, and some part of it (like Delta Engine) is not open-sourced. Other product like the [Delta sharing](https://delta.io/sharing/) sounds really promising. * ➕ It's backed by Databricks, which is one of the top companies in the data space at the moment * ➖ Really tight to Spark (though this is going to change in 2022 according to their announced [roadmap](https://databricks.com/blog/2021/12/01/the-foundation-of-your-lakehouse-starts-with-delta-lake.html)) #### High-level summary 📓 [![](https://substackcdn.com/image/fetch/$s_!SsC1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10fc27e3-dfee-4759-9bb5-cb24e408eab8_800x208.png)](https://substackcdn.com/image/fetch/$s%5F!SsC1!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10fc27e3-dfee-4759-9bb5-cb24e408eab8%5F800x208.png) High-level features as 19.02.2022 \[Image by the Author\] ### What's the general interest? A look at Github 👀 As all projects are open-source, a good data source for evaluating the interest and growth is to look at Github itself. [![](https://substackcdn.com/image/fetch/$s_!DcDl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3505aef0-93e7-4d84-b6c8-d81b1f34c321_800x548.png)](https://substackcdn.com/image/fetch/$s%5F!DcDl!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3505aef0-93e7-4d84-b6c8-d81b1f34c321%5F800x548.png) Github Stars history as 19.02.2022 generated through star-history.com \[Image by Author\] As we can see, these table formats are still really young in the eyes of mainstream data users. Most of the traction appeared these past 2 years when the _Lakehouse_ concept started to emerge. Another interesting insight would be to look at the current number of commits, pull requests, and issues. [![](https://substackcdn.com/image/fetch/$s_!UTqB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc06d7405-a71d-4ea7-bb19-f118acbddf94_800x514.png)](https://substackcdn.com/image/fetch/$s%5F!UTqB!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc06d7405-a71d-4ea7-bb19-f118acbddf94%5F800x514.png) Github Public Data as 19.02.2022 \[Image by Author\] As mentioned above, some features of the Delta eco-system are not open-sourced — which explains the low number of commits compared to the traction. ### Everybody wins as the future is interoperability ⚙️ No matter which format you are going to pick, and no matter who's going to win the end game, it's going in the right direction: we need more open standards in terms of data format to enable interoperability and use cases. It's great to see general adoption on all these formats by both compute and reading engines. For instance, AWS Redshift added support for both Delta Lake and Apache Hudi in [September 2020](https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-redshift-spectrum-adds-support-for-querying-open-source-apache-hudi-and-delta-lake/). More recently, Snowflake [announced support for Apache Iceberg](https://www.snowflake.com/blog/expanding-the-data-cloud-with-apache-iceberg/). Another good thing is that all of them are backed by an open-source community. Interoperability is critical, so the more compute engines will support these formats, the less we will need to pick something that will lock us. ### Mehdi OUAZZA aka mehdio 🧢 Thanks for reading! 🤗 🙌 If you enjoyed this, **follow me on** 🎥 **[Youtube](https://www.youtube.com/channel/UCiZxJB0xWfPBE2omVZeWPpQ)**,✍️ **[Medium](https://medium.com/@mehdio)**, or 🔗**[LinkedIn](https://linkedin.com/in/mehd-io/)** for more data/code content! **Support my writing** ✍️ by joining Medium through this **[link](https://mehdio.medium.com/membership)** --- ## The Battle for Data Engineer’s Favorite Programming Language Is Not Over Yet URL: https://mehdio.com/blog/the-battle-for-data-engineers-favorite-programming-language-is-not-over-yet-bb3cd07b14a0 Date: 2022-01-27T18:20:30.992 #### Let's discuss the next contender for 2022 [![](https://substackcdn.com/image/fetch/$s_!2hBH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdccef76b-be51-4023-aab6-e7b1749b26c2_800x450.jpeg)](https://substackcdn.com/image/fetch/$s%5F!2hBH!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdccef76b-be51-4023-aab6-e7b1749b26c2%5F800x450.jpeg) \[Digital Image\] by Jaime Spaniol Picking the right programming language for the right job can be challenging as the technology moves fast, and so many frameworks are popping up. Let's break down these past years and understand data engineers' current programming language ecosystem and the ideal candidate for 2022\. Could **Scala**, **Golang**, or **Rust** be our next favorites? Let's find out. ### The Early Days of Scripting and SQL 💾 When Python was still for most of us only a snake, Perl was quite well used, like other scripting languages. Perl was originally developed for text processing, like extracting the required information from a specified text file and converting the text file into a different form. It was well-fitted for data purposes. Perl was an excellent way to interface with SQL databases, which was the standard back then. Remember, no cloud and no API dominance era as we have today. It's important to note that what mainly was powering Perl in data use cases was **SQL**. Perl was used to perform these SQL commands against databases. ### Python’s Reign of Dominance 🐍 Today, data engineer's job is not only interacting with SQL databases and running queries but also the following: * Managing infrastructure (through infrastructure as a code with frameworks like Terraform/Pulumi) * Developing data pipelines * Developing microservices/API/data frameworks * Interacting with cloud services SDKs When moving to Big Data, we saw a lot of usage of Java. And it's still there either as a behind the scene or first development API (👋 Trino, Flink, Akka, etc.). Scala tried to rise (with Spark from Databricks), but even if it's more performant at scale and probably a better suitable language for data pipeline, it lacks more significant adoption outside Spark use-case. Databricks [reported that most of their API calls are done through Python and SQL](https://towardsdatascience.com/highlights-from-data-ai-summit-2021-3abfd9aaccaa), forcing them to provide similar performances on Python binding — another downfall for Scala? **Python** has massive adoption today, and here's why : * The learning curve for new programmer folks is pretty easy (notebooks help a lot). * The data science ecosystem: machine learning, visualization, deep learning. * Cloud adoption: all major cloud providers have a well-supported Python SDK. Is there something that you can't do in Python? ### The Brighter Future and Rust’s Potential ✨ SQL is here to stay for a while. Even with its limits, it's a low technical entry point to democratize data usage in general, and it's still the easiest way to interact with an SQL/analytical database. Golangseemed to be a good competitor. Terraform and Kubernetes have massive adoption, and both are written in Golang. It's also designed and supported by a major cloud provider: Google. That being said, there aren't that many data frameworks built around Golang. The learning curve is also a significant barrier to catch-up for Python's data average users. Who would be the next candidate then?Rust. Here are four non-exhaustive reasons: #### 1\. General popularity According to a [Stack Overflow study](https://insights.stackoverflow.com/survey/2021#most-loved-dreaded-and-wanted-language-love-dread), Rust has been the most loved programming language for four years in a row! Google trends also show a steady growth of Rust and [general fatigue for Python](https://towardsdatascience.com/why-python-is-not-the-programming-language-of-the-future-30ddc5339b66). [![](https://substackcdn.com/image/fetch/$s_!kAxF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec66d215-0699-4497-9c70-72ddf04627f9_800x238.png)](https://substackcdn.com/image/fetch/$s%5F!kAxF!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec66d215-0699-4497-9c70-72ddf04627f9%5F800x238.png) Google trends — Python vs. Rust Another big news is that Rust will be the [second language of the official Linux Kernel!](https://www.zdnet.com/article/rust-in-the-linux-kernel-why-it-matters-and-whats-happening-next/) And that will gain insane traction. #### 2\. Performance and low-memory footprint It's not big news that Python can be slow. Rust's performance is at another level because it's compiled directly into machine code. No virtual machine, no interpreter sitting between your code and the computer. In our cloud computing era, the footprint of your program on your compute system is directly impacting your costs, but also the electricity usage and therefore the impact on the environment. [An interesting study](https://thenewstack.io/which-programming-languages-use-the-least-electricity/) by the New Stack revealed which programming language consumes the least electricity. Rust is at the top of that list, while Python… at the bottom. #### 3\. Interoperability with Python What if you could rewrite some part of your existing Python code base and still use it through your main Python program? That's combining the best of the two worlds. A concrete use case would be to perform specific actions against s3 files, which can be pretty slow in Python. With [AWS announcing recently their AWS SDK](https://aws.amazon.com/about-aws/whats-new/2021/12/aws-sdk-rust-developer-preview/) in Rust in developer preview, this is something you could perform in Rust. Using a Rust binding for a Python library like [PyO3](https://github.com/PyO3/PyO3) enables you to quickly do a simple interface to call your Rust program within Python! Even Microsoft published a [windows crate](https://github.com/microsoft/windows-rs) that enables [you to access Win32 API’s from Rust!](https://blogs.windows.com/windowsdeveloper/2021/01/21/making-win32-apis-more-accessible-to-more-languages/) #### 4\. A lot of data projects are being rebuilt in Rust [Apache Arrow](https://github.com/apache/arrow-rs) is a key common interface to build data processing frameworks. It has a great Rust implementation, and it’s pushing other data projects to rise: * Spark's Rust equivalent called [data fusion](https://github.com/apache/arrow-datafusion) * Delta Lake[ has a native Rust interface](https://github.com/delta-io/delta-rs) with binding in Python and Ruby. Other big players like Confluent Kafka [offer now a Rust binding](https://www.confluent.io/blog/getting-started-with-rust-and-kafka/). There are many new projects to handle data. It's still in the early stages, but since adoption is growing, we could even see Java no longer be the default choice. ### Is It Worth It, Though? 🤔 Initially, both Rust and Python were built with different goals. The learning curve is steeper for Rust, and it will be difficult for some data citizens (data scientists, data analysts) to jump on the boat. You are making a trade-off between performance and simplicity. The data engineer role evolves more strongly as a devops/backend engineer rather than just the “SQL person.” It makes sense to try out Rust for some use cases in that context. Rust's mindset is also valuable for any future programming language you would learn next. If you want to get your hands dirty, one of my favorite resources for Rust is the YouTube channel [Let's Get Rusty](https://www.youtube.com/c/LetsGetRusty). In the very end, programming languages are just part of your toolbelt, and it doesn't hurt to have more than one, especially when you see that the data engineer scope is expanding exponentially lately! 🚀 ``` Want to Connect With the Author? ``` ``` Follow me on 🎥 YouTube,🔗 LinkedIn ``` --- ## Your Next Container Strategy: From Development to Deployment URL: https://mehdio.com/blog/your-next-container-strategy-from-development-to-deployment-66167c0d028a Date: 2021-12-14T14:42:56.133 #### Learn how to manage dockerfiles and version through a working Python API project [![](https://substackcdn.com/image/fetch/$s_!6nT5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c099638-c5fa-48be-a602-27194332a38e_800x533.jpeg)](https://substackcdn.com/image/fetch/$s%5F!6nT5!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c099638-c5fa-48be-a602-27194332a38e%5F800x533.jpeg) container balls dependencies — Image by the author In today's development world, containers are everywhere. It's a practical way to test things locally and be able to deploy it on any cloud service, as long as it supports a container image. It's easy however to fall into the hell of `dockerfile(s)` and versions. I'll provide in this guide a clear strategy with an example of how to set up your container strategy from development to deployment with an awesome `Makefile`. We will take a python API project deployed on [Cloud Run](https://cloud.google.com/run?utm%5Fsource=google&utm%5Fmedium=cpc&utm%5Fcampaign=emea-de-all-en-dr-bkws-all-all-trial-e-gcp-1010042&utm%5Fcontent=text-ad-none-any-DEV%5Fc-CRE%5F526671526787-ADGP%5FHybrid%20%7C%20BKWS%20-%20EXA%20%7C%20Txt%20~%20Compute%20~%20Cloud%20Run-KWID%5F43700059490201715-kwd-1072793830925-userloc%5F9060640&utm%5Fterm=KW%5Fgoogle%20cloud%20run-NET%5Fg-PLAC%5F&gclsrc=ds&gclsrc=ds&gclid=COqX%5F%5Ffsq%5FQCFQgGGwodvyMBGQ) as a case study and you will have a full working code example to play with. ### The strategy 🗺 [![](https://substackcdn.com/image/fetch/$s_!Qu4B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fec6acd-912f-489b-bb23-cfb8816ba490_652x312.png)](https://substackcdn.com/image/fetch/$s%5F!Qu4B!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fec6acd-912f-489b-bb23-cfb8816ba490%5F652x312.png) Image by the author The goal is to have little to no difference between the development environment (Docker Desktop), the CI processes (e.g [Github Action](https://github.com/features/actions), for testing/deployment), and the application runtime (GCP Cloud Run). In order to do so, we will separate them into multiple Dockerfile. This brings several advantages : * Not keeping the Dockerfile too big * Being able to hash a Dockerfile to detect a change and use it as a container version tag We will make use of a powerful `Makefile` that would help to be a single entry-point for both development and CI. ### Base container #### What's inside? This is basically the root of all other containers (here development/CI and application). This is where you usually define which OS you are going to use and put the baseline for the programming language. In practice here, we are going to put : * Debian 10 * Python 3.9 * A few Debian packages like curl, git, etc. Sometimes your company may have a base image that they want you to use to avoid too much disparity in the Linux OS distribution and better security patches management. Here, we will use directly an image from the official python repo. #### How to manage the version? A simple strategy is to do a [md5](https://en.wikipedia.org/wiki/MD5) hash of a list of files or folders. That way, we are sure that the version we generated is tight to the content of the code and it's unique. Such a script will look like this. Script example to manage version (image tag) of each layer Depending on the docker layer, we will use the same script. Getting the latest version of the `dev` layer would be : `./version.sh dev` If you have any scripts or files that you use within that base image (for example: creating a custom Linux user), feel free to add them to the hash to keep track of these changes! ### Application container #### What's inside? * Everything includes in the base image * Code * Py packages #### What would define the version? It can be a simple hash. We use here a git hash to hash the whole directory. ### Development container 🏗 Our strategy is to leverage the [devcontainer](https://code.visualstudio.com/docs/remote/containers) feature from VSCode so that we can have a fully isolated development environment built on the base image. #### What’s inside? * Everything included from the base image * No code (as it will be a mounted volume so that we can have hot reload while developing) * CLI tools (GCP CLI) * Py packages (and dev dependencies) #### What would define the version? No need to version this one unless you want to speed up the built time and push the image on a remote docker registry. ### Packaging everything using Makefile The only requirements to develop/build/test/deploy are only and `make`and `docker`! We will have a single command to build with a `DOCKER_LAYER` parameter that can be: base, app, dev. We can dynamically get the latest version and try either to `pull` from a remote registry or `build` if not exist. To respect the dependency (base → dev and base → app) between docker layers, we can add a simple if/else case so that the base layer is always built upfront no matter which layer we are requesting. Example : `make get-img DOCKER_LAYER=dev` will pull or build first the latest version of `base` docker image. Make target `get-img` looks like this : Make target example for get-img ### Automating tests & deployment with the CI Now that all our actions are containerized, our CI actions will be pretty simple. Apart from a dedicated way of authentification to the Cloud service, everything else is just a `make` command. Notice that we push the images (`base`, `dev`) when preparing the CI images. This will speed up the CI process when doing development on branches (if there's no change in the `base` image). Example of CI commands for GitHub Actions ### Conclusion With this example, you have a good overview of how to containerize your next project from development to deployment! Having a good consistency on the base image from the different layers will help you to move faster when you need to upgrade or test different versions of your dependencies. A lot of things can be easily adapted or extended depending on your needs and the tradeoff that you want to make. The full code of the project can be found [here](https://github.com/mehd-io/python-api-boilerplate/). Happy Coding! ``` Want to Connect With the Author? ``` ``` Follow me on 🎥 Youtube,🔗 LinkedIn ``` --- ## Stop Using The Term “Data Engineer”, There’s Something Better URL: https://mehdio.com/blog/five-overused-definitions-of-a-data-engineer-f0d9059a174 Date: 2021-11-25T17:06:06.612 #### 5 overused definitions of the hottest job of the year [![](https://substackcdn.com/image/fetch/$s_!MKGJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6956d37b-196e-4b4e-9880-a20bb20a7ecd_800x533.jpeg)](https://substackcdn.com/image/fetch/$s%5F!MKGJ!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6956d37b-196e-4b4e-9880-a20bb20a7ecd%5F800x533.jpeg) Git fork at Standup — Image by the Author The Data Engineer scope **has never been so large and confusing**. Some new roles definitions have been popping up to focus on specific areas, but they are still underused. If you look at Data Engineer job offers, you may get confused as the content and job itself may differ from company to company. In this article, we will remind ourselves **what the close cousins of data engineer roles are** to be able to spot a misnomer. ### 📕 Data Engineer aka Database Administrator (DBA) This role has been fading out as most analytics use cases are based on OLAP systems. Some companies are still heavily relying on OLTP databases (e.g. MySQL, Postgres) for their analytics. They are many reasons for this : * They don't have any pub/sub system to fetch data in real-time and have low-latency query requirements (milliseconds response time) * They still haven't invested in a data platform and just relying on existing software engineers to do analytical jobs * They may need a DBA to maintain their production database: performance tuning, monitoring, migrations, backups, etc. How to spot a DBA data engineer role? You will probably have a job offer that focuses on a specific database with a heavy requirement on SQL and database/query/stored procedures optimizations. ### 📗 Data Engineer aka Data Analyst Data analysts are more focused on business value than internal SQL performance. They usually work with dashboarding tools and provide KPIs, insights directly consumable for the business. The trap here is that a Data Analyst may do some data pipelines as a lack of data engineer's availability, but that doesn't make it a Data Engineer; it's a different beast. Software Engineering is not where Data Analyst shines. If you missed the following in a job offer: CI/CD, Programming knowledge (Testing, Python/Java/Scala), infra topics (Terraform, Docker, etc.) — you are probably looking for a Data Analyst job offer, not a data engineer one. ### 📘 Data Engineer aka Analytics Engineer Analytics engineer is a pretty new named role. With the Cloud data warehouse emerging (Snowflake, BigQuery, Firebolt), a new era of data engineers was born. These engineers are super-powered "Data analysts". They apply software engineer best practices (version control, testing, CICD) and usually focus on SQL pipelines & optimization while using a Cloud Data Warehouse technology. They are usually responsible for data assets (cleaned, transformed data) that are directly used by businesses (at the opposite of just providing "raw" data). Dbt is often part of their primary tool belt. 🔗 Fishtown Analytics did a great comparison of Data Analyst, Analytics Engineer, & Data Engineer [here](https://www.getdbt.com/what-is-analytics-engineering/). ### 📙 Data Engineer aka Machine Learning Engineer Data maturity is growing and machine learning is getting more democratized. Therefore, we need dedicated people for the challenge! This role focuses his time on the Machine learning lifecycle: deploying models (developed by Data Scientists) and turning them into a live production system. They usually have a strong software engineering background. Examples of some (specifics) tools they use: * ML Platform : AWS SageMaker, GCP Cloud ML * ML Libs : Tensorflows, scikit-learn, Spark ML, Keras, Pytorch * Orchestrations : MLflow, Kubeflow, Airflow ### 📔 Data Engineer aka Data Platform Engineer With the [data mesh ](https://towardsdatascience.com/what-is-a-data-mesh-and-how-not-to-mesh-it-up-210710bb41e0)concept getting traction, we tend to have more of a self-service approach where the centralized data team will mostly focus on providing the data platform and tooling to enable other data citizens to be autonomous in terms of data pipelines. Some of the work of a Data Platform Engineer : * Managing data infrastructure (Kafka, Data Orchestrators, Data Catalog) * Providing ETL tooling (dbt) or framework to rationalize and simplify data pipelines * Creating some microservices/API that provide data. ### So who's the real data engineer? 🕵 Today's big challenge is that the data engineer role in a company often doesn't involve solely one of the definitions above **but a mixed percentage of 2 or more.** However, it's fair to say that if your job is mainly falling into 1 area, then don't use the data engineer title! When you apply for a role, **be sure to ask what's the expected amount of time you will work on these topics**, this could give you a better idea of what to expect. One data engineer can hide another, and your role definition may be completely different from company to company, so watch out! 👀 ### Mehdi OUAZZA aka mehdio 🧢 Thanks for reading! 🤗 🙌 If you enjoyed this, **follow me on** 🎥 **[Youtube](https://www.youtube.com/channel/UCiZxJB0xWfPBE2omVZeWPpQ)**,✍️ **[Medium](https://medium.com/@mehdio)**, or 🔗**[LinkedIn](https://linkedin.com/in/mehd-io/)** for more data/code content! **Support my writing** ✍️ by joining Medium through this **[link](https://mehdio.medium.com/membership)** --- ## 7 Things You Need To Know If You Want to Become a Data Engineer ☄ URL: https://mehdio.com/blog/7-hacks-to-get-your-first-data-engineer-job-4b3e44bb35fd Date: 2021-10-22T17:54:26.428 #### Strategies to help you to land your first Data Engineer job [![](https://substackcdn.com/image/fetch/$s_!EqNB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba62d2a6-8e6d-4da7-abba-7cb10f739cd5_800x533.jpeg)](https://substackcdn.com/image/fetch/$s%5F!EqNB!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fba62d2a6-8e6d-4da7-abba-7cb10f739cd5%5F800x533.jpeg) Magic sunrise in the Bromo mountain \[Digital Image\] by Irham Bahtiar There is a ton of knowledge out there on how to be a 10x data engineer, but what's worth taking for you to land your first data engineer job? There are so many tooling and concepts you need to learn: _it's scary_ 😱. In this article, I will give you some **tips on how to hack your learning** journey and **provide you with extra resources** 📌. This is not a shortcut, but it will help you to prioritize things and not get swamped. ### If the bar is too high, try another data role. 👷 Depending on your experience, getting a first job as Data Engineer may be challenging. If you are new to Software Engineering, then the technical entry barrier will be pretty high. The range of technical skillset required for a Data Engineer is much broader than for a Data Analyst. Therefore, Data Analyst is a good start as it will require less hard skills (yet more domain knowledge sometimes). Knowing SQL and mastering a dashboarding tool ([Tableau](https://www.tableau.com/trial/tableau-software?utm%5Fcampaign%5Fid=2017049&utm%5Fcampaign=Prospecting-CORE-ALL-ALL-ALL-ALL&utm%5Fmedium=Paid+Search&utm%5Fsource=Google+Search&utm%5Flanguage=EN&utm%5Fcountry=DACH&kw=tableau&adgroup=CTX-Brand-Core-EN-E&adused=543201874518&matchtype=e&placement=&gclsrc=ds&gclsrc=ds)/[PowerBi](https://powerbi.microsoft.com/en-au/)/[Metabase](https://www.metabase.com/)) should take you already to a good position. On top of that, a Data Analyst will often work with Data Engineers. Therefore, you have an opportunity to understand what they do and when you feel ready, you can apply for an internal move. It's always easier like this rather than going through the main door. There are many current data engineers I know that followed that path. 📌 Check out [this article ](https://betterprogramming.pub/how-i-went-from-analyst-to-data-engineer-b6cf7f6fb73)if you want to hear a story about such a move and some insights about becoming a data analyst [here](https://towardsdatascience.com/how-to-become-a-data-analyst-in-2020-209d2ed9d130#:~:text=As%20I%20mentioned%20above%2C%20data,job%20of%20a%20data%20analyst.). ### Don't start with Streaming & Machine Learning 🌊 Depending on the company's data maturity, some of these concepts are not must-haves. Please don't get fooled by how many buzz words you would find in their job offer. [Wired](https://www.wired.com/story/ai-why-not-more-businesses-use/) mentioned last year that only 9 percent of firms employ tools like machine learning. While AI adoption is growing fast, there's still a ton of companies that are struggling with basic data engineering. As a junior, there's a baseline in terms of knowledge that would cover a lot of use cases and get you pretty far. If you learn how to write **Python and SQL** using the classic pipeline framework (**Pandas, Spark, dbt**), you will cover most analytical batch use cases. Get experience with an analytical database such as **BigQuery** (or Redshift/Snowflake, but there aren't any free tiers for playground) and pick an orchestration tool. **Airflow** on that side is an industry-standard at the moment. 📌 Have a look at these data engineer roadmaps, this should give you a proper learning path : * The [data engineer roadmap](https://github.com/datastacktv/data-engineer-roadmap) from [datastack.tv](http://datastack.tv/) * [Data engineering roadmap 2021](https://medium.com/coriers/data-engineering-roadmap-for-2021-eac7898f0641) from [SeattleDataGuy](https://medium.com/u/41cd8f154e82) Did you notice that these roadmaps are starting with Software Engineering basics ? ### Software Engineering basics matter. A lot. 💾 We often forget that data engineers, at the very essence, are just another type of software engineer. I believe the reason for that oversight it’s because the job has evolved, and a lot of Data Engineers today come from a non-software engineering background (BI developer, Data Analyst). However, if you master these basics, you will shine among Software engineer peers, and you will get an edge to understand how to deliver a production-ready project. These includes (not exhaustive list) : * CICD concepts & tooling (Github Actions / Jenkins / Circle CI) * Git (Github / Gitlab) * Testing (unit/integration/system testing) * Infrastructure as a code (Terraform, Pulumi) * Devops (k8s, Docker, etc) 📌 Here’s [an excellent article](https://k21academy.com/microsoft-azure/data-engineer/devops-for-data-engineering/) to understand how DevOps is related to Data Engineer and what's in for you to take. ### Learn to build things end to end with a side project 🗺 Take a pen and design how you would take data from point A, transform, consume it (with a dashboarding tool) and make decisions based on this. Try to answer these questions : * Where is my data coming from? How do I get it? API? Database? Scraping? * How do I orchestrate the pipeline? * How will I consume it? Which dashboarding tool can I use? How does the connection work? What are the limitations/costs behind this? How do I model my data? * What happens if I want to change a feature in the data pipeline? How do I manage access? How do I manage versioning? Having a good picture of the high-level design and understanding how each component will talk to each other is an excellent start to learn how you would put your skillset into actionable values. 📌 Check [this article](https://medium.com/coriers/5-data-engineering-projects-to-add-to-your-resume-32984d86fd37) to get inspiration about side project ideas for data engineers. ### Focus on one cloud provider, and learn the similarity with the others ☁️ All the cloud providers have a lot of similarities in terms of tooling. The fancy names are just there to get you lost. Focus on one provider and look up online what's the equivalent of the service you are using on another cloud provider. While there may sometimes be significant features difference, you will grasp how that tool fits in your end-to-end pipeline without having work experience. **AWS** dominates the market with more than 32 % of the market share according to [Statista](https://www.statista.com/chart/18819/worldwide-market-share-of-leading-cloud-infrastructure-service-providers/). So it's definitely a good bet to be able to land your first job. 📌 Google keeps an up-to-date comparison table with their competitors [here](https://cloud.google.com/free/docs/aws-azure-gcp-service-comparison). ### Target young companies without too much legacy and reasonable data maturity 🏢 Based on the previous point, you probably want to focus on a company that's cloud-native. There are many reasons as a junior to do so. First, you probably have spent already quite some time focusing on cloud services. If the company has a lot of old frameworks or on-premise clusters, this is additional knowledge you need to grasp. Next to that, you are ensuring that the time you invest into learning data modern stack will last at least a couple of years before becoming obsolete. 📌 [Crunchbase](https://crunchbase.com/) is a great resource to quickly see how big / how old is a company. Checking their engineering blog and GitHub organization will also give you another feeling about their maturity. ### Soft skills matter as much as hard skills, or rather, even more. 👨‍🏫 > _”Engeering is easy — it's the people problems that are hard.” Google VP Bill Coughran_ Data engineers are NOT technical gurus living in a basement. They are surrounded by many stakeholders: business, software engineers, data scientists, data analysts, etc. Therefore, **Teamwork** and **communication** are key in data in order to break all these silos. Good soft skills (or rather _human skills_ because there isn't any softness in these) will give you powerful leverage once you are in the industry. 📌 Here are some articles worth reading to get practical tips about soft skills in a data role: * * ### Conclusion 🚀 Don't focus on being the next technical superstar. Step back, see the bigger picture, and focus on what you need to strengthen to get your first role in the data world depending on the market trends and your experience. Don't give up on the first failure. Keep going, and good luck! ❤️ ### Mehdi OUAZZA aka mehdio 🧢 Thanks for reading! 🤗 🙌 If you enjoyed this, **follow me on** 🎥 **[Youtube](https://www.youtube.com/channel/UCiZxJB0xWfPBE2omVZeWPpQ)**,✍️ **[Medium](https://medium.com/@mehdio)**, or 🔗**[LinkedIn](https://linkedin.com/in/mehd-io/)** for more data/code content! **Support my writing** ✍️ by joining Medium through this **[link](https://mehdio.medium.com/membership)** --- ## Why you should try something else than Airflow for data pipeline orchestration URL: https://mehdio.com/blog/why-you-should-try-something-else-than-airflow-for-data-pipeline-orchestration-7a0a2c91c341 Date: 2021-09-20T12:59:00.144 #### Let’s evaluate [AWS step functions](https://aws.amazon.com/step-functions/), [Google workflows](https://cloud.google.com/workflows), [Prefect](https://www.prefect.io/) next to Airflow. [![](https://substackcdn.com/image/fetch/$s_!z2I9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faecee08c-6887-4707-8847-88fbbb67c1ab_800x533.jpeg)](https://substackcdn.com/image/fetch/$s%5F!z2I9!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faecee08c-6887-4707-8847-88fbbb67c1ab%5F800x533.jpeg) Fan\[Digital image\] by rajat sarki, While Airflow has dominated the market in terms of usage and community size as a data orchestrator pipeline, it’s pretty old and wasn’t designed initially to meet some of the needs we have today. Airflow is still a great product, but the article's goal is to raise awareness on the alternative and what the perfect orchestration tool would be for your data use case. Let’s evaluate [AWS step functions](https://aws.amazon.com/step-functions/), [Google workflows](https://cloud.google.com/workflows), [Prefect](https://www.prefect.io/) next to Airflow. So what are the criteria for a good data orchestrator tool nowadays? ### API-First design ⚙ As the Cloud providers are API-First, you want your orchestrations tool to be the same. Ideally, you want to be able to do a couple of things through the API : * Create/delete workflows * Easy DAG serialization & deserialization for non-static /evolving workflows. * Run parameterized workflows * Handling access management * Deploy the orchestration tool (if not serverless) through IaC frameworks (Terraform/Pulumi) All these features will enable you to connect to all your existing cloud services while using event-driven pipelines to its maximum potential. Airflow DAGs creation is pretty static and the API is still quite limited compared to the other tools. While you can have a strategy to automate the deployments of the dags, you are still tight to generate a static file somewhere at the end. Prefect seems not fully adapted for dynamic dag creation and has a bit of the same pattern as Airflow for DAG creation, see the issue [here](https://github.com/PrefectHQ/prefect/discussions/3772). ### Serverless & Separation of concern with Runtime ☁️ There’s always a paradox in serverless. On one hand, we don’t want to manage services and would rather focus on our use case. On the other hand, when something goes wrong or we need a custom feature, it’s a black-box hell. Nevertheless, managing an Airflow cluster in the past has been a pain and Kubernetes with Airflow v2 have solved many issues. Still, we should not underestimate the maintenance cost of a Kubernetes cluster. Aside from that, you will still need to add a couple of things to make sure it's working smoothly, for example, authentification, secrets management, and monitoring of the K8s cluster. Using a serverless orchestration tool from the cloud provider you are in, this is pretty smooth and built-in. With a Kubernetes Cluster, you are on your own to maintain or enable these. Another thing with the serverless orchestrator tool is that you are forced to have a clear separation of concerns and use that one ONLY for orchestrating tasks, not for actually running them. A dangerous path with Airflow is to use it as runtime. Again, Kubernetes helped a lot to solve this (see article [here](https://medium.com/bluecore-engineering/were-all-using-airflow-wrong-and-how-to-fix-it-a56f14cb0753)) but still, it's on your cluster and the maintenance depends on the tool you have put in place to monitor this one. ### Integration capabilities ⛓ What do you want to trigger? Is there any "connector" that enables you to trigger the target runtime without any custom layer? Airflow has a lot of operators. And if you don’t find what you need, you can always build your own. Prefect has also a good list of [integrations](https://docs.prefect.io/api/latest/tasks/airtable.html#writeairtablerow). Step Functions [has a couple of integrations](https://docs.aws.amazon.com/step-functions/latest/dg/concepts-service-integrations.html) with AWS services, offering even sync job or wait for callback. Google workflows started also to add [connectors ](https://cloud.google.com/workflows/docs/connectors)with GCP services. ### UI features 🔮 When running complex workflows, it's essential to have a clear place to observe what went wrong and quickly take action. You also would like to easily roll back or retry on a specific task/sub-task especially in a data pipeline context. Note that best practice should make your data idempotent. However, nowadays, pipeline dashboarding is not even enough. The problem is that you may have a silent failure, and you may need other alerts/info to be feed in a central tool. For example, you have a pipeline that always discards any data that doesn’t fit the schema. In such a case, you need to have another monitoring tool, such as a data quality tool for data observability. ### Testing 🏗 As a developer, you want an easy way to test your pipelines and have a development cycle as small as possible. At first thought, we may think that data orchestrator that can run anywhere (like Airfow/Prefect) would be the one that provides the best and smooth testing experience right? Well not really because running them locally will probably torture your laptop's CPU and boost your fan generating insane _air flows_ (sorry for the joke, I had to 😏). With a managed Airflow (AWS/Astronomer) you can have the possibility to create Airflow Instance on the fly (and automate it through code) for development reasons but the startup time is not negligible. Yes, even a minute or 2 is a lifetime for a developer. So at the very end, having full serverless orchestrators like AWS Step Functions/Workflows enable you to test rapidly your pipelines if you leverage IaC frameworks. Besides, you are testing it directly in the target environment to have little to no side effects. Note that AWS provides an emulator of their Step Functions for testing purposes [here](https://docs.aws.amazon.com/step-functions/latest/dg/sfn-local.html). ### In conclusion, let’s put some stars! 🌟 This table is just a high-level evaluation. Of course, you should consider other factors like the current knowledge of your team, your existing infrastructure, and make your own benchmark! While I really like and I have been (and still am) a long user of Airflow, I must say that my go-to for most of the new use cases would be AWS Step Functions / GCP Workflows depending on the use case. [Dagster](https://dagster.io/) is another great tool to be considered but having no prior experience with it and as they don’t provide a cloud-hosted version (though it's in progress according to their website), I didn't take time to invest in it. [![](https://substackcdn.com/image/fetch/$s_!P6mF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db7392b-9dfd-4c54-968b-368a5e6090be_800x121.png)](https://substackcdn.com/image/fetch/$s%5F!P6mF!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7db7392b-9dfd-4c54-968b-368a5e6090be%5F800x121.png) ### Mehdi OUAZZA aka mehdio 🧢 Thanks for reading! 🤗 🙌 If you enjoyed this, **follow me on** 🎥 **[Youtube](https://www.youtube.com/channel/UCiZxJB0xWfPBE2omVZeWPpQ)**,✍️ **[Medium](https://medium.com/@mehdio)**, or 🔗**[LinkedIn](https://linkedin.com/in/mehd-io/)** for more data/code content! **Support my writing** ✍️ by joining Medium through this **[link](https://mehdio.medium.com/membership)** --- ## Highlights from DATA+AI Summit 2021 💥 URL: https://mehdio.com/blog/highlights-from-data-ai-summit-2021-3abfd9aaccaa Date: 2021-06-27T22:26:20.094 [![](https://substackcdn.com/image/fetch/$s_!m_SY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff41da1bb-a5e6-4cc5-ab36-e5a8131fc0f9_800x509.jpeg)](https://substackcdn.com/image/fetch/$s%5F!m%5FSY!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff41da1bb-a5e6-4cc5-ab36-e5a8131fc0f9%5F800x509.jpeg) Sublime purple night sky. \[Digital Image\] [https://unsplash.com/](https://unsplash.com/@strong18philip)[@vincentiu](http://twitter.com/vincentiu "Twitter profile for @vincentiu") Here it goes again, one of my favorite free data conferences! And the name has been again rebranded for the better. It was initially the "Spark Summit," then the “Spark+AI Summit,” and now the "Data+AI" Summit. Databricks now offers a couple of products, and it’s not anymore considered only as "the spark company". A lot of what I’m talking about below goes in the same trends as my last year’s highlights [here](https://engineering.klarna.com/highlights-from-spark-ai-summit-2020-for-data-engineers-359211b1eec2). But Databricks released a lot of interesting products this year! ### Delta sharing This new Databricks product is defined as the industry's first open protocol for secure data sharing. With companies going multi-cloud and the rise of Cloud Datawarehouse, there are big challenges in terms of data availability. Data engineers are spending a lot of time just moving/copying the data to make it accessible, queryable in a cost-efficient and secure manner in different places. Delta sharing aims to solve this by storing "once" and read it anywhere. It uses a middleware (Delta Share Server) to talk between the reader and the data provider. Delta sharing, on paper, is, in my opinion, the biggest product release since Delta format itself. However, there are some concerns worth mentioning : * Even though Databricks claims query is optimized and cheap, I think we need to take into account egress/ingress cloud cost. What happens if the data recipient is doing an ugly query which is not optimized? * Any open standard sounds promising as long as there's global adoption. Still, Delta has clearly a bit more traction than their other ACID format brothers (like Iceberg and Hudi), so it is definitely the best horse to bet on. ### Delta live table Another Delta Databricks product! You can see it as a super-powered "view" from a delta table where you can use either pure SQL or python for processing. You can create an entire data flow and creating multiple tables based on a single delta live table. The delta live engine is smart enough to do caching and checkpoint to only reprocessing what's needed. This is pretty interesting as instead of considering multiple copies of data as classic data pipelines in a data lake architecture(raw/bronze/silver), you have one source of truth. This enables clear lineage and, therefore good documentation of transformations. Besides, because data quality is hot (see below), Databricks added their own data quality tool with declarative quality expectations. ### Unity Catalog Databricks is launching its own Data Catalog. The data catalog is another trend in the data industry. Development on major opensource projects (like [Amundsen](https://www.amundsen.io/), [DataHub](https://datahubproject.io/), etc.) has been keeping up the pace last year. In the meantime, other big cloud providers, like Google, also released their own [data catalog](https://cloud.google.com/data-catalog). On top of that, as companies are going more and more towards a multi-cloud strategy, it makes data discovery and governance an even bigger concern. An interesting point that Databricks tackle with their own catalog is that they want to simplify the data access management through a higher API. This is something other solutions don't really focus on, and it's a major advantage as managing access at low level (File-based permission like s3 or GCS, for example) can be really tricky. Fine-grained permission is difficult, and the layout of data is not really flexible as often tight to a metastore like Hive/Glue. ### More python The most common denominator between data profiles (Data Scientist, Data Engineer, Data Analyst, ML Engineers, etc.) is probably SQL, and the second would be python. According to Databricks, most of the spark API calls today are done through Python (45%) and then SQL (43%). It's clear that Databricks wants to reduce the gap between "laptop data science" and distributed computing. Lowering the entry barrier will enable more users to do AI… and more money for data SAAS companies 😉 As python has wide adoption and it's beginner-friendly, it makes sense to invest in. Most of the improvements go into the so-called "Project Zen", among these : * Readability on pyspark logs * Typehints improvements * Smarter autocomplementions #### Pandas into spark natively If you are not familiar with the [koalas](https://github.com/databricks/koalas) project, it's pandas API on top of the Apache Spark. Koalas project will be merge through Spark. Everywhere you have a spark data frame, you will have a pandas dataframe without an explicit conversion needed. Of course, for small data use cases, Spark will still be an overhead on a standalone node cluster but the fact that it can scale without any change to the codebase is pretty convenient. ### Building low code tool to democratize Data Engineering/Data Science It's incredible how many talks there were this year about ETL pipeline and data quality frameworks. This again double bet on the trends I talked about last year. A lot of companies wants to lower the entry barrier for data engineering with motivations of * Reducing complexity in increasing reusability through ETL * Metadata and configuration driven for ETL. Configurations double up as documentation for your data flows * Make it easier for SQL devs to write production-ready pipelines and increasing the range of contributors. ### Conclusion It was great to see Databrick's product catalog increasing. It feels that they are more going towards an integration strategy rather than trying to be the next platform where you will run everything (even if it's what they are also selling). But again, the major product around delta also depends on vendor adoption so let's see how fast the data community will adopt it! _Resources :_ DATA+AI Keynotes, __ ### Mehdi OUAZZA aka mehdio 🧢 Thanks for reading! 🤗 🙌 If you enjoyed this, **follow me on** 🎥 **[Youtube](https://www.youtube.com/channel/UCiZxJB0xWfPBE2omVZeWPpQ)**,✍️ **[Medium](https://medium.com/@mehdio)**, or 🔗**[LinkedIn](https://linkedin.com/in/mehd-io/)** for more data/code content! **Support my writing** ✍️ by joining Medium through this **[link](https://mehdio.medium.com/membership)** --- ## Why & how to market yourself as a data engineer URL: https://mehdio.com/blog/why-how-to-market-yourself-as-a-data-engineer-98633371ea7b Date: 2021-06-10T17:22:55.982 #### Understand the values of marketing that will help to highlight your strengths [![](https://substackcdn.com/image/fetch/$s_!hYYp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3cc6f9a-e15c-4906-a886-e82f2d584da0_800x533.jpeg)](https://substackcdn.com/image/fetch/$s%5F!hYYp!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3cc6f9a-e15c-4906-a886-e82f2d584da0%5F800x533.jpeg) man selling fruits \[Digital image\] Here are my takeaways from the great [podcast with Swyx ](https://datatalks.club/podcast/s03e07-market-yourself.html)at DataTalks.Club. [DataTalks.Club](https://datatalks.club/) is a great community of data enthusiasts with weekly events and top speakers from the industry. In this blog post, you will understand the added value of marketing yourself, and you will get some insights on how to get started. As I'm on my journey to get better at marketing and learning in public, I felt it would be interesting to share some points from the talk that resonated with me. ### Why you should be able to market (yourself) As a data engineer and technical person, marketing seems something far from our domain. When we think about marketing, we either see a cliche of suit guys selling things or a lame tv commercial ad. However, there are tons of reasons why you should be able to market and market yourself. Here are a few of them : * Recognition and getting the salary you deserve. * Being able to sell your opinion on a technical solution. #### Get recognition From finding jobs to promotions, getting recognition for your skills will obviously get you more money. One astonishing fact is the difference in the first salary when you are a young graduate. We are all unequal on this, and this is because some people sell themselves better than others. Juniors are often scared to negotiate salary, but in fact, there's no harm in doing so, and recruiters are used to it. You need to learn how to do it. #### Data Engineers sell every day. Today, there are tons of options for solving a given technical problem. Tons of programming languages, front-end frameworks, design patterns, and libraries. How do you agree to move forward on a particular solution? Yes, there are some technical facts you can evaluate, but at the very end, it's also a matter of selling. Selling to your colleague, your team, or your manager. In the end, we are selling our services/time to our employer. ### Find your personal brand. While selling yourself may sound like a scam if you overdo it, it's actually about finding what you are good at and what value you bring to the table. When you do that, you need to step back and reflect on your work. Find a good balance between what you would like to do, what you are good at, and what people are interested in. Can people make memes of you? Is there something that you like to talk about often? Whatever you decide to focus on, or which profile picture you are going to pick on social media, be consistent. Consistency and repetition will help you to grow your audience (or network) step by step. ### Pick a domain you would like to share As you need to prove certain expertise and ignite the interest, the narrower the better. It's easier to be an expert in niches. How many people share the same problems that you solve? If you look at existing content on Hacker News, YouTube, or podcasts, you can see what triggers interest and get inspired. ### Learning in public This may sound scary as we need to overcome the potential negative judgments of others. The truth is that as long as you are honest and humble, a lot of people will follow along your journey and give you feedback. What are you doing wrong? What are you doing well? > An expert is a man who has made all the mistakes which can be made, in a narrow field. — **[Niels Bohr](https://www.brainyquote.com/authors/niels-bohr-quotes)** The cool thing about doing it in public is that you get more data points in terms of feedback than you would with your employer and standard reviews process. There’s no shortcut in expertise. You "just" have to learn and earn it. Pick some mentors online, and engage with them. You can also do the same with companies. Writing something for others is more challenging but comes with greater rewards. You will need to master it as much as possible, and you probably will remember it better afterward. ### Landing your dream job Depending on your expertise, you may want to focus on a certain type of company. Pre-filtering companies may be a good option. I talk about this in my blog post [here](https://towardsdatascience.com/i-did-25-interviews-at-8-different-tech-companies-for-a-data-engineer-position-in-1-month-feab3e465f13). Don't be afraid to do free work and embrace opportunities. When you show dedication and commitment, often, opportunities start to show up. If you are not getting your dream job/position, think about the long term. Is this going to give you at least a foot in the industry? Is it possible to move within the company later? It's worth mentioning that getting into some company through the front door can be more challenging for certain positions than an internal move. ### Internal marketing It’s the same thing but with a smaller audience, and people have to listen because your work with them 😏. Swyx gives us a couple of tips. #### Brag document Build a brag document, a 1–2 page summary of what you have accomplished at the company. Whenever you want to ask for a promotion, you will usually have to sell to your manager, who will probably sell your promotion case to someone else. Having a clear written starting point is a good idea to make sure the message is correctly spread. It also helps you to reflect on what you have been doing. #### Use open channels Standups, demo sessions, hackathons, and internal guilds are examples of places where you can engage, share, shine, and cheer up people. #### Take initiatives Be creative, start small, get feedback and growth. If you create interest for others, people will give you more responsibility. ### Practice public speaking Selling yourself on paper is a completely different thing than selling yourself verbally. Your tone, attitude, body language are all things that would increase or decrease the impact of your message. Mastering all these take time, and it's interesting to consider this a separate skill, and it will help organize your thoughts and communicate better. ### Conclusion Learning how to market yourself and learning it in public has an insane added value for your career. It takes time, courage but you shouldn't be afraid to start and do baby steps! _Extra resources :_ ### Mehdi OUAZZA aka mehdio 🧢 Thanks for reading! 🤗 🙌 If you enjoyed this, **follow me on** 🎥 **[Youtube](https://www.youtube.com/channel/UCiZxJB0xWfPBE2omVZeWPpQ)**,✍️ **[Medium](https://medium.com/@mehdio)**, or 🔗**[LinkedIn](https://linkedin.com/in/mehd-io/)** for more data/code content! **Support my writing** ✍️ by joining Medium through this **[link](https://mehdio.medium.com/membership)** --- ## I did 25+ interviews at 8 different tech companies for a data engineer position in 1 month. URL: https://mehdio.com/blog/i-did-25-interviews-at-8-different-tech-companies-for-a-data-engineer-position-in-1-month-feab3e465f13 Date: 2021-05-18T12:56:27.158 #### Here is what I learned from this marathon and the current data market [![](https://substackcdn.com/image/fetch/$s_!Izqf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c9e32a-779f-4d06-93a8-bfd75553577b_800x533.jpeg)](https://substackcdn.com/image/fetch/$s%5F!Izqf!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c9e32a-779f-4d06-93a8-bfd75553577b%5F800x533.jpeg) Don’t forget to stretch. \[Digital Image\] I got the opportunity to take a break and take time for my family and me. When I was ready to go back out there for work, I decided to do a **marathon of interviews** and gave me 2 months maximum for finding my next best opportunity. I made a shortlist of 8 companies, went through more than **25+ interviews in total, and got 3 offers on the table** 🎉. This marathon was only possible because all interviews were conducted remotely due to the global pandemic. This article will share my **thoughts from this experience** and give some insights into the **current data job market** (mostly EU/DE focus). ### About the data engineer market 📊 #### Full remote option within the country with acceptable Timezone difference as a standard While big tech companies like FAANG already made the move, I can clearly see that the trend pushes for a full remote option in all job offers. There are different levels of remote flexibilities, but in general, the minimum is that the company will authorize you to work in the country where they have an office. They assume that you would work in a relatively close timezone to one of your direct teammates. #### More Opportunity More Competition A direct effect on the point above is that there are a lot more opportunities at the moment. There is a bigger impact if your country has a "tech hub" city like Berlin, Paris, Barcelona. Tech companies now open their city frontier and sometimes open offices in coworking spaces to access local talent. That means that you are not forced to live in that actual tech hub city as long as you live in the corresponding country. There are also 3rd party companies that help startups grow across the EU/UK by providing a local contract to future employees. These 3rd party services have offices in multiple countries and handle all the paperwork for the employer. As an employee, it's totally transparent, and you should not have any tax concerns. #### Data Engineer is bulling. Even if we consider the previous points, the number of job ads for data engineers is crazy. Why is that so? * Many people realized that before doing fancy Machine learning stuff, you need to be mature on your data = you need data engineers. This statement became more mainstream as the job definition around data became more clear. * Data engineer has been scaling beyond the scope. They do more things than they used to be, and inevitably there's more job for them. That being said, new job titles are emerging to define a specific area of the work, like analytics engineer or data platform engineer. These job titles may replace a bit over time the general "Data Engineer" title. #### Equity program as a standard in EU for tech jobs For some reason, it's not really in the European culture to give equity programs as part of compensation. While equity is more something common in the US tech job market, nowadays, I can feel, at least in the EU tech hub cities, it’s starting to be a standard in the offer, and it’s not only reserved for top job title only. Equity is definitely something that is the most undervalued aspect of compensation in Europe. People are jealous about US salaries, but the biggest part of their compensation comes from equity when you look at the numbers. However, it still not as much as in the US market, but hey, it's an improvement. If you are not familiar with equity programs or want to have a better vision of software engineer salaries in a tech company within the EU, check out this [video](https://www.youtube.com/watch?v=KF7Hk9AppM8) from [Pragmatic Engineer youtube channel!](https://www.youtube.com/c/mrgergelyorosz) ### About the marathon 🏃 #### Shortlisting the potential companies will smooth first rounds You don’t want to spend too much time in interviews realizing that the company, in general, is not a good fit. I realize I could avoid such pain if I did my homework properly on the potential company I would like to work for. Here are a couple of questions I was trying to answer before even applying to the company. I would then check if the answers met my requirements. * How old is the company? * How healthy (financially speaking) is the company? Take a look at [CrunchBase](https://crunchbase.com/). It also gives you insight into funding rounds, etc. * What’s the company size? What’s the past growth? * What’s the company culture? Do they have a blog/podcast about it? What’s the opinion on glassdoor? * Are they active on Open Source projects? Is their GitHub organization active? With this groundwork, I could easily target where I wanted to work and keep some questions in the backlog for the first rounds. #### Doing interviews marathon is great From time to time, I do random interviews to feel the market and evaluate myself. However, I’ve never done so many interviews in parallel in such a short period. Why? Because it’s a full-time job. Despite this, I think that taking a week off to start such a marathon is definitely worth it. * It's easier to present yourself properly and tackle behavioral interviews. Believe me, after doing it 8 times in 1 week, you get better at it as long as you get feedback and improve. And most of the behavioral questions are pretty much similar. * You actually get more data points regarding your skillset, your level, and the salary market. * As you start the interviews simultaneously, you will hopefully land offers roughly at the same time. This gives you huge leverage at the end for salary negotiation. #### Be open and honest about your marathon, right from the start In the first rounds, I always mentioned that I was talking to other companies. There's nothing wrong with that, even after a couple of interviews, as you really need all information in your hands (including an offer) to make the right decision. ### Conclusion Doing this marathon was the best thing I have ever done. I would highly recommend it if you are looking for a new opportunity rather than just overlooking job ads from time to time. You will spend a lot of your time at that future company, so make sure it's the right one and take enough data points to make a good decision! ### Mehdi OUAZZA Thanks for reading ! 🤗 🙌 If you enjoyed this, **[follow me on Medium](https://medium.com/@mehdio)** and **[LinkedIn](https://linkedin.com/in/mehd-io/)** for more! --- ## A day in the life of a data engineer URL: https://mehdio.com/blog/a-day-in-the-life-of-a-data-engineer-d65293272121 Date: 2021-04-20T05:15:31.061 #### Breaking down the main activities of a data engineer in 2021 [![](https://substackcdn.com/image/fetch/$s_!fDdN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb22d2c20-af6f-4a2f-9741-61757f994f80_800x450.jpeg)](https://substackcdn.com/image/fetch/$s%5F!fDdN!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb22d2c20-af6f-4a2f-9741-61757f994f80%5F800x450.jpeg) Coding \[Digital Image\] | Spongebob Cleaning \[Digital Image\] Data engineering's role in 2021 has been scaling beyond the scope for a better or for worse. Therefore, multiple definitions of the role are popping up. Does the data engineer do more analytics (aka new role definition, analytics engineer), data pipelines, handling more infrastructure (DevOps), or machine learning engineering? Basically, it’s getting a bit blurry on what an average data engineer will spend his time. However, these categories fall into technical activities, and we often forget that it represents just a chunk of the time spent. In this article, we will **break down into different activities** what a typical day in the life of a data engineer looks like. ### Coding — 30 to 40% Let’s define what do we actually mean by coding: * Development of a data pipeline/API/microservice. * Setup/Maintenance infrastructure * Fixing bugs, improving code base, documentation Depending on the project phase, you will work on different coding aspects: new features, debugging, maintenance, and stability. It’s also worth remembering that coding is not only about “more” (adding lines of code) but also about “less” — removing code. A good example is to look at the top committers of Apache Spark [here](https://github.com/apache/spark/graphs/contributors). We can see that most of them actually have a negative ratio; they removed more lines than adding them! So no, coding is not the main activity! [Multiple studies](https://thenewstack.io/how-much-time-do-developers-spend-actually-writing-code/) tend to show that a software engineer will spend 30 to 40% coding daily. That number is totally correlated with my experience. ### Project and time management — 20 to 30% This is a challenging part as it's fairly easy to be unproductive with these. Measuring project/time management efficiency is hard, and you are often not the only variable in the equation. These activities fall mainly into 2 types : * Writing: tickets grooming, roadmap, etc. * Meetings: standup, sprint planning, etc. Writing is (almost?) a pre-requisite to every meeting. A proper pre-read or agenda speeds up the discussion and gets everyone on the same page. ### Data Evangelism — 10 to 15 % Data engineers are most of the time sitting between the hammer (data consumers aka data analyst/data scientist/business/microservice) and the anvil (data producers). If something goes wrong for the data consumer, the first one to blame will be the data engineers. [![](https://substackcdn.com/image/fetch/$s_!lXDs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F596b2462-a761-4c4b-aae6-d7cb1cea4b7c_683x384.jpeg)](https://substackcdn.com/image/fetch/$s%5F!lXDs!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F596b2462-a761-4c4b-aae6-d7cb1cea4b7c%5F683x384.jpeg) Angry Lady Cat \[Digital Image\] In that situation, you are the bad cop. You need to play your role by setting the rules and spreading the data culture. You sometimes have to say no. You may have to bring people back to reality. Being able to communicate realistic milestones politely and gently is an invaluable skill set. Write best practices, communicate with stakeholders and data producers and show them that these guidelines are there to help everyone and improve productivity, not to block them. ### Review — 10 to 20 % Review is an important category as it’s basically the time you learn the most. When you are on your own to learn new things, it’s pretty hard to know if you are on the right track. Getting a close feedback loop with people (peers, stakeholders) is crucial. You learn what you are doing well and what you need to adapt. Review can be split into 3 different categories: * Code review * Project review * Performance review (Team or peer to peer review) There are days where I would spend more time reviewing code than coding. And it’s not a bad thing. It may be that I need to get familiar with a new code base, or there’s some big feature I would like to double-check. Project review can be post-mortem or demos to your stakeholder. It's basically everything related to a specific project, understanding what is/was going wrong, what is/was going well. It's also an opportunity to share best practices and establish conventions: coding style, documentation, etc. ### Technology watch — 5 to 15% Even if it’s not daily, it’s essential for a data engineer today to do technology watch as new tools and frameworks are popping up so fast that you need to follow the trend if you don't want to be outdated. When people think about technology watch, they sometimes think, “that’s the kiddo that just hypes about new toys.” But actually doing technology watch is not necessarily looking at big-breaking new tech but also : * Reading articles, books. * Improving your current setup with new libs/frameworks or design patterns. * Follow-up the new cloud services or features that could simplify your setup or reduce costs. You can read up about my blog post and our data tech skills radar if you want to get more insights on the trends [here](https://medium.datadriveninvestor.com/what-are-the-most-requested-technical-skills-in-the-data-job-market-insights-from-35k-datajobs-ads-d8642555f89e). ### Nonefficient — 1 to 10 % > "It takes a lot of effort to be this unproductive" Let's be honest. We have all days where we feel that nothing we have been doing falls into any of the categories above. Scrolling through your LinkedIn feed, talking about the last game you played at coffee break, non-productive meetings, these are all kinds of activities that would take you down some days. If you consider this as part of your time, there's nothing wrong with having it and acknowledge it, as long as it doesn't take a too big part of your time. ### Conclusion As we can guess, **coding is just the tip of the iceberg**. And yes, **communication** is key at the end for almost all the sections. I always try to keep these ratios in mind weekly to be sure I'm spending my time accordingly. These ratios will of course change depending on the culture and the size of your company. Are you missing something? Feel free to share your ratios or/and sections that I may have forgotten! ### Mehdi OUAZZA aka mehdio 🧢 Thanks for reading! 🤗 🙌 If you enjoyed this, **follow me on** 🎥 **[Youtube](https://www.youtube.com/channel/UCiZxJB0xWfPBE2omVZeWPpQ)**,✍️ **[Medium](https://medium.com/@mehdio)**, or 🔗**[LinkedIn](https://linkedin.com/in/mehd-io/)** for more data/code content! **Support my writing** ✍️ by joining Medium through this **[link](https://mehdio.medium.com/membership)** --- ## What are the most requested technical skills in the data job market?Insights from 35k+ datajobs ads URL: https://mehdio.com/blog/what-are-the-most-requested-technical-skills-in-the-data-job-market-insights-from-35k-datajobs-ads-d8642555f89e Date: 2021-03-09T08:51:27.772 [![](https://substackcdn.com/image/fetch/$s_!GtQM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32284692-230c-4cfd-80b8-ede280707ad6_800x501.png)](https://substackcdn.com/image/fetch/$s%5F!GtQM!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32284692-230c-4cfd-80b8-ede280707ad6%5F800x501.png) Data Skills Radar Dashboard The tech scene is evolving fast. Blazingly fast. There are so many new projects, frameworks, and cloud API services popping up, it is just too difficult to constantly stay up-to-date for a software engineer. To help our community stay on track, we ([Adriaan Slechten](https://medium.com/u/6d2936e22d60), [Grégoire Hornung](https://medium.com/u/d6fcfdc65841), [Vincent Claes](https://medium.com/u/f0f938b799df), and myself) have developed a dashboard that monitors the technical skills that are currently trending. The dashboard is available [here](http://dataskillsradar.amaaai.com/) and has a particular focus on the data skills. In this blog post, we will go through the dashboard's backstage and share some insights and comments on these based on our experience. ### How we did it? We scraped on different top job ads websites worldwide, cleaned a bit the data, and processed it using a simple term-frequency matrice model. We mostly used serverless services on AWS (Glue/Fargate/S3) and GCP BigQuery as our data warehouse. Some points to be aware of: * Scraping is done through location and a pre-defined list of job titles. * We mostly focused on data profiles. * We picked the most prominent cities around the world and the most common data jobs title. * We have some datasets that we either pull directly for the relevant source (like programming language from the Github open dataset) or maintain manually to enrich the final output and categorize the technical skills. * The dashboard is refreshed every day. ### Insights #### AWS and Azure lead the way, but GCP has a good presence for data jobs [![](https://substackcdn.com/image/fetch/$s_!PeYo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddf3c228-532a-4137-8ed8-2bc11582101e_656x271.jpeg)](https://substackcdn.com/image/fetch/$s%5F!PeYo!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fddf3c228-532a-4137-8ed8-2bc11582101e%5F656x271.jpeg) AWS is the most popular (more than 50% of total), followed by Azure and then GCP. What’s interesting to notice is that GCP has clearly a bit more presence for data profile than standard Software Engineer jobs. Some of the factors can be : * Google BigQuery has far more features than Redshift, and it’s fully serverless. * AutoML features * Pricing: you can basically start for free on GCP side and scale as you need. Azure is second, probably mainly because of the initial footprint of Microsoft at the big corporate. Adding a contract for cloud products is then just an amendment :-) #### The top 3 languages slightly differ around the continent. [![](https://substackcdn.com/image/fetch/$s_!a1QE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bd79e9c-da78-4f9b-9018-558290b01146_647x251.jpeg)](https://substackcdn.com/image/fetch/$s%5F!a1QE!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bd79e9c-da78-4f9b-9018-558290b01146%5F647x251.jpeg) As a Software engineer exposed to global information on the internet, it was quite surprising that trends are different from continents. And it does make sense as their legacy software comes at other times from different places. Europe and North America are more in popular languages such as python. South America and Asia seem to have traditional backend tech like java. #### ML Engineer and Data engineer: spot the 7 differences [![](https://substackcdn.com/image/fetch/$s_!twrZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdceb7e57-b9da-476b-90d9-da471ba53f31_550x824.jpeg)](https://substackcdn.com/image/fetch/$s%5F!twrZ!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdceb7e57-b9da-476b-90d9-da471ba53f31%5F550x824.jpeg) The programming languages ranking is the same, and we can find a lot of similarities in terms of technical skills: Spark, SQL, K8s, etc. What’s most different is that Tensorflow and Pytorch are sitting on the #1 and #2 of ML Engineer profile where it does not even make it to the top 10 in Data engineer jobs. #### Data Scientist, get your SQL knowledge up to the level. [![](https://substackcdn.com/image/fetch/$s_!UX6v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2ac22a-385f-48b3-a385-691e6342f32c_451x341.jpeg)](https://substackcdn.com/image/fetch/$s%5F!UX6v!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f2ac22a-385f-48b3-a385-691e6342f32c%5F451x341.jpeg) SQL is ranking #1 for Data Scientist. Being a Data Scientist is not only about working on fancy AI topics like machine learning or deep learning. Knowing your basics still matters. Besides, it’s still happening that a Data Scientist’s ending more doing basic reporting and SQL pipelines because the company's data maturity is not yet ready for machine learning. R is in #2 as programming language after the always winner Python. Also worth mentioning that the skills again differ around the globe. For example, SAS has a good presence in the US, while in the EU, it’s not making in the top 10. #### Data Engineer, Scala is roughly top 3 [![](https://substackcdn.com/image/fetch/$s_!Rlij!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14d5aee-d08b-415e-8629-18e70a6b7553_307x261.jpeg)](https://substackcdn.com/image/fetch/$s%5F!Rlij!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc14d5aee-d08b-415e-8629-18e70a6b7553%5F307x261.jpeg) Let’s face it, the success of Scala within the Data Engineer ecosystem is due to Spark. But Databricks (founder of Spark) shared recently during their summit keynote that the actual current usage through Spark is mostly through their Python API. Even if performance and native functional programming beat the python spark API, there are multiple reasons why Scala is losing speed against python : * Most of the Data engineer tools (Pandas, Airflow, Cloud APIs, etc.) sit in python (and Go for K8s, Terraform…) * Data engineer work and discuss a lot with Data Analyst / Data Scientist / Machine Learning Engineer. The common denominator within these profiles is python (and SQL). * The learning curve for Scala is quite steep for starters. ### What’s next Even if these insights are interesting, we would like to add KPI around the evolution of each skill (winners/losers) in the future to be able to see the trends. Is there anything you would like to see on the Radar? Let us know here! _Please note that these findings should not be taken for granted as 100% correct. It instead gives you knowledge of the trends. We also, on purpose, picked up Data profiles to confirm these based on our professional experience._ ### Mehdi OUAZZA aka mehdio 🧢 Thanks for reading! 🤗 🙌 If you enjoyed this, **follow me on** 🎥 **[Youtube](https://www.youtube.com/channel/UCiZxJB0xWfPBE2omVZeWPpQ)**,✍️ **[Medium](https://medium.com/@mehdio)**, or 🔗**[LinkedIn](https://linkedin.com/in/mehd-io/)** for more data/code content! **Support my writing** ✍️ by joining Medium through this **[link](https://mehdio.medium.com/membership)** --- ## Why and how you should dockerize your development environment (with VS Code 💙) URL: https://mehdio.com/blog/dockerize-your-development-environment-with-vs-code-cac9e7a60751 Date: 2021-01-18T08:00:14.511 [![](https://substackcdn.com/image/fetch/$s_!MZZF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32cc60a4-58d4-4cc9-b63a-e20deffdc5e1_600x327.jpeg)](https://substackcdn.com/image/fetch/$s%5F!MZZF!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32cc60a4-58d4-4cc9-b63a-e20deffdc5e1%5F600x327.jpeg) In this blog post, I will cover a few elements that should **motivate you to dockerize your development environment** and give you a repo example on **how you can smoothly achieve this with VS Code**. The idea here goes further than just _"I have my application dockerized that I can test locally,"_ and creates a **complete development experience entirely (or almost?) in docker**. I will also share **my general experience,** and limits I have encountered while having all my development environment dockerized for the past few months. ### So… Why? #### It uses the same runtime environment as your application * A good practice nowadays is to provide a Dockerfile for either the target deployment runtime (e.g K8s, AWS Fargate, GCP Cloud Run, etc…) or local testing. It's relatively easy to extend this existing image for development purposes. * Managing multiple versions of multiple frameworks/languages is easier. Even if there are tools to help to solve this (for example python version and `pyenv`) it's easier if it's just a variable to change in your Dockerfile, and no conflicts guarantee. * No more "it works on my machine". It will work on your machine, and everywhere docker is too. (\*) (\*) Actually, you may have little glitches between Windows and Unix host (for path references for example). #### Standardize development tools Every developer has their own flavors of IDE/extensions/terminal and I'm not fighting against that. But because docker provides you with a base layer, you can also add all the classic things a developer may need along the way and share best practices for the sake of productivity. Plus, when using VS Code, you can configure **extensions** to be installed for that environment. Aside from this, any developer can still override these base settings with their custom choices. #### Get Ready for Cloud IDE There are a couple of initiatives for some years now to provide a Cloud IDE experience (Cloud9, Codeanywhere, etc), and to be honest, I've tried them regularly but wasn’t really satisfied with the whole experience. They were lacking a lot of what I had on my laptop at the end. But with the recent announcement of Github's [Codespaces](https://github.com/features/codespaces) (Visual Studio Online), this shows that it gets more mature and bring the experience to another level: * Instead of providing a completely different IDE, they _just_ enable an existing desktop IDE as a web app. This is a different strategy from the one used by the _old_ CloudIDE initiatives. Therefore it’s not like you have to give up your favorite desktop IDE, your **Cloud IDE is just another option for your development**: same extensions, same shortcuts, same flavors. * Having it on Github closes the gap of having yet another service to care to host your development experience, and it's quite affordable (see the pricing [here](https://docs.github.com/en/free-pro-team@latest/github/developing-online-with-codespaces/about-billing-for-codespaces)). * It makes open-source projects even more willing to contribute. Sometimes it can be a pain to set up the development environment but what if it's 'just a click away from launching your containerized environment? ### Excited now? Let's get it started! > Setup (dockerfiles, scripts, etc.) is mostly taken from an official repo from Microsoft [here](https://github.com/microsoft/vscode-dev-containers/tree/master/containers) that contains a set of development container configuration files. You can clone the full repository example (Simple python 3.8 API) [here](https://github.com/mehd-io/devcontainer-demo) to follow along. #### Introduction You’ll need : * Docker (yes, really) * VS Code * [Remote-containers](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers) VS Code extension What do we need in our repo? Here's a good practice example : ``` ├── .devcontainer │ └── devcontainer.json └── docker ├── app.Dockerfile ├── dev.Dockerfile └── library-scripts ``` #### devcontainer.json The **devcontainer.json** will contain all _metadata_ for your development environment meaning: * Path to Dockerfile to be used * VS Code extensions to be installed * All defaults settings from VS Code you want to set up (e.g default linter, testing framework) * All extra docker parameters: what ports to forward, what volume to mount, etc. The file will look like this : devcontainer.json example You can find the official documentation [here](https://code.visualstudio.com/docs/remote/devcontainerjson-reference) if you need more information. Note that an extension name is a unique name provided in the extensions store. To add these, the easiest way is to go to your extensions panel in VS Code, click on the setting button, and “Add to devcontainer.json”. [![](https://substackcdn.com/image/fetch/$s_!ZdjT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e30a00-9187-4e07-a072-67e15e524df6_800x489.png)](https://substackcdn.com/image/fetch/$s%5F!ZdjT!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37e30a00-9187-4e07-a072-67e15e524df6%5F800x489.png) The most interesting part here is the `mount`. By default, the current working directory (where your code is) will be mounted to `/workspace` but you can override this with `workspaceMount`. So `mount` refers to extra things you want to mount. Here, we mount what's needed for : * git (if you use ssh keys, `.ssh`) Some other things you may also want to mount are : * cloud credentials (`.aws` for AWS or `.config` for GCP) * docker registry `.docker` * Run dockers command inside the docker container (it connects to your OS docker-engine) `/var/run/docker.sock` . You can have a look at the repo example [here](https://github.com/microsoft/vscode-dev-containers/tree/master/containers/docker-from-docker). However, this increases the size of your image and you can always open a terminal locally with VS Code using `terminal:Create New integrated Terminal (local)` or using your favorite terminal. There are also examples in the official Microsoft repo to not run as root inside the container [here](https://github.com/microsoft/vscode-dev-containers/blob/master/containers/python-3/.devcontainer/devcontainer.json#L44). Keep this file under the **.devcontainer folder** as VS Code will scan for it at launch time. If it’s there, VS Code will notify you and suggest to reopen VS Code in that docker environment. Or you can use the Command pallet (Shift+Cmd+P) and pick: `Remote-Containers:Reopen in Container` or `Remote-Containers-Rebuild and Reopen in Container` if you want to force a rebuild of your image. The `postCreateCommand` is quite useful to keep your docker image light and independent from your application package's requirements. Here we used it to install `py` packages through `poetry` after the build of the image. Note that we deactivated the virtualenv and just installed it through the system python as this one doesn’t really bring much value because our process is already isolated in the container. #### dev.Dockerfile The **dev.Dockerfile** will be your docker definition for your development environment. Here we can install everything we need for development. This example includes: * zsh and ohmyzsh * Standard py tools (pytest/black/isort/…) * etc You may want also to install : * IaC frameworks (pulumi/terraform) * Cloud CLI (aws, gcp) Sometimes you will have installation scripts ( `docker/library-scripts` in the repo example) that are going to be used by both your `app.Dockerfile` and `dev.Dockerfile`, so it’s good to have them outside your Dockerfile to avoid duplicate code. On top of that, it’s always good practice not to keep your Dockerfile too big. #### app.Dockerfile The **app.Dockerfile** will be the definition for your application. You could also have a 3rd docker image called for example base.Dockerfile which serves as a base layer for both app.Dockerfile and dev.Dockerfile. Here, we’ll just simplify and assume you are using the same source image (FROM : xxx) in both Dockerfiles. ### How does it work in Github codespace? (VS Code in the cloud) [Github Codespace](https://github.com/features/codespaces)s is still in closed beta but the same `.devcontainer.json` configuration will roughly(\*) work out-of-the-box! On your Github profile, click on the `Code` button, then `open with Codespaces`in the drop down menu, and VS Code will load as a web app, fully containerized! (\*)Some fields aren’t (yet) available on Codespaces, limitations are listed in the official documentation. [![](https://substackcdn.com/image/fetch/$s_!dFSB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d4063c6-461f-4d2f-afd3-5bce83ae443b_680x550.png)](https://substackcdn.com/image/fetch/$s%5F!dFSB!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d4063c6-461f-4d2f-afd3-5bce83ae443b%5F680x550.png) ### Conclusion You’ve now seen the basics and added value, go play around with the demo project [here](https://github.com/mehd-io/devcontainer-demo). Why not containerize your next project from the start? If you encounter some limitations/bugs feel free to share in the comments! ### Mehdi OUAZZA aka mehdio 🧢 Thanks for reading! 🤗 🙌 If you enjoyed this, **follow me on** 🎥 **[Youtube](https://www.youtube.com/channel/UCiZxJB0xWfPBE2omVZeWPpQ)**,✍️ **[Medium](https://medium.com/@mehdio)**, or 🔗**[LinkedIn](https://linkedin.com/in/mehd-io/)** for more data/code content! **Support my writing** ✍️ by joining Medium through this **[link](https://mehdio.medium.com/membership)** --- ## Highlights from Spark+AI Summit 2020 for Data engineers URL: https://mehdio.com/blog/highlights-from-spark-ai-summit-2020-for-data-engineers-359211b1eec2 Date: 2020-07-23T08:31:00.93 [![](https://substackcdn.com/image/fetch/$s_!2LjF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16f6a454-0920-41cb-9501-fbf395ba1cb6_2500x1500.jpeg)](https://substackcdn.com/image/fetch/$s%5F!2LjF!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16f6a454-0920-41cb-9501-fbf395ba1cb6%5F2500x1500.jpeg) Originally called “Spark Summit” and now drifting to AI (to follow the hype of course), the summit, organized by [Databricks](https://medium.com/u/5ae67e7eecef)(founder of Spark, Delta, MLflow) brings together all top tech companies with mature experience in **data science and data engineering** with more than **200 sessions.** So even if you are not a spark-fan boy (no, I won't talk about spark 3.0), there’s a lot to learn from this event. Bonus this year: the event was online and free and as usual, all talks+slides are available [here](https://databricks.com/sparkaisummit/north-america-2020/agenda). In these takeaways focusing on the **data engineering topics**, I'll provide as resources, the **most interesting talks** I've seen for each highlight and some **extra bonus links**. Each highlight could be a dedicated article on its own, the idea here is to regroup a _curated list_ for anyone that would like to be up to date with the latest data engineer news in 2020. [![](https://substackcdn.com/image/fetch/$s_!Xd_4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35a0d898-275c-433b-9b7e-33f0aa6d5751_800x450.jpeg)](https://substackcdn.com/image/fetch/$s%5F!Xd%5F4!,f%5Fauto,q%5Fauto:good,fl%5Fprogressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35a0d898-275c-433b-9b7e-33f0aa6d5751%5F800x450.jpeg) _Copyright Databricks_ #### Spark on Kubernetes(K8s) is getting better and the data engineer world is welcoming it with open arms Kubernetes makes peace with infrastructure for big data and the rest. It's cumbersome to build knowledge around Yarn/Mesos cluster if it's only for specific big data applications like spark jobs. As K8s is a general-purpose orchestration framework, every spark job is _just_ another application living in the cluster. With the [spark on k8s operator from google](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator), each spark job can be seen as a K8s Object that you can describe/delete/restart as it would be any other application running on your cluster. Talks: * **[Running Apache Spark on Kubernetes: Best Practices and Pitfalls](https://databricks.com/session%5Fna20/running-apache-spark-on-kubernetes-best-practices-and-pitfalls)** * **[Running Apache Spark Jobs Using Kubernetes](https://databricks.com/session%5Fna20/running-apache-spark-jobs-using-kubernetes "https://databricks.com/session_na20/running-apache-spark-jobs-using-kubernetes")** * **[Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes](https://databricks.com/session%5Fna20/simplify-and-boost-spark-3-deployments-with-hypervisor-native-kubernetes "https://databricks.com/session_na20/simplify-and-boost-spark-3-deployments-with-hypervisor-native-kubernetes")** #### Data catalog/lineage/governance: always and forever It seems that almost all top tech companies are building their own data catalog/lineage/governance tool. It's a bit funny because this problem exists for years but for a weird reason, more people tend to realize that it is starting to be a must-have. The exponential growth of data over the past years, the need for _fresh_ data, and the _beloved_ GDPR are probably the drivers for this. I was a bit disappointed however by the talks around the topic this year because it seems that most of the products presented were commercial. That being said, there are plenty of opensource initiatives (see extra links). Talks : * **[Case Study and Automation Strategies to Protect Sensitive Data](https://databricks.com/session%5Fna20/case-study-and-automation-strategies-to-protect-sensitive-data) (Immuta)** * **[Find and Protect Your Crown Jewels in Databricks with Privacera and Apache Ranger](https://databricks.com/session%5Fna20/find-and-protect-your-crown-jewels-in-databricks-with-privacera-and-apache-ranger) (Privacera)** Extra : * (Lyft) * (Linkedin) * (Netflix) #### **Data engineer needs to democratize data pipelines** Data engineers also need to build tools/frameworks to be able to scale productivity on data pipelines and enable other friends (Data Science, Data analysts, Software Engineers) to write production-ready ETLs. Some of the talks make a nice distinction between job code (=boilerplate to run your job) and business code (the business data transformation logic). When doing data pipeline at scale, in production, there’s a lot of boilerplate needed (job code) compared to the actual business code that produces valuable data. There are different ways to achieve this : * Force SQL to be the unique 'Business code' and build DSL(Domain Specific Language)/boilerplate that run that SQL code. * Create a DSL where common functions/pipelines can be reused without having to (re)code them * Create/Reuse a shared library with boilerplate generation * … Talks : * **[Sputnik: Airbnb’s Apache Spark Framework for Data Engineering — Databricks](https://databricks.com/session%5Fna20/sputnik-airbnbs-apache-spark-framework-for-data-engineering "https://databricks.com/session_na20/sputnik-airbnbs-apache-spark-framework-for-data-engineering")** * **[Composable Data Processing with Apache Spark](https://databricks.com/session%5Fna20/composable-data-processing-with-apache-spark "https://databricks.com/session_na20/composable-data-processing-with-apache-spark")** * **[Designing the Next Generation of Data Pipelines at Zillow with Apache Spark](https://databricks.com/session%5Fna20/designing-the-next-generation-of-data-pipelines-at-zillow-with-apache-spark "https://databricks.com/session_na20/designing-the-next-generation-of-data-pipelines-at-zillow-with-apache-spark")** * **[Fugue: Unifying Spark and Non-Spark Ecosystems for Big Data Analytics](https://databricks.com/session%5Fna20/fugue-unifying-spark-and-non-spark-ecosystems-for-big-data-analytics "https://databricks.com/session_na20/fugue-unifying-spark-and-non-spark-ecosystems-for-big-data-analytics")** #### **Big data format for ACID transactions: Delta lake seems to take the lead among his peer's Apache Iceberg, Apache Hudi** Again this year, there was a talk that compares Delta/Iceberg/Hudi which are all _new_ data formats that have been created to solve common issues (handling jobs failing mid-way, modification of existing data difficult, etc.) we have with the classic big data format (parquet, avro,…). As Delta is developed by [Databricks](https://medium.com/u/5ae67e7eecef), there was of course more information about that last one. Talks : * **[A Thorough Comparison of Delta Lake, Iceberg and Hudi](https://databricks.com/session%5Fna20/a-thorough-comparison-of-delta-lake-iceberg-and-hudi "https://databricks.com/session_na20/a-thorough-comparison-of-delta-lake-iceberg-and-hudi")** Extra : * **[ACID ORC, Iceberg, and Delta Lake — An Overview of Table Formats for Large Scale Storage and Analytic](https://databricks.com/session%5Feu19/acid-orc-iceberg-and-delta-lake-an-overview-of-table-formats-for-large-scale-storage-and-analytics)s \[Spark Summit 2019\]** #### Apache Arrow is getting to the next level Because Arrow is a bit the shadow protocol behind certain tools that data engineer use as 'end-user', it may be confusing to understand how does arrow help/fits in the big data landscape. One thing is sure: it's getting more traction and one use case I wish to see is the replacement of the antic JDBC protocol but for that, we need the DB to do their part of the integration. Talk : * **[Data Microservices in Apache Spark using Apache Arrow Flight](https://databricks.com/session%5Fna20/data-microservices-in-apache-spark-using-apache-arrow-flight)** Extra : * (Understand basics of Arrow) #### Data quality at scale In the same idea (and probably the same drivers) than data lineage/catalog/governance, data quality got its own big piece of the cake. [Great expectations](https://github.com/great-expectations/great%5Fexpectations) is a really nice framework for data quality and comes with a bunch of integrations (pandas, spark, BigQuery, Redshift). There was a dedicated talk about it, so check it out! Talk : **[Automated Testing For Protecting Data Pipelines from Undocumented Assumptions](https://databricks.com/session%5Fna20/automated-testing-for-protecting-data-pipelines-from-undocumented-assumptions "https://databricks.com/session_na20/automated-testing-for-protecting-data-pipelines-from-undocumented-assumptions")** Extra : * (Spark Scala — from AWS) * (Pyspark) #### Python with Spark eco-system has some good future Even if there's no doubt that Scala has better performance for Spark (and easier to package), because of the mass adoption of Python in the data ecosystem, it makes sense to invest more into it (and to democratize data pipelines!). Databricks has announced a couple of things during their keynote that shows they want to go into that direction and improve the usability of the spark python API with their "project Zen". Talk: * [Spark + AI Summit 2020: Wednesday Morning Keynotes](https://databricks.com/session%5Fna20/wednesday-morning-keynotes) (see starting 15:50) Extra : * Are these points resonating with your thoughts? Feel free to share yours!