-
-
21 Unique Reasons Why Apache Hudi Should Be Your Next Data Lakehouse | Apache Hudi
Hard-Earned Lessons from a Year of Building AI Agents by Maya Murad
- LLMs can be harnessed for higher complexity problem-solving. By combining clever engineering with advanced prompting techniques, models could go beyond what few-shot learning could achieve. Retrieval-Augmented Generation (RAG) was an early example of how models could interact with documents and retrieve factual information. LLMs can also can dynamically interact with a designated environment via tool calling. When combined with chain-of-thought prompting, these capabilities laid the foundation for what would later be known as AI agents.
Resilience Best Practices: How Amazon Builds Well-Behaved Clients and Well-Protected Services
- Three operational strategies are suggested: load shedding, auto-scaling, and fairness.
- token bucket, leaky bucket, exponentially weighted moving average (EWMA), fixed window, or sliding window.
- To avoid making the situation worse for a dependency that is under stress, AWS suggests two patterns for well-behaved clients: circuit breakers, preventing the sustained overload of a dependency, and retries, letting the client retry every request up to N times using exponential backoff with jitter between requests
Prefix Aliases in SQL by Hannes Mühleisen
Gems of DuckDB 1.2 by The DuckDB team
- Starting with version 1.2.0, DuckDB supports OR and IN expressions for filter pushdown. This optimization comes especially handy when querying remote
-
Graph Databases after 15 Years – Where Are They Headed? by LDBC Linked Data Benchmark Council
In Praise of “Normal” Engineers
A 10x Faster TypeScript - TypeScript by Anders Hejlsberg
Introducing a SQL-based metrics layer powered by DuckDB by DuckDB
Ibis, DuckDB, and GeoParquet: Making geospatial analytics fast, simple, and Pythonic by DuckDB
-
GitHub - reloadware/reloadium: Hot Reloading and Profiling for Python by reloadware
Four steps toward building an open source community by Klint Finley
Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses by Ananth Packkildurai
- However, many data ingestion tools don’t natively support compaction, requiring manual intervention or dedicated Spark clusters.
AWS CDK Introduces Garbage Collection to Remove Outdated Assets
Mastering Spark: The Art and Science of Table Compaction
DuckDB goes distributed? DeepSeek’s smallpond takes on Big Data by mehdio
- Let’s recap the features of smallpond :Lazy evaluation with DAG-based execution – Operations are deferred until explicitly triggered.Flexible partitioning strategies – Supports hash, column-based, and row-based partitioning.Ray-powered distribution – Each task runs in its own DuckDB instance for parallel execution.Multiple storage layer options – Benchmarks have primarily been conducted using 3FS.Cluster management trade-off – Requires maintaining a compute cluster, though fully managed services like Anyscale can mitigate this.Potential 3FS overhead – Self-managing a 3FS cluster introduce significant additional complexity.
Announcing AWS Step Functions Workflow Studio for the VS Code IDE - AWS
Definite: Understanding smallpond and 3FS: A Clear Guide
- Smallpond’s distribution leverages Ray Core at the Python level, using partitions for scalability. Partitioning can be done manually, and Smallpond supports:
Hash partitioning (based on column values) Even partitioning (by files or row counts) Random shuffle partitioning
-
Data Products: A Case Against Medallion Architecture by Modern Data 101
A non-beginner Data Engineering Roadmap — 2025 Edition by Ernani Castro
Open Source Data Engineering Landscape 2025 - Apache DolphinScheduler - Medium by Apache DolphinScheduler
- Further consolidation in the open table format spaceContinued evolution of zero-disk architectures in real-time and transactional systemsQuest toward providing a unified lakehouse experienceThe rise of LLMOps and AI EngineeringThe expansion of the data lakehouse ecosystem in areas such as open catalog integration and development of native librariesThe increasing traction of single-node data processing and embedded analytics
-
macOS Tips & Tricks - saurabhs.org
The Commitment Inventory: Get More Done By Saying “No”
- Important: When the timer goes off, start the next item immediately, even if you haven’t finished. Otherwise, you’ll get distracted, and your attention will lag
- I’m now working within a 50-minute “burst,” and I have 9:56 left before I have to move on to a different checklist within Pitches, so I am cranking out this draft much faster and with more focus than I otherwise would.If your resistance is high, start with a short burst, then add five as you go up. Forster recommends keeping it under 40, but when I’m writing, I use 50-minute bursts. Adapt to fit your needs. When you lose momentum, go back to 5 minutes. You can use bursts for your breaks, too. In 8 minutes, when I finish this burst, I will get a cup of coffee, stretch my legs, pet my dog Duke, and maybe read a quick news article. Forster recommends being strict with the timer, or your attention will drift.
If it is worth keeping, save it in Markdown - Piotr Migdał by Piotr Migdał
Does Ibis understand SQL? – Ibis by Deepyaman Datta
-
Concurrency Control in Open Data Lakehouse | Apache Hudi
- cases where multiple writer jobs need to access the same table, Hudi supports multi-writer setups. This model allows disparate processes, such as multiple ingestion writers or a mix of ingestion and separate table service jobs to write concurrently.
- Apache Iceberg supports multiple concurrent writes through Optimistic Concurrency Control (OCC). The most important part to note here is that Iceberg needs a catalog component to adhere to the ACID guarantees. Each writer assumes it is the only one making changes, generating new table metadata for its operation. When a writer completes its updates, it attempts to commit the changes by performing an atomic swap of the latest metadata.json file in the catalog, replacing the existing metadata file with the new one.
-
Towards composable data platforms — Jack Vanlightly by Jack Vanlightly
- The OTFs introduce a new abstraction layer that can be used to virtualize table storage. The key is that it allows for the separation of data from metadata and shared storage from compute. Through metadata, one table can appear in two data platforms, without data copying. To avoid overloading data virtualization anymore, I will use the term Table Virtualization.
- The Modern Data Stack (MDS) failed to sustain itself. Few wanted to compose a data architecture from 10-15 different vendors. People want to choose a small number of trusted vendors, and they want them to work together without a lot of toil and headaches.
Common pitfalls when building generative AI applications by Chip Huyen
- Use generative AI when you don’t need generative AI
- Start too complex
Examples of this pitfall:
Use an agentic framework when direct API calls work. Agonize over what vector database to use when a simple term-based retrieval solution (that doesn’t require a vectordb) works. Insist on finetuning when prompting works. Use semantic caching.
- Forgo human evaluation
To automatically evaluate AI applications, many teams opt for the AI-as-a-judge (also called LLM-as-a-judge) approach — using AI models to evaluate AI outputs. A common pitfall is forgoing human evaluation to rely entirely on AI judges.
-
-
-
Microsoft Introduces CoRAG: Enhancing AI Retrieval with Iterative Reasoning
The Quest to Understand Metric Movements - Pinterest Engineering Blog - Medium by Pinterest Engineering
- root-cause analysis (RCA)
- Slice and DiceThis approach finds clues for a metric movement by drilling down on specific segments within the metric; it has found successes at Pinterest, especially in diagnosing video metric regressions
- General SimilarityIn this approach, we look for clues of why a metric movement happened by scanning through other metrics and finding ones that have moved very “similarly” in the same time period, whether in the same direction (positive association) or in the opposite direction (negative association).
- practice, we have found that the first two factors, Pearson and Spearman’s rank correlations, work best because:p-values can be computed for both, which help to gauge statistical significanceboth have more natural support for measuring negative associations between two time-seriesnon-monotonic (e.g. quadratic) relationships, for which Pearson and Spearman’s rank correlations won’t apply, don’t tend to arise naturally so far in our use-cases / time window of analysis
- Experiment EffectsThis third approach looks for clues of why metric movements happened by looking at what a lot of internet companies have: experiments.
- For each control and treatment group in an experiment, we perform a Welch’s t-test on the treatment effect, which is robust in the sense that it supports unequal variances between control and treatment groups. To further combat noise in the results, we filter experiments by each experiment’s harmonic mean p-value of its treatment effects over each day in the given time period, which helps limit false positive rates. We also detect imbalances in control and treatment group sizes (i.e., when they are being ramped up at a different rate from each other) and filter out cases when that happens.
You Should Use /tmp/ More
-
How to add a directory to your PATH by Julia Evans
- Bash has three possible config files: ~/.bashrc, ~/.bash_profile, and ~/.profile.
python-build-standalone now has Python 3.14.0a5 by Simon Willison
How to Use a Microphone
-
Scale Out Batch Inference with Ray
Systematically Improving RAG Applications - Jason Liu by Jason Liu
- 4 Re-Rankers¶ Instead of (or in addition to) fine-tuning a bi-encoder (embedding model), you might fine-tune a cross-encoder or re-ranker that scores each candidate chunk directly. Re-rankers can be slower but often yield higher precision. Typically, you do a quick vector search, then run re-ranking on the top K results.
Emerging Patterns in Building GenAI Products
- Self evaluation: Self-evaluation lets LLMs self-assess and enhance their own responses. Although some LLMs can do this better than others, there is a critical risk with this approach. If the model’s internal self-assessment process is flawed, it may produce outputs that appear more confident or refined than they truly are, leading to reinforcement of errors or biases in subsequent evaluations. While self-evaluation exists as a technique, we strongly recommend exploring other strategies.
- LLM as a judge: The output of the LLM is evaluated by scoring it with another model, which can either be a more capable LLM or a specialized Small Language Model (SLM). While this approach involves evaluating with an LLM, using a different LLM helps address some of the issues of self-evaluation. Since the likelihood of both models sharing the same errors or biases is low, this technique has become a popular choice for automating the evaluation process
- Human evaluation: Vibe checking is a technique to evaluate if the LLM responses match the desired tone, style, and intent. It is an informal way to assess if the model “gets it” and responds in a way that feels right for the situation. In this technique, humans manually write prompts and evaluate the responses. While challenging to scale, it’s the most effective method for checking qualitative elements that automated methods typically miss.
- However, embeddings are not ideal for structured or relational data, where exact matching or traditional database queries are more appropriate. Tasks such as finding exact matches, performing numerical comparisons, or querying relationships are better suited for SQL and traditional databases than embeddings and vector stores.
Why Most Machine Learning Projects Fail to Reach Production and How to Beat the Odds
-
Jujutsu VCS Introduction and Patterns by Kuba Martin
- While in Git you generally organize your commits in branches, and a commit that’s not part of a branch is scarily called a “detached HEAD”, in jj it’s completely normal to work on changes that are not on branches. jj log is the main command to view the history and tree of changes, and will default to showing you a very reasonable set of changes that should be relevant to you right now - that is (more or less) any local mutable changes, as well as some additional changes for context (like the tip of your main branch).
-
The Rise of Single-Node Processing: Challenging the Distributed-First Mindset by Alireza Sadeghi
The End of the Bronze Age: Rethinking the Medallion Architecture by
Bias and Fairness in Natural Language Processing - Thomson Reuters Labs - Medium by Navid Rekabsaz
The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success by Ananth Packkildurai
Google Releases PaliGemma 2 Vision-Language Model Family by
Catching memory leaks with your test suite by Itamar Turner-Trauring
-
-
JavaScript Temporal is coming | MDN Blog by
Amazon (S3) Tables by Daniel Beach
Rill | Designing a Declarative Data Stack: From Theory to Practice by
The Hidden Cost of Over-Abstraction in Data Teams by Zakaria Hajji
Staff Engineer vs Engineering Manager by Alex Ewerlöf
PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts by Brenda Potts
-
-
-
-
-
-
-
-
How AI is unlocking ancient texts — and could rewrite history by Marchant, Jo
Glue work considered harmful by
Goodbye Github Pages, Hello Coolify · Quakkels.com by
Tools Worth Changing To in 2025 by Matthew Sanabria
Databases in 2024: A Year in Review by
Dismantling ELT: The Case for Graphs, Not Silos — Jack Vanlightly by Jack Vanlightly
-
-
-
-
Scalar and binary quantization for pgvector vector search and storage by Jonathan Katz
Turbocharge Efficiency & Slash Costs: Mastering Spark & Iceberg Joins with Storage Partitioned Join by Samy Gharsouli
Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality | Amazon Web Services by
Designing data products by
On writing and getting from zero to done — Jack Vanlightly by Jack Vanlightly
Introducing AWS Glue Data Catalog automation for table statistics collection for improved query performance on Amazon Redshift and Amazon Athena | Amazon Web Services by
React v19 – React by
Tech predictions for 2025 and beyond by Dr Werner Vogels - https://www.allthingsdistributed.com/
First impressions of the new Amazon Nova LLMs (via a new llm-bedrock plugin) by
Use open table format libraries on AWS Glue 5.0 for Apache Spark | Amazon Web Services by
Migrating AWS Glue for Spark jobs to AWS Glue version 5.0 - AWS Glue by
-
-
-
-
Building Confidence: A Case Study in How to Create Confidence Scores for GenAI Applications - Spotify Engineering by alexandrawei
Storing times for human events by
DataFrames at Scale Comparison: TPC-H by
Enabling compaction optimizer - AWS Glue by
Top Python Web Development Frameworks in 2025 · Reflex Blog by
Building Python tools with a one-shot prompt using uv run and Claude Projects by
-
-
-
-
-
-
-
-
The Part of PostgreSQL We Hate the Most // Blog // Andy Pavlo - Carnegie Mellon University
What goes into bronze, silver, and gold layers of a medallion data architecture? | by Lak Lakshmanan | Sep, 2024 | Medium by Lak Lakshmanan
Using DuckDB-WASM for in-browser Data Engineering by Tobias Müller
Go talk to the LLM
The CDC MERGE Pattern. by Ryan Blue | by Tabular | Medium by Tabular
FireDucks : Pandas but 100x faster
#!/usr/bin/env -S uv run by
Amazon Data Firehose supports continuous replication of database changes to Apache Iceberg Tables in Amazon S3 - AWS by
-
-
Anthropic’s upgraded Claude 3.5 Sonnet model and computer use now in Amazon Bedrock - AWS
- It will be fun to see how the computer use api will evolve
It’s Not Easy Being Green: On the Energy Efficiency of Programming Languages
-
-
-
-
-
-
NotebookLM | Note Taking & Research Assistant Powered by AI
- Very interesting experiment from google which provides a summary of any article, video or file. Additionaly can generate a podcast talking about it!
-
Talks - Juliana Ferreira Alves: Improve Your ML Projects: Embrace Reproducibility and Production… by PyCon US
- These truly looks like a good improvement to an ml project 👀
-
-
Copy-on-Write (CoW) — pandas 2.2.3 documentation
- Ptty big change to pandas 3.0 but one I think will bring a lot of clarity to data transformations
Announcing DuckDB 1.1.0 – DuckDB by The DuckDB team
Introducing Contextual Retrieval \ Anthropic
What is a Vector Index? An Introduction to Vector Indexing by Alejandro CantareroField CTO of AI, DataStax
Chunking · Malaikannan
Chronon - Airbnb’s End-to-End Feature Platform - InfoQ by Nikhil Simha
- This is a very high level overview of the feature platform but allowed me to get a better sense on why use this platform. But I’d love to test it to understand if for those without streaming I’m this isn’t a bit overkill
-
Deno 2.0 Release Candidate
Time spent programming is often time well spent - Stan Bright
What I tell people new to on-call | nicole@web
What is io_uring?
Memory Management in DuckDB – DuckDB by Mark Raasveldt
Yuno | How Apache Hudi transformed Yuno’s data lake
Google Proposes Adding Pipe Syntax to SQL - InfoQ by Renato Losio
-
PostgreSQL: PostgreSQL 17 Released!
The sorry state of Java deserialization @ marginalia.nu
- Interesting to see that duckdb is serving as the benchmark to process data in the order of GB’s
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
-
-
I Like Makefiles by Sebastian Witowski
- I tend to avoid make files because I hadn’t looked at how they worked and always associated either Java projects. I might want to take another look at them
Rethinking Analyst Roles in the Age of Generative AI by Ben Lorica 罗瑞卡
No, Data Engineers Don’t NEED dbt. | by Leo Godin | Jul, 2024 | Data Engineer Things by Leo Godin
-
Gotten my reading from +100 articles to 58! 🎉
Predicting the Future of Distributed Systems by Colin Breck
Continuous reinvention: A brief history of block storage at AWS | All Things Distributed by Werner Vogels
Iceberg vs Hudi — Benchmarking TableFormats | by Mudit Sharma | Aug, 2024 | Flipkart Tech Blog by Mudit Sharma
Splicing Duck and Elephant DNA by Jordan Tigani, Brett Griffin
How we sped up Notion in the browser with WASM SQLite by Carlo Francisco
Table format comparisons - How do the table formats represent the canonical set of files? — Jack Vanlightly by Jack Vanlightly
-
Should you be migrating to an Iceberg Lakehouse? | Hightouch by Hugo Lu
How data departments have evolved and spread across English football clubs - The Athletic by Mark Carey
How to use AI coding tools to learn a new programming language - The GitHub Blog by Sara Verdi
How to choose the best rendering strategy for your app – Vercel by Alice Alexandra MooreSr. Content Engineer, Vercel
Amazon’s Exabyte-Scale Migration from Apache Spark to Ray on Amazon EC2 | AWS Open Source Blog
-
Don’t Use JS for That: Moving Features to CSS and HTML by Kilian Valkhof by JSConf
AWS Chatbot now allows you to interact with Amazon Bedrock agents from Microsoft Teams and Slack - AWS
Astro 5.0 Beta Release | Astro by Erika
Making progress on side projects with content-driven development | nicole@web
AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables - AWS
-
Introducing OpenAI o1 | OpenAI
Rewrite Bigdata in Rust by Xuanwo
Recommending for Long-Term Member Satisfaction at Netflix | by Netflix Technology Blog | Aug, 2024 | Netflix TechBlog by Netflix Technology Blog
- Til: reward engineering. Measure proxy features and optimize for them to get to your actual end goal
We need to talk about ENUMs | boringSQL by Radim Marek
I spent 8 hours learning Parquet. Here’s what I discovered | by Vu Trinh | Aug, 2024 | Data Engineer Things by Vu Trinh
-
What I Gave Up To Become An Engineering Manager by Suresh Choudhary
Unpacking the Buzz around ClickHouse - by Chris Riccomini by Chris Riccomini
Introducing job queuing to scale your AWS Glue workloads | AWS Big Data Blog
”SRE” doesn’t seem to mean anything useful any more
Microsoft Launches Open-Source Phi-3.5 Models for Advanced AI Development - InfoQ by Robert Krzaczyński
-
Production-ready Docker Containers with uv by Hynek Schlawack
Many of us can save a child’s life, if we rely on the best data - Our World in Data by By: Max Roser
- Another article that I find eye opening when we look at data to improve our decisions
Why I Still Use Python Virtual Environments in Docker by Hynek Schlawack
- Suddenly this way of working, very similar to node modules, is being talked everywhere.
Python Developers Survey 2023 Results
- Was looking for this for a long time. Got to learn a bit more on twine, mlflow and sqlmodel. In terms of getting on track with python world I think I’m good with my current rss feed and podcast
Elasticsearch is Open Source, Again | Elastic Blog by ByShay Banon29 August 2024
Monitor data quality in your data lake using PyDeequ and AWS Glue | AWS Big Data Blog
- Data drift detection 👀
- Makes sense. Pydeequ is just a wrapper
How top data teams are structured by Mikkel Dengsøe
- As I work on a team that doesn’t have this kind of distribution I need to reflect a bit on the impact of being the sole data engineer on an ML team
Talks - Amitosh Swain: Testing Data Pipelines by PyCon US
-
uv: Unified Python packaging
Wow, need to test this out on my side projects.
CSS finally adds vertical centering in 2024 | Blog | build-your-own.org by James Smith
Timeless Skills For Data Engineers And Analysts by SeattleDataGuy
Skimmed the article but although high level I find the main points very true. Understanding the system, being on top of the state of the art. That and the tips for head of data
CSS 4, 5, and 6! With Google’s Una and Adam by Syntax
Great episode! Nice to see that css is getting better and better
-
I’ve Built My First Successful Side Project, and I Hate It by Sebastian Witowski
Wow, I’d love this one man saas but knowing how it can burn you out…
NodeJS Evolves by Syntax
Nice episode on the long sought features for node that have been introduced on bun and demo (single file, typescript support and top await async are the big ones for me)
Talks - Brandt Bucher: Building a JIT compiler for CPython by PyCon US
Python Insider: Python 3.13.0 release candidate 1 released
Great to see a new version coming along! Is pdb worth using with vs code? 🤔
Google kills Chromecast, replaces it with Apple TV and Roku Ultra competitor | Ars Technica by Samuel Axon
As a google to owner would be good to know that my working hardware wouldn’t brick just because I don’t want to upgrade
-
Can Reading Make You Happier? | The New Yorker by Ceridwen Dovey
Beyond Hypermodern: Python is easy now - Chris Arderne
Although I had my eyes already on rye I certainly didn’t know it was so full feature. Gotta try and change some of my projects to it ,
Visual Studio Code July 2024 by Microsoft
Python support slowly getting good
Introducing GitHub Models: A new generation of AI engineers building on GitHub - The GitHub Blog by Thomas Dohmke
Creativity Fundamentally Comes From Memorization
Gen AI Increases Workloads and Decreases Productivity Upwork Study Finds - InfoQ by Sergio De Simone
The report raises a good point about an increase in workload. With a big productivity boost bosses might start loading even bigger workloads than what we gained from the productivity boost
Ofc this isn’t good as people will feel overloaded
tea-tasting: a Python package for the statistical analysis of A/B tests | Evgeny Ivanov by Evgeny Ivanov
Super interesting IMO. Might give it a try on my team when we start deploying solutions to our clients
Data Science Spotlight: Cracking the SQL Interview at Instacart (LLM Edition) | by Monta Shen | Jul 2024 | tech-at-instacart by Monta Shen
Would be interesting to test this out. Have an example of a dataset that can be queried using duckdb. Given a question understand if a query is correct how to fix it and improve it’s performance. One in Sal and another in pyspark (or ibis/pandas)
-
Are you a workaholic? Here’s how to spot the signs | Ars Technica by Chris Woolston, Knowable Magazine
At the Olympics, AI is watching you | Ars Technica by Morgan Meaker, WIRED.comCrazy to see that now with ai it’s actually possible to survey an entire city either cameras
Use Apache Spark™ from Anywhere: Remote Connectivity with Spark Connect by Databricks
English sdk looks awesome but requires an OpenAI key. Could it be replaced with ollama?
-
Let’s Consign CAP to the Cabinet of Curiosities - Marc’s Blog by Marc Brooker
Interesting topics to research a bit more on CAP alternatives
Why Your Generative AI Projects Are Failing by Ben Lorica 罗瑞卡
In summary, to have a good AI product we need to have data with quality which requires good data governance. With this data we need to define useful products that we can measure it’s value using data driven metrics. We must also ensure the product has good practices avoiding security or bias issues
Engage your audience by getting to the point, using story structure, and forcing specificity – Ian Daniel Stewart
Resumed by:
Slack Conquers Deployment Fears with Z-score Monitoring - InfoQ by Matt Saunders
This is something I would love to implement. Allowing to define the metrics on which to evaluate a new feature, the expected hypothesis and revert the feature (i.e feature flags) automatically with a report on the experiment
DuckDB + dbt : Accelerating the developer experience with local power - YouTube
Could I replace Athena with this? I think the main blocker for me is I want to work with S3. And need to check how it runs for a really large dataset…
How Unit Tests Really Help Preventing Bugs | Amazing CTO
Good tip. For any project define the metric of code coverage goals and start increasing on the project
Mocking is an Anti-Pattern | Amazing CTO
Building an open data pipeline in 2024 - by Dan Goldin by Dan Goldin
-
Spark-Connect: I’m starting to love it! | Sem Sinchenko by Sem Sinchenko
This article wasn’t properly parsed by omnivore but the big takeaways:
- We can add plugins for extended functionality to our spark server
- Using spark connect we can implement any library in any language we want and send grpc requests to the server (spark connect server needs to be running)
- spark connect works on >3.5. Should be much better on v4.0
- If I truly want to be good in spark I eventually need to relearn scala/java
- Glue is cool but it’s still on version 3.3. All these goodies will take too long to be implemented in glue
So you got a null result. Will anyone publish it? by Kozlov, Max
After reading the statistics books I can see much more clear the value of proving a null hypophesis, this is the feeling I am getting out of the academia. We are seeing more research without any any added value. Goodhart’s law.
Maestro: Netflix’s Workflow Orchestrator | by Netflix Technology Blog | Jul, 2024 | Netflix TechBlog by Netflix Technology Blog
Sounds just like an airflow contender. with the plus of being able to run notebooks 🤔
How to Create CI/CD Pipelines for dbt Core | by Paul Fry | Medium by Paul Fry
Simplify PySpark testing with DataFrame equality functions | Databricks Blog by Haejoon Lee, Allison Wang and Amanda Liu
Good theme for a blog post on the changes of spark 4, this is really useful for human errors (been there multiple times)
How to build a Semantic Layer in pieces: step-by-step for busy analytics engineers | dbt Developer Blog by Gwen Windflower
So this will be generated on the fly as views by the semantic layer?, This looked neat until the moment I understood that the semantic layer requires dbt cloud
Meta’s approach to machine learning prediction robustness #### Engineering at Meta by Yijia Liu, Fei Tian, Yi Meng, Habiya Beg, Kun Jiang, Ritu Singh, Wenlin Chen, Peng Sun
Meta seems to be a couple of years ahead of the industry. The article doesn’t provide a lot of insights but gives a feeling of their model evaluation being mostly automated + having a good AI debug tool
Free-threaded CPython is ready to experiment with! | Labs
First step in a long way before we can get run python wihtout GIL. Interested on seeing if libraries like pandas will be able to leverage multithreading eventually with this
Amazon DataZone introduces OpenLineage-compatible data lineage visualization in preview | AWS Big Data Blog
Mixed feelings here. Great to see Open Lineage implemented at AWS. However it feels again that AWS just created the integration and won’t be driving the development of open lineage
Unlocking Efficiency and Performance: Navigating the Spark 3 and EMR 6 Upgrade Journey at Slack - Slack Engineering by Nilanjana Mukherjee
What could be improved to help this kind of migrations be done in a matter of days?, Livy might be deprecated in favor of spark connect. With their migration to Spark 3 and eventually 3.5 (not clear on this article) they could be interested in moving new jobs to connect , Basically solved issues by using the old behaviours. These will need to be migrated eventually. Would need to better understand these features , This looks like an important detail. With no explicit order spark can have random order of rows?, Cool to see these migrations and teams using open source solutions. EMR although expensive with a good engineering team can prove to be quite cost effective
DuckDB Community Extensions – DuckDB by The DuckDB team
Visual Studio Code June 2024 by Microsoft
The Rise of the Data Platform Engineer - by Pedram Navid by Pedram Navid
The need to define a data platform is something I see everywhere. It really looks like we are missing a piece here. Netflix maestro for example seems like a good contender for to solve the issue of (yet another data custom platform)
Lambda, Kappa, Delta Architectures for Data | The Pipeline by Venkatesh Subramanian
Write-Audit-Publish (WAP) Pattern - by Julien Hurault by Julien Hurault
This articles brings me the question. Can we improve dbt by using WAP? How does the rollback mechanism work when a process fails?
AWS Batch Introduces Multi-Container Jobs for Large-Scale Simulations - InfoQ by Renato Losio
Data Council 2024: The future data stack is composable, and other hot takes | by Chase Roberts | Vertex Ventures US | Apr, 2024 | Medium by Chase Roberts
Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions | AWS Big Data Blog
Super interesting to see how we can enable data quality visibility
Pyspark 2023: New Features and Performance Improvement | Databricks Blog by in Industries
Cost Optimization Strategies for scalable Data Lakehouse by Suresh Hasundi
Good to use open data lakes showing the big cost and speed improvements
Prompt injection and jailbreaking are not the same thing
Flink SQL and the Joy of JARs by ByRobin MoffattShare this post
Catalogs in Flink SQL—Hands On by ByRobin MoffattShare this post
Catalogs in Flink SQL—A Primer by ByRobin MoffattShare this post
Why is this so hard? 😭