Data Impostor

Blog Talks Reads About
Follow me on Mastodon Go to my GitHub repo Go to my RSS feed
  • Jun 4, 2025


    Google Releases LMEval, an Open-Source Cross-Provider LLM Evaluation Tool
  • May 29, 2025


    ClickHouse vs StarRocks vs Presto vs Trino vs Apache Spark™ — Comparing Analytics Engines
  • May 27, 2025


    Making Sense of Apache Iceberg Statistics - Yuval Yogev - Medium by Yuval Yogev
    • data-level (what’s inside the files) and metadata-level (how the files are organized)
    • Theta sketch — a probabilistic data structure for estimating NDV (number of distinct values) for each column.
    Grey box and outcome driven engineering
    • But how can you trust code you haven’t reviewed?” I hear the traditionalists cry.Simple: I don’t trust the code. I trust the verification process.
    • In contrast, the AI-assisted Grey Box paradigm completely separates validation from implementation. You define what correctness looks like through expected outcomes and verification criteria, but delegate both test implementation and solution implementation to AI systems
    • There are still times when opening the box makes sense:When verification fails in unexpected waysWhen you need to modify the implementation for a new use caseWhen you’re curious about how something worksWhen you need to teach others
    14 Advanced Python Features
    • Quick bonus trick: As you probably saw, Python also has support for String Literals. These help assert that only specific string values can be passed to a parameter, giving you even more type safety. Think of them like a lightweight form of Enums!
    • type Vector = list[float]
    • That’s where Protocols come in. Protocols (also known as Structural Subtyping) are typing classes in Python defining the structure or behavior that classes can follow without the use of interfaces or inheritance.
  • May 25, 2025


    Announcing a new IDE for PostgreSQL in VS Code from Microsoft
    Everything You Need to Know About Incremental View Maintenance by Chris
  • May 24, 2025


    Decibels are ridiculous by lcamtuf
    Claude and I write a utility program
  • May 23, 2025


    Announcing DuckDB 1.3.0 by The DuckDB team
    How to Evaluate AI that’s Smarter than Us - ACM Queue by Chip Huyen
    • Functional correctness – evaluating AI by how well it accomplishes its intended tasks.
    • AI-as-a-judge – using AI instead of human experts to evaluate AI outputs.
    • Comparative evaluation – evaluating AI systems in relationship with each other instead of independently.
    • MMLU (Massive Multitask Language Understanding
    • For some applications, figuring out evaluation can take up the majority of the development effort.
    • Because evaluation is difficult, many people settle for word of mouth (e.g., someone says that model X is good) or eyeballing the results (also known as vibe check). This creates even more risk and slows down application iteration. Instead, an investment in systematic evaluation is needed to make the results more reliable
    • your task can be evaluated by functional correctness, that’s what you should do
    • Limitations of AI-as-a-judge
    • Despite the many advantages of AI-as-a-judge, some teams are hesitant to adopt this approach. Using AI to evaluate AI seems tautological. The probabilistic nature of AI makes it seem too unreliable to act as an evaluator. AI judges can potentially introduce nontrivial costs and latency to an application. Given these limitations, some teams see AI-as-a-judge as a fallback option when they don’t have any other way of evaluating their systems
    • One big question is whether the AI judge needs to be stronger than the model being evaluated
    • Using a model to judge itself—self-evaluation or self-critique—sounds like cheating, especially because of self-bias. Self-evaluation can be great for sanity checks, however. If a model thinks its response is incorrect, the model might not be that reliable. Beyond sanity checks, asking a model to evaluate itself can nudge the model to revise and improve its responses
    • With comparative evaluation, you evaluate models against each other and compute a ranking from comparison results. For responses whose quality is subjective, comparative evaluation is typically easier to do than pointwise evaluation
    Reducing Runtime Errors in Spark: Why We Migrated from DataFrame to Dataset by Agoda Engineering
    • Off-heap memory refers to memory that is managed outside the JVM heap. It is directly allocated and managed by Spark, bypassing the JVM’s garbage collection (GC) mechanism
    • Tungsten avoids creating individual JVM objects for each row or column. Instead, it uses a binary format to represent data in memory.For example, instead of creating an object for each record, Spark stores the data as a contiguous block of memory in a serialized format, which is faster to process.
    • How Dataset Reduces Runtime Errors
    • Dataset provides type safety at compile time, while DataFrame does not.
    • No Hard-Coded Column Names
    • Schema Awareness and Readability
    • Encoders for Optimized Serialization/Deserialization
    • Encoders generate schema-specific bytecode, which is then compiled into JVM bytecode. This bytecode is highly optimized for execution.The JVM’s Just-In-Time (JIT) compiler dynamically compiles frequently executed code paths (hot code) into native machine code at runtime, further improving performance.
  • May 22, 2025


    Three AI Design Patterns of Autonomous Agents by Alexander Sniffin
    • ReAct AgentTask-Planner AgentMulti-Agent Orchestration
    • The benefits of implementing the agent this way are:predictabilitytasks are isolated from other stateseasy to troubleshooteasy to add new states
    • Potential problems include:prone to getting stuck in loopscan get side tracked or lose focus from the original request
  • May 21, 2025


    How One Company Secretly Poisoned The Planet by Veritasium
  • May 14, 2025


    Ensuring Data Contracts adoption across an organization by Pierre-Yves BONNEFOY
    Everything Wrong with MCP by Shrivu Shankar
    • MCP initially didn’t define an auth spec and now that they have people don’t like it.
    • MCP has no concept or controls for tool-risk levels.
    • MCP has no concept or controls for costs.
    • MCP makes it easier to accidentally expose sensitive data.
  • May 9, 2025


    What’s new in pip 25.1 - Dependency groups!
    Powering Apache Pinot ingestion with Hoptimator
    • Hoptimator was developed to empower data consumers to create and control their own data pipelines. At LinkedIn, Hoptimator powers subscriptions, which represent an ongoing request for data. Data consumers can create, modify, and delete subscriptions dynamically via the Subscription API. This service leverages Hoptimator to orchestrate end-to-end multi-hop data pipelines that deliver the requested data
    The One-Person Framework in practice
    The One-Person Framework in practice
  • May 8, 2025


    Throwing it all away - how extreme rewriting changed the way I build databases
  • May 7, 2025


    FastAPI-MCP: Simplifying the Integration of FastAPI with AI Agents
  • May 6, 2025


    AWS Introduces MCP Servers for AI-Assisted Cloud Development
    AI assisted search-based research actually works now by Simon Willison
  • May 4, 2025


    Understanding the recent criticism of the Chatbot Arena by Simon Willison
    Making PyPI’s test suite 81% faster
  • Apr 19, 2025


    Manage concurrent write conflicts in Apache Iceberg on the AWS Glue Data Catalog | Amazon Web Services
  • Apr 16, 2025


    Technology Radar | Guide to technology landscape
  • Apr 13, 2025


    CaMeL offers a promising new direction for mitigating prompt injection attacks by Simon Willison
    • So, are prompt injections solved now?
    • No, prompt injection attacks are not fully solved. While CaMeL significantly improves the security of LLM agents against prompt injection attacks and allows for fine-grained policy enforcement, it is not without limitations.
    • Importantly, CaMeL suffers from users needing to codify and specify security policies and maintain them. CaMeL also comes with a user burden. At the same time, it is well known that balancing security with user experience, especially with de-classification and user fatigue, is challenging
    • CaMeL really does represent a promising path forward though: the first credible prompt injection mitigation I’ve seen that doesn’t just throw more AI at the problem and instead leans on tried-and-proven concepts from security engineering, like capabilities and data flow analysis.
    How Meta is Using a New Metric for Developers: Diff Authoring Time
    • Diff Authoring Time (DAT)
    • By tracking the time from the initiation of a code change to its submission, DAT offers insights into the efficiency of the development process and helps identify areas for improvement.
    • For instance, DAT has been instrumental in evaluating the impact of introducing a type-safe mocking framework in Hack, leading to a 14% improvement in authoring time. Additionally, the development of automatic memoization in the React compiler resulted in a 33% improvement, and efforts to promote code sharing have saved thousands of DAT hours annually, achieving over a 50% improvement. ​
    Maximizing Your Delta Scan Performance in DuckDB by Sam Ansmink
  • Apr 12, 2025


    The 2025 AI Index Report | Stanford HAI
    Is not writing tests unprofessional?
    The Best Programmers I Know | Matthias Endler by Matthias Endler
    • Read the Reference
    • Don’t go to Stack Overflow, don’t ask the LLM, don’t guess, just go straight to the source. Oftentimes, it’s surprisingly accessible and well-written
    • Know Your Tools Really Well
    • For example, if you are a backend engineer and you make heavy use of Kafka, I expect you to know a lot about Kafka – not just things you read on Reddit. At least that’s what I expect if you want to be one of the best engineers
    • Read The Error Message
    • Turns out, if you just sit and meditate about the error message, it starts to speak to you
    • Break Down Problems
    • you work as a professional developer, that is the bulk of the work you get paid to do: breaking down problems. If you do it right, it will feel like cheating: you just solve simple problems until you’re done.
    • Don’t Be Afraid To Get Your Hands Dirty
    • Before you know it, they become the go-to person in the team for whatever they touched. Mostly because they were the only ones who were not afraid to touch it in the first place.
    • Always Help Others
    • A related point. Great engineers are in high demand and are always busy, but they always try to help. That’s because they are naturally curious and their supportive mind is what made them great engineers in the first place. It’s a sheer joy to have them on your team, because they are problem solvers.
    • Write
    • Most awesome engineers are well-spoken and happy to share knowledge.The best have some outlet for their thoughts: blogs, talks, open source, or a combination of those.
    • Never Stop Learning
    • Have Patience
    • You need patience with computers and humans. Especially with yourself
    • Never Blame the Computer
    • The best keep digging until they find the reason. They might not find the reason immediately, they might never find it, but they never blame external circumstances.
    • Don’t Be Afraid to Say “I Don’t Know”
  • Apr 10, 2025


    Data Wants to Be Free: Fast Data Exchange with Apache Arrow by David Li, Ian Cook, Matt Topol
    • So generally, you can use Arrow data as-is without having to parse every value.
    • By using Arrow for serialization, data coming off the wire is already in Arrow format, and can furthermore be directly passed on to DuckDB, pandas, polars, cuDF, DataFusion, or any number of systems.
    • Arrow IPC defines how to serialize and deserialize Arrow data
    • And finally, ADBC actually isn’t a protocol. Instead, it’s an API abstraction layer for working with databases (like JDBC and ODBC—bet you didn’t see that coming), that’s Arrow-native and doesn’t require transposing or converting columnar data back and forth. ADBC gives you a single API to access data from multiple databases, whether they use Flight SQL or something else under the hood, and if a conversion is absolutely necessary, ADBC handles the details so that you don’t have to build out a dozen connectors on your own.
    Beyond thumbs up and thumbs down: A human-centered approach to evaluation design for LLM products by shima ghassempour
    • For example, a model may generate output that could technically be accurate, but those suggestions may not always be useful to the people or processes they support. Improving workflows or process efficiency may occur at different stages and require different metrics.
    • Ensure the realization of intended business value by aligning system outputs and performance to real world expectations.Build trust with end users and stakeholders by ensuring solution reliability in diverse scenarios.Identify areas for performance improvement by pinpointing systematic gaps in model performance.Improve user satisfaction by allowing end users to provide feedback, and refining responses accordingly.Adapt to real-world use cases and ensure stable system performance over time, as the iterative nature of evaluation helps it to remain relevant within changing and unexpected real-world conditions and to adjust promptly.
    • Lack of granularity in feedback
    • Binary feedback often fails to capture why a response was unsatisfactory — whether it lacked accuracy, completeness, or the right tone
    • Accounting for variation in human judgment:
    • Collecting human feedback without appropriate context and judgment can introduce variability that is difficult to interpret and understand
    • Bias in feedback
    • Emotions, prior experiences, and context may influence feedback, leading to skewed data. For example, feedback from someone with 10 years of experience versus someone new to the job may vary significantly, influencing evaluation outcomes
    • Automated and human-in-the-loop evaluation
    • Combining automated evaluation metrics (e.g., BLEU, ROUGE, or perplexity scores) with human feedback provides a holistic view on system performance. Periodic human-in-the-loop testing ensures that the model meets quality standards before deployment.
    • A/B testing to compare new model versions or evaluation designs and see which delivers better outcomes.Gradual rollout, which is when new model performance versions are released to a small portion of users and performance metrics are closely monitored.Shadow mode release to allow evaluation of the model on real scenarios without exposing the outputs to the real users
    • Testing with users and experts early and often
    Develop and test AWS Glue 5.0 jobs locally using a Docker container | Amazon Web Services
  • Apr 9, 2025


    How to Write Blog Posts that Developers Read
    Upskilling data engineers | Georg Heiler by Georg Heiler
    How AI will disrupt data engineering as we know it
    • Hardly. Data engineers, one of the hottest jobs of the last decade, will stay hot. But practitioners will be pushed in one of three directions: towards the business domain, towards automation, or towards the underlying data platform.Data platform engineers will become ever-more-important. They don’t spend their time building pipelines, but rather on the infrastructure that pipelines are built on. They are responsible for performance, quality, governance, uptime.Automation engineers will sit side-by-side with data teams and take the insights coming out of data and build business automations around it. As a data leader recently told me: “I’m no longer in the business of insights. I’m in the business of creating action.”Data engineers that are primarily obsessed with business outcomes will have ample opportunity to act as enablement and support for the insight-generation process, from owning and supporting datasets to liaising with stakeholders. The value to the business won’t change, but the way the job is done will.
    Can we make AI less power-hungry? These researchers are working on it. by Jacek Krywko
  • Apr 8, 2025


    AI 2027
    Advanced RAG Techniques: What They Are & How to Use Them by Guy Korland
    • Semantic chunking is a method of dividing text into segments based on their inherent meaning rather than adhering to predetermined lengths. This ensures that each segment, or “chunk,” encapsulates a complete and meaningful portion of information.
    • GraphRAG applications already utilize this technique which contributes to their effectiveness. In these systems, the LLM translates the user query into knowledge graph Cypher queries, which are then used to query the knowledge graph and retrieve relevant information.
    • The retrieval process begins with broader chunks or parent nodes, followed by a more focused search within smaller chunks or child nodes linked to the selected parent nodes. Hierarchical indexing not only improves retrieval efficiency but also minimizes the inclusion of irrelevant data in the final output
    • Self-query retrieval is a technique in which the language model (LLM) generates follow-up queries based on the initial user query to retrieve more precise information. For example, this method allows for the extraction of metadata from the user’s query, enabling a search over filtered data to achieve more accurate results.
    • Refining Through Data: By carefully curating data and using it to train the model, you enable it to differentiate more effectively between relevant and irrelevant information. This process sharpens the model’s ability to retrieve the most pertinent results.Better Performance Metrics: The outcome of fine-tuning is a marked improvement in retrieval accuracy and efficiency, facilitating better user experiences and decision-making
    • Reranker models, such as Cohere’s Rerank3, are specialized AI models that assess and prioritize the relevance of these retrieved documents in relation to a user’s query. Operating on a smaller set of candidate documents, these models focus on fine-tuning rankings based on the context of both the query and the documents. Typically trained on datasets containing examples of relevant and irrelevant documents, rerankers can effectively distinguish high-quality results from less relevant ones.
    • To refine your search results, you can employ Corrective Retrieval-Augmented Generation (Corrective RAG or CRAG). This technique involves scoring and filtering the retrieved documents based on their relevance and accuracy concerning your query.
    • Chain-of-thought prompting is particularly effective when dealing with complex queries, where the LLM needs to reason to generate the final response. Frameworks like DSPy are particularly capable of Chain-of-Thought prompting
    • LangChain, LlamaIndex, and DSPy, which offer powerful modules to help you integrate these advanced RAG strategies into your workflows.
    Comparing Open-Source AI Agent Frameworks
    • Developers who prefer to model AI tasks in stateful workflows often gravitate toward LangGraph. If your application demands robust task decomposition, parallel branching, or the ability to inject custom logic at specific stages, you might find LangGraph’s explicit approach a good fit.
    • you are already deep into OpenAI’s stack and want an officially supported solution to spin up agents that utilize GPT-4o or GPT-o3, the OpenAI Agents SDK might be your first stop.
    A Field Guide to Rapidly Improving AI Products – Hamel’s Blog by Hamel Husain
    • This is a pattern I’ve seen repeatedly: teams build evaluation systems, then gradually lose faith in them. Sometimes it’s because the metrics don’t align with what they observe in production. Other times, it’s because the evaluations become too complex to interpret. Either way, the result is the same – the team reverts to making decisions based on gut feeling and anecdotal feedback, undermining the entire purpose of having evaluations.
    • The teams that maintain trust in their evaluation systems embrace this reality rather than fighting it. They treat evaluation criteria as living documents that evolve alongside their understanding of the problem space. They also recognize that different stakeholders might have different (sometimes contradictory) criteria, and they work to reconcile these perspectives rather than imposing a single standard.
    • The most successful teams take a more measured approach:

    Start with high human involvement: In the early stages, have domain experts evaluate a significant percentage of outputs. Study alignment patterns: Rather than automating evaluation, focus on understanding where automated evaluations align with human judgment and where they diverge. This helps you identify which types of cases need more careful human attention. Use strategic sampling: Rather than evaluating every output, use statistical techniques to sample outputs that provide the most information, particularly focusing on areas where alignment is weakest. Maintain regular calibration: Even as you scale, continue to compare automated evaluations against human judgment regularly, using these comparisons to refine your understanding of when to trust automated evaluations.

    • Instead of defining success as shipping a feature, the capability funnel breaks down AI performance into progressive levels of utility. At the top of the funnel is the most basic functionality – can the system respond at all? At the bottom is fully solving the user’s job to be done. Between these points are various stages of increasing usefulness. For example, in a query assistant, the capability funnel might look like: 1. Can generate syntactically valid queries (basic functionality) 2. Can generate queries that execute without errors 3. Can generate queries that return relevant results 4. Can generate queries that match user intent 5. Can generate optimal queries that solve the user’s problem (complete solution)
    • The most successful teams I’ve worked with structure their roadmaps around experiments rather than features. Instead of committing to specific outcomes, they commit to a cadence of experimentation, learning, and iteration.
    • Perhaps the most counterintuitive aspect of this approach is the emphasis on learning from failures. In traditional software development, failures are often hidden or downplayed. In AI development, they’re the primary source of learning
    • This pattern – long periods of apparent failure followed by breakthroughs – is common in AI development. Traditional feature-based roadmaps would have killed the project after months of “failure,” missing the eventual breakthrough.
    Parquet Bloom Filters in DuckDB by Hannes Mühleisen
    • In the end we end up with a lot of additional reads to find and read the Bloom filter bytes, in principle requiring a careful trade-off between reading the filters and “just” reading the column brute-force.
    • During reading, DuckDB will automatically use constant-comparison filter predicates in the query (e.g., WHERE a = 42) to probe the Bloom filter (if present) and skip row groups where the Bloom filter can guarantee there are no matching rows in the group. Again, this happens transparently to users and there is no configuration that needs to be set.
  • Apr 7, 2025


    How engineers can use one-on-ones with their manager to accelerate career growth by Dalia Abuadas
    • One-on-one meetings with your manager are one of the most valuable tools you have for career growth, problem-solving, and unlocking new opportunities. So if you’re only using them to provide status updates, you’re leaving a lot on the table.
    • Here are a few ideas that stuck with me:

    Your manager isn’t a mind reader.
    You can’t expect guidance if you don’t come with a direction.
    Your growth is a shared effort, but it starts with you.

    Exploring Aging of Programmers: Fostering Inclusive and Age-friendly Workplaces
    • Gregory advised to make friends, and don’t stop making new friends all of your life. Try new things too - as life goes on there will be losses, and the only cure for loss is gain, so you have to give new hobbies, new foods, and new entertainment a chance. Some of them will work out wonderfully, she said.
    Query Engines: Gatekeepers of the Parquet File Format by Laurens Kuiper
    Streaming AI Agents: Why Kafka and Flink are the foundations of AI at scale
    Unlocking graph analytics in DuckDB with SQL/PGQ by DuckDB
  • Apr 6, 2025


    Amazon SES now offers attachments in sending APIs - AWS
    Zero config debugging with Deno and OpenTelemetry by Deno
    The Pragmatic Open Source Contributor
    • They don’t think it’s part of their job. Hopefully I’ve made a brief but decent case for why this is important, both for the wider community, and for your own growth. Familiarity and confidence in this process empowers you to blast through technical barriers, as you might no longer be “blocked” from achieving your goals due to some underlying third-party code not supporting XYZ.
  • Apr 3, 2025


    AI Agents: Less Capability, More Reliability, Please
    The Power of Asymmetric Experiments @ Meta - Analytics at Meta - Medium by Analytics at Meta
    • Asymmetric experiments make sense when (1) you don’t need to run many experiments in parallel, (2) recruiting people for the experiment is cheap, and (3) running the test intervention is expensive
    • Asymmetric experiments make sense when your need for concurrent experiment capacity is low*.
    • The test intervention is expensive. Asymmetric experiments have a smaller test group. This makes asymmetric experiments appealing when the cost of the test intervention is high. For example, if the test intervention has the possibility of negatively impacting the user experience, or would require an increase in compute costs.
    Troubleshooting: The Skill That Never Goes Obsolete
    • It’s easy to get lost in reactive problem whack-a-mole without stopping to think: what’s the real cause of this issue? What, exactly, is going on here?
    • Writers are fond of saying that “writing is thinking”. Here are two ways I use writing as a troubleshooting tool:

    Rubber duck debugging like a pro: I can often solve my problem by drafting a forum post without posting it. The effort required to articulate the salient details of the system and the problem, without looking dumb, is higher than the effort I have usually put in at the point I decide I need help. Corollary: making a forum post without sounding like I haven’t done my homework also tends to put me over my time/energy budget for solving a seemingly-trivial problem.

    Behold the trail of crumbs: I find that writing and diagramming, while helpful for many troubleshooting projects, are essential for multi-session troubleshooting projects. I overestimate how much I will remember about the context, as well as how soon I will get around to continuing the project. A troubleshooting notes file, no matter how obvious or incomplete the information in it seems at the time I write it, leaves a trail of crumbs that I can follow next time. (I have often repeated, verbatim, an entire troubleshooting process, found the problem — and then remembered I troubleshot the exact system, and arrived at the same conclusion, years ago; but there was some hiccup, and I failed to order or install the new part.)

  • Mar 30, 2025


    Build Bigger With Small Ai: Running Small Models Locally by MotherDuck
  • Mar 26, 2025


    Who needs GitHub Copilot when you can roll your own AI code assistant at home
  • Mar 24, 2025


    Amazon Nova expands Tool Choice options for Converse API - AWS
    • Auto leaves tool selection entirely to Nova’s discretion, whether to call a tool or generate text instead. Auto is useful in use cases like chatbots and assistants where you may need to ask the user for more information, and is the current default. Any prompts Nova to return at least one tool call, from the list of tools specified, while allowing it to choose which tool to use. Any is particularly useful in machine to machine interactions where your downstream components may not understand natural language but might be able to parse a schema representation. Tool enable developers to request a specific tool to be returned by Nova. Tool is particularly useful in forcing a structured output by having a tool that has the return type as your desired output schema.
  • Mar 23, 2025


    Reinventing notebooks as reusable Python programs by akshay, dylan, myles
    Highlights from Git 2.49 by Taylor Blau
    Preview: Amazon S3 Tables in DuckDB by Sam Ansmink, Tom Ebergen, Gabor Szarnyas
    In S3 simplicity is table stakes by Dr Werner Vogels - https://www.allthingsdistributed.com/
    • As the team started to look at scaling, they created a test account with an enormous number of buckets and started to test rendering times in the AWS Console — and in several places, rendering the list of S3 buckets could take tens of minutes to complete
    fleetwood.dev
    • Reading and writing from memory is extraordinarily slow when compared to computation
    • Using multiple different approaches, we’ve derived a (non-exhaustive) list of design decisions that should hold for any AI inference accelerator, namely:

    Hardware support for low precision data types Design for asynchronous transfers from day 1 Dedicated hardware for tensor aware memory transfers Replace your cache hierarchy with an outsized scratchpad for AI inference For a single accelerator, turn the memory bandwidth up to 11 Design for scale-out from day 1 Dedicated communication hardware should complement compute hardware

    Performance of the Python 3.14 tail-call interpreter by Nelson Elhage
  • Mar 22, 2025


    “Vibe Coding” vs Reality
    Life Altering Postgresql Patterns
  • Mar 19, 2025


    My Favorite Firefox Extensions
  • Mar 16, 2025


    21 Unique Reasons Why Apache Hudi Should Be Your Next Data Lakehouse | Apache Hudi
    Hard-Earned Lessons from a Year of Building AI Agents by Maya Murad
    • LLMs can be harnessed for higher complexity problem-solving. By combining clever engineering with advanced prompting techniques, models could go beyond what few-shot learning could achieve. Retrieval-Augmented Generation (RAG) was an early example of how models could interact with documents and retrieve factual information. LLMs can also can dynamically interact with a designated environment via tool calling. When combined with chain-of-thought prompting, these capabilities laid the foundation for what would later be known as AI agents.
    Resilience Best Practices: How Amazon Builds Well-Behaved Clients and Well-Protected Services
    • Three operational strategies are suggested: load shedding, auto-scaling, and fairness.
    • token bucket, leaky bucket, exponentially weighted moving average (EWMA), fixed window, or sliding window.
    • To avoid making the situation worse for a dependency that is under stress, AWS suggests two patterns for well-behaved clients: circuit breakers, preventing the sustained overload of a dependency, and retries, letting the client retry every request up to N times using exponential backoff with jitter between requests
    Prefix Aliases in SQL by Hannes Mühleisen
    Gems of DuckDB 1.2 by The DuckDB team
    • Starting with version 1.2.0, DuckDB supports OR and IN expressions for filter pushdown. This optimization comes especially handy when querying remote
  • Mar 15, 2025


    Graph Databases after 15 Years – Where Are They Headed? by LDBC Linked Data Benchmark Council
    In Praise of “Normal” Engineers
    A 10x Faster TypeScript - TypeScript by Anders Hejlsberg
    Introducing a SQL-based metrics layer powered by DuckDB by DuckDB
    Ibis, DuckDB, and GeoParquet: Making geospatial analytics fast, simple, and Pythonic by DuckDB
  • Mar 9, 2025


    GitHub - reloadware/reloadium: Hot Reloading and Profiling for Python by reloadware
    Four steps toward building an open source community by Klint Finley
    Is Apache Iceberg the New Hadoop? Navigating the Complexities of Modern Data Lakehouses by Ananth Packkildurai
    • However, many data ingestion tools don’t natively support compaction, requiring manual intervention or dedicated Spark clusters.
    AWS CDK Introduces Garbage Collection to Remove Outdated Assets
    Mastering Spark: The Art and Science of Table Compaction
    DuckDB goes distributed? DeepSeek’s smallpond takes on Big Data by mehdio
    • Let’s recap the features of smallpond :Lazy evaluation with DAG-based execution – Operations are deferred until explicitly triggered.Flexible partitioning strategies – Supports hash, column-based, and row-based partitioning.Ray-powered distribution – Each task runs in its own DuckDB instance for parallel execution.Multiple storage layer options – Benchmarks have primarily been conducted using 3FS.Cluster management trade-off – Requires maintaining a compute cluster, though fully managed services like Anyscale can mitigate this.Potential 3FS overhead – Self-managing a 3FS cluster introduce significant additional complexity.
    Announcing AWS Step Functions Workflow Studio for the VS Code IDE - AWS
    Definite: Understanding smallpond and 3FS: A Clear Guide
    • Smallpond’s distribution leverages Ray Core at the Python level, using partitions for scalability. Partitioning can be done manually, and Smallpond supports:

    Hash partitioning (based on column values) Even partitioning (by files or row counts) Random shuffle partitioning

  • Mar 4, 2025


    Data Products: A Case Against Medallion Architecture by Modern Data 101
    A non-beginner Data Engineering Roadmap — 2025 Edition by Ernani Castro
    Open Source Data Engineering Landscape 2025 - Apache DolphinScheduler - Medium by Apache DolphinScheduler
    • Further consolidation in the open table format spaceContinued evolution of zero-disk architectures in real-time and transactional systemsQuest toward providing a unified lakehouse experienceThe rise of LLMOps and AI EngineeringThe expansion of the data lakehouse ecosystem in areas such as open catalog integration and development of native librariesThe increasing traction of single-node data processing and embedded analytics
  • Mar 3, 2025


    macOS Tips & Tricks - saurabhs.org
    The Commitment Inventory: Get More Done By Saying “No”
    • Important: When the timer goes off, start the next item immediately, even if you haven’t finished. Otherwise, you’ll get distracted, and your attention will lag
    • I’m now working within a 50-minute “burst,” and I have 9:56 left before I have to move on to a different checklist within Pitches, so I am cranking out this draft much faster and with more focus than I otherwise would.If your resistance is high, start with a short burst, then add five as you go up.  Forster recommends keeping it under 40, but when I’m writing, I use 50-minute bursts. Adapt to fit your needs. When you lose momentum, go back to 5 minutes. You can use bursts for your breaks, too. In 8 minutes, when I finish this burst, I will get a cup of coffee, stretch my legs, pet my dog Duke, and maybe read a quick news article.  Forster recommends being strict with the timer, or your attention will drift.
    If it is worth keeping, save it in Markdown - Piotr Migdał by Piotr Migdał
    Does Ibis understand SQL? – Ibis by Deepyaman Datta
  • Feb 24, 2025


    Concurrency Control in Open Data Lakehouse | Apache Hudi
    • cases where multiple writer jobs need to access the same table, Hudi supports multi-writer setups. This model allows disparate processes, such as multiple ingestion writers or a mix of ingestion and separate table service jobs to write concurrently.
    • Apache Iceberg supports multiple concurrent writes through Optimistic Concurrency Control (OCC). The most important part to note here is that Iceberg needs a catalog component to adhere to the ACID guarantees. Each writer assumes it is the only one making changes, generating new table metadata for its operation. When a writer completes its updates, it attempts to commit the changes by performing an atomic swap of the latest metadata.json file in the catalog, replacing the existing metadata file with the new one.
  • Feb 23, 2025


    Towards composable data platforms — Jack Vanlightly by Jack Vanlightly
    • The OTFs introduce a new abstraction layer that can be used to virtualize table storage. The key is that it allows for the separation of data from metadata and shared storage from compute. Through metadata, one table can appear in two data platforms, without data copying. To avoid overloading data virtualization anymore, I will use the term Table Virtualization.
    • The Modern Data Stack (MDS) failed to sustain itself. Few wanted to compose a data architecture from 10-15 different vendors. People want to choose a small number of trusted vendors, and they want them to work together without a lot of toil and headaches.
    Common pitfalls when building generative AI applications by Chip Huyen
    • Use generative AI when you don’t need generative AI
    • Start too complex

    Examples of this pitfall:

    Use an agentic framework when direct API calls work. Agonize over what vector database to use when a simple term-based retrieval solution (that doesn’t require a vectordb) works. Insist on finetuning when prompting works. Use semantic caching.

    • Forgo human evaluation

    To automatically evaluate AI applications, many teams opt for the AI-as-a-judge (also called LLM-as-a-judge) approach — using AI models to evaluate AI outputs. A common pitfall is forgoing human evaluation to rely entirely on AI judges.

  • Feb 22, 2025


    Redefining Data Engineering with Go and Apache Arrow by Thomas F McGeehan V
  • Feb 16, 2025


    Emerging Patterns in Building GenAI Products
    Mixture of Experts Explained
  • Feb 15, 2025


    Microsoft Introduces CoRAG: Enhancing AI Retrieval with Iterative Reasoning
    The Quest to Understand Metric Movements - Pinterest Engineering Blog - Medium by Pinterest Engineering
    • root-cause analysis (RCA)
    • Slice and DiceThis approach finds clues for a metric movement by drilling down on specific segments within the metric; it has found successes at Pinterest, especially in diagnosing video metric regressions
    • General SimilarityIn this approach, we look for clues of why a metric movement happened by scanning through other metrics and finding ones that have moved very “similarly” in the same time period, whether in the same direction (positive association) or in the opposite direction (negative association).
    • practice, we have found that the first two factors, Pearson and Spearman’s rank correlations, work best because:p-values can be computed for both, which help to gauge statistical significanceboth have more natural support for measuring negative associations between two time-seriesnon-monotonic (e.g. quadratic) relationships, for which Pearson and Spearman’s rank correlations won’t apply, don’t tend to arise naturally so far in our use-cases / time window of analysis
    • Experiment EffectsThis third approach looks for clues of why metric movements happened by looking at what a lot of internet companies have: experiments.
    • For each control and treatment group in an experiment, we perform a Welch’s t-test on the treatment effect, which is robust in the sense that it supports unequal variances between control and treatment groups. To further combat noise in the results, we filter experiments by each experiment’s harmonic mean p-value of its treatment effects over each day in the given time period, which helps limit false positive rates. We also detect imbalances in control and treatment group sizes (i.e., when they are being ramped up at a different rate from each other) and filter out cases when that happens.
    You Should Use /tmp/ More
  • Feb 14, 2025


    How to add a directory to your PATH by Julia Evans
    • Bash has three possible config files: ~/.bashrc, ~/.bash_profile, and ~/.profile.
    python-build-standalone now has Python 3.14.0a5 by Simon Willison
    How to Use a Microphone
  • Feb 9, 2025


    Scale Out Batch Inference with Ray
    Systematically Improving RAG Applications - Jason Liu by Jason Liu
    • 4 Re-Rankers¶ Instead of (or in addition to) fine-tuning a bi-encoder (embedding model), you might fine-tune a cross-encoder or re-ranker that scores each candidate chunk directly. Re-rankers can be slower but often yield higher precision. Typically, you do a quick vector search, then run re-ranking on the top K results.
    Emerging Patterns in Building GenAI Products
    • Self evaluation: Self-evaluation lets LLMs self-assess and enhance their own responses. Although some LLMs can do this better than others, there is a critical risk with this approach. If the model’s internal self-assessment process is flawed, it may produce outputs that appear more confident or refined than they truly are, leading to reinforcement of errors or biases in subsequent evaluations. While self-evaluation exists as a technique, we strongly recommend exploring other strategies.
    • LLM as a judge: The output of the LLM is evaluated by scoring it with another model, which can either be a more capable LLM or a specialized Small Language Model (SLM). While this approach involves evaluating with an LLM, using a different LLM helps address some of the issues of self-evaluation. Since the likelihood of both models sharing the same errors or biases is low, this technique has become a popular choice for automating the evaluation process
    • Human evaluation: Vibe checking is a technique to evaluate if the LLM responses match the desired tone, style, and intent. It is an informal way to assess if the model “gets it” and responds in a way that feels right for the situation. In this technique, humans manually write prompts and evaluate the responses. While challenging to scale, it’s the most effective method for checking qualitative elements that automated methods typically miss.
    • However, embeddings are not ideal for structured or relational data, where exact matching or traditional database queries are more appropriate. Tasks such as finding exact matches, performing numerical comparisons, or querying relationships are better suited for SQL and traditional databases than embeddings and vector stores.
    Why Most Machine Learning Projects Fail to Reach Production and How to Beat the Odds
  • Feb 8, 2025


    Jujutsu VCS Introduction and Patterns by Kuba Martin
    • While in Git you generally organize your commits in branches, and a commit that’s not part of a branch is scarily called a “detached HEAD”, in jj it’s completely normal to work on changes that are not on branches. jj log is the main command to view the history and tree of changes, and will default to showing you a very reasonable set of changes that should be relevant to you right now - that is (more or less) any local mutable changes, as well as some additional changes for context (like the tip of your main branch).
  • Feb 7, 2025


    The Rise of Single-Node Processing: Challenging the Distributed-First Mindset by Alireza Sadeghi
    The End of the Bronze Age: Rethinking the Medallion Architecture by
    Bias and Fairness in Natural Language Processing - Thomson Reuters Labs - Medium by Navid Rekabsaz
    The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success by Ananth Packkildurai
    Google Releases PaliGemma 2 Vision-Language Model Family by
    Catching memory leaks with your test suite by Itamar Turner-Trauring
  • Jan 27, 2025


    How the Apache Arrow Format Accelerates Query Result Transfer by Ian Cook, David Li, Matt Topol
    ZenML VS Flyte VS Metaflow - MLOps Community by Ankur Tyagi
  • Jan 25, 2025


    JavaScript Temporal is coming | MDN Blog by
    Amazon (S3) Tables by Daniel Beach
    Rill | Designing a Declarative Data Stack: From Theory to Practice by
    The Hidden Cost of Over-Abstraction in Data Teams by Zakaria Hajji
    Staff Engineer vs Engineering Manager by Alex Ewerlöf
    PromptWizard: The future of prompt optimization through feedback-driven self-evolving prompts by Brenda Potts
  • Jan 18, 2025


    Lessons Learned Implementing Metric Trees by Ergest Xheblati
  • Jan 12, 2025


    Cost Optimized Vector Database: Introduction to Amazon OpenSearch Service quantization techniques | Amazon Web Services by
    Apache Iceberg Won the Future — What’s Next for 2025? by Yingjun Wu
    Building effective agents by
    How I run LLMs locally by
    Next.js 15.1 by
  • Jan 11, 2025


    Hugging Face Smolagents is a Simple Library to Build LLM-Powered Agents by
  • Jan 9, 2025


    A Pixel Parable by Facundo Olano
  • Jan 8, 2025


    Hitting OKRs vs Doing Your Job by jessitron
  • Jan 7, 2025


    Goodhart’s Law Isn’t as Useful as You Might Think by
  • Jan 6, 2025


    Incremental Jobs and Data Quality Are On a Collision Course - Part 2 - The Way Forward — Jack Vanlightly by Jack Vanlightly
  • Jan 5, 2025


    How AI is unlocking ancient texts — and could rewrite history by Marchant, Jo
    Glue work considered harmful by
    Goodbye Github Pages, Hello Coolify · Quakkels.com by
    Tools Worth Changing To in 2025 by Matthew Sanabria
    Databases in 2024: A Year in Review by
    Dismantling ELT: The Case for Graphs, Not Silos — Jack Vanlightly by Jack Vanlightly
  • Jan 4, 2025


    Things we learned about LLMs in 2024 by
  • Jan 3, 2025


    Change query support in Apache Iceberg v2 — Jack Vanlightly by Jack Vanlightly
  • Jan 2, 2025


    Collection of insane and fun facts about SQLite - blag by
  • Jan 1, 2025


    Scalar and binary quantization for pgvector vector search and storage by Jonathan Katz
    Turbocharge Efficiency & Slash Costs: Mastering Spark & Iceberg Joins with Storage Partitioned Join by Samy Gharsouli
    Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality | Amazon Web Services by
    Designing data products by
    On writing and getting from zero to done — Jack Vanlightly by Jack Vanlightly
    Introducing AWS Glue Data Catalog automation for table statistics collection for improved query performance on Amazon Redshift and Amazon Athena | Amazon Web Services by
    React v19 – React by
    Tech predictions for 2025 and beyond by Dr Werner Vogels - https://www.allthingsdistributed.com/
    First impressions of the new Amazon Nova LLMs (via a new llm-bedrock plugin) by
    Use open table format libraries on AWS Glue 5.0 for Apache Spark | Amazon Web Services by
    Migrating AWS Glue for Spark jobs to AWS Glue version 5.0 - AWS Glue by
  • Dec 31, 2024


    Getting to Two Million Users as a One Woman Dev Team by
  • Dec 28, 2024


    Why AI language models choke on too much text by Timothy B. Lee
    AI-generated tools can make programming more fun by
  • Dec 22, 2024


    The 150x pgvector speedup: a year-in-review by Jonathan Katz
    How to generate unit tests with GitHub Copilot: Tips and examples by Greg Larkin
    Introducing AWS Glue 5.0 for Apache Spark | Amazon Web Services by
  • Dec 21, 2024


    Building Confidence: A Case Study in How to Create Confidence Scores for GenAI Applications - Spotify Engineering by alexandrawei
    Storing times for human events by
    DataFrames at Scale Comparison: TPC-H by
    Enabling compaction optimizer - AWS Glue by
    Top Python Web Development Frameworks in 2025 · Reflex Blog by
    Building Python tools with a one-shot prompt using uv run and Claude Projects by
  • Dec 18, 2024


    How to Speed Up Spark Jobs on Small Test Datasets by luminousmen
  • Dec 15, 2024


    A high-velocity style of software development by
    Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt | Amazon Web Services by
    A First Look at S3 (Iceberg) Tables by Nikhil Benesch
  • Dec 11, 2024


    Amazon Bedrock Knowledge Bases now supports RAG evaluation (Preview) - AWS by
  • Dec 8, 2024


    Amazon Aurora now available as a quick create vector store in Amazon Bedrock Knowledge Bases - AWS by
    Why it took a long time to build that tiny link preview on Wikipedia by Jon Robson (David Lyall)
  • Dec 5, 2024


    AWS Glue Data catalog now automates generating statistics for new tables - AWS by
    Introducing AWS Glue 5.0 - AWS by
  • Dec 3, 2024


    Table format comparisons - Change queries and CDC — Jack Vanlightly by Jack Vanlightly
  • Nov 28, 2024


    Amazon S3 adds new functionality for conditional writes - AWS by
    AWS Amplify introduces passwordless authentication with Amazon Cognito - AWS by
  • Nov 26, 2024


    The Part of PostgreSQL We Hate the Most // Blog // Andy Pavlo - Carnegie Mellon University

    What goes into bronze, silver, and gold layers of a medallion data architecture? | by Lak Lakshmanan | Sep, 2024 | Medium by Lak Lakshmanan

    Using DuckDB-WASM for in-browser Data Engineering by Tobias Müller

    Go talk to the LLM

    The CDC MERGE Pattern. by Ryan Blue | by Tabular | Medium by Tabular

    FireDucks : Pandas but 100x faster
    #!/usr/bin/env -S uv run by
    Amazon Data Firehose supports continuous replication of database changes to Apache Iceberg Tables in Amazon S3 - AWS by
  • Nov 25, 2024


    Use data that looks like data - by Thorsten Ball by Thorsten Ball

    Unit Tests As Documentation - by Teiva Harsanyi by Teiva Harsanyi

  • Nov 14, 2024


    Anthropic’s upgraded Claude 3.5 Sonnet model and computer use now in Amazon Bedrock - AWS
    • It will be fun to see how the computer use api will evolve
    It’s Not Easy Being Green: On the Energy Efficiency of Programming Languages

  • Nov 8, 2024


    Embeddings are underrated

  • Oct 28, 2024


    Technology Radar | Guide to technology landscape | Thoughtworks

  • Oct 23, 2024


    AWS announces a seamless link experience for the AWS Console Mobile App - AWS

    init.py files are optional. Here’s why you should still use them | Arie Bovenberg

  • Oct 9, 2024


    Talks - Reuven M. Lerner: Times and dates in Pandas by PyCon US

    Talks - Bruce Eckel: Functional Error Handling by PyCon US

    What’s New In Python 3.13 — Python 3.13.0 documentation

  • Oct 7, 2024


    Visual Studio Code September 2024 by Microsoft

  • Oct 4, 2024


    NotebookLM | Note Taking & Research Assistant Powered by AI
    • Very interesting experiment from google which provides a summary of any article, video or file. Additionaly can generate a podcast talking about it!
  • Oct 1, 2024


    Talks - Juliana Ferreira Alves: Improve Your ML Projects: Embrace Reproducibility and Production… by PyCon US
    • These truly looks like a good improvement to an ml project 👀
  • Sep 30, 2024


    Talks - Krishi Sharma: Trust Fall: Three Hidden Gems in MLFlow by PyCon US
    Table format comparisons - Streaming ingest of row-level operations — Jack Vanlightly by Jack Vanlightly
    Embeddings · Malaikannan
  • Sep 29, 2024


    Copy-on-Write (CoW) — pandas 2.2.3 documentation
    • Ptty big change to pandas 3.0 but one I think will bring a lot of clarity to data transformations
    Announcing DuckDB 1.1.0 – DuckDB by The DuckDB team

    Introducing Contextual Retrieval \ Anthropic

    What is a Vector Index? An Introduction to Vector Indexing by Alejandro CantareroField CTO of AI, DataStax

    Chunking · Malaikannan

    Chronon - Airbnb’s End-to-End Feature Platform - InfoQ by Nikhil Simha
    • This is a very high level overview of the feature platform but allowed me to get a better sense on why use this platform. But I’d love to test it to understand if for those without streaming I’m this isn’t a bit overkill
  • Sep 28, 2024


    Deno 2.0 Release Candidate
    Time spent programming is often time well spent - Stan Bright
    What I tell people new to on-call | nicole@web
    What is io_uring?
    Memory Management in DuckDB – DuckDB by Mark Raasveldt
    Yuno | How Apache Hudi transformed Yuno’s data lake
    Google Proposes Adding Pipe Syntax to SQL - InfoQ by Renato Losio
  • Sep 26, 2024


    PostgreSQL: PostgreSQL 17 Released!
    The sorry state of Java deserialization @ marginalia.nu
    • Interesting to see that duckdb is serving as the benchmark to process data in the order of GB’s
    Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
  • Sep 24, 2024


    @daily_cache implementation in Python • Max Halford
    Good programmers worry about data structures and their relationships by Engineer’s Codex
  • Sep 23, 2024


    I Like Makefiles by Sebastian Witowski
    • I tend to avoid make files because I hadn’t looked at how they worked and always associated either Java projects. I might want to take another look at them
    Rethinking Analyst Roles in the Age of Generative AI by Ben Lorica 罗瑞卡
    No, Data Engineers Don’t NEED dbt. | by Leo Godin | Jul, 2024 | Data Engineer Things by Leo Godin
  • Sep 20, 2024


    Gotten my reading from +100 articles to 58! 🎉

    Predicting the Future of Distributed Systems by Colin Breck
    Continuous reinvention: A brief history of block storage at AWS | All Things Distributed by Werner Vogels
    Iceberg vs Hudi — Benchmarking TableFormats | by Mudit Sharma | Aug, 2024 | Flipkart Tech Blog by Mudit Sharma
    Splicing Duck and Elephant DNA by Jordan Tigani, Brett Griffin
    How we sped up Notion in the browser with WASM SQLite by Carlo Francisco
    Table format comparisons - How do the table formats represent the canonical set of files? — Jack Vanlightly by Jack Vanlightly
  • Sep 14, 2024


    Should you be migrating to an Iceberg Lakehouse? | Hightouch by Hugo Lu
    How data departments have evolved and spread across English football clubs - The Athletic by Mark Carey
    How to use AI coding tools to learn a new programming language - The GitHub Blog by Sara Verdi
    How to choose the best rendering strategy for your app – Vercel by Alice Alexandra MooreSr. Content Engineer, Vercel
    Amazon’s Exabyte-Scale Migration from Apache Spark to Ray on Amazon EC2 | AWS Open Source Blog
  • Sep 14, 2024


    Don’t Use JS for That: Moving Features to CSS and HTML by Kilian Valkhof by JSConf
    AWS Chatbot now allows you to interact with Amazon Bedrock agents from Microsoft Teams and Slack - AWS
    Astro 5.0 Beta Release | Astro by Erika
    Making progress on side projects with content-driven development | nicole@web
    AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables - AWS
  • Sep 13, 2024


    Introducing OpenAI o1 | OpenAI
    Rewrite Bigdata in Rust by Xuanwo
    Recommending for Long-Term Member Satisfaction at Netflix | by Netflix Technology Blog | Aug, 2024 | Netflix TechBlog by Netflix Technology Blog
    • Til: reward engineering. Measure proxy features and optimize for them to get to your actual end goal
    We need to talk about ENUMs | boringSQL by Radim Marek
    I spent 8 hours learning Parquet. Here’s what I discovered | by Vu Trinh | Aug, 2024 | Data Engineer Things by Vu Trinh
  • Sep 12, 2024


    What I Gave Up To Become An Engineering Manager by Suresh Choudhary
    Unpacking the Buzz around ClickHouse - by Chris Riccomini by Chris Riccomini
    Introducing job queuing to scale your AWS Glue workloads | AWS Big Data Blog
    ”SRE” doesn’t seem to mean anything useful any more
    Microsoft Launches Open-Source Phi-3.5 Models for Advanced AI Development - InfoQ by Robert Krzaczyński
  • Sep 3, 2024


    Production-ready Docker Containers with uv by Hynek Schlawack
    Many of us can save a child’s life, if we rely on the best data - Our World in Data by By: Max Roser
    • Another article that I find eye opening when we look at data to improve our decisions
    Why I Still Use Python Virtual Environments in Docker by Hynek Schlawack
    • Suddenly this way of working, very similar to node modules, is being talked everywhere.
    Python Developers Survey 2023 Results
    • Was looking for this for a long time. Got to learn a bit more on twine, mlflow and sqlmodel. In terms of getting on track with python world I think I’m good with my current rss feed and podcast
    Elasticsearch is Open Source, Again | Elastic Blog by ByShay Banon29 August 2024
    Monitor data quality in your data lake using PyDeequ and AWS Glue | AWS Big Data Blog
    • Data drift detection 👀
    • Makes sense. Pydeequ is just a wrapper
    How top data teams are structured by Mikkel Dengsøe
    • As I work on a team that doesn’t have this kind of distribution I need to reflect a bit on the impact of being the sole data engineer on an ML team
    Talks - Amitosh Swain: Testing Data Pipelines by PyCon US
  • Aug 26, 2024


    uv: Unified Python packaging

    Wow, need to test this out on my side projects.

    CSS finally adds vertical centering in 2024 | Blog | build-your-own.org by James Smith
    Timeless Skills For Data Engineers And Analysts by SeattleDataGuy

    Skimmed the article but although high level I find the main points very true. Understanding the system, being on top of the state of the art. That and the tips for head of data

    CSS 4, 5, and 6! With Google’s Una and Adam by Syntax

    Great episode! Nice to see that css is getting better and better

  • Aug 23, 2024


    I’ve Built My First Successful Side Project, and I Hate It by Sebastian Witowski

    Wow, I’d love this one man saas but knowing how it can burn you out…

    NodeJS Evolves by Syntax

    Nice episode on the long sought features for node that have been introduced on bun and demo (single file, typescript support and top await async are the big ones for me)

    Talks - Brandt Bucher: Building a JIT compiler for CPython by PyCon US
    Python Insider: Python 3.13.0 release candidate 1 released

    Great to see a new version coming along! Is pdb worth using with vs code? 🤔

    Google kills Chromecast, replaces it with Apple TV and Roku Ultra competitor | Ars Technica by Samuel Axon

    As a google to owner would be good to know that my working hardware wouldn’t brick just because I don’t want to upgrade

  • Aug 6, 2024


    Can Reading Make You Happier? | The New Yorker by Ceridwen Dovey
    Beyond Hypermodern: Python is easy now - Chris Arderne

    Although I had my eyes already on rye I certainly didn’t know it was so full feature. Gotta try and change some of my projects to it ,

    Visual Studio Code July 2024 by Microsoft

    Python support slowly getting good

    Introducing GitHub Models: A new generation of AI engineers building on GitHub - The GitHub Blog by Thomas Dohmke
    Creativity Fundamentally Comes From Memorization
    Gen AI Increases Workloads and Decreases Productivity Upwork Study Finds - InfoQ by Sergio De Simone

    The report raises a good point about an increase in workload. With a big productivity boost bosses might start loading even bigger workloads than what we gained from the productivity boost

    Ofc this isn’t good as people will feel overloaded

    tea-tasting: a Python package for the statistical analysis of A/B tests | Evgeny Ivanov by Evgeny Ivanov

    Super interesting IMO. Might give it a try on my team when we start deploying solutions to our clients

    Data Science Spotlight: Cracking the SQL Interview at Instacart (LLM Edition) | by Monta Shen | Jul 2024 | tech-at-instacart by Monta Shen

    Would be interesting to test this out. Have an example of a dataset that can be queried using duckdb. Given a question understand if a query is correct how to fix it and improve it’s performance. One in Sal and another in pyspark (or ibis/pandas)

  • Jul 31, 2024


    Are you a workaholic? Here’s how to spot the signs | Ars Technica by Chris Woolston, Knowable Magazine
    At the Olympics, AI is watching you | Ars Technica by Morgan Meaker, WIRED.comCrazy to see that now with ai it’s actually possible to survey an entire city either cameras
    Use Apache Spark™ from Anywhere: Remote Connectivity with Spark Connect by Databricks

    English sdk looks awesome but requires an OpenAI key. Could it be replaced with ollama?

  • Jul 26, 2024


    Let’s Consign CAP to the Cabinet of Curiosities - Marc’s Blog by Marc Brooker

    Interesting topics to research a bit more on CAP alternatives

    Why Your Generative AI Projects Are Failing by Ben Lorica 罗瑞卡

    In summary, to have a good AI product we need to have data with quality which requires good data governance. With this data we need to define useful products that we can measure it’s value using data driven metrics. We must also ensure the product has good practices avoiding security or bias issues

    Engage your audience by getting to the point, using story structure, and forcing specificity – Ian Daniel Stewart

    Resumed by:

    storyline talk

    Slack Conquers Deployment Fears with Z-score Monitoring - InfoQ by Matt Saunders

    This is something I would love to implement. Allowing to define the metrics on which to evaluate a new feature, the expected hypothesis and revert the feature (i.e feature flags) automatically with a report on the experiment

    DuckDB + dbt : Accelerating the developer experience with local power - YouTube

    Could I replace Athena with this? I think the main blocker for me is I want to work with S3. And need to check how it runs for a really large dataset…

    How Unit Tests Really Help Preventing Bugs | Amazing CTO

    Good tip. For any project define the metric of code coverage goals and start increasing on the project

    Mocking is an Anti-Pattern | Amazing CTO
    Building an open data pipeline in 2024 - by Dan Goldin by Dan Goldin
  • Jul 25, 2024


    Spark-Connect: I’m starting to love it! | Sem Sinchenko by Sem Sinchenko

    This article wasn’t properly parsed by omnivore but the big takeaways:

    • We can add plugins for extended functionality to our spark server
    • Using spark connect we can implement any library in any language we want and send grpc requests to the server (spark connect server needs to be running)
    • spark connect works on >3.5. Should be much better on v4.0
    • If I truly want to be good in spark I eventually need to relearn scala/java
    • Glue is cool but it’s still on version 3.3. All these goodies will take too long to be implemented in glue
    So you got a null result. Will anyone publish it? by Kozlov, Max

    After reading the statistics books I can see much more clear the value of proving a null hypophesis, this is the feeling I am getting out of the academia. We are seeing more research without any any added value. Goodhart’s law.

    Maestro: Netflix’s Workflow Orchestrator | by Netflix Technology Blog | Jul, 2024 | Netflix TechBlog by Netflix Technology Blog

    Sounds just like an airflow contender. with the plus of being able to run notebooks 🤔

    How to Create CI/CD Pipelines for dbt Core | by Paul Fry | Medium by Paul Fry
    Simplify PySpark testing with DataFrame equality functions | Databricks Blog by Haejoon Lee, Allison Wang and Amanda Liu

    Good theme for a blog post on the changes of spark 4, this is really useful for human errors (been there multiple times)

    How to build a Semantic Layer in pieces: step-by-step for busy analytics engineers | dbt Developer Blog by Gwen Windflower

    So this will be generated on the fly as views by the semantic layer?, This looked neat until the moment I understood that the semantic layer requires dbt cloud

    Meta’s approach to machine learning prediction robustness #### Engineering at Meta by Yijia Liu, Fei Tian, Yi Meng, Habiya Beg, Kun Jiang, Ritu Singh, Wenlin Chen, Peng Sun

    Meta seems to be a couple of years ahead of the industry. The article doesn’t provide a lot of insights but gives a feeling of their model evaluation being mostly automated + having a good AI debug tool

    Free-threaded CPython is ready to experiment with! | Labs

    First step in a long way before we can get run python wihtout GIL. Interested on seeing if libraries like pandas will be able to leverage multithreading eventually with this

    Amazon DataZone introduces OpenLineage-compatible data lineage visualization in preview | AWS Big Data Blog

    Mixed feelings here. Great to see Open Lineage implemented at AWS. However it feels again that AWS just created the integration and won’t be driving the development of open lineage

    Unlocking Efficiency and Performance: Navigating the Spark 3 and EMR 6 Upgrade Journey at Slack - Slack Engineering by Nilanjana Mukherjee

    What could be improved to help this kind of migrations be done in a matter of days?, Livy might be deprecated in favor of spark connect. With their migration to Spark 3 and eventually 3.5 (not clear on this article) they could be interested in moving new jobs to connect , Basically solved issues by using the old behaviours. These will need to be migrated eventually. Would need to better understand these features , This looks like an important detail. With no explicit order spark can have random order of rows?, Cool to see these migrations and teams using open source solutions. EMR although expensive with a good engineering team can prove to be quite cost effective

    DuckDB Community Extensions – DuckDB by The DuckDB team
    Visual Studio Code June 2024 by Microsoft
    The Rise of the Data Platform Engineer - by Pedram Navid by Pedram Navid

    The need to define a data platform is something I see everywhere. It really looks like we are missing a piece here. Netflix maestro for example seems like a good contender for to solve the issue of (yet another data custom platform)

    Lambda, Kappa, Delta Architectures for Data | The Pipeline by Venkatesh Subramanian
    Write-Audit-Publish (WAP) Pattern - by Julien Hurault by Julien Hurault

    This articles brings me the question. Can we improve dbt by using WAP? How does the rollback mechanism work when a process fails?

    AWS Batch Introduces Multi-Container Jobs for Large-Scale Simulations - InfoQ by Renato Losio
    Data Council 2024: The future data stack is composable, and other hot takes | by Chase Roberts | Vertex Ventures US | Apr, 2024 | Medium by Chase Roberts
    Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions | AWS Big Data Blog

    Super interesting to see how we can enable data quality visibility

    Pyspark 2023: New Features and Performance Improvement | Databricks Blog by in Industries
    Cost Optimization Strategies for scalable Data Lakehouse by Suresh Hasundi

    Good to use open data lakes showing the big cost and speed improvements

    Prompt injection and jailbreaking are not the same thing
    Flink SQL and the Joy of JARs by ByRobin MoffattShare this post
    Catalogs in Flink SQL—Hands On by ByRobin MoffattShare this post
    Catalogs in Flink SQL—A Primer by ByRobin MoffattShare this post

    Why is this so hard? 😭

Follow me on Mastodon Go to GitHub repo Go to my RSS feed