Reading Feed

Articles I've read with my notes and highlights

Context Anchoring
Breaking the Microbatch Barrier: The Architecture of Apache Spark Real-Time Mode
  • Microbatch mode processes batches of data called epochs. Epoch boundaries are decided upfront using start and end offsets. Real-time mode instead processes longer duration epochs but modifies how data flows within each epoch.
  • We essentially evolved the micro-batch in Structured Streaming into a checkpoint interval.
Still Missing Critical Pieces by Julien Simon
Building an MCP Ecosystem at Pinterest by Pinterest Engineering
  • Contrast with the MCP OAuth StandardThe MCP specification defines an OAuth 2.0 authorization flow where users explicitly authenticate with each MCP server, typically involving consent screens and per-server token management. Our approach is different: users already authenticate against our internal auth stack when they open a surface like the AI chat interface, so we piggyback on that existing session. There is no additional login prompt or consent dialog when a user invokes an MCP tool
Cognitive Helmets for the AI Bicycle Part 2: The Sometimes-Wrong Bot by Cat Hicks
  • developers in my interviews have pointed out that in their first months using Claude Code in a more “raw” way, counting on themselves to manage and monitor every output for sustained hours, they have felt a creeping sense of fatigue. One person called this “over-monitoring,” and multiple people have used the metaphor of “becoming a manager
  • If our goal is to help our junior colleagues integrate into organizational goals to use this tooling, we also need to listen to them about their challenges and friction points, and believe in their potential for learning.
Cognitive Helmets for the AI Bicycle: Part 1 by Cat Hicks
  • Avoid the temptation to spin up so many parallel tasks that you are in constant “cram.”
  • Another metacognitive strategy is something with the unglamorous name of pretesting. But it’s actually a fascinating window of insight into that “functional architecture” of our problem-solving minds. Simply put, if we prompt ourselves to try to generate an answer for something we don’t know before we go try to learn it, we learn better.
Your Data Agents Need Context
  • What agentic coding tools such as Claude Code are doing is making data engineers vastly more productive
Beyond Hypermodern: Python is easy now
  • Or do it dynamically: [project] name = “postmodern” dynamic = [“version”] …

[tool.hatch.version] source = “vcs”

Rethinking open source mentorship in the AI era by Abigail Cabunoc Mayes
  • CImplementationComprehensionRequire issue before pull requestHost an in-person code sprint for live discussionsContextAdd AI disclosure or AGENTS.mdContinuityWatch who comes back
  • AI tools are here to stay. The question is whether we adapt our practices to maintain what makes open source work: human relationships, knowledge transfer, and the multiplier effect.
OpenClaw and the Dream of Free Labour by The Daemon
Variant Type in Apache Parquet for Semi-Structured Data
  • Variant type—a feature that brings native support for semi-structured data to Parquet, significantly improving efficiency compared to less efficient formats such as JSON
  • Traditional approaches that store JSON as text strings require full parsing to access any field, making queries slow and resource-intensive. Variant solves this by storing data in a structured binary format that enables direct field access through offset-based navigation. Query engines can jump directly to nested fields without deserializing the entire document, dramatically improving performance.
  • Binary encodings like BSON improve upon plain JSON by storing data in binary format, but they still redundantly store field names like “timestamp”, “user”, and “event” in every row, wasting storage space
  • Variant data can be shredded by extracting frequently accessed fields into separate, strongly-typed columns
  • If the field matches the expected schema, its value is written to the strongly typed field.If the field does not match, the original representation is written as a Variant-encoded binary field and the corresponding strongly typed field is left NULL.
The Art of Learning in the AI Age by Jose Blanca
  • Exercises are opportunities to practice. It is through this practice that you develop your problem-solving skills. You will be tempted to let the AI write the code for you, but if you want to grow, you must resist that urge. If your objective is learning, do not use AI to write code you don’t understand—unless you intend to study that code until you do.
  • You won’t learn German or Chinese just by reading a grammar book or a dictionary. Similarly, you won’t become a good programmer just by reading about syntax. At the start of your journey prog
The Reviewer Isn’t the Bottleneck by Rishi Baldawa
  • Whether you can systematically extract what a good reviewer knows and run it at CI speed, I genuinely don’t know. Every check you write is one less thing a human has to catch. But reviewers don’t just catch bugs
  • They catch drift, intent mismatches, architectural decisions that look fine locally and cause problems three services away
ETL is Dead by Ananth Packkildurai
The Ivy Lee Method: Focus Better with This 100-Year-Old Strategy
Porto (OPO) Airport Delays
Making Retrospectives Effective with Small Concrete Actions and Rotating Facilitators by Ben Linders
  • He also encouraged rotating retrospective facilitators. It is challenging to fairly represent one’s own ideas and opinions while facilitating the retrospective, as every person brings their own unique perspective, Žabkar Nordberg said. Having different people who facilitate retrospectives helps build ownership and engagement
Using Git with coding agents by Simon Willison
  • Git has a mechanism called the reflog which can often capture details of code that hasn’t been committed to a permanent branch. Agents can search that, and search other branches too.
  • When you run a bisect operation you provide Git with some kind of test condition and a start and ending commit range. Git then runs a binary search to identify the earliest commit for which your test condition fails.
Announcing DuckDB 1.5.0 by The DuckDB team
  • DuckDB now natively supports the VARIANT type, inspired by Snowflake’s semi-structured VARIANT data type and available in Parquet since 2025. Unlike the JSON type, which is physically stored as text, VARIANT stores typed, binary data. Each row in a VARIANT column is self-contained with its own type information. This leads to better compression and query performance.
  • DuckDB also supports reading VARIANT types from Parquet files, including shredding (storing nested data as flat values).
I don’t know if my job will still exist in ten years