By Juan Fernandez (data engineer, Amarji ltd)
At Amarji, we helped a client slash their Microsoft Fabric daily refresh time from around 80 minutes to just 15 — a reduction of roughly 81%.
Instead of throwing compute at the problem, we focused on removing overhead, optimising data flows, and using the right tools for the right job.
1. Full-Load Tables
Problem:
Some wide tables were being fully reloaded each day, even though several large columns weren’t used by analytics. These unnecessary columns increased I/O and slowed extraction.
Some wide tables were being fully reloaded each day, even though several large columns weren’t used by analytics. These unnecessary columns increased I/O and slowed extraction.
Change:
We excluded the unused columns directly at the data source.
We excluded the unused columns directly at the data source.
Result:
Refresh time dropped from 27 minutes to 7 — a 74% improvement.
Refresh time dropped from 27 minutes to 7 — a 74% improvement.
2. Append-Only Tables with a Monotonic Primary Key
Problem:
Daily deltas were small, but the PySpark notebooks managing the append logic still had to start a Spark cluster each time adding ~4 minutes per run just for startup.
Daily deltas were small, but the PySpark notebooks managing the append logic still had to start a Spark cluster each time adding ~4 minutes per run just for startup.
Change:
We eliminated Spark for this task. Instead, a Lookup activity reads the max key directly, and a Copy activity appends only the new rows.
We eliminated Spark for this task. Instead, a Lookup activity reads the max key directly, and a Copy activity appends only the new rows.
Result:
Refresh time fell from 17 minutes to 2 — an 88% improvement.
Refresh time fell from 17 minutes to 2 — an 88% improvement.
3. Tables with Updatable Rows
Problem:
For tables that allowed updates, we ran frequent micro-batches using PySpark notebooks. Again, the compute startup cost outweighed the actual work.
For tables that allowed updates, we ran frequent micro-batches using PySpark notebooks. Again, the compute startup cost outweighed the actual work.
Change:
We switched to Python notebooks using DuckDB for lightweight local joins and merges, writing back to the Lakehouse afterward.
We switched to Python notebooks using DuckDB for lightweight local joins and merges, writing back to the Lakehouse afterward.
Result:
Refresh time dropped from 35 minutes to 5 — an 86% improvement.
Refresh time dropped from 35 minutes to 5 — an 86% improvement.
The Overall Impact
Across all patterns, total refresh time dropped from ~80 minutes to 15, an 81% faster pipeline overall.
For the client, that means:
- More frequent Power BI refreshes
- Shorter compute windows
- Lower operating cost per run
By cutting refresh times by 81%, the client didn’t just save minutes, they gained flexibility, lower costs, and the ability to make data-driven decisions faster.
Key Takeaways
- Trim early. Remove unnecessary data at the source — fewer bytes moved means less time spent.
- Right-size your engine. Use Spark where parallelism matters; rely on Fabric-native activities for small deltas.
- Lightweight compute wins. DuckDB is a great fit for fast, in-process upserts on modest datasets.
At Amarji, we help teams get more out of their data stack by designing with intent, not excess.