Stop Retrying Everything: Smart Graph Batch Retry Logic

The Day My Batch Requests Started Fighting Back

Picture this: It’s 2 AM, you’re on your third cup of coffee, and you’re watching your perfectly crafted Microsoft Graph batch request fail spectacularly. Again.

You’ve got 25 files to download from SharePoint. Your batch processes 24 of them perfectly, then one lonely file decides to throw a throttling tantrum. What does your retry logic do? It throws away all 24 successful downloads and starts over. From scratch. Like a digital Groundhog Day, but less amusing and more soul-crushing.

Sound familiar? Welcome to the “retry everything” club – where perfectly good API calls go to die unnecessarily. 😅

The Grocery Cart Problem (Or: Why We’re Doing This Wrong)

Let me paint you a picture. You’re at the grocery store with a cart full of 20 items. You get to checkout, and the cashier says, “Sorry, we’re out of milk.”

What would you do?

Option A: Put back everything, go home, and come back later to shop for all 20 items again
Option B: Buy the 19 items you can get, then come back just for the milk

If you picked Option A, congratulations – you think like most API retry logic! If you picked Option B, you’re ready to learn about smart retries.

The “retry everything” approach is like Option A, and here’s why it’s bonkers:

🔄 Wasted effort: You’re re-requesting stuff that already worked perfectly
🐌 Slower performance: Users wait longer while you redo successful work
📈 Throttling amplification: You’re actually making the problem worse by hitting successful endpoints again
🔍 Poor debugging: Can’t easily identify which specific requests are the real troublemakers

I learned this the hard way when I watched a simple file sync turn into an API call avalanche. 20 requests became 40, then 80, then… well, let’s just say Microsoft’s throttling system got very acquainted with my application.

The “Aha!” Moment (It’s Simpler Than You Think)

The solution hit me during one of those 2 AM debugging sessions: What if we only retry the stuff that actually failed?

Revolutionary, right? 😏

Here’s the beautiful thing – this isn’t some PhD-level computer science. It’s just common sense applied to code. Keep the winners, retry the losers. Simple.

But (there’s always a “but”), there’s one sneaky technical challenge that makes this trickier than it sounds. Microsoft’s Graph SDK has a helpful method called NewBatchWithFailedRequests(), but it has a quirk: it generates brand new request IDs. This breaks your ability to map responses back to your original data.

Think of it like this: You order pizza for table 5, but when they bring the replacement slice, they call it table 23. Good luck figuring out who ordered what!

If you’re new to Graph batching or request mapping, I’d recommend checking out my post on Graph Batching for File Content: Mapping Requests to Responses first. It’s like the prequel to this story – explains how to keep track of what’s what when dealing with batch responses.

Quick Win Summary (For the Impatient Developers)

The Problem: Your retry logic is like that friend who starts the entire conversation over when they missed one word. Inefficient and annoying.

The Solution: A drop-in extension method that only retries the actual failures while keeping successful responses safe and sound.

The Payoff:

⚡ Faster operations (no more re-downloading working files)
📉 Fewer API calls (your rate limits will thank you)
🎯 Less throttling (stop beating dead endpoints)
😌 Happier users (and happier you at 2 AM)

The Catch: You need to understand request-to-response mapping. Don’t worry, it’s not rocket science, and I’ve got a whole post about it.

Time Investment: About 5 minutes to implement, countless hours of frustration saved.

Ready for the nitty-gritty? Let’s dive in! 👇

The Hero of Our Story: The Smart Retry Extension

Okay, here’s where we get our hands dirty. The main challenge isn’t just filtering out successful requests – it’s that pesky NewBatchWithFailedRequests method that scrambles your request IDs like eggs at Sunday brunch.

Here’s the extension method that saves the day:

internal static class GraphServiceClientExtensions 
{
    internal static async Task<(IReadOnlyDictionary<string, HttpStatusCode> Statuses, Dictionary<string, HttpResponseMessage> BatchResponse)> 
        PostBatchWithFailedDependencyRetriesAsync(this GraphServiceClient graphClient, BatchRequestContentCollection originalBatch) 
    {
        const int maxRetries = 5;
        TimeSpan delay = TimeSpan.FromSeconds(1);

        Dictionary<string, HttpResponseMessage> allResponses = new Dictionary<string, HttpResponseMessage>();
        Dictionary<string, HttpStatusCode> allStatuses = new Dictionary<string, HttpStatusCode>();

        BatchRequestContentCollection batchToSend = originalBatch;

        for (int attempt = 1; attempt <= maxRetries; attempt++) 
        {
            BatchResponseContentCollection batchResponse = await graphClient.Batch.PostAsync(batchToSend);
            Dictionary<string, HttpStatusCode> responses = await batchResponse.GetResponsesStatusCodesAsync();

            // Filter out failures (excluding redirects which are normal for file content)
            Dictionary<string, HttpStatusCode> failedRequests = responses
                .Where(kvp => !BatchResponseContent.IsSuccessStatusCode(kvp.Value) && kvp.Value != HttpStatusCode.Found)
                .ToDictionary(kvp => kvp.Key, kvp => kvp.Value);

            // Collect all responses from this attempt
            foreach (var kvp in responses) 
            {
                var response = await batchResponse.GetResponseByIdAsync(kvp.Key);
                allResponses[kvp.Key] = response;
                allStatuses[kvp.Key] = kvp.Value;
            }

            if (failedRequests.Count == 0 || attempt == maxRetries) 
            {
                break;
            }

            await Task.Delay(delay);
            delay = TimeSpan.FromSeconds(delay.TotalSeconds * 2);

            // The key problem: NewBatchWithFailedRequests creates new request IDs!
            batchToSend = batchToSend.NewBatchWithFailedRequests(responses);
            
            // This is why we need this method - restore the original request IDs
            await RestoreOriginalRequestIdsAsync(batchToSend, originalBatch);
        }

        return (allStatuses, allResponses);
    }

    private static async Task RestoreOriginalRequestIdsAsync(
        BatchRequestContentCollection newBatch, 
        BatchRequestContentCollection originalBatch) 
    {
        var stepsSnapshot = newBatch.BatchRequestSteps.ToArray();

        foreach (var kvp in stepsSnapshot) 
        {
            var oldStepId = kvp.Key;
            var step = kvp.Value;
            var requestPath = step.Request.RequestUri!.AbsolutePath;

            // Find the original request ID by matching the request path
            var matchingOriginal = originalBatch.BatchRequestSteps
                .First(x => x.Value.Request.RequestUri!.AbsolutePath == requestPath);

            var originalStepId = matchingOriginal.Key;

            if (oldStepId != originalStepId) 
            {
                newBatch.RemoveBatchRequestStepWithId(oldStepId);

                var newStep = new BatchRequestStep(
                    requestId: originalStepId,
                    httpRequestMessage: step.Request,
                    dependsOn: step.DependsOn);

                newBatch.AddBatchRequestStep(newStep);
            }
        }
    }
}

What’s happening here? Think of it as a diplomatic negotiator for your API calls:

The ID Shuffle Problem: NewBatchWithFailedRequests gives failed requests shiny new IDs, like witness protection for HTTP requests
The Detective Work: RestoreOriginalRequestIdsAsync plays detective, matching requests by their paths to find their original identities
The Happy Reunion: Failed requests get their original IDs back, so your mapping dictionary doesn’t break down in tears

It’s like having a really good wedding planner who makes sure everyone sits at the right table, even after the venue changes.

Showtime: Watching Smart Retries in Action

Now let’s see our smart retry logic work its magic in a real-world scenario. Imagine you’re building a document sync tool and need to download 25 files from SharePoint. Some will work perfectly, others might throw tantrums due to throttling or network hiccups.

Here’s how the new approach handles it like a champ:

// Let's say you need to download content from 25 SharePoint files
var batch = new BatchRequestContentCollection(graphClient);
var fileMapping = new Dictionary<string, FileContentRequest>();

// Build the batch for file content downloads
var filesToDownload = await GetFilesToProcess(); // Your method to get file list

foreach (var file in filesToDownload) 
{
    var requestInfo = graphClient.Sites[siteId]
                                .Drives[driveId]
                                .Items[file.DriveItemId]
                                .Content
                                .ToGetRequestInformation();
    
    var requestId = await batch.AddBatchRequestStepAsync(requestInfo);
    
    // Map the request ID to your file info (same pattern as previous post)
    fileMapping[requestId] = new FileContentRequest 
    { 
        DriveItemId = file.DriveItemId,
        FileName = file.Name,
        ExpectedSize = file.Size,
        DownloadStartTime = DateTime.Now
    };
}

// 🎯 Here's where the magic happens - just one line change!
var (statuses, responses) = await graphClient.PostBatchWithFailedDependencyRetriesAsync(batch);

// Process results - this is where the retry really shines
var successfulDownloads = new List<FileDownloadResult>();
var failedDownloads = new List<string>();

foreach (var kvp in fileMapping) 
{
    var requestId = kvp.Key;
    var fileRequest = kvp.Value;
    
    if (responses.TryGetValue(requestId, out var response)) 
    {
        var statusCode = statuses[requestId];
        
        if (BatchResponseContent.IsSuccessStatusCode(statusCode)) 
        {
            // Success! Handle the file content
            var contentBytes = await response.Content.ReadAsByteArrayAsync();
            
            // Save to your desired location
            var localPath = Path.Combine(downloadFolder, fileRequest.FileName);
            await File.WriteAllBytesAsync(localPath, contentBytes);
            
            successfulDownloads.Add(new FileDownloadResult 
            { 
                FileName = fileRequest.FileName,
                LocalPath = localPath,
                ActualSize = contentBytes.Length,
                ExpectedSize = fileRequest.ExpectedSize,
                DownloadTime = DateTime.Now - fileRequest.DownloadStartTime
            });
        } 
        else 
        {
            // Even after smart retries, this file failed
            failedDownloads.Add($"{fileRequest.FileName} ({statusCode})");
            
            // Log the specific failure for debugging
            Console.WriteLine($"Failed to download {fileRequest.FileName}: {statusCode}");
        }
        
        // Always clean up the response
        response.Dispose();
    }
}

Console.WriteLine($"Successfully downloaded: {successfulDownloads.Count} files");
if (failedDownloads.Any())
{
    Console.WriteLine($"Failed downloads: {string.Join(", ", failedDownloads)}");
}

The beautiful part? Look at that line with the magic emoji 🎯. That’s literally the only change you need to make to your existing batch processing code. Everything else stays exactly the same.

Here’s what’s happening behind the scenes:

First batch attempt: Say 20 files succeed, 5 fail due to throttling
Smart filtering: Keep those 20 successful responses safe
Targeted retry: Build a new batch with just the 5 failures
ID preservation: Make sure those 5 retries still map to your original file info
Rinse and repeat: Maybe 4 of the 5 succeed on retry, leaving just 1 persistent troublemaker

The result? Instead of making 250 API calls (25 files × 5 retry attempts for the unlucky ones), you might only make 35 total calls. Your throttling problems become manageable, and files download way faster.

// Supporting classes for the example above
public class FileContentRequest 
{
    public string DriveItemId { get; set; } = string.Empty;
    public string FileName { get; set; } = string.Empty;
    public long ExpectedSize { get; set; }
    public DateTime DownloadStartTime { get; set; }
}

public class FileDownloadResult 
{
    public string FileName { get; set; } = string.Empty;
    public string LocalPath { get; set; } = string.Empty;
    public long ActualSize { get; set; }
    public long ExpectedSize { get; set; }
    public TimeSpan DownloadTime { get; set; }
}

The Method to the Madness (What’s Really Happening)

I know that extension method looks intimidating – like trying to read assembly instructions in a foreign language. But once you break it down, it’s actually pretty logical. Let me walk you through the step-by-step dance:

Step 1: The First Attempt “Let’s try everything once and see what happens”

Sends your original batch of 25 files and carefully captures every single response and status code. No throwing anything away yet.

Step 2: The Great Sorting “Okay, who succeeded and who’s being difficult?”

Separates the winners from the losers, but (and this is important) ignores redirect responses. Why? Because when you’re downloading large files, redirects are totally normal – SharePoint often redirects you to the actual storage location.

Step 3: The Preservation Society “Keep the good stuff safe while we deal with the troublemakers”

All successful responses get stored in a safe place while we build a new, smaller batch containing only the failed requests. It’s like having a really good filing system for your API responses.

Step 4: The Identity Crisis Resolution “Wait, who are you again? Let me check your original ID…”

This is the tricky bit! The NewBatchWithFailedRequests method gives everyone new IDs, like a witness protection program for HTTP requests. Our RestoreOriginalRequestIdsAsync method plays detective, matching requests by their URL paths to restore their original identities.

Step 5: The Polite Wait “Let’s not be pushy – maybe try again in a second?”

Implements exponential backoff – starts with a 1-second wait, then 2 seconds, then 4 seconds, etc. This prevents your app from being that annoying person who keeps knocking on the door every second.

Step 6: The Safety Net “Okay, we tried 5 times. Some files just aren’t meant to be downloaded today.”

Gives up after 5 attempts to prevent infinite retry loops. Because sometimes you need to know when to walk away from the poker table.

Why This Actually Works (The Science Behind the Magic)

Here’s what makes this approach so much better than the “retry everything” strategy:

🎯 Surgical Precision Only retry what actually failed – it’s like having a really good therapist who focuses on the actual problems instead of rehashing everything from childhood.

No wasted API calls on requests that already succeeded. If 24 out of 25 files downloaded perfectly, why punish them with another round trip?

⚡ Speed Demon Successful requests don’t get repeated, so everything finishes faster – sometimes dramatically faster.

Users see their successful downloads immediately while you quietly retry the problematic ones in the background.

🤝 Throttling-Friendly Fewer total requests means you’re less likely to hit Microsoft’s rate limits, and when you do, recovery is faster.

Instead of amplifying throttling issues, you’re actually helping to resolve them by reducing load on the endpoints that are already struggling.

🔄 Drop-in Simplicity Change literally one line of code and you’re done. No architectural rewrites, no complex state management – just swap out the method call.

Your existing error handling, logging, and business logic all stay exactly the same.

🔍 Debug Paradise Easy to see exactly which requests are consistently failing, making troubleshooting a breeze instead of a nightmare.

When file “ImportantDocument.pdf” fails on every retry attempt, you know there’s something specific about that file, not your entire batch logic.

The “Before and After” Moment

Let me paint you a picture of how this changes your life:

Before Smart Retries:

25 file batch fails on 3 files due to throttling
Retry all 25 files → now 5 files fail due to increased throttling
Retry all 25 files again → now 8 files fail
You’re now in the throttling spiral of doom
Users are staring at loading spinners
You’re questioning your career choices

After Smart Retries:

25 file batch fails on 3 files due to throttling
Keep the 22 successful files, retry only the 3 failures
Maybe 2 of the 3 succeed on retry, leaving 1 stubborn file
Final retry gets the last file, or you log it as a persistent issue
Users get 24/25 files quickly, you sleep better at night

It’s the difference between being stuck in traffic because one lane is blocked (and everyone keeps switching to that lane), versus just using the open lanes and going around the problem.

The Bottom Line (And Why Your Future Self Will Thank You)

I’ll be real with you – when I first started working with Microsoft Graph batching, I thought the built-in retry policies were enough. “How hard could it be?” I thought. “APIs fail sometimes, just retry them!”

Then I built my first real-world document sync application. Suddenly, I was dealing with users uploading hundreds of files, enterprise throttling limits, and the occasional network hiccup that would bring the whole operation to a screeching halt.

That’s when I learned the hard way that “retry everything” is like using a sledgehammer to hang a picture frame. Sure, it might work, but you’re probably going to break some stuff in the process.

This selective retry approach has been a game-changer. Not just for performance (though users definitely notice when their bulk operations actually complete), but for debugging too. When you can see that ImportantReport_v23_FINAL_REALLY_FINAL.docx is the file that keeps failing, you can actually do something about it.

The best part? Once you have this extension method in your toolkit, it becomes muscle memory. You’re not adding complexity to your day-to-day development – you’re just swapping out one method call for a smarter one. It’s like upgrading from a flip phone to a smartphone – you wonder how you ever lived without it.

Pro tip: After implementing this, keep an eye on your application logs. You’ll start to notice patterns in failures that you never saw before. Maybe certain file types are more prone to issues, or maybe there’s a specific time of day when throttling gets worse. This kind of insight is pure gold for optimization.

The moral of the story? Sometimes the biggest performance improvements come not from doing things faster, but from doing fewer unnecessary things. And sometimes, the best debugging tool is just… not breaking the working stuff while you fix the broken stuff.

Your 2 AM debugging sessions will never be the same. 😌

Want to Learn More? (The Reading List)

Microsoft Graph JSON batching - The official documentation (surprisingly readable!)
Graph Batching for File Content: Mapping Requests to Responses - My previous post that sets up the foundation for this one
Microsoft Graph throttling guidance - Understanding what makes Microsoft’s APIs cranky
Graph Explorer - Test your batch requests interactively (great for experimenting)
Exponential Backoff Pattern - The polite way to retry things

Now go forth and batch smarter, not harder! 🚀

Jeppe Spanggaard

A passionate software developer. I love to build software that makes a difference!