How to build an application which processes user uploaded files using Azure Functions, Remix and XState
Establishing a Motivating Goal
Given my career I have an interest in keeping up with application development techniques and wanted to learn more about Azure Functions.
I believe you learn much better when you are motivated. In order to be motivated we need a problem we want to solve.
You may be asking “Why does having a problem improve learning?”
While making progress solving the problem you get emotional satisfaction and have a stronger connection which makes you more likely to retain the information and be able to apply it in the future.
Thus, I began looking for a moderate complexity problem where Azure Functions would be a great solution.
Problem: Converting Data from XML to JSON
One of my hobbies is playing the game StarCraft 2. For those unfamiliar with the game, it is a real time strategy (RTS) game where players compete in environments with limited resources. Players are either eliminated via combat or surrender and the last player remaining is victorious.
Each of the players controls a set of units, each unit has different health, weapons, upgrades, and abilities. Given the high number of units the SC2 the community developed many tools to ease discovery and learning about these values to develop strategies and “build orders”
One method of finding these values is from a set of XML files exported by the Map Editor application. I will revert to these set of files as Balance Data
Issue: Mismatch of export format with consumption format
The Game Editor application exports a folder with an .xml
file for each unit. However, most user facing applications that display this data are web-based applications. It would be much easier if the data was represented as unified file in .json
format so it could be natively parsed.
The Problem
This conversion of SC2 Balance Data from a set of XML files to a single JSON file is what we want to build.
Sneak Peek at Application
Divide and Conquer
As with most projects, we want to break down the system into smaller problems and solve the most important problems first. After all, if we can’t solve the biggest challenges than the smaller challenges will be irrelevant.
Also, the challenges you perceive as the most difficult often provide the most potential for learning since you are unfamiliar with that domain.
- Application: User Uploading XML Files
- Background Process: Start Execution when Files are Uploaded
- Background Process: Converting XML to JSON
- Background Process: Saving JSON to Storage
- Application: Providing Process State Feedback to User
Problem 2: Starting Background Process when File is Uploaded
We actually start with problem 2 instead of 1 since it is a more critical part of the system and utilizes the features of Azure Functions.
Doing a quick search, you will find an Azure Function has the ability to be triggered by file/blob upload. This is what made the technology a fitting solution for the problem.
We will use the new V4-preview
version of @azure/functions which provides a much cleaner and more natural API
Install Azure Function CLI tool using WinGet
https://github.com/Azure/azure-functions-core-tools#to-install-with-winget
winget install Microsoft.Azure.FunctionsCoreTool
Create new Azure Function project using the V4 model
func init azure-functions-v4
Navigate into the project folder
cd azure-functions-v4
Create a new function using the V4 package (I will be using Node environment and with Typescript language)
func new convertXmlToJson --model V4
Start with the most basic function to ensure the CLI, Azure connectivity, and deployment is working.
We simply print out a string when a blob is uploaded.
import { app, InvocationContext } from "@azure/functions"
export async function blobTriggerFn(
blob: ReadableStream,
context: InvocationContext,
): Promise<unknown> {
context.log(`context: ${JSON.stringify(context, null, 2)}`)
context.log(`Blob was Uploaded`)
}
app.storageBlob('blobTrigger', {
path: 'sc2-balancedata-xml/{name}',
connection: 'AzureWebJobsStorage',
handler: blobTriggerFn,
})
Note: that the value of the path
property must use the desired container name before the /
. In my case, the container is sc2-balancedata-
xml
The value in the connection
property must be the name of the key holding the connection string to your storage account in local.settings.json
(We will set that in the next steps)
We have a function locally, but we want to verify it works when deployed.
Create Azure Storage Blob Container
If you don’t have a storage account create one.
Create a new blob container using the matching name sc2-balancedata-data
Create Azure Function Resource
Create Azure Functions app using V4.
You may do this through the Portal or use a command similar to the following:
az functionapp create `
-g $RESOURCE_GROUP_NAME `
-s $STORAGE_ACCOUNT_NAME `
-p $SERVICE_PLAN_NAME `
-n $FUNCTIONS_ACCOUNT_NAME`
--functions-version 4
Download Settings
Download the settings from the Azure resources to your local repository. This will update the local.settings.json
with the appropriate connection strings and other secrets to enable deployment to those resources.
func azure storage fetch-connection-string $STORAGE_ACCOUNT_NAME
func azure functionapp fetch $AZURE_FUNCTIONS_ACCOUNT_NAME
It will look something like this:
Add deploy scripts
The function deployments works by creating a zip file of the repo contents. and uploading it to the WebApp. Given I will be using TypeScript which as a compilation step, we would need to ensure the build output is up to date when deploying or we may see the deployed function behave differently than the source code.
For this reason, I choose to add the deployment as an NPM script to ensure it always builds before deploying.
"scripts": {
"build": "tsc",
"watch": "tsc -w",
"clean": "rimraf dist",
"prestart": "npm run clean && npm run build",
"start": "func start",
"predeploy": "npm run prestart",
"deploy": "func azure functionapp publish mattmazzola"
}
- Remember to replace
mattmazzola
with the name of your Function App you created above.
Test the Function!
Now that the function is deployed. We can test it!
First let’s open up the logs so we can observe the function output when it is triggered.
Then upload a file to the sc2-balancedata-zip
container through the portal.
You should see logs similar to show shown below:
2023-07-18 03:39:07.765
Executing 'Functions.blobTrigger' (Reason='New blob detected(LogsAndContainerScan): sc2-balancedata-zip/balancedata_1689651542631.zip', Id=97aa34e9-b6ac-4b3c-9f2e-660ec94832e4)
2023-07-18 03:39:07.766
Trigger Details: MessageId: 71cc8ff9-594d-4b20-9017-b7ecd4a7a550, DequeueCount: 1, InsertedOn: 2023-07-18T03:39:07.000+00:00, BlobCreated: 2023-07-18T03:39:03.000+00:00, BlobLastModified: 2023-07-18T03:39:03.000+00:00
2023-07-18 03:39:08.473
context: { "invocationId": "97aa34e9-b6ac-4b3c-9f2e-660ec94832e4", "functionName": "blobTrigger", "extraInputs": {}, "extraOutputs": {}, "traceContext": { "traceParent": "00-07c88930abdd7226e2a60264ed71fdb4-361a957296560e27-00", "traceState": "", "attributes": { "OperationName": "blobTrigger" } }, "triggerMetadata": { "blobTrigger": "sc2-balancedata-zip/balancedata_1689651542631.zip", "uri": ...
Problem 4: Background Process: Saving JSON to Storage
We have done a lot. Let’s remember our goal is to process the incoming XML data and output a JSON file.
Similar to before, let’s start with the simplest version by updating the existing function to construct a dummy object with the triggering blob’s name and save the file as using formatted name.
These are the changes we will make:
- Update function to return an object
- Create an
output
file and set it as thereturn
property
import { app, InvocationContext, output } from "@azure/functions"
export async function blobTriggerFn(
blob: ReadableStream,
context: InvocationContext,
): Promise<unknown> {
context.log(`context: ${JSON.stringify(context, null, 2)}`)
context.log(`Blob was Uploaded`)
const json = {
triggerBlobName: context.triggerMetadata.name
}
return json
}
const jsonBlobOutput = output.storageBlob({
path: 'sc2-balancedata-json/balancedata_{DateTime}.json',
connection: 'AzureWebJobsStorage',
})
app.storageBlob('blobTrigger', {
path: 'sc2-balancedata-xml/{name}',
connection: 'AzureWebJobsStorage',
handler: blobTriggerFn,
return: jsonBlobOutput,
})
Notice:
- We output to a different container
sc2-balancedata-json
- We used a Binding Expression to include the
{DateTime}
string in the output file name. Example:balancedata_2023–07–16T18–39–59Z.json
- This guarantees that the file is unique and also includes useful information about its creation time.
At this point, we have the inputs and outputs working. The function is triggered when a blob is uploaded, and we save a file to another container.
Problem 3: Converting XML to JSON
Up until now we have been manually uploading a single file for testing our function. We didn’t actually care about the contents of file since we don’t read it, but our final application will read the file.
How to process set of files as a unit?
Recall the “Balance Data” format is a set of XML files. If we were to upload each file, our function will be triggered for each file.
Worse, each function would only have context of the single file which caused the function to be invoked. What we want is awareness of ALL xml files.
We want a guarantee that the entire set of XML files is processed as a group.
For the reasons above I decided it would be best if the file that was uploaded was a .zip
of all the .xml
files. Our complexity slightly increases from XML[] to JSON, to ZIP — XML[] — JSON; however, we have remove another set of errors and made the function more robust
Decompressing the .zip file
In order to decompress the input, we need to know what format the input is. It isn’t very clear from the documentation.
Possible values are string, binary, or stream.
From my searching, I believe the type
of blob argument is ReadableStream
export async function blobTriggerFn(
blob: ReadableStream,
context: InvocationContext,
): Promise<unknown> {
I tried various libraries to unzip such as decompress but it was unclear how to adapt them to operate on streams. For this reason, I chose AdmZip which does take streams as input.
Below we get the files within .zip, filter to .xml files, then get each file’s contents as a string.
import * as admZip from "adm-zip"
...
const zip = new admZip(blob)
const zipEntries = zip.getEntries()
const xmlFileEntries = zipEntries
.flatMap(zipEntry => {
if (zipEntry.entryName.endsWith('.xml')) {
const filename = zipEntry.entryName
const xmlContent = zipEntry.getData().toString('utf8')
return [{ filename, xmlContent }]
}
return []
})
const xmlFileContents = xmlFileEntries.map(xmlFileEntry => xmlFileEntry.xmlContent)
Merging the set XML files to single XML file
Each XML file represents a “unit” object in StarCraft. We want to merge by creating a <units>
element with ALL of the unit files.
export function mergeXmls(unitsStrings: string[]): string {
const unitsWithoutFirstlines = unitsStrings.map(u => {
return u
.split('\n')
.slice(1)
.join('\n')
.trim()
})
return `
<units>
${unitsWithoutFirstlines.join('\n')}
</units>`.trim()
}
const xmlFile = mergeXmls(xmlFileContents)
Saving the unified XML file
Similar to how we created a new output for the json return value, we can create “extra” outputs and associate them with our function also.
const xmlBlobOutput = output.storageBlob({
path: 'sc2-balancedata-xml/balancedata_{DateTime}.xml',
connection: 'AzureWebJobsStorage',
})
Add this “extra” output to our function registration
app.storageBlob('blobTrigger', {
path: 'sc2-balancedata-zip/{name}',
connection: 'AzureWebJobsStorage',
handler: blobTriggerFn,
extraOutputs: [xmlBlobOutput],
return: jsonBlobOutput,
})
Set the unified XML to this output
context.extraOutputs.set(xmlBlobOutput, xmlFile)
Converting XML to JSON
Most TypeScript libraries I have found to perform this conversion from XML to JSON use libraries compiled in C++ and thus requires something like running Node-Gyp during install. I have experienced many failures due to python versions and wanted to avoid this unreliability.
I wanted a library with minimal dependencies and ended up using xml-js
import { xml2js } from "xml-js"
...
const json = xml2js(xmlFile)
return json
The full function looks like this:
import { app, InvocationContext, input, output } from "@azure/functions"
import * as admZip from "adm-zip"
import { mergeXmls } from "./utilities"
import { xml2js } from "xml-js"
export async function blobTriggerFn(
blob: ReadableStream,
context: InvocationContext,
): Promise<unknown> {
context.log(`context: ${JSON.stringify(context, null, 2)}`)
try {
const zip = new admZip(blob)
const zipEntries = zip.getEntries()
const xmlFileEntries = zipEntries
.flatMap(zipEntry => {
if (zipEntry.entryName.endsWith('.xml')) {
const filename = zipEntry.entryName
const xmlContent = zipEntry.getData().toString('utf8')
return [{ filename, xmlContent }]
}
return []
})
const xmlFileContents = xmlFileEntries.map(xmlFileEntry => xmlFileEntry.xmlContent)
const xmlFile = mergeXmls(xmlFileContents)
context.extraOutputs.set(xmlBlobOutput, xmlFile)
const json = xml2js(xmlFile)
return json
}
catch (error) {
context.log(`Error: ${error}`)
const outputJson = {
'data': JSON.stringify(error, null, 2)
}
return outputJson
}
}
const xmlBlobOutput = output.storageBlob({
path: 'sc2-balancedata-xml/balancedata_{DateTime}.xml',
connection: 'AzureWebJobsStorage',
})
const jsonBlobOutput = output.storageBlob({
path: 'sc2-balancedata-json/balancedata_{DateTime}.json',
connection: 'AzureWebJobsStorage',
})
app.storageBlob('blobTrigger', {
path: 'sc2-balancedata-zip/{name}',
connection: 'AzureWebJobsStorage',
handler: blobTriggerFn,
extraOutputs: [xmlBlobOutput],
return: jsonBlobOutput,
})
Notice the input container is sc2-balancedata-zip
We save the unified XML to sc2-balancedata-xml
and save the converted .json file to sc2-balancedata-json
Congratulations! 🎉🚀
If you followed along this far, thank you, and congratulations!
You can now apply knowledge learned about Azure Functions for simple file processing operations to many future projects.
We’ve solved problems 2–3–4, now let’s look at problems 1 and 5
Problem 1: User Uploading XML Files
We need an application the user can access to upload the file.
The complete flow of the application will look like this:
I will be using RemixJS framework with TailwindCSS
There is too much code to put all of it in this in article, but I will link to the relevant documentation and include small snippets which you may use to mimic the application.
Creating the Form
Similar to before, we start with the minimal implementation of the most critical components first. We want a single route, with a <Form>
that allows the user to upload a file.
See the unstable_createfileuploadhandler documentation.
Creating the Action
We will use the @azure/storage-blob package to upload the .zip file to the container.
import { BlobServiceClient } from "@azure/storage-blob"
const blobServiceClient = BlobServiceClient.fromConnectionString(process.env.AZURE_STORAGE_CONNECTION_STRING!)
const zipContainerClient = blobServiceClient.getContainerClient(process.env.AZURE_STORAGE_BLOB_ZIP_CONTAINER_NAME!)
export {
zipContainerClient,
}
When the Form is submitted, we will get the File objects and upload them.
const balanceDataFiles = formData.getAll("files") as File[]
if (balanceDataFiles.length > 0) {
...
for (const balanceDataFile of balanceDataFiles) {
const filename = `balancedata_${Date.now()}.zip`
const fileBuffer = await balanceDataFile.arrayBuffer()
const uploadResponse = await zipContainerClient.uploadBlockBlob(filename, fileBuffer, fileBuffer.byteLength)
...
Now we have the ability to upload a .zip file and it will be processed by our function; however, we don’t provide any output to the user so it is not very useful.
For a good experience we want to give the user feedback about the progress and give them links to the output files so they can download or use it in their applications.
Designing the State Machine
I use the fantastic library XState and you may view this state machine here
Notice that we simulate the “polling” for the processed blob by first clicking the “blobNotFound” event a few times. Then when the Azure Function completes saves the blob, the polling will succeed, and we click “blobFound” which advances to the terminal/final state “ProcessComplete”
Now we put that state machine into the Remix application. You will likely want to look at the full source code at the end, but you can view the documentation for using XState with React here
Success Case
You can compare the application below to the sequence diagram above and understand how the process is visualized.
Expiration Case
In the case that the time expired the user would see this.
Since we are using a “Consumption” based hosting plan which scales to 0, it can take some time to initialize the resources for the first execution of the function. In this first case, the timer will often expire before the blob is processed.
Why is there a password input?
Since I planned to host the website publicly and didn’t want people abusing the application. This is simple way to prevent unauthorized users from uploading blobs.
Notable Technique
There was also a notable technique I used but didn’t cover above.
Sending Form intent using the query string rather than a form submission value
We wanted to give the user feedback about the process such as providing the URL to the uploaded file BEFORE the entire Azure Function completes.
This requires splitting from a single Action which waits for the full process including polling operation to finish, to multiple Actions each returning there own intermediate data. We will have one acting that returns immediately after .zip upload finishes which returns that URL, then invokes another action to start polling for the processed file.
In order to have multiple forms with different inputs. As described in the Remix Documentation for Handling Multipole Forms this is usually solved by providing and intent
as one of the submitted values.
<button type="submit" name="intent" value="update">
However, there is a problem because this technique requires reading the form data to get the intent, but unstable_parseMultipartFormData
is already reading the form data to get the File objects.
const formData = await request.formData();
const intent = formData.get("intent");
This creates an error about trying to read a Stream that has already been read.
Solution: Use query string to send intent instead of element value
This intent value can be read from the request.url
property instead of the request.formData()
Notice we manually specify upload=true
<Form
method="post"
action="?index&upload=true"
encType="multipart/form-data"
onSubmit={onFormSubmit}
>
See documentation for posting to an index route here
const actionUrl = new URL(request.url)
if (actionUrl.searchParams.has('upload')) {
const formData = await unstable_parseMultipartFormData(...)