Written by Sabr Research · April 2025
In today's rapidly evolving financial landscape, the ability to efficiently collect, validate, and prepare data is paramount for maintaining a competitive edge within the Quant industry. At Sabr Research, we’ve developed a robust infrastructure for financial data collection and processing to orchestrate data ingestion from diverse sources, implement rigorous data quality checks, and prepare data for downstream machine learning models.
When it comes to building data infrastructure, starting strong is non-negotiable. Scalability and robustness aren’t features you add on later - they’re principles that should guide design from the very beginning. By prioritizing scalable architecture from Day 1, you lay the groundwork for effortless expansion. Whether it's integrating new data sources, introducing additional validation checks, or layering in new analytical modules, a strong foundation ensures each new piece fits seamlessly into place. This is especially critical in fast-moving and changing environments. Too often, short-term optimizations lead to long-term technical debt. Quick fixes may feel efficient, but they come at the expense of flexibility down the road. We’ve chosen to build with the long view in mind—investing in clean architecture and thoughtful design now, so we’re not stuck paying down a costly mess later. We have used the AWS SDK stack to deploy our internal infrastructure-as-code setup not only to foster better collaboration within our team, but also to keep our systems organized and easy to grow. Creating a stack is simple, as shown here:
import * as cdk from 'aws-cdk-lib';
import * as lambda from 'aws-cdk-lib/aws-lambda';
export class DataPipelineStack extends cdk.Stack {
constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
# ...
}
}
Resources can be created through the use of helper functions to make the stack modular and easy to update and maintain. For example, many standard market-related data, such as EOD prices and the latest market news, can be collected independently and in parallel using simple Lambda functions. These functions can be modularized by leveraging helper functions like:
const makeLambda = (
logicalId: string,
name: string,
cmd: string[],
timeout: Duration,
memorySize: number,
envVars: { [key: string]: string }
) => {
return new lambda.DockerImageFunction(this, logicalId, {
functionName: name,
code: lambda.DockerImageCode.fromImageAsset('../', {
cmd: cmd,
}),
timeout: timeout,
memorySize: memorySize,
environment: {
BUCKET_NAME: ...envVars,
},
});
};
Our data pipeline is a complex system in which multiple stages and logical steps are chained to form the overall flow.
const parallelIngestion = new stepfunctions.Parallel(this, 'ParallelIngestion')
.branch(dataIngestionTask1)
.branch(dataIngestionTask2);
const aggregationStep = new stepfunctions.Pass(this, 'AggregateResults')
.next(new tasks.LambdaInvoke(this, 'DataAggregationTask'));
const dataProcessingFlow = aggregationStep
.next(qualityCheckTask)
.next(new stepfunctions.Choice(this, 'CheckQuality')
.when(stepfunctions.Condition.stringEquals('$.Payload.quality', 'good'), notifySuccessTask)
.otherwise(notifyFailureTask));
const definition = parallelIngestion
.next(dataProcessingFlow);
const stateMachine = new stepfunctions.StateMachine(this, 'DataPipelineStateMachine', {
definition,
timeout: Duration.minutes(timeout),
});
The above standard allows for rapid deployment, version control, and consistent environments, ultimately reducing the risk of errors and improving overall system reliability. The intricacies of data cleaning, which address issues such as data organization, quality, metrics definition, and feature engineering, are also crucial for ensuring the reliability of data used in analysis. This is why we have a single common processing layer which harmonizes this process and enforces alignment across different use cases.