Enabling Self-Service and Advanced Analytics

Transport for NSW - Customer Success Story

Data-Driven helped TfNSW build the data foundation for an Operational Data Lake with self-service capability to enable advanced analytics previously not possible.

Transport for NSW (TfNSW) manages one of the largest fleets of vehicles in Australia, including Buses, Ferries, Light Rail, Trains and Metro. The real-time data generated by these vehicles provides valuable analytical opportunities to improve transport services by measuring service performance and optimising routes, however, the sheer volume of data presents data management challenges.

Data-Driven partnered with TfNSW to deliver the Operational Data Lake (ODL) platform on Azure to enable self-service and advanced analytics capabilities previously not possible. Data-Driven’s ODL included CloudMonitor as a tool to reduce and manage the consumption costs.

56 Bus at Martime Museum 2 scaled 1 • Cost Optimization
“TfNSW needed a solution to capture real-time data for every vehicle in motion across the state. This solution just gives us that so that we mine nuggets from this data.”

NSW Transport Challenges

Historical operational data too large and costly to store efficiently and analyse

Historical GTFS data has always been too large and costly to store efficiently and analyse. Each TfNSW vehicle sends its location every 10 seconds which results in a huge stockpile of data containing valuable insights.

In the past, this data was not stored which made insights from past vehicle trips almost impossible to obtain and ruled out the ability to predict trip delays or optimise trip routes. It also made it difficult to report on and analyse service performance and reduced the ability to improve customer services.

Key Operational Data-related challenges to overcome:

infrastructure • Cost Optimization

The Solution

Building an Operational Data Lake capable of ingesting and storing infinite data and allowing self-service analytics

The Data Foundation solution by Data-Driven was the perfect starting point as it was designed to be highly scalable and extensible whilst providing an analytics platform for citizen data scientists and business users. The DevOps-first mindset meant it was east to extend for new use-cases.

This met TfNSW’s vision to build the Operational Data Lake (ODL); a unified, next-generation data and analytics platform, leveraging native Azure services to enable the continuous collection/curation of diverse transport operational data sets, allowing self-service analytics and machine learning to gain further insights to improve transport customer services.

The ODL platform service offerings include but not limited to:  

  1. continuous collection/curation of diverse Transport operational data sets,
  2. data management,
  3. self-service analytics and,
  4. platform services for advanced analytics e.g. AI/ML/DS.

The requirements for the GTFS data self-service analytics were:

  • Allow both internal and external users to access historical GTFS data sets
  • Ability for TfNSW data scientists to perform advanced analytics and run machine learning experiments in a cost-efficient manner on operational data
  • The data platform must allow data discovery, have built-in monitoring and cost management tools
  • Ensure data privacy and security, with robust governance controls supported by the platform
  • Ability to deliver insights from these systems to the organization in an automated, interactive, and near real-time manner

Key Outcomes for the Business

ODL Data Foundation ready for Advanced Analytics

The Operational Data Lake Data Foundation solution by Data-Driven was the perfect starting point for TfNSW as it was designed to be highly scaleable, extensible and whilst providing a platform for citizen data scientists and business users to perform advanced analytics.

Operational Data Lake

Modern data platform capable of ingesting and storing infinite operational data in realtime or batch in a well-governed, secure and cost-effective manner

Infinite Cost Efficient Storage

Real-time positions and telemetry for every TfNSW mode is now tracked and stored for analysis

Self-Service Analytics

Self-service analytics and data-sharing on those operational data sets can be done by internal and public users

AI & ML Readiness

TfNSW data scientists and analysts can perform advanced analytics and machine learning on operational data

The true business value will grow even greater in the future as machine learning and analysis is done on the data. Here are some examples of insights that are now possible with the right data:

Daniel • Cost Optimization
Daniel Yu
Lead Architect for TfNSW Operational Data Lake
“With the ODL platform we are able to ingest and process 500GB, millions of various data files a day, in real time and batch efficiently, which is unprecedented in NSW Transport. The ODL is a great example of building a next-generation Cloud-based data and analytics platform using native Azure services. We can deliver what we had in mind with the Azure ODL because it is a flexible, rich in services and features, high-performant and easily extensible.”

The Technical Solution

Operational Data Lake built on Microsoft Azure + Databricks

The architecture uses native Microsoft Azure technologies to reduce the learning curve for operators of the data platform which also makes it easy to extend for new use-cases as they come onboard. Azure Data Factory and Azure Functions are used to ingest data depending on the data access method and frequency of ingress data. Storage and data life cycle management is performed by Azure Data Lake Gen 2 components. 

Azure Databricks is the compute engine used to transform millions of IoT files into a usable Big Data within the cost and performance constraints. The Unified Data Platform workspaces of Databricks were the perfect solution for the self-service capability to allow internal data analysts to explore data and run experements in a secure and controlled manner. Governance around cluster use, data access, data management and security is handled by Azure Databricks RBAC controls to ensure the user sees only the data they are meant to see. 

A key technical challenge to overcome was the storing of millions of Json files per day from the IoT devices for each vehicle. Delta Lake was used to process the raw operational data as well as providing data integrity, ACID transactions and data versioning to add a governance layer to the Data Lake. The use of Delta Lake and Databricks allows “Hot” analytics to be performed on the Delta lake with a rapid response time across vast amounts of data.

Other Success Stories

Azure Cost Optimisation - Clinic to Cloud

Other Similar Success Stories

How to choose a good use-case for the first IoT or AI PoC
Understand when to use pre-packaged AI services vs custom ML
Automation and removal of repetitive manual tasks

About Transport for NSW

Transport for NSW is a government run enterprise responsible for the delivering and development of  of safe, integrated and efficient transport systems for the people of NSW; including transport planning, strategy, policy, procurement and other non-service delivery functions across all modes of transport: Buses, Ferries, Light Rail, Trains and Metro.

NSW Transport works hand-in-hand with operating agencies, private operators and industry partners to deliver customer-focused services and projects in order to make NSW a better place to live, work and visit.

Download CloudMonitor and find out how much you can save on your cloud costs

CloudMonitor logo - Azure Cost Optimization and Cost Management
Connect Power BI Reporting adn Analytics Engine

Subscribed! We'll let you know when we have new blogs and events...