6/21/2023 0 Comments Automated etl processes![]() While you can read Ben’s white paper for details on our approach, and the ReadTheDocs page for the nuts and bolts, I’ll give a brief, high-level, runthrough of our code below. As a younger developer, It was valuable for me to work directly with Ben and draw on his years of experience. Ben’s white paper provided us with an outstanding foundation to our ETL framework, and he and I then developed the code used in our work. It needed to cut down the work necessary to set up a new ETL job.īen got back to us a with an extremely helpful white paper describing how we should think about our ETL framework.It needed to provide notifications when a process did not run properly and.It needed to log metadata for each job.It needed to run jobs automatically and with arbitrary frequencies.The basic requirements we identified at the outset of the process include: Ben’s contributions were vital in helping us both understand our needs, and write code for our re-usable framework. We were fortunate to hire the former 2015 Pittsburgh Code for America fellow Ben Smithgall as a consultant to help us think through our ETL philosophy, and also have him pair with me to write code to support our ETL framework. This led us to consider building an all-encompassing ETL framework. However, we were still left with two-thirds of the process being rewritten each time (the “E” and “T” in ETL). I began developing improvements our ETL process – the load part, to be precise – by building a small library of code to handle the loading process for each dataset. The Regional Data Center uses the open-source CKAN software for the open data portal. ![]() We started by writing code to help us streamline the ETL process starting with loading data to the open data portal (the “L” in ETL). Rewriting, instead of reusing, code is not only more-efficient, it will also increase the reliability of our ETL processes. Since the bulk of these ETL processes are similar across datasets, we then began to think about how to bring greater efficiencies into our process. We quickly learned that this ad-hoc ETL approach wound up being labor intensive. ETL processes do these three things – extract the raw data from the original source, transform it into a more-useful format, and load it into the open data portal We would write an Extract, Transform, Load (ETL) script from scratch for our first few automated ETL processes. In the early days of the Regional Data Center, our automated publishing processes were highly custom. ![]() Automating the publication of data will provide our users with routine and predictable data updates, and provide efficiencies to the publishing process. As the Western Pennsylvania Regional Data Center grows in its number of publishers, and as those publishers release more datasets, having a system to efficiently and automatically load those datasets into the Regional Data Center’s open data portal has become essential. ![]()
0 Comments
Leave a Reply. |