Given that “moving data to the cloud” is, rightly or wrongly, currently in vogue in our industry I have to think that pretty soon there will be a glaring need for tools that help us to move data between these heterogeneous sources – a cloud-based ETL tool for cloud-based data if you will. Perhaps such a thing already exists - I’ve talked about Kapow in the past which may well be considered a form of cloud ETL tool given that it fetches data from the web– if you know of anything that might fit this very loose description feel free to let me know in the comments.
I started to ponder what capabilities a cloud ETL tool should have and here’s a quick brainstormed list:
- Data transformation would be done “in the cloud” i.e. I wouldn’t need to own my own hardware in order to run it
- Ability to consume data from/push data to* the following types of data protocols:
- Ability to consume data from/push data to the following MIME types:
- text/html (gives rise to the idea of screenscraping as a source of data)
- Adapters (possibly with a plug-in model) for the following cloud storage and API providers:
- Job scheduler
- Workflow. (e.g. Do this, then do that. Do these things in parallel. Only do this if some condition is true. Restart from here in case of failure.)
- An IDE (open to debate whether the IDE should be “in the cloud” as well)
- Ability to carry out common transformations (join, aggregate, sort, projection) on those heterogeneous data sources
- Ability to authenticate using different authentication mechanisms
- Configurable logging
- Ability to publish transformed data in a manner that makes it consumable rather than insert it into another data store
Any thoughts here? As I said this is a brainstormed list so I don’t mind being told that I am approaching this from the wrong angle or even that I’m completely wrong . Should I be concentrating on scenarios rather than technologies?. I’m only too aware that given my ETL heritage my brain is already wired to consider how traditional ETL tools could be supplanted into the cloud (my mention of a job scheduler sold me out there) – perhaps that is completely wrong too and that my heritage is actually a disadvantage here.
I’m interested to know what people think and hopefully trigger a conversation. I’m especially keen to hear about scenarios that you might have where you need to move and transform data that lives “in the cloud”.
UPDATE. Within seconds of publishing this post I’d already been alerted to InformaticaCloud.com & AWS Data Pipeline. Checking those out now!
I’ve been recommended to check out the following articles: