<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://www2.sqlblog.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Search results matching tag 'Data Integration'</title><link>http://www2.sqlblog.com/search/SearchResults.aspx?o=DateDescending&amp;tag=Data+Integration&amp;orTags=0</link><description>Search results matching tag 'Data Integration'</description><dc:language>en-US</dc:language><generator>CommunityServer 2.1 SP2 (Build: 61129.1)</generator><item><title>ETL is dead, long live AP2 ?</title><link>http://www2.sqlblog.com/blogs/jamie_thomson/archive/2013/02/15/etl-is-dead-long-live-ap2.aspx</link><pubDate>Fri, 15 Feb 2013 15:45:33 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:47734</guid><dc:creator>jamiet</dc:creator><description>&lt;p&gt;Three days ago I posted &lt;a href="http://sqlblog.com/blogs/jamie_thomson/archive/2013/02/12/what-would-a-cloud-based-etl-tool-look-like.aspx" target="_blank"&gt;What would a cloud-based ETL tool look like?&lt;/a&gt; where I wondered out loud about the sorts of tools data integration dudes like myself would be using in the future. I got some good feedback and already have a list of “stuff” to go and look at including:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;&lt;a href="http://www.boomi.com/" target="_blank"&gt;Boomi&lt;/a&gt; – They claim 1million cloud integrations (whatever one of those is) per day&lt;/li&gt;    &lt;li&gt;&lt;a href="http://aws.amazon.com/datapipeline/" target="_blank"&gt;AWS Data Pipeline&lt;/a&gt; – A web service that incorporates a scheduler, a workflow engine and (as the name suggests) a data pipeline engine&lt;/li&gt;    &lt;li&gt;&lt;a href="http://www.informaticacloud.com/" target="_blank"&gt;Informatica Cloud&lt;/a&gt; – An extension to Informatica’s &lt;a href="http://www.informatica.com/uk/company/news-and-events-calendar/press-releases/10222012-gartner-data-integration-tools-magic-quadrant.aspx" target="_blank"&gt;market-leading&lt;/a&gt; PowerCenter for SalesForce.&lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;Most interesting to me though was a link that &lt;a href="https://twitter.com/joeharris76" target="_blank"&gt;Joe Harris&lt;/a&gt; provided to a a blog post by Mike Reich entitled &lt;a href="http://seabourneinc.com/2013/02/08/rethinking-etl-for-the-api-age/"&gt;Rethinking ETL for the API age&lt;/a&gt;. Mike outlined a number of points that really struck a chord with me; the key one was his message that the Extract-Transform-Load (ETL) mantra that has been trumpeted for years should be replaced by something that is more pertinent for “the cloud” – Mike offers &lt;strong&gt;Acquiring, Processing and Publishing&lt;/strong&gt; (AP2) as a new acronym (we all love acronyms, right?). The idea of &lt;em&gt;publishing&lt;/em&gt; data rather than &lt;em&gt;loading&lt;/em&gt; it really resonated with me as making data easily available in non-proprietary formats so that people can consume it in whatever manner they choose has &lt;a href="http://sqlblog.com/blogs/jamie_thomson/archive/2010/06/03/thinking-differently-about-bi-delivery.aspx" target="_blank"&gt;long been an interest of mine&lt;/a&gt;.&lt;/p&gt;  &lt;p&gt;Here are some other bulleted thoughts that came into my head as I read Mike’s blog post:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;“&lt;strong&gt;Flows are fluid and flexible, unlike structured, point-to-point ‘pipelines’&lt;/strong&gt;” – My interpretation of “fluid and flexible” is that these “flows” can be plugged together to create a greater whole. This gives rise to the notion of &lt;em&gt;&lt;a href="http://en.wikipedia.org/wiki/Composability" target="_blank"&gt;composability&lt;/a&gt;;&lt;/em&gt; imagine being able to leverage flows that other people have constructed in your own flows. &lt;a href="http://pipes.yahoo.com/pipes/" target="_blank"&gt;Yahoo Pipes&lt;/a&gt; (which I first blogged about almost five years ago in &lt;a title="http://consultingblogs.emc.com/jamiethomson/archive/2007/05/07/Taking-Yahoo-Pipes-for-a-test-drive.aspx" href="http://consultingblogs.emc.com/jamiethomson/archive/2007/05/07/Taking-Yahoo-Pipes-for-a-test-drive.aspx" target="_blank"&gt;Taking Yahoo Pipes for a test drive&lt;/a&gt;) was an early incarnation of this notion of composability and is a great demonstrator of what the future holds for us.&lt;/li&gt;    &lt;li&gt;&lt;strong&gt;Composability&lt;/strong&gt; further gives rise to the notion of a marketplace where one could sell “flows”. For example, one could build a flow that aggregated data for a given search term from both Google and Bing, deduplicated the results then made them available as a single feed; expose that feed via a marketplace and charge on a pay-per-use basis. Its a simplistic, contrived example but in my opinion aptly demonstrates the opportunity here. I think data marketplaces, perhaps more pertinently &lt;em&gt;data integration marketplaces&lt;/em&gt;, are going to be huge, I really do. Given the technology agnostic nature that is being proposed here these marketplaces would be totally interoperable too, unlike the hateful app stores that &lt;a href="http://xkcd.com/1174/" target="_blank"&gt;today’s xkcd expertly satirises&lt;/a&gt;.&lt;/li&gt;    &lt;li&gt;“&lt;em&gt;&lt;strong&gt;by using APIs to move information around, we decouple the data from the underlying technology and vendor&lt;/strong&gt;&lt;/em&gt;” Absolutely true. An API is essentially a well-understood interface/abstraction over a proprietary data store so really there’s nothing new here (isn’t this what &lt;a href="http://en.wikipedia.org/wiki/Service-oriented_architecture" target="_blank"&gt;SOA&lt;/a&gt; was all about?) but there’s no harm in reiterating the point.&lt;/li&gt;    &lt;li&gt;“&lt;em&gt;&lt;strong&gt;information is stored in multiple structures and formats. Any effort to manage information should focus on translating between structures rather than trying to develop a common schema&lt;/strong&gt;&lt;/em&gt;” I worked on a project from 2005-2008 where we attempted to adhere to a supposed &lt;a href="http://ppdm.org/about-ppdm" target="_blank"&gt;industry standard schema&lt;/a&gt;. Eventually we realised that those attempts were futile given that no business can be fitted neatly into an industry-standard-shaped-box and that dovetails nicely with Mike’s point here.&lt;/li&gt;    &lt;li&gt;“&lt;strong&gt;There are four common processing tasks; combining multiple streams, translating data formats, QA information, integrate third party processing&lt;/strong&gt;” – I wonder if there is a fifth that we might refer to as data caching; after all, if we’re pulling data out of multiple APIs we are at the mercy of the speed at which those APIs can provide the data – is a person going to be prepared to wait for the data or do we need regularly cache the transformed data for easy retrieval?&lt;/li&gt;    &lt;li&gt;“&lt;strong&gt;Publishing should be application/technology agnostic&lt;/strong&gt;” It would be hard for me to agree more with this point.&lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;As you can tell this is an area that I’m particularly interested in and shall continue to keep a watching brief.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://twitter.com/jamiet" target="_blank"&gt;@Jamiet&lt;/a&gt;&lt;/p&gt;</description></item><item><title>What would a cloud-based ETL tool look like?</title><link>http://www2.sqlblog.com/blogs/jamie_thomson/archive/2013/02/12/what-would-a-cloud-based-etl-tool-look-like.aspx</link><pubDate>Tue, 12 Feb 2013 14:15:17 GMT</pubDate><guid isPermaLink="false">21093a07-8b3d-42db-8cbf-3350fcbf5496:47664</guid><dc:creator>jamiet</dc:creator><description>&lt;p&gt;Given that “moving data to the cloud” is, rightly or wrongly, currently in vogue in our industry I have to think that pretty soon there will be a glaring need for tools that help us to move data between these heterogeneous sources – a cloud-based ETL tool for cloud-based data if you will. Perhaps such a thing already exists -&lt;a href="http://consultingblogs.emc.com/jamiethomson/archive/2009/07/08/kapow-etl-for-html.aspx" target="_blank"&gt;I’ve talked about Kapow&lt;/a&gt; in the past which may well be considered a form of cloud ETL tool given that it fetches data from the web– if you know of anything that might fit this very loose description feel free to let me know in the comments.&lt;/p&gt;  &lt;p&gt;I started to ponder what capabilities a cloud ETL tool should have and here’s a quick brainstormed list:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;Data transformation would be done “in the cloud” i.e. I wouldn’t need to own my own hardware in order to run it&lt;/li&gt;    &lt;li&gt;Ability to consume data from/push data to* the following types of data protocols:&lt;/li&gt;    &lt;ul&gt;     &lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Atom_(standard)" target="_blank"&gt;ATOM&lt;/a&gt; (application/atom+xm)&lt;/li&gt;      &lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/RSS_(file_format)" target="_blank"&gt;RSS&lt;/a&gt; (application/rss+xml)&lt;/li&gt;      &lt;li&gt;&lt;a href="http://www.odata.org/" target="_blank"&gt;OData&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="https://developers.google.com/gdata/" target="_blank"&gt;GData&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://www.json.org/" target="_blank"&gt;JSON&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/ODBC" target="_blank"&gt;ODBC&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/JDBC" target="_blank"&gt;JDBC&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Resource_Description_Framework" target="_blank"&gt;RDF&lt;/a&gt; (application/rdf+xml)&lt;/li&gt;      &lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/File_Transfer_Protocol" target="_blank"&gt;FTP&lt;/a&gt;&lt;/li&gt;   &lt;/ul&gt;    &lt;li&gt;Ability to consume data from/push data to the following &lt;a href="http://en.wikipedia.org/wiki/MIME_type" target="_blank"&gt;MIME types&lt;/a&gt;:&lt;/li&gt;    &lt;ul&gt;     &lt;li&gt;text/html&amp;#160; (gives rise to the idea of screenscraping as a source of data)&lt;/li&gt;      &lt;li&gt;text/plain&lt;/li&gt;      &lt;li&gt;text/xml&lt;/li&gt;   &lt;/ul&gt;    &lt;li&gt;Adapters (possibly with a plug-in model) for the following cloud storage and API providers:&lt;/li&gt;    &lt;ul&gt;     &lt;li&gt;&lt;a href="http://aws.amazon.com/s3/" target="_blank"&gt;Amazon S3&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Amazon_SimpleDB" target="_blank"&gt;Amazon SimpleDB&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://aws.amazon.com/rds/" target="_blank"&gt;Amazon RDS&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://aws.amazon.com/redshift/" target="_blank"&gt;Amazon RedShift&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://msdn.microsoft.com/en-gb/library/windowsazure/dd179423.aspx" target="_blank"&gt;Azure Tables&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://msdn.microsoft.com/en-us/library/windowsazure/dd135733.aspx" target="_blank"&gt;Azure BLOBs&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://msdn.microsoft.com/en-us/library/windowsazure/dd179363.aspx" target="_blank"&gt;Azure Queues&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://www.windowsazure.com/en-us/home/features/messaging/" target="_blank"&gt;Azure Service Bus&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://msdn.microsoft.com/en-us/library/windowsazure/ee336279.aspx" target="_blank"&gt;Azure SQL Database&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://datamarket.azure.com/" target="_blank"&gt;Azure Datamarket&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/BigTable" target="_blank"&gt;BigTable&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://wiki.developerforce.com/page/REST_API" target="_blank"&gt;Salesforce API&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://database.com/en/howitworks/open" target="_blank"&gt;Database.com (from Salesforce)&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://datasift.com/" target="_blank"&gt;Datasift&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://www.guardian.co.uk/data" target="_blank"&gt;Guardian Datastore&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Hadoop" target="_blank"&gt;Hadoop&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://office365.com/" target="_blank"&gt;Office 365&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Apache_Cassandra" target="_blank"&gt;Cassandra&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://www.workday.com/" target="_blank"&gt;WorkDay&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="https://www.dropbox.com/" target="_blank"&gt;DropBox&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://www.skydrive.com" target="_blank"&gt;SkyDrive&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://twitter.com" target="_blank"&gt;Twitter&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="https://developers.facebook.com/docs/reference/api/" target="_blank"&gt;Facebook Graph API&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://www.concur.com/" target="_blank"&gt;Concur&lt;/a&gt;&lt;/li&gt;      &lt;li&gt;&lt;a href="http://www.programmableweb.com/apis/directory" target="_blank"&gt;thousands more...&lt;/a&gt; &lt;/li&gt;   &lt;/ul&gt;    &lt;li&gt;Job scheduler&lt;/li&gt;    &lt;li&gt;An &lt;a href="http://en.wikipedia.org/wiki/Integrated_development_environment" target="_blank"&gt;IDE&lt;/a&gt; (open to debate whether the IDE should be “in the cloud” as well)&lt;/li&gt;    &lt;li&gt;Ability to carry out common transformations (join, aggregate, sort, &lt;a href="http://en.wikipedia.org/wiki/Projection_(relational_algebra)" target="_blank"&gt;projection&lt;/a&gt;) on those heterogeneous data sources&lt;/li&gt;    &lt;li&gt;Ability to authenticate using different authentication mechanisms&lt;/li&gt;    &lt;li&gt;Configurable logging&lt;/li&gt;    &lt;li&gt;Ability to publish transformed data in a manner that makes it consumable rather than insert it into another data store&lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;Any thoughts here? As I said this is a brainstormed list so I don’t mind being told that I am approaching this from the wrong angle or even that I’m completely wrong . Should I be concentrating on scenarios rather than technologies?. I’m only too aware that given my ETL heritage my brain is already wired to consider how traditional ETL tools could be supplanted into the cloud (my mention of a job scheduler sold me out there) – perhaps that is completely wrong too and that my heritage is actually a disadvantage here.&lt;/p&gt;  &lt;p&gt;I’m interested to know what people think and hopefully trigger a conversation. I’m especially keen to hear about scenarios that you might have where you need to move and transform data that lives “in the cloud”.&lt;/p&gt;  &lt;p&gt;&lt;a href="http://twitter.com/jamiet" target="_blank"&gt;@Jamiet&lt;/a&gt;&amp;#160;&lt;/p&gt;  &lt;p&gt;*where applicable&lt;/p&gt;  &lt;p&gt;UPDATE. Within seconds of publishing this post I’d already &lt;a href="https://twitter.com/BrentO/status/301334110612373505" target="_blank"&gt;been alerted&lt;/a&gt; to &lt;a href="http://www.informaticacloud.com/" target="_blank"&gt;InformaticaCloud.com&lt;/a&gt; &amp;amp; &lt;a href="http://aws.amazon.com/datapipeline/" target="_blank"&gt;AWS Data Pipeline&lt;/a&gt;. Checking those out now!&lt;/p&gt;  &lt;p&gt;I’ve been &lt;a href="https://twitter.com/joeharris76/status/301346114240647169" target="_blank"&gt;recommended&lt;/a&gt; to check out the following articles:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;&lt;a href="http://seabourneinc.com/2013/02/08/rethinking-etl-for-the-api-age/"&gt;Rethinking ETL for the API age&lt;/a&gt;&lt;/li&gt;    &lt;li&gt;&lt;a href="http://www.apievangelist.com/2013/02/10/bringing-etl-to-the-masses-with-apis/"&gt;Bringing ETL to the Masses with APIs&lt;/a&gt;&lt;/li&gt; &lt;/ul&gt;</description></item></channel></rss>