{"componentChunkName":"component---src-pages-author-author-yaml-id-js","path":"/author/abhilash-k-r/","result":{"data":{"allMarkdownRemark":{"edges":[{"node":{"id":"14efec93-f52d-5d6c-9d1f-61f22d4eb90d","html":"<p>We'll talk about Apache Beam in this guide and discuss its fundamental concepts. We will begin by showing the features and advantages of using Apache Beam, and then we will cover basic concepts and terminologies.</p>\n<p>Ever since the concept of big data got introduced to the programming world, a lot of different technologies, frameworks have emerged. The processing of data can be categorized into two different paradigms. One is Batch Processing, and the other is Stream Processing. </p>\n<p>Different technologies came into existence for different paradigms, solving various big data world problems, for, e.g., Apache Spark, Apache Flink, Apache Storm, etc. </p>\n<p>As a developer or a business, it's always challenging to maintain different tech stacks and technologies. Hence, Apache Beam to the rescue!</p>\n<p><span\n      class=\"gatsby-resp-image-wrapper\"\n      style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 768px; \"\n    >\n      <span\n    class=\"gatsby-resp-image-background-image\"\n    style=\"padding-bottom: 49.84615384615385%; position: relative; bottom: 0; left: 0; background-image: url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAKCAYAAAC0VX7mAAAACXBIWXMAAAsTAAALEwEAmpwYAAACHklEQVQoz2WSW2/TQBCF/VeReOYnIPHEMxL0pVJBkaCiSCBVqSiI0obSplUhbZM0F5qEksS5NY4vseO1vfbHOqagiJFWe5kzc3b3HC12Z4jrE8J+jWjcJhw2SYAkSUjimDTSvZSSWO3TcxknyxnSvJ9hjHcE+w/Q4sAjGDQJVbPotktk9FH4ZZM0HNNgYU2zolj+Icmypl7j+v0XWnuX9PKb6M8fobES0b+l6xJPhpTXcuiFwgrqvCsJUrRep5IrU3nVoLrZovL6Ek0sbBoXH7gq7VIsPuG6U+H8/Ai9VsQ6/kxUvVg2kbNb/FIR7BkPt4Y825/ysdKi16sj5z/xzRae2USzzTGlozd8/7pFu12nPxtyXN/jpLWP4Y1xwwRbqJuHAcbGU+Zrj7k6LbN1afLisE9v6jAY+1Tb4ZJYsy2T8WjAwvNWnlU92+bsIMdxYRPPzz5ehpJBp4lo7qgvaTDs99jJ53m7vceNnmG0KIoUuSCKQiVGpurCc9G7LSajXwT+/C+JUOnclUnvpkLn9CW1coF8YRdh1RRBE0PvpqLc6XlnhQQhFpizKYEQGcHCZTLWcWyTSEHmc5OjTxsc7K5zcbiO+Haf+ek9+qX/VF6NO6ogCLDMmWqsLCZ8jOkto5GOaVkkUihRfiCdhrJkT/nQmyBdHakEkN6I2DeyISziwFnOaRGhkxoRGQl8Z4zn2oSewoULktAniQKFn/MbmpDqVNye7XEAAAAASUVORK5CYII='); background-size: cover; display: block;\"\n  ></span>\n  <img\n        class=\"gatsby-resp-image-image\"\n        alt=\"Timeline of Big Data Frameworks\"\n        title=\"Timeline of Big Data Frameworks\"\n        src=\"/static/1e7f65aaa920379cd8b429236ecf5cb7/e5715/timeline-bigdata-frameworks.png\"\n        srcset=\"/static/1e7f65aaa920379cd8b429236ecf5cb7/a6d36/timeline-bigdata-frameworks.png 650w,\n/static/1e7f65aaa920379cd8b429236ecf5cb7/e5715/timeline-bigdata-frameworks.png 768w,\n/static/1e7f65aaa920379cd8b429236ecf5cb7/07a9c/timeline-bigdata-frameworks.png 1440w\"\n        sizes=\"(max-width: 768px) 100vw, 768px\"\n        style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\"\n        loading=\"lazy\"\n      />\n    </span></p>\n<h2 id=\"what-is-apache-beam\" style=\"position:relative;\"><a href=\"#what-is-apache-beam\" aria-label=\"what is apache beam permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>What is Apache Beam?</h2>\n<p>Apache Beam is an open source, centralised model for describing parallel-processing pipelines for both batch and streaming data. The programming model of the Apache Beam simplifies large-scale data processing dynamics. </p>\n<p>The Apache Beam model offers helpful abstractions that insulate you from distributed processing information at low levels, such as managing individual staff, exchanging databases, and other activities. These low-level information are handled entirely by Dataflow. </p>\n<p><span\n      class=\"gatsby-resp-image-wrapper\"\n      style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 768px; \"\n    >\n      <span\n    class=\"gatsby-resp-image-background-image\"\n    style=\"padding-bottom: 76.46153846153847%; position: relative; bottom: 0; left: 0; background-image: url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAPCAYAAADkmO9VAAAACXBIWXMAAAsTAAALEwEAmpwYAAAEHElEQVQ4y02UC0xTVxjHv0IpYKaQyEOyMWhLawx0Mpgxm2yyKO/yKM4lsoTNMEbcSwLUIkUID43bkBCJzrBBykBgmDhlXQimKE8FlrmFxwUKLVVMnDQUS3m0vbf322lBspP88v9yT/LL951zcoHRy4BdkIFVmw64kQWIZ+BeyxEgi0PwJngSeASvbTzXS4WALRLAzkjArijAhnBYuyCEhYJQAAcRoiHTbY1K9UD8nIv4LcnT7rmn+N6iYF5A9H6vgJMf+vglv7s7MJzv+fr+N3n+eFHsjn2HuPgohov6WC7eedvjRbHArT/nDQBGJyNkyIl4bnM2fcQylTrm0Mmm2ScfNeOQpI4dCJ9c7TnwiPAQSc0ORdRhRViz5YJwerVUMGYqEYxslAlnsVpUgRdFWx0S4Y+sXoYbM+n4ciINWX0GMoaTE9YHkgH7gwg0qg+4cNbWfkk/VorHzUo+mpRh+FwhQDIuEmE7XhJvCQnBtC7jMD7LjFrVnZcaJ/LicT6Sb78NIXQXSKxDH8TYCLSaK7H/ASH4BfCt11MSN1XZaVgjjsLL4sNMhYi/WS7cEQIznwz4LA9wEGDhBYbNGejITUR/RNylW8FQnQVDnfUa+TZvwahxxGC88hZgLbmcSwJgKkVgLQ8jQtcZyjiMUeVBI8LG+kosso4VB8usOMzTL9FqNhLRMrLMMqLDyDqYZQLtoO13T5CXgM2pbsy1BHe2SsTBKhG4Fo/HA5vN5ma322HNYo5jHUSBDqR73sf1x2o0/LuMBr0OtVotmkwm1rlL0/RDEuBkaWkJRkdHt2RUqwambmo4RsrAdW5azOYklmVdQjvVjfTSItoZFu02G9qsNmRoMgfrEv7pfKt3vq51uyo96ybfl8Ap8o97JewFqukeTHQPw+jy/K6Zudk4al6bNm20yaaeLn42SU3KJ2emz80+0Z3WLuozSMq0Bl10dcQp+CHpS8jfEwvyoAQo/J/Qj2rRCMd/6zv4uKsvpruxU3C3sF6slteE93yviqkXfuqTCfDar/k1MaqsUsnPKYXiX5S1op/OVB29Iv0msmDvMaE8KDGg0P/4jrCWJE619TJU+33854Y6f0DR9HxQ2WwfKFHhcFnLrbbEcw0dqSXYIT1vu5WqNNcczD5bFJSwVhwixaLAeCQjN5KErXFbNZUk16ZaNfNUW+/i39d/z+tXNFGDJaqnwyXNppHymw0diYrvOqVKU0dysaEzVblQE5mdV+B3XFsUEK8nMjOR1ZHc6dCLpC9JH9Kh79jV255/lbXvbsyp2Jt79EToJ++lBCaFvOOfKToiUBz6OECdWbnnsiTLs3BfvI8iONmHyHyJzHurw9bebTQuqLb7sNg6BM051bC9vF/9ZbaTU3/sK7gRnQsF5MycXRUR5ETmFP4HPQyC2KrLXHkAAAAASUVORK5CYII='); background-size: cover; display: block;\"\n  ></span>\n  <img\n        class=\"gatsby-resp-image-image\"\n        alt=\"Beam-Model\"\n        title=\"Beam-Model\"\n        src=\"/static/bcda8196d6c879a95ccdd397242e01fb/e5715/beam_architecture.png\"\n        srcset=\"/static/bcda8196d6c879a95ccdd397242e01fb/a6d36/beam_architecture.png 650w,\n/static/bcda8196d6c879a95ccdd397242e01fb/e5715/beam_architecture.png 768w,\n/static/bcda8196d6c879a95ccdd397242e01fb/1cfc2/beam_architecture.png 900w\"\n        sizes=\"(max-width: 768px) 100vw, 768px\"\n        style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\"\n        loading=\"lazy\"\n      />\n    </span></p>\n<h2 id=\"features-of-apache-beam\" style=\"position:relative;\"><a href=\"#features-of-apache-beam\" aria-label=\"features of apache beam permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Features of Apache Beam</h2>\n<p>The unique features of Apache  beam are as follows:</p>\n<ol>\n<li>Unified - Use a single programming model for both batch and streaming use cases.</li>\n<li>Portable - Execute pipelines in multiple execution environments. Here, execution environments mean different runners. Ex. Spark Runner, Dataflow Runner, etc</li>\n<li>Extensible - Write custom SDKs, IO connectors, and transformation libraries.</li>\n</ol>\n<h2 id=\"apache-beam-sdks-and-runners\" style=\"position:relative;\"><a href=\"#apache-beam-sdks-and-runners\" aria-label=\"apache beam sdks and runners permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Apache Beam SDKs and Runners</h2>\n<p>As of today, there are 3 Apache beam programming SDKs</p>\n<ol>\n<li>Java</li>\n<li>Python</li>\n<li>Golang</li>\n</ol>\n<p>Beam Runners translate the beam pipeline to the API compatible backend processing of your choice. Beam currently supports runners that work with the following backends.</p>\n<ol>\n<li>Apache Spark</li>\n<li>Apache Flink</li>\n<li>Apache Samza</li>\n<li>Google Cloud Dataflow</li>\n<li>Hazelcast Jet</li>\n<li>Twister2</li>\n</ol>\n<p>Direct Runner to run on the host machine, which is used for testing purposes.</p>\n<h2 id=\"basic-concepts-in-apache-beam\" style=\"position:relative;\"><a href=\"#basic-concepts-in-apache-beam\" aria-label=\"basic concepts in apache beam permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Basic Concepts in Apache Beam</h2>\n<p>Apache Beam has three main abstractions. They are</p>\n<ol>\n<li>Pipeline</li>\n<li>PCollection</li>\n<li>PTransform</li>\n</ol>\n<p><span\n      class=\"gatsby-resp-image-wrapper\"\n      style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 768px; \"\n    >\n      <span\n    class=\"gatsby-resp-image-background-image\"\n    style=\"padding-bottom: 13.999999999999998%; position: relative; bottom: 0; left: 0; background-image: url('data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAADCAYAAACTWi8uAAAACXBIWXMAAAsTAAALEwEAmpwYAAAA/klEQVQI1wHzAAz/APXx65U1ZahDOWmr/0Fvr/osX6b/Y4e5ef///z8mWaBRNGWq/zdnq/ktYKf/dJO/bv///zsfVJ5bPWut/0BurvkvYaf/SXSwPvbz7Z3k5ebHAOPg28VLd7VKOmqu/0t2tP8zZKr/cJC8k///73coV5paOGir/0NxsP8xY6n/g57Cjf//73EfUphjPmyu/0l0sv82Zqr/Y4e6TNrX09PS0tP/AHlsWRhcc5MrJU6HZyNSlFUwYqlUDEeaFpWsywAuY60YK16mUyZapE0tX6ZUADWTD5OqygAxZa0cK16lVChcpU0vYadTCEacDf/v1wvf4eMUMeqAEPko5kUAAAAASUVORK5CYII='); background-size: cover; display: block;\"\n  ></span>\n  <img\n        class=\"gatsby-resp-image-image\"\n        alt=\"Beam-Pipeline\"\n        title=\"Beam-Pipeline\"\n        src=\"/static/439968fcb84e700e21b6475c5aa49214/e5715/pipeline-design.png\"\n        srcset=\"/static/439968fcb84e700e21b6475c5aa49214/a6d36/pipeline-design.png 650w,\n/static/439968fcb84e700e21b6475c5aa49214/e5715/pipeline-design.png 768w,\n/static/439968fcb84e700e21b6475c5aa49214/1132d/pipeline-design.png 1158w\"\n        sizes=\"(max-width: 768px) 100vw, 768px\"\n        style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\"\n        loading=\"lazy\"\n      />\n    </span></p>\n<h3 id=\"pipeline\" style=\"position:relative;\"><a href=\"#pipeline\" aria-label=\"pipeline permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Pipeline:</h3>\n<p>A pipeline is the first abstraction to be created. It holds the complete data processing job from start to finish, including reading data, manipulating data, and writing data to a sink. Every pipeline takes in options/parameters that indicate where and how to run. </p>\n<h3 id=\"pcollection\" style=\"position:relative;\"><a href=\"#pcollection\" aria-label=\"pcollection permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>PCollection:</h3>\n<p>A pcollection is an abstraction of distributed data. A pcollection can be bounded, i.e., finite data, or unbounded, i.e., infinite data. The initial pcollection is created by reading data from the source. From then on, pcollections are the source and sink of every step in the pipeline.</p>\n<h3 id=\"transform\" style=\"position:relative;\"><a href=\"#transform\" aria-label=\"transform permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Transform:</h3>\n<p>A transform is a data processing operation. A transform is applied on one or more pcollections. Complex transforms have other transform nested within them. Every transform has a generic <code>apply</code> method where the logic of the transform sits in.</p>\n<h2 id=\"example-of-pipeline\" style=\"position:relative;\"><a href=\"#example-of-pipeline\" aria-label=\"example of pipeline permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Example of Pipeline</h2>\n<p>Here, let's write a pipeline to output all the jsons where the name starts with a vowel.</p>\n<p>Let's take a sample input. Name the file as <code>input.json</code></p>\n<pre class=\"grvsc-container dark-default-dark\" data-language=\"json\" data-index=\"0\"><code class=\"grvsc-code\"><span class=\"grvsc-line\"><span class=\"mtk1\">{</span><span class=\"mtk12\">&quot;name&quot;</span><span class=\"mtk1\">:</span><span class=\"mtk8\">&quot;abhi&quot;</span><span class=\"mtk1\">, </span><span class=\"mtk12\">&quot;score&quot;</span><span class=\"mtk1\">:</span><span class=\"mtk7\">12</span><span class=\"mtk1\">}</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">{</span><span class=\"mtk12\">&quot;name&quot;</span><span class=\"mtk1\">:</span><span class=\"mtk8\">&quot;virat&quot;</span><span class=\"mtk1\">, </span><span class=\"mtk12\">&quot;score&quot;</span><span class=\"mtk1\">:</span><span class=\"mtk7\">23</span><span class=\"mtk1\">}</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">{</span><span class=\"mtk12\">&quot;name&quot;</span><span class=\"mtk1\">:</span><span class=\"mtk8\">&quot;dhoni&quot;</span><span class=\"mtk1\">, </span><span class=\"mtk12\">&quot;score&quot;</span><span class=\"mtk1\">:</span><span class=\"mtk7\">45</span><span class=\"mtk1\">}</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">{</span><span class=\"mtk12\">&quot;name&quot;</span><span class=\"mtk1\">:</span><span class=\"mtk8\">&quot;rahul&quot;</span><span class=\"mtk1\">, </span><span class=\"mtk12\">&quot;score&quot;</span><span class=\"mtk1\">: </span><span class=\"mtk7\">156</span><span class=\"mtk1\">}</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">{</span><span class=\"mtk12\">&quot;name&quot;</span><span class=\"mtk1\">: </span><span class=\"mtk8\">&quot;Edmund&quot;</span><span class=\"mtk1\">}</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">{</span><span class=\"mtk12\">&quot;name&quot;</span><span class=\"mtk1\">: </span><span class=\"mtk8\">&quot;Ojha&quot;</span><span class=\"mtk1\">}</span></span></code></pre>\n<p>The input should be a newline delimited JSON.</p>\n<p>Include the following dependencies in your <code>pom.xml</code></p>\n<pre class=\"grvsc-container dark-default-dark\" data-language=\"\" data-index=\"1\"><code class=\"grvsc-code\"><span class=\"grvsc-line\">&lt;dependency&gt;</span>\n<span class=\"grvsc-line\">    &lt;groupId&gt;org.apache.beam&lt;/groupId&gt;</span>\n<span class=\"grvsc-line\">    &lt;artifactId&gt;beam-sdks-java-core&lt;/artifactId&gt;</span>\n<span class=\"grvsc-line\">    &lt;version&gt;2.24.0&lt;/version&gt;</span>\n<span class=\"grvsc-line\">&lt;/dependency&gt;</span>\n<span class=\"grvsc-line\"></span>\n<span class=\"grvsc-line\">&lt;dependency&gt;</span>\n<span class=\"grvsc-line\">    &lt;groupId&gt;org.apache.beam&lt;/groupId&gt;</span>\n<span class=\"grvsc-line\">    &lt;artifactId&gt;beam-runners-direct-java&lt;/artifactId&gt;</span>\n<span class=\"grvsc-line\">    &lt;version&gt;2.24.0&lt;/version&gt;</span>\n<span class=\"grvsc-line\">&lt;/dependency&gt;</span></code></pre>\n<p>Let's code the beam pipeline. Follow the steps</p>\n<ol>\n<li>\n<p>Create a pipeline.</p>\n<pre class=\"grvsc-container dark-default-dark\" data-language=\"java\" data-index=\"2\"><code class=\"grvsc-code\"><span class=\"grvsc-line\"><span class=\"mtk10\">Pipeline</span><span class=\"mtk1\"> </span><span class=\"mtk12\">pipeLine</span><span class=\"mtk1\"> = </span><span class=\"mtk12\">Pipeline</span><span class=\"mtk1\">.</span><span class=\"mtk11\">create</span><span class=\"mtk1\">();</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk3\">// OR </span></span>\n<span class=\"grvsc-line\"><span class=\"mtk3\">// Pipeline pipeLine = Pipeline.create(options);</span></span></code></pre>\n<p>Create a pipeline which binds all the pcollections and transforms. Optionally you can pass the PipelineOptions <code>options</code> if needed.</p>\n</li>\n<li>\n<p>Read the input file</p>\n<pre class=\"grvsc-container dark-default-dark\" data-language=\"java\" data-index=\"3\"><code class=\"grvsc-code\"><span class=\"grvsc-line\"><span class=\"mtk10\">PCollection</span><span class=\"mtk1\">&lt;</span><span class=\"mtk10\">String</span><span class=\"mtk1\">&gt; </span><span class=\"mtk12\">inputCollection</span><span class=\"mtk1\"> = </span><span class=\"mtk12\">pipeLine</span><span class=\"mtk1\">.</span><span class=\"mtk11\">apply</span><span class=\"mtk1\">(</span><span class=\"mtk8\">&quot;Read My File&quot;</span><span class=\"mtk1\">, </span><span class=\"mtk12\">TextIO</span><span class=\"mtk1\">.</span><span class=\"mtk11\">read</span><span class=\"mtk1\">().</span><span class=\"mtk11\">from</span><span class=\"mtk1\">(</span><span class=\"mtk8\">&quot;input.json&quot;</span><span class=\"mtk1\">));</span></span></code></pre>\n<p>Use the <code>TextIO</code> transform to read the input files. Every line is a different json record.</p>\n</li>\n<li>\n<p>Apply a transform to filter out the names starting from a vowel</p>\n<pre class=\"grvsc-container dark-default-dark\" data-language=\"java\" data-index=\"4\"><code class=\"grvsc-code\"><span class=\"grvsc-line\"><span class=\"mtk10\">PCollection</span><span class=\"mtk1\"> </span><span class=\"mtk12\">filteredCollection</span><span class=\"mtk1\"> = </span><span class=\"mtk12\">inputCollection</span><span class=\"mtk1\">.</span><span class=\"mtk11\">apply</span><span class=\"mtk1\">(</span><span class=\"mtk8\">&quot;Filter names starting with vowels&quot;</span><span class=\"mtk1\">, </span><span class=\"mtk12\">Filter</span><span class=\"mtk1\">.</span><span class=\"mtk11\">by</span><span class=\"mtk1\">(</span><span class=\"mtk15\">new</span><span class=\"mtk1\"> </span><span class=\"mtk10\">SerializableFunction</span><span class=\"mtk1\">&lt;</span><span class=\"mtk10\">String</span><span class=\"mtk1\">, </span><span class=\"mtk10\">Boolean</span><span class=\"mtk1\">&gt;() {</span></span>\n<span class=\"grvsc-line\"></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">        </span><span class=\"mtk4\">public</span><span class=\"mtk1\"> </span><span class=\"mtk10\">Boolean</span><span class=\"mtk1\"> </span><span class=\"mtk11\">apply</span><span class=\"mtk1\">(</span><span class=\"mtk10\">String</span><span class=\"mtk1\"> </span><span class=\"mtk12\">input</span><span class=\"mtk1\">) {</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">            </span><span class=\"mtk10\">ObjectMapper</span><span class=\"mtk1\"> </span><span class=\"mtk12\">jacksonObjMapper</span><span class=\"mtk1\"> = </span><span class=\"mtk15\">new</span><span class=\"mtk1\"> </span><span class=\"mtk11\">ObjectMapper</span><span class=\"mtk1\">();</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">            </span><span class=\"mtk15\">try</span><span class=\"mtk1\"> {</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">                </span><span class=\"mtk10\">JsonNode</span><span class=\"mtk1\"> </span><span class=\"mtk12\">jsonNode</span><span class=\"mtk1\"> = </span><span class=\"mtk12\">jacksonObjMapper</span><span class=\"mtk1\">.</span><span class=\"mtk11\">readTree</span><span class=\"mtk1\">(input);</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">                </span><span class=\"mtk10\">String</span><span class=\"mtk1\"> </span><span class=\"mtk12\">name</span><span class=\"mtk1\"> = </span><span class=\"mtk12\">jsonNode</span><span class=\"mtk1\">.</span><span class=\"mtk11\">get</span><span class=\"mtk1\">(</span><span class=\"mtk8\">&quot;name&quot;</span><span class=\"mtk1\">).</span><span class=\"mtk11\">textValue</span><span class=\"mtk1\">();</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">                </span><span class=\"mtk15\">return</span><span class=\"mtk1\"> </span><span class=\"mtk12\">vowels</span><span class=\"mtk1\">.</span><span class=\"mtk11\">contains</span><span class=\"mtk1\">(</span><span class=\"mtk12\">name</span><span class=\"mtk1\">.</span><span class=\"mtk11\">substring</span><span class=\"mtk1\">(</span><span class=\"mtk7\">0</span><span class=\"mtk1\">,</span><span class=\"mtk7\">1</span><span class=\"mtk1\">).</span><span class=\"mtk11\">toLowerCase</span><span class=\"mtk1\">());</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">            } </span><span class=\"mtk15\">catch</span><span class=\"mtk1\"> (</span><span class=\"mtk10\">JsonProcessingException</span><span class=\"mtk1\"> </span><span class=\"mtk12\">e</span><span class=\"mtk1\">) {</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">                </span><span class=\"mtk12\">e</span><span class=\"mtk1\">.</span><span class=\"mtk11\">printStackTrace</span><span class=\"mtk1\">();</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">            }</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">            </span><span class=\"mtk15\">return</span><span class=\"mtk1\"> </span><span class=\"mtk4\">false</span><span class=\"mtk1\">;</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">        }</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">    }))</span></span></code></pre>\n<p>The filter transform takes a SerializableFunction Object where the <code>apply</code> method is overridden. Every json-string record is converted to a JSON. The first character of the <code>name</code> is checked if it's a vowel. The transform is applied to each input JSON record. Based on the boolean value returned, the record is retained or discarded.</p>\n</li>\n<li>\n<p>Write the results to a file</p>\n<pre class=\"grvsc-container dark-default-dark\" data-language=\"java\" data-index=\"5\"><code class=\"grvsc-code\"><span class=\"grvsc-line\"><span class=\"mtk12\">inputCollection</span><span class=\"mtk1\">.</span><span class=\"mtk11\">apply</span><span class=\"mtk1\">(</span><span class=\"mtk8\">&quot;write to file&quot;</span><span class=\"mtk1\">, </span><span class=\"mtk12\">TextIO</span><span class=\"mtk1\">.</span><span class=\"mtk11\">write</span><span class=\"mtk1\">().</span><span class=\"mtk11\">to</span><span class=\"mtk1\">(</span><span class=\"mtk8\">&quot;result&quot;</span><span class=\"mtk1\">).</span><span class=\"mtk11\">withSuffix</span><span class=\"mtk1\">(</span><span class=\"mtk8\">&quot;.txt&quot;</span><span class=\"mtk1\">).</span><span class=\"mtk11\">withoutSharding</span><span class=\"mtk1\">());</span></span></code></pre>\n<p>The results of the <code>Filter</code> transform are stored in a text file using the write method of the <code>TextIO</code> transform. As PCollections are distributed across machines, the results are written to multiple files/shards. To avoid this, we use <code>withoutSharding</code> where all the output is written to a single file.</p>\n</li>\n</ol>\n<p>Output:</p>\n<pre class=\"grvsc-container dark-default-dark\" data-language=\"json\" data-index=\"6\"><code class=\"grvsc-code\"><span class=\"grvsc-line\"><span class=\"mtk1\">{</span><span class=\"mtk12\">&quot;name&quot;</span><span class=\"mtk1\">: </span><span class=\"mtk8\">&quot;Edmund&quot;</span><span class=\"mtk1\">}</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">{</span><span class=\"mtk12\">&quot;name&quot;</span><span class=\"mtk1\">: </span><span class=\"mtk8\">&quot;Ojha&quot;</span><span class=\"mtk1\">}</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">{</span><span class=\"mtk12\">&quot;name&quot;</span><span class=\"mtk1\">:</span><span class=\"mtk8\">&quot;abhi&quot;</span><span class=\"mtk1\">, </span><span class=\"mtk12\">&quot;score&quot;</span><span class=\"mtk1\">:</span><span class=\"mtk7\">12</span><span class=\"mtk1\">}</span></span></code></pre>\n<hr>\n<p>Complete Code:</p>\n<pre class=\"grvsc-container dark-default-dark\" data-language=\"java\" data-index=\"7\"><code class=\"grvsc-code\"><span class=\"grvsc-line\"><span class=\"mtk10\">Pipeline</span><span class=\"mtk1\"> </span><span class=\"mtk12\">pipeLine</span><span class=\"mtk1\"> = </span><span class=\"mtk12\">Pipeline</span><span class=\"mtk1\">.</span><span class=\"mtk11\">create</span><span class=\"mtk1\">();</span></span>\n<span class=\"grvsc-line\"></span>\n<span class=\"grvsc-line\"><span class=\"mtk4\">final</span><span class=\"mtk1\"> </span><span class=\"mtk10\">Set</span><span class=\"mtk1\">&lt;</span><span class=\"mtk10\">String</span><span class=\"mtk1\">&gt; </span><span class=\"mtk12\">vowels</span><span class=\"mtk1\"> = </span><span class=\"mtk15\">new</span><span class=\"mtk1\"> </span><span class=\"mtk10\">HashSet</span><span class=\"mtk1\">&lt;</span><span class=\"mtk10\">String</span><span class=\"mtk1\">&gt;(</span><span class=\"mtk12\">Arrays</span><span class=\"mtk1\">.</span><span class=\"mtk11\">asList</span><span class=\"mtk1\">(</span><span class=\"mtk8\">&quot;a&quot;</span><span class=\"mtk1\">,</span><span class=\"mtk8\">&quot;e&quot;</span><span class=\"mtk1\">,</span><span class=\"mtk8\">&quot;i&quot;</span><span class=\"mtk1\">,</span><span class=\"mtk8\">&quot;o&quot;</span><span class=\"mtk1\">,</span><span class=\"mtk8\">&quot;u&quot;</span><span class=\"mtk1\">));</span></span>\n<span class=\"grvsc-line\"></span>\n<span class=\"grvsc-line\"><span class=\"mtk12\">pipeLine</span><span class=\"mtk1\">.</span><span class=\"mtk11\">apply</span><span class=\"mtk1\">(</span><span class=\"mtk8\">&quot;Read My File&quot;</span><span class=\"mtk1\">,</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">                </span><span class=\"mtk12\">TextIO</span><span class=\"mtk1\">.</span><span class=\"mtk11\">read</span><span class=\"mtk1\">().</span><span class=\"mtk11\">from</span><span class=\"mtk1\">(</span><span class=\"mtk8\">&quot;input.json&quot;</span><span class=\"mtk1\">))</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">        .</span><span class=\"mtk11\">apply</span><span class=\"mtk1\">(</span><span class=\"mtk8\">&quot;Filter names starting with vowels&quot;</span><span class=\"mtk1\">, </span><span class=\"mtk12\">Filter</span><span class=\"mtk1\">.</span><span class=\"mtk11\">by</span><span class=\"mtk1\">(</span><span class=\"mtk15\">new</span><span class=\"mtk1\"> </span><span class=\"mtk10\">SerializableFunction</span><span class=\"mtk1\">&lt;</span><span class=\"mtk10\">String</span><span class=\"mtk1\">, </span><span class=\"mtk10\">Boolean</span><span class=\"mtk1\">&gt;() {</span></span>\n<span class=\"grvsc-line\"></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">            </span><span class=\"mtk4\">public</span><span class=\"mtk1\"> </span><span class=\"mtk10\">Boolean</span><span class=\"mtk1\"> </span><span class=\"mtk11\">apply</span><span class=\"mtk1\">(</span><span class=\"mtk10\">String</span><span class=\"mtk1\"> </span><span class=\"mtk12\">input</span><span class=\"mtk1\">) {</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">                </span><span class=\"mtk10\">ObjectMapper</span><span class=\"mtk1\"> </span><span class=\"mtk12\">jacksonObjMapper</span><span class=\"mtk1\"> = </span><span class=\"mtk15\">new</span><span class=\"mtk1\"> </span><span class=\"mtk11\">ObjectMapper</span><span class=\"mtk1\">();</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">                </span><span class=\"mtk15\">try</span><span class=\"mtk1\"> {</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">                    </span><span class=\"mtk10\">JsonNode</span><span class=\"mtk1\"> </span><span class=\"mtk12\">jsonNode</span><span class=\"mtk1\"> = </span><span class=\"mtk12\">jacksonObjMapper</span><span class=\"mtk1\">.</span><span class=\"mtk11\">readTree</span><span class=\"mtk1\">(input);</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">                    </span><span class=\"mtk10\">String</span><span class=\"mtk1\"> </span><span class=\"mtk12\">name</span><span class=\"mtk1\"> = </span><span class=\"mtk12\">jsonNode</span><span class=\"mtk1\">.</span><span class=\"mtk11\">get</span><span class=\"mtk1\">(</span><span class=\"mtk8\">&quot;name&quot;</span><span class=\"mtk1\">).</span><span class=\"mtk11\">textValue</span><span class=\"mtk1\">();</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">                    </span><span class=\"mtk15\">return</span><span class=\"mtk1\"> </span><span class=\"mtk12\">vowels</span><span class=\"mtk1\">.</span><span class=\"mtk11\">contains</span><span class=\"mtk1\">(</span><span class=\"mtk12\">name</span><span class=\"mtk1\">.</span><span class=\"mtk11\">substring</span><span class=\"mtk1\">(</span><span class=\"mtk7\">0</span><span class=\"mtk1\">,</span><span class=\"mtk7\">1</span><span class=\"mtk1\">).</span><span class=\"mtk11\">toLowerCase</span><span class=\"mtk1\">());</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">                } </span><span class=\"mtk15\">catch</span><span class=\"mtk1\"> (</span><span class=\"mtk10\">JsonProcessingException</span><span class=\"mtk1\"> </span><span class=\"mtk12\">e</span><span class=\"mtk1\">) {</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">                    </span><span class=\"mtk12\">e</span><span class=\"mtk1\">.</span><span class=\"mtk11\">printStackTrace</span><span class=\"mtk1\">();</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">                }</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">                </span><span class=\"mtk15\">return</span><span class=\"mtk1\"> </span><span class=\"mtk4\">false</span><span class=\"mtk1\">;</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">            }</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">        }))</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">        .</span><span class=\"mtk11\">apply</span><span class=\"mtk1\">(</span><span class=\"mtk8\">&quot;write to file&quot;</span><span class=\"mtk1\">, </span><span class=\"mtk12\">TextIO</span><span class=\"mtk1\">.</span><span class=\"mtk11\">write</span><span class=\"mtk1\">().</span><span class=\"mtk11\">to</span><span class=\"mtk1\">(</span><span class=\"mtk8\">&quot;result&quot;</span><span class=\"mtk1\">).</span><span class=\"mtk11\">withSuffix</span><span class=\"mtk1\">(</span><span class=\"mtk8\">&quot;.txt&quot;</span><span class=\"mtk1\">).</span><span class=\"mtk11\">withoutSharding</span><span class=\"mtk1\">());</span></span>\n<span class=\"grvsc-line\"></span>\n<span class=\"grvsc-line\"><span class=\"mtk12\">pipeLine</span><span class=\"mtk1\">.</span><span class=\"mtk11\">run</span><span class=\"mtk1\">().</span><span class=\"mtk11\">waitUntilFinish</span><span class=\"mtk1\">();</span></span></code></pre>\n<p>For more advanced concepts, refer to the official site - beam.apache.org</p>\n<style class=\"grvsc-styles\">\n  .grvsc-container {\n    overflow: auto;\n    -webkit-overflow-scrolling: touch;\n    padding-top: 1rem;\n    padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem));\n    padding-bottom: 1rem;\n    padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem));\n    border-radius: 8px;\n    border-radius: var(--grvsc-border-radius, 8px);\n    font-feature-settings: normal;\n  }\n  \n  .grvsc-code {\n    display: inline-block;\n    min-width: 100%;\n  }\n  \n  .grvsc-line {\n    display: inline-block;\n    box-sizing: border-box;\n    width: 100%;\n    padding-left: 1.5rem;\n    padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem));\n    padding-right: 1.5rem;\n    padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem));\n  }\n  \n  .grvsc-line-highlighted {\n    background-color: var(--grvsc-line-highlighted-background-color, transparent);\n    box-shadow: inset var(--grvsc-line-highlighted-border-width, 4px) 0 0 0 var(--grvsc-line-highlighted-border-color, transparent);\n  }\n  \n  .dark-default-dark {\n    background-color: #1E1E1E;\n    color: #D4D4D4;\n  }\n  .dark-default-dark .mtk1 { color: #D4D4D4; }\n  .dark-default-dark .mtk12 { color: #9CDCFE; }\n  .dark-default-dark .mtk8 { color: #CE9178; }\n  .dark-default-dark .mtk7 { color: #B5CEA8; }\n  .dark-default-dark .mtk10 { color: #4EC9B0; }\n  .dark-default-dark .mtk11 { color: #DCDCAA; }\n  .dark-default-dark .mtk3 { color: #6A9955; }\n  .dark-default-dark .mtk15 { color: #C586C0; }\n  .dark-default-dark .mtk4 { color: #569CD6; }\n</style>","frontmatter":{"title":"Apache Beam: A Basic Guide","author":{"id":"Abhilash K R","github":"Better-Boy","avatar":null},"date":"October 16, 2020","updated_date":null,"tags":["Engineering","Big Data","Streaming","Apache Beam","Java"],"coverImage":{"childImageSharp":{"fluid":{"aspectRatio":1.5037593984962405,"src":"/static/17c96c6367455f79bdb854c4608ebaba/ee604/main.png","srcSet":"/static/17c96c6367455f79bdb854c4608ebaba/69585/main.png 200w,\n/static/17c96c6367455f79bdb854c4608ebaba/497c6/main.png 400w,\n/static/17c96c6367455f79bdb854c4608ebaba/ee604/main.png 800w,\n/static/17c96c6367455f79bdb854c4608ebaba/f3583/main.png 1200w","sizes":"(max-width: 800px) 100vw, 800px"}}}},"fields":{"authorId":"Abhilash K R","slug":"/engineering/apache-beam/"}}}]},"authorYaml":{"id":"Abhilash K R","bio":null,"github":"Better-Boy","stackoverflow":"6849682","linkedin":null,"medium":null,"twitter":null,"avatar":null}},"pageContext":{"id":"Abhilash K R","__params":{"id":"abhilash-k-r"}}},"staticQueryHashes":["1171199041","1384082988","2100481360","23180105","528864852"]}