{"componentChunkName":"component---src-templates-tag-js","path":"/tags/coding/","result":{"data":{"site":{"siteMetadata":{"title":"LoginRadius Blog"}},"allMarkdownRemark":{"totalCount":2,"edges":[{"node":{"fields":{"slug":"/engineering/learning-how-to-code/"},"html":"<p>When you work for a tech company in an office capacity, it feels like everyone around you is speaking another language. Which they are, most of their work exists in coding language. To feel more relevant, I signed up on <a href=\"https://www.codecademy.com/\">codecademy.com</a> and started working on their beginner courses (for free! you should try it!).</p>\n<p>Here are some things I have learned about coding:</p>\n<ol>\n<li>There are many different ways to code. Not every website or app is made using the same terms or patterns, programmers and developers have options for how they want to format their content, and how they want to communicate with their computer.</li>\n<li>Coding is more about seeing what you want in your head than about math. Deciding exactly what you want and communicating it is the big struggle. Having your website/app be capable of computing things is optional, but the layout is mandatory. (Not to mention the most noticeable aspect of your site).</li>\n<li>Code is like a language. The words and punctuation must be learned for each coding language. Though they can be linked to English words (like “p” for a new paragraph), you have to be able to pull them up quick in your mind. As with any language, repetition is key to recognition.</li>\n<li>You don’t have to start from scratch (but you can). There are frameworks available, like Bootstrap, which have some existing templates you can draw from. This helps you when you are constructing your site, so that you can set up your layout quicker, and have more time to work on your content.</li>\n<li>\n<p>There are some areas of knowledge that you don’t need to memorize, but having an understanding of what they do will be helpful in coding:</p>\n<ol>\n<li>The hexi-decimal system (and how it relates to color)</li>\n<li>Binary code</li>\n<li>ASCII</li>\n</ol>\n</li>\n<li>Once you learn some of the basics, websites you see every day will start to seem simpler. You will be able to pick out some of their elements, and if you look at their bare code, you will recognize how they come together.</li>\n<li>The world of coders is large and easy to access. If you have questions, there is someone online who is willing and eager to answer. This is not a skill that anyone was born with, and due to new languages etc, everyone is still learning. </li>\n</ol>\n<p>I hope that this list has contributed to your knowledge, and would encourage you to check online for available resources to expand (or begin) your abilities in coding.</p>\n<style class=\"grvsc-styles\">\n  .grvsc-container {\n    overflow: auto;\n    -webkit-overflow-scrolling: touch;\n    padding-top: 1rem;\n    padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem));\n    padding-bottom: 1rem;\n    padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem));\n    border-radius: 8px;\n    border-radius: var(--grvsc-border-radius, 8px);\n    font-feature-settings: normal;\n  }\n  \n  .grvsc-code {\n    display: inline-block;\n    min-width: 100%;\n  }\n  \n  .grvsc-line {\n    display: inline-block;\n    box-sizing: border-box;\n    width: 100%;\n    padding-left: 1.5rem;\n    padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem));\n    padding-right: 1.5rem;\n    padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem));\n  }\n  \n  .grvsc-line-highlighted {\n    background-color: var(--grvsc-line-highlighted-background-color, transparent);\n    box-shadow: inset var(--grvsc-line-highlighted-border-width, 4px) 0 0 0 var(--grvsc-line-highlighted-border-color, transparent);\n  }\n  \n</style>","frontmatter":{"date":"December 29, 2015","updated_date":null,"title":"Learning How to Code","tags":["Learning","Coding","Learning resources"],"coverImage":{"childImageSharp":{"fluid":{"aspectRatio":1,"src":"/static/26a66e05ab78493dc6d84d3afe0d8a82/630fb/begin-code-300x300.png","srcSet":"/static/26a66e05ab78493dc6d84d3afe0d8a82/69585/begin-code-300x300.png 200w,\n/static/26a66e05ab78493dc6d84d3afe0d8a82/630fb/begin-code-300x300.png 300w","sizes":"(max-width: 300px) 100vw, 300px"}}},"author":{"id":"Carling","github":null,"avatar":null}}}},{"node":{"fields":{"slug":"/engineering/write-a-highly-efficient-python-web-crawler/"},"html":"<p>As my previous blog, I use the python web Crawler library to help crawl the static website. For the Scrapy, there can be customize download middle ware, which can deal with static content in the website like JavaScript.</p>\n<p>However, the Scrapy already helps us with much of the underlying implementation, for example, it uses it own dispatcher and it has pipeline for dealing the parsing word after download.  One drawback for using such library is hard to deal with some strange bugs occurring because they run the paralleled jobs.</p>\n<p>For this tutorial, I want to show the structure of a simple and efficient web crawler.</p>\n<p>First of all, we need a scheduler, who can paralleled the job. Because the most of the time is on the requesting.  I use the  <a href=\"http://www.gevent.org/\">gevent</a> to schedule the jobs. Gevent uses the <a href=\"http://libevent.org/\">libevent</a> as its underlying library, which combines the multithreading and event-based techniques to parallel the job.</p>\n<p>There is the sample code:</p>\n<pre class=\"grvsc-container dark-default-dark\" data-language=\"python\" data-index=\"0\"><code class=\"grvsc-code\"><span class=\"grvsc-line\"><span class=\"mtk15\">import</span><span class=\"mtk1\"> gevent</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk15\">from</span><span class=\"mtk1\"> gevent </span><span class=\"mtk15\">import</span><span class=\"mtk1\"> Greenlet</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk15\">from</span><span class=\"mtk1\"> gevent </span><span class=\"mtk15\">import</span><span class=\"mtk1\"> monkey</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk15\">from</span><span class=\"mtk1\"> selenium </span><span class=\"mtk15\">import</span><span class=\"mtk1\"> webdriver</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">monkey.patch_socket()</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk4\">class</span><span class=\"mtk1\"> </span><span class=\"mtk10\">WebCrawler</span><span class=\"mtk1\">:</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">    </span><span class=\"mtk4\">def</span><span class=\"mtk1\"> </span><span class=\"mtk11\">__init__</span><span class=\"mtk1\">(</span><span class=\"mtk12\">self</span><span class=\"mtk1\">,</span><span class=\"mtk12\">urls</span><span class=\"mtk1\">=[],</span><span class=\"mtk12\">num_worker</span><span class=\"mtk1\"> = </span><span class=\"mtk7\">1</span><span class=\"mtk1\">):</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">        </span><span class=\"mtk4\">self</span><span class=\"mtk1\">.url_queue = Queue()</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">        </span><span class=\"mtk4\">self</span><span class=\"mtk1\">.num_worker = num_worker</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">    </span><span class=\"mtk4\">def</span><span class=\"mtk1\"> </span><span class=\"mtk11\">worker</span><span class=\"mtk1\">(</span><span class=\"mtk12\">self</span><span class=\"mtk1\">,</span><span class=\"mtk12\">pid</span><span class=\"mtk1\">):</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">        driver = </span><span class=\"mtk4\">self</span><span class=\"mtk1\">.initializeAnImegaDisabledDriver()  </span><span class=\"mtk3\">#initilize the webdirver</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk3\">#</span><span class=\"mtk4\">TODO</span><span class=\"mtk3\"> catch the exception</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">        </span><span class=\"mtk15\">while</span><span class=\"mtk1\"> </span><span class=\"mtk4\">not</span><span class=\"mtk1\"> </span><span class=\"mtk4\">self</span><span class=\"mtk1\">.url_queue.empty():</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">            url = </span><span class=\"mtk4\">self</span><span class=\"mtk1\">.url_queue.get()</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">            </span><span class=\"mtk4\">self</span><span class=\"mtk1\">.driver.get(url)</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">            elem = </span><span class=\"mtk4\">self</span><span class=\"mtk1\">.driver.find_elements_by_xpath(</span><span class=\"mtk8\">&quot;//script | //iframe | //img&quot;</span><span class=\"mtk1\">) </span><span class=\"mtk3\"># get such element from webpage</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">    </span><span class=\"mtk4\">def</span><span class=\"mtk1\"> </span><span class=\"mtk11\">run</span><span class=\"mtk1\">(</span><span class=\"mtk12\">self</span><span class=\"mtk1\">):</span></span>\n<span class=\"grvsc-line\"><span class=\"mtk1\">        jobs = [gevent.spawn(</span><span class=\"mtk4\">self</span><span class=\"mtk1\">.worker,i) </span><span class=\"mtk15\">for</span><span class=\"mtk1\"> i </span><span class=\"mtk4\">in</span><span class=\"mtk1\"> </span><span class=\"mtk12\">xrange</span><span class=\"mtk1\">(</span><span class=\"mtk4\">self</span><span class=\"mtk1\">.num_worker)]</span></span></code></pre>\n<p>The next part is the headless browser part. I use the phantomjs with <code>--webdriver=4444 --disk-cache=true --ignore-ssl-errors=true --load-images=false --max-disk-cache-size=100000</code>. You can get the detailed option from their documents.</p>\n<p>Phantomjs uses selenium webdriver as front-end to handle the request. However phantomjs is using the webkit and QT as its underlying browser and controller. It has memory leak bugs therefore the phantomjs will consume ton of memory and it only can use one core of your CPU but you can deploy many instances of the phantomjs on different ports. I wrote a daemon process to monitor the memory and its situation but later I realize I can use Perl script to get the status of process and when it exceeds the limits like 1G memory and send kill signal to the process.</p>\n<p>To speed up the crawler, I choose to use static browser to verify the website first because the website is bad written, there might be deadlock occurring so just skip them.</p>\n<style class=\"grvsc-styles\">\n  .grvsc-container {\n    overflow: auto;\n    -webkit-overflow-scrolling: touch;\n    padding-top: 1rem;\n    padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem));\n    padding-bottom: 1rem;\n    padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem));\n    border-radius: 8px;\n    border-radius: var(--grvsc-border-radius, 8px);\n    font-feature-settings: normal;\n  }\n  \n  .grvsc-code {\n    display: inline-block;\n    min-width: 100%;\n  }\n  \n  .grvsc-line {\n    display: inline-block;\n    box-sizing: border-box;\n    width: 100%;\n    padding-left: 1.5rem;\n    padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem));\n    padding-right: 1.5rem;\n    padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem));\n  }\n  \n  .grvsc-line-highlighted {\n    background-color: var(--grvsc-line-highlighted-background-color, transparent);\n    box-shadow: inset var(--grvsc-line-highlighted-border-width, 4px) 0 0 0 var(--grvsc-line-highlighted-border-color, transparent);\n  }\n  \n  .dark-default-dark {\n    background-color: #1E1E1E;\n    color: #D4D4D4;\n  }\n  .dark-default-dark .mtk15 { color: #C586C0; }\n  .dark-default-dark .mtk1 { color: #D4D4D4; }\n  .dark-default-dark .mtk4 { color: #569CD6; }\n  .dark-default-dark .mtk10 { color: #4EC9B0; }\n  .dark-default-dark .mtk11 { color: #DCDCAA; }\n  .dark-default-dark .mtk12 { color: #9CDCFE; }\n  .dark-default-dark .mtk7 { color: #B5CEA8; }\n  .dark-default-dark .mtk3 { color: #6A9955; }\n  .dark-default-dark .mtk8 { color: #CE9178; }\n</style>","frontmatter":{"date":"July 14, 2015","updated_date":null,"title":"Write a highly efficient python Web Crawler","tags":["Python","Coding"],"coverImage":{"childImageSharp":{"fluid":{"aspectRatio":1,"src":"/static/87be28baa66b2bfdbec1ee9478ca3d79/7d145/python-web-crawler.png","srcSet":"/static/87be28baa66b2bfdbec1ee9478ca3d79/69585/python-web-crawler.png 200w,\n/static/87be28baa66b2bfdbec1ee9478ca3d79/497c6/python-web-crawler.png 400w,\n/static/87be28baa66b2bfdbec1ee9478ca3d79/7d145/python-web-crawler.png 610w","sizes":"(max-width: 610px) 100vw, 610px"}}},"author":{"id":"Mark Duan","github":null,"avatar":null}}}}]}},"pageContext":{"tag":"Coding"}},"staticQueryHashes":["1171199041","1384082988","2100481360","23180105","528864852"]}