From Cloud Notebooks to Enterprise Deployment
Data science used to be performed in expensive, proprietary software, like SAS, SPSS, and Minitab. Then, open source software took its place, with languages such as R and Python. First, there were IDEs for some of these programs – as in for R and Python, then open source GUI’s. Now notebook systems have become standard. There are several online cloud notebook vendors, many of them hosting free instances where data science projects can be built, exported and shared. What is the value of these notebook systems for data scientists, and is there value to an enterprise as well?
A Wealth of Online Cloud Notebook Providers
Data Science Languages
Two of the most popular programming languages for data science remain R and Python. R Stats was developed as a way of teaching statistics to college students. Many people who use it have classical leanings into statistical training and methodology. Among free online cloud instances for R users include Posit (at onetime RStudio) and Quarto. Both of these integrate with Python.
Python, the other very popular language for data scientists, is multipurpose. Python can be used for a wider range of tasks, such as web development, data manipulation, machine learning, etc. Jupyter Notebooks are specifically developed for Python. Although, they do support many other languages, including R. The drawback is that the only way to interact with Jupyter Notebooks specifically is to have the Jupyter environment installed on your own computer.
Cloud Notebook Providers
There are other cloud notebooks which are popular and you don’t need to have anything installed on your local machine. These include Google Collab, Wolfram, Kaggle, (which is also a great source for sample/practice datasets), Binder, (which also allows you to fork over Github code), and Replit. Some of the pros of having a free notebook cloud is that a data scientist can learn and practice new code, and they can log in from anywhere. You can import and export your notebooks in a standard (.ipynb) form, and several languages, including Python, R, Scala, SQL, Julia, etc are supported. In some of these instances, like Kaggle, data scientists share with a community as well. Kaggle, as a side note, is one of several websites that is a great resource for practice/example datasets.
Why Use Cloud Notebooks
Cloud notebooks are extremely beneficial to new data scientists, or students, for several reasons. For one, they are free and can be accessed from anywhere. Secondly, data scientists become part of a larger community where they can see other techniques on real data and gain feedback in a short period of time. Additionally, they view data science implemented across many verticals/industries. Lastly, they can export the notebooks and re-use them.
Some of the same reasons that newer data scientists find value in free cloud notebook systems apply to more seasoned ones as well. These are environments in which to try new techniques, find practice/experimental data and become part of a larger community of scientists. These engage scientists with those outside of their work.
Notebooks in the Enterprise
One of the great advantages that the proprietary software had over open source was deploy-ability. Now with notebooks, data scientists can code in several languages and export it to perform several tasks and have it go into deployment. Additionally, data engineers can also use notebooks on Spark. We will take a look at this and how notebooks have been adapted to this framework. We’ll also look at some of the challenges that are faced in taking notebooks into a deployment environment.
Apache Spark and Notebooks – Deployment in the Enterprise
First, what is Apache Spark? Spark is an analytics engine that runs in clusters. It has several foundational components including, Spark Streaming, Spark SQL, Spark ML (Machine Learning), and GraphX on top of the analytics engine of Spark Core (Day, 2022,
Goyal, 2018). All of these are different inherent capabilities within Apache Spark that make it a very powerful analytics engine. Another foundation is the Resilient Distributed Dataset (RDD). This essentially means that Spark identifies a reference dataset and as transactions are created in programmed steps, Spark can create parallel operations or store intermediate, computed results in a distributed memory instead of Stable memory. (Day, 2022). This preserves speed. Spark clusters can be “spun up” or destroyed as needed. They can run in batch or real time (streaming). It can connect to numerous types of databases and output the same. In this regard, it is a flexible analytics engine.
Apache Spark in the Enterprise Cloud
Databricks, which is a paid notebook instance of Apache Spark, is generally accessed through another paid service, such as Azure or AWS. This is an example of how notebooks can be utilized and deployed within the enterprise. The enterprise may may have a public cloud to perform analytics and will create one or more databases/stores, for example in Azure. The cloud is where data is stored in some type of database, whether it is SQL, Blob, etc. Then connect to Databricks, the Apache Spark analytics engine externally through a secure link external to Azure (Gupta, 2022). There are two main options for private links: Azure private link and Azure Virtual Network Service Endpoints (Kuladia, et al., 2022). I’m using the example of Databricks, because it is a managed instance of an analytics engine of Apache Spark, connected to databases, and can be deployed using notebook services in several configurations.
Language Compatibility
One of the hailed advantages of Databricks and Spark is that data scientists/data engineers can program in several languages – of their choice and collaborate using notebooks. In a sense this is true, however it is easier to use some languages on Spark than others. The reason for this is that Spark Core is written in Scala, which is JVM (a Java based language). It follows that the languages that most easily work within Spark, and in the notebook system, are of course, Scala and Java (Deshpande, 2022, Kiran, 2023). SQL also works fairly well. Python and R must be adapted through respectfully pySpark and either Sparklyr or Sparkr libraries to connect and work with Spark Core. You can use RStudio Server hosted on Azure Databricks by importing the installed version of sparklyr, which is another modification. These adaptations do work. However, at some point, the smoothest deployment happens with Scala or Java. When you are processing large datasets including algorithms, which can slow down speed, to optimize processing, you may want your programming done in Scala (Kiran, 2023). Scala has a much steeper learning curve than Python or even Java (Deshpande, 2022). The solution may be to have the beginning stages of the data science project done in language of the data scientist and the deployment done by data engineer/architect as the output will go back, to another database.
In this sense, notebook systems are “language agnostic” and many can collaborate as data scientists and across the enterprise. However, when it comes to deployment, it may be advantageous to have code written in JVM compatible languages (Microsoft, 2023).
Whatever the solution to this particular challenge, this architecture, with notebooks as the way code is managed and transmitted now, is likely to continue to evolve and improve. As individual data scientists, data engineers and enterprises continue to benefit from this way of sharing code, notebook systems are likely to continue to evolve in applications that make use of them.
References:
Berman, D. (2019). Comparing Apache Hive and Spark. DZone. Comparing Apache Hive and Spark - DZone
Deshpande, S. (2022). Scala Vs Python Vs R Vs Java - Which language is better for Spark & Why? Knowledgehut. https://www.knowledgehut.com/blog/programming/scala-vs-python-vs-r-vs-java Retrieved April 20, 2023.
Goyal, S. (2018). Spark Architecture and Deployment Environment. Medium. https://medium.com/@goyalsaurabh66/spark-architecture-and-deployment-f713ac031a88 Retrieved April 22, 2023.
Gupta, B (2022). Using Microsoft Databricks to Facilitate a Modern Data Architecture. Tridant. https://www.tridant.com/microsoft-databricks-facilitate-modern-data-architecture/ Retrieved April 23, 2023.
Kiran, R. (2023). Spark Java Tutorial : Your One Stop Solution to Spark in Java. Edureka! https://www.edureka.co/blog/spark-java-tutorial/#:~:text=Spark%20is%20written%20in%20Java,Hive%2C%20Scala%20and%20many%20more. Retrieved April 22, 2023
Kuladia, B., Garg, A., and Marusan, M. (2020). Securely Accessing Azure Data Sources from Azure Databricks. Databricks.com https://www.databricks.com/blog/2020/02/28/securely-accessing-azure-data-sources-from-azure-databricks.html retrieved April 19, 2023
Microsoft (2023). Azure Databricks for Scala developers. https://learn.microsoft.com/en-us/azure/databricks/languages/scala retrieved April 20, 2023.