Julia language features for processing statistical data

Cover Page

Cite item

Abstract

The Julia programming language is a specialized language for scientific computing. It is relatively new, so most of the libraries for it are in the active development stage. In this article, the authors consider the possibilities of the language in the field of mathematical statistics. Special emphasis is placed on the technical component, in particular, the process of installing and configuring the software environment is described in detail. Since users of the Julia language are often not professional programmers, technical issues in setting up the software environment can cause difficulties that prevent them from quickly mastering the basic features of the language. The article also describes some features of Julia that distinguish it from other popular languages used for scientific computing. The third part of the article provides an overview of the two main libraries for mathematical statistics. The emphasis is again on the technical side in order to give the reader an idea of the general possibilities of the language in the field of mathematical statistics.

Full Text

1. Introduction In this paper we give a brief overview of Julia [1] programming language capabilities in the field of mathematical statistics. Julia is a fast compiled language with dynamic typing, originally developed for scientific computing. The language is relatively new, however, it has already reached version 1.8 and the core of the language is quite stable. An impressive number of modules have been created for Julia and several books have been written [2-4]. There are a number of arguments in favor of learning and using the Julia language: - Just-in-Time compilation (JIT) [5] allows you to simultaneously achieve high performance and ease of use of the interpreted language. Singlethreaded programs in Julia have the performance of programs in C/C++ and Fortran [6] and significantly exceed the interpreted languages, such as R, Python, Matlab, SciLab, etc. © Gevorkyan M.N., Korolkova A.V., Kulyabov D.S., 2023 This work is licensed under a Creative Commons Attribution 4.0 International License https://creativecommons.org/licenses/by-nc/4.0/legalcode - The syntax of Julia is simple and for researchers familiar with Python, Fortran and R languages, it will not be difficult to master it at a basic level in the shortest possible time. - The language has built-in extensive capabilities for parallel and distributed computing, which are constantly being refined. The authors tend to give a general idea of Julia language’s available capabilities in the field of mathematical statistics and demonstrate a number of examples that allow one to quickly grasp the features of the language and move on to use it. At the beginning of the article we give a step-by-step description of the configuration of the working environment for Unix-type systems (macOS, GNU/Linux) and Windows. We do not give a consistent description of the syntax of the language, but focus on some specific features (dynamic dispatching, custom data types) that distinguish Julia from most popular programming languages. Libraries for mathematical statistics for Julia are combined under the general name Julia Statistics [7, 8] and a separate section is allocated for them on the official forum of language developers [9]. In the main part of this paper, we give an overview of the modules StatsBase and Distributions [10, 11], comparing their functionality with the libraries of the R language and the scipy.stats library of Python [12]. 2. Installation and configuration of Julia environment There are several ways of programs development in Julia languages. - Using REPL-shell (read-eval-print loop) in interactive mode, by running the julia command from the terminal and entering instructions that will be executed immediately, and the user will see the returned result. - By saving the program source code to files with the extension jl and then passing them for compilation and launching to the julia JIT compiler (same julia command). - By using interactive shells, such as Jupiter Notebook [13] and Pluto [14]. We will describe the process of Julia installation, as well as Jupiter interactive shell and all necessary modules in the GNU/Linux and Windows environments. The installation does not require superuser rights and it can be performed remotely by connecting via ssh, which can be convenient if calculations are supposed to be performed on a remote server. 2.1. Installing Julia and the necessary packages On the official Julia website, in the downloads section, binary files for many systems are presented. Download the archive for the 64-bit version of GNU/Linux: wget https://julialang-s3.julialang.org/bin/linux/x64/1.8/juli⌋ ↪ a-1.8.5-linux-x86_64.tar.gz ↪ --no-check-certificate Please note that the url may change, as it clearly indicates the current version of the Julia distribution. Extract the files from the downloaded archive: tar -xvzf julia-1.8.5-linux-x86_64.tar.gz The directory julia-d386e40c17 will be created (or with another alphanumeric combination), which we will rename to just julia: mv julia-d386e40c17/ julia export PATH="~/julia/bin:$PATH" In the case of the Windows operating system, we will describe the installation of the portable version. In the same section of the official website, download the 64-bit (portable) version for Windows and unpack the archive, for example, into the following directory: E:\Program Files\julia In this directory, we will create the folder depot, and in it, the folder config, in which we will create an empty text file startup.jl, which we will need next. The Julia directory depot will host an index of modules from the official repository, as well as installed modules and additional libraries. The location of this directory is non-standard and in order for the Julia JIT compiler to recognize it correctly at startup, you should create an environment variable JULIA_DEPOT_PATH and assign it a value: E:\Program Files\julia\depot We also added the path to the Julia JIT compiler to the variable PATH (executable file julia.exe) E:\Program Files\julia\bin After the installation is complete, run the Julia command shell. To do this, run the command julia in the console. In the case of Windows, it is recommended to use PowerShell or the new Windows Terminal application [15]. After launching, press the key ] and switch to package management mode, where you run the command update, which will download the package index. You can also immediately install the necessary packages using the command add, for example add StatsBase Distributions Pluto Plots The built-in package manager saves all package-related files to the storage directory (depot), which is pointed to by the environment variable JULIA_DEPOT_PATH. No other directories are involved. On Unix systems - if the variable JULIA_DEPOT_PATH is not defined - the corresponding directory is created in the user directory and is called .julia. In it, you should also manually create a directory config with the configuration file startup.jl. At this stage, the installation and configuration of the compiler is complete and we will proceed to the installation of additional tools that may be needed to write programs on Julia. 2.2. Jupyter installation Julia language code can be executed in the Jupyter environment, for which you should install the Jupyter Notebook kernel, which is included in the package IJulia. During the installation, the built-in Julia package manager automatically downloads the Python distribution miniconda [16], places it in the storage directory and installs with its help all the necessary python packages, including Jupyter Notebook. Since most Julia users probably already have a Python distribution installed on the system, we will show you how to use the already installed Jupyter in Julia. Even if there is no Python distribution in the system, it seems more practical to install it separately, as this will allow better control of the packages used. Next, we will describe the process of installing Jupiter using the Miniconda distribution into the user’s local directory. Download the installation script from the official website: wget https://repo.continuum.io/miniconda/Miniconda3-latest-Lin⌋ ↪ ux-x86_64.sh and start the installation process: bash Miniconda3-latest-Linux-x86_64.sh During the installation process, you must read and accept the license agreement by using the key Enter to scroll through the text and typing the word yes to accept it. After that, the installer will prompt you to select the directory where the distribution directory tree will be copied. By default, this is the directory miniconda3 in the user’s home directory. Let’s leave it unchanged, for which you should press Enter. The process of downloading the necessary files will begin, after which the installer will offer to add the path to the Python interpreter to the environment variable PATH. You should agree by typing the word yes and pressing Enter. Then check that the following line has been added to the file .bashrc located in the home directory: export PATH="~/miniconda3/bin:$PATH" where ~/miniconda3/bin is the path to the miniconda directory. If the installer did not add this path automatically, you can do it manually. After the anaconda installation is completed, the command conda will be available with which you can manage the installed Python modules. Let’s use this command to install the modules we need (Numpy, SciPy, Matplotlib and Jupiter): conda install numpy matplotlib scipy jupyter The process of downloading and unpacking the required files can take considerable time, and after completion, the miniconda directory will occupy about 2.5 GB of disk space. To check the correctness of the installation, run Jupiter by running the following command: jupyter notebook --notebook-dir=~ --port=7000 The interactive shell session will start. If launched on a local computer, a browser will automatically open with a list of files and directories of the home directory (option --notebook-dir=~). If you run it on a remote computer, you should add the option --no-browser, then you can connect to the session that has started remotely by entering the address of the remote computer into the local network in the browser address bar or organize an ssh tunnel. Now, in the file startup.jl, which we previously created, but left empty, we should add local environment variables that will point to the locations of the executable files of the python interpreter and the jupyter shell: ENV["PYTHON"] = "~/miniconda3/bin/python" ENV["JUPYTER"] = "~/miniconda3/bin/jupyter" Note that the variable ENV is a dictionary to which, when the compiler is started, system and user environment variables are added, as well as a number of local Julia parameters. This dictionary is available in any julia program for reading and modification, which we used by adding two new keys to it. After that, run REPL Julia with the command julia and install the necessary packages: add PyCall PyPlot IJulia During the installation process, the package manager will see that the variables PYTHON and JUPITER have been assigned values and the local copy of miniconda will not be installed. After the installation is completed, when Jupiter Notebook is launched, the Julia kernel will be available and it will be possible to create and open interactive notebooks with scripts in the Julia language. In addition to Julia, we also installed the package PyCall, which greatly simplifies calling functions from python modules, and the package PyPlot, which makes it possible to use the library Matplotlib in Julia. In this article, we will not use the capabilities of these packages, but for those users who are used to standard Python scientific libraries, they may be useful, since they transfer the usual functionality to Julia. 2.3. Pluto shell as an alternative to Jupiter Using Jupyter with the Julia language has at first glance an unobvious drawback associated with a fundamental feature of the architecture of the language itself - multiple dispatching of functions. In the case of Jupyter, it manifests itself as follows: when initially creating a function in a separate cell and executing this cell once, no problems arise, however, if the programmer decides to change the body of this function without changing the signature of the arguments, then re-executing the cell with the modified code will lead to an error. If the list of arguments has been modified, there will be no errors during execution, but a new method will be created or, in other terminology, an overloaded version of the function will be created. This difficulty can be overcome by restarting the kernel, but this makes the process uncomfortable if the cells contain resource-intensive calculations, which will have to be done again every time. There will definitely be such cells, since the initial initialization of graphical libraries for data visualization in Julia is extremely slow, and the main advantage of Jupyter is precisely the interactive display of the results of various data visualization. The Julia development community has created an alternative shell called Pluto.jl. At the moment, the repository of this package [17] ranks second in the number of stars on GitHub, second only to the Julia compiler itself [18]. Shell Pluto.jl is generally similar to Jupyter, but has two key differences: - reactivity (reactive); - no hidden states (no hidden workspace state). Reactivity lies in the fact that all cells of the interactive notebook are immediately restarted if the variables on which they depend are modified, even if these variables are contained in other cells. The absence of hidden states means that if a cell is deleted, then all the variables, functions and data structures contained in it are deleted from memory and become inaccessible. It is also impossible to redefine variables and functions in neighboring cells, which removes the problem with implicit function overloading. To install Pluto, just run add Pluto in package management mode in Julia REPL. No additional manipulations are required, since Pluto is written in Julia only. At the time of writing, this package has reached version 0.19.22, but it works quite stably. Among the disadvantages, it can be noted that the interface is too minimalistic, as well as the demands on the amount of RAM. To run the shell in Julia REPL mode, follow these instructions import Pluto Pluto.run() At the same time, the browser will immediately be launched with a welcome interface, and a link will be displayed in the console, which can be used for remote connection. Let’s note some features of the interactive notepad. To store the contents of the notebook, a simple text file with the extension jl is used, the entire code of the cells is stored as a regular code in the Julia language, and standard code comments are used to store meta information. This makes it possible to execute Pluto notebooks like regular Julia programs, passing them to the JIT compiler for execution. Pluto has built-in support for local package environments. It automatically detects the packages used by looking at instructions used and import and downloading the necessary packages to the local directory. This is useful if you need to transfer the created notebooks to third-party users, as well as fix specific versions of libraries. However, this behavior can be disabled if desired, for which the following code should be added to the first cell begin using Pkg Pkg.activate() end This command will disable the local package manager and Pluto will use the standard Julia environment that was created when the package update was initially launched. This code snippet illustrates another feature of Pluto which is called reactivity. By default, each cell can contain only one line of code. In the case of several lines, they must be framed with the construction begin ... end. This may cause some inconvenience, but the shell itself determines such cells and offers to automatically insert begin and end. Pluto notepad allows you to add cells with comments in markdown format in combination with LATEX formulas. Unlike jupiter notebooks in Pluto, these are not special cells, but standard ones with a multiline string preceded by the md modifier, for example: md"""# Header The text of the comment and the equation $\dot{x} = f(x)$ """ Finally, we note that Pluto includes the module PlutoUI, which allows you to create interactive graphical interface elements such as sliders, drop-down lists, text input fields, etc. and bind variables to them. This allows you to add interactivity to the notebooks being created, which is useful, for example, for selecting parameters for certain functions. Due to reactivity, when changing the values of variables, all graphs that depend on them will also be rebuilt. 3. The main features of the Julia language The Julia language was originally created for the field of scientific programming and its syntax is very similar to the syntax of the Fortran and Python languages, which are well known to specialists in scientific computing. However, at the same time, it contains some specific features, and without knowing them, it will be difficult to use third-party libraries effectively. We will illustrate all the features with examples from probability theory and mathematical statistics. 3.1. Custom data structures One of the distinctive features of the Julia language is the high performance of data types created by the user himself. In many modules there is large number of custom data types and functions. A composite data type in the first approximation resembles a structure from the C-language. It is specified using the construct struct, inside which the fields of the structure with the type annotation are listed. As an example, consider setting a structure that stores the parameters of a normal distribution. "Normal distributions" struct Normal "first moment" μ::Real "standard deviation" σ::Real end Let’s list some important features. - Since Julia provides full Unicode encoding support, Greek letters and other symbols that are standard for mathematical formulas can be used as field designations. - A composite type and its fields can be provided with documentation lines that explain the purpose of the structure and its fields. These strings are similar to Python doc-strings, with the difference that they can be supplied to almost any object and they must be specified before, not after the declaration. - Julia is a dynamically typed language, but it supports type annotations that can be used by the compiler for code optimization and to limit the types of variables passed to functions when they are called and to structures when they are initialized. After defining the structure, you can create objects of type Normal using the default constructor. N = Normal(0, 1) @show typeof(N) @show N.μ, N.σ The macro @show prints the line of code that is passed to it and the result of executing this line of code. So, in the example above, the following will be printed to standard output: typeof(N) = Normal (N.μ, N.σ) = (0, 1) The default constructor is created automatically, but it can be set explicitly in the body of the structure, for example, if you need to limit the scope of acceptable values of the fields of the structure. After the checks, it is necessary to allocate memory for the fields of the structure using a special function new. struct Normal μ::Real σ::Real function Normal(μ, σ) if σ == 0 throw(ArgumentError("σ != 0")) end return new(μ, σ) end end Only one main constructor can be defined in the structure body. If you need to define additional constructors, they should be set outside the structure. So, you can define a constructor without arguments, which will set the parameters of the standard normal distribution. function Normal() return Normal(0.0, 1.0) end 3.2. Multiple dispatch Julia implements a multiple dispatching mechanism [5, 19, 20], which, according to the developers, is a more flexible mechanism compared to the object-oriented approach applied to mathematical applications. Each function in Julia can have many implementations called methods. Implementations have the same name, but differ from each other both in the number of arguments and their types. When calling a function, the compiler analyzes the arguments passed to it and calls the desired implementation. Various operators such as +, - are also functions and can be overridden for any new data type. To illustrate multiple dispatching, we additionally define a structure that stores the parameters of the exponential distribution: struct Exponential <: Distribution λ::Real function Exponential(λ) if λ <= 0 throw(ArgumentError("λ > 0")) end return new(λ) end end Now we implement two functions that calculate the PDF of normal and exponential distributions: function pdf(d::Normal, x) return 1/(sqrt(2*π)*d.σ) * exp(-(x-d.μ)^2 / (2*d.σ^2)) end function pdf(d::Exponential, x) return d.λ * exp(-d.λ*x) end It should be noted that the first arguments of the function are provided with type annotations. This is done so that the compiler can call the desired implementation depending on the type of the first argument: N = Normal(0, 1) E = Exponential(2) @show pdf(N, 3) # <- call of the implementation for the normal ↪ distribution @show pdf(E, 2) # <- call of the implementation for the ↪ exponential distribution In addition, Julia allows you to automatically vectorize a scalar function, that is, apply it to each element of some array, without having to implement an additional method. To do this, it is enough to use a special syntax: pdf.(N, [1, 2, 3, 4, 5]) To achieve a similar effect, R uses the function Vectorize, and Python uses map or a list assembly. 4. Module StatsBase.jl overview In the module StatsBase.jl [10] implement basic functions for working with statistical samples presented as one-dimensional arrays. Due to the ease of use of most functions, we will not dwell on examples, but will give only short description of the main functionality of this module. - Vectors of the sample weight coefficients (weight vectors). - Functions that calculate the mean (geometric, harmonic, power and weighted arithmetic mean). - The simplest statistical functions. - Moments that take into account the vectors of weight coefficients: mathematical expectation, variance, standard deviation, skewness coefficient, kurtosis coefficient and central moments of arbitrary order. - Standardized score (Z-score). - Entropy calculations, such as standard, Rényi (generalized) entropy, crossentropy, Kullback-Leibler divergence distance. - Quantiles and mods. - Robust statistics: truncation and winsorization of the sample. - Comparing two samples, by calculating different discrete metrics. - Calculation of scattering, covariance, and correlation matrices. - Functions that calculate the frequency of occurrence of a particular value in the sample. - Calculation of histograms. - Autocorrelation and autocovariance. Functions from the module StatsBase.jl is actively used in other modules, so it is included in the list of dependencies of most statistical libraries created for Julia. 5. Module Distributions.jl overview 5.1. Brief overview of the module The module Distributions.jl [21] implements functions and methods related to probability distributions (mainly one-dimensional discrete and continuous, as well as a small number of multidimensional ones). - Probability distribution Functions (CDF) and probability distribution density functions (PDF). - Functions for calculating statistical characteristics of distributions (expectation, variance, moments, modes, quantiles, kurtosis, etc.). - Characteristic functions of distributions and generating functions of moments. - Methods for selecting distribution parameters based on statistical data (distribution fitting) by the maximum likelihood method (Maximum Likelihood) and the Sufficient Statistics method (Sufficient Statistics). 5.2. Module installation In order to use the module Distributions.from, it must first be installed using the command Pkg.add("Distribution.from"). After that, it can be imported using the instructions using or import. We use the second method to avoid mixing the module namespaces Distributions.jl with the global scope. import Distributions const dist = Distributions Now dist will serve as a short synonym for Distribution and all functions and variables defined in the module name area will be accessible via the period operator.. 5.3. Creating a probability distribution Since Julia’s custom data types are not inferior in performance to the builtin data types, in the module Distributions.jl, probability distributions are implemented as additional data types. For example, to set a normal distribution, you should call the constructor Normal, passing to it two parameters:
×

About the authors

Migran N. Gevorkyan

Peoples’ Friendship University of Russia (RUDN University)

Email: gevorkyan-mn@rudn.ru
ORCID iD: 0000-0002-4834-4895

Docent, Candidate of Sciences in Physics and Mathematics, Associate Professor of Department of Applied Probability and Informatics

6, Miklukho-Maklaya St., Moscow, 117198, Russian Federation

Anna V. Korolkova

Peoples’ Friendship University of Russia (RUDN University)

Email: korolkova-av@rudn.ru
ORCID iD: 0000-0001-7141-7610

Docent, Candidate of Sciences in Physics and Mathematics, Associate Professor of Department of Applied Probability and Informatics

6, Miklukho-Maklaya St., Moscow, 117198, Russian Federation

Dmitry S. Kulyabov

Peoples’ Friendship University of Russia (RUDN University); Joint Institute for Nuclear Research

Author for correspondence.
Email: kulyabov-ds@rudn.ru
ORCID iD: 0000-0002-0877-7063

Professor, Doctor of Sciences in Physics and Mathematics, Professor at the Department of Applied Probability and Informatics

6, Miklukho-Maklaya St., Moscow, 117198, Russian Federation; 6, Joliot-Curie St., Dubna, Moscow Region, 141980, Russian Federation

References

  1. J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, “Julia: A fresh approach to numerical computing,” SIAM Review, vol. 59, no. 1, pp. 65-98, Jan. 2017. doi: 10.1137/141000671.
  2. B. Lauwens and A. Downey, Think Julia. O’Reilly Media, Inc., 2019.
  3. T. Kwong, Hands-on design patterns and best practices with Julia. Packt Publishing, 2020.
  4. C. T. Kelley, Solving nonlinear equations with iterative methods, Solvers and Examples in Julia. SIAM, 2022.
  5. J. Bezanson, J. Chen, B. Chung, S. Karpinski, V. B. Shah, J. Vitek, and L. Zoubritzky, “Julia: dynamism and performance reconciled by design,” Proceedings of the ACM on Programming Languages, vol. 2, no. OOPSLA, pp. 1-23, Oct. 2018. doi: 10.1145/3276490.
  6. M. N. Gevorkyan, A. V. Korolkova, D. S. Kulyabov, and K. P. Lovetskiy, “Statistically significant comparative performance testing of Julia and Fortran languages in case of Runge-Kutta methods,” in Numerical methods and applications. NMA 2018, ser. Lecture Notes in Computer Science, G. Nikolov, N. Kolkovska, and K. Georgiev, Eds., vol. 11189, Cham: Springer International Publishing, 2019, ch. 45, pp. 400-407. doi: 10.1007/978-3-030-10692-8_45.
  7. “JuliaStats, Statistics and machine learning made easy in julia.” (2023), [Online]. Available: https://juliastats.org/.
  8. Y. Nazarathy and H. Klok, Statistics with Julia, Fundamentals for Data Science, Machine Learning and Artificial Intelligence. Springer International Publishing, 2021. doi: 10.1007/978-3-030-70901-3.
  9. “Julia forums.” (2023), [Online]. Available: https://discourse.julialang.org.
  10. “StatsBase.jl.” (2023), [Online]. Available: https://github.com/JuliaStats/StatsBase.jl.
  11. “Distributions.jl.” (2023), [Online]. Available: https://github.com/JuliaStats/Distributions.jl.
  12. C. Führer, J. E. Solem, and O. Verdier, Scientific computing with Python, High-performance scientific computing with NumPy, SciPy, and pandas, 2nd. Packt Publishing Ltd., 2021.
  13. D. Toomey, Learning Jupyter. Packt Publishing Ltd., 2016.
  14. “Pluto.jl - interactive Julia programming environment.” (2023), [On-line]. Available: https://plutojl.org/.
  15. “Windows terminal, console and command-line repo.” (2023), [Online]. Available: https://github.com/microsoft/terminal.
  16. “Miniconda.” (2023), [Online]. Available: https://docs.conda.io/en/latest/miniconda.html.
  17. “Pluto.jl GitHub.” (2023), [Online]. Available: https://github.com/fonsp/Pluto.jl.
  18. “Julia GitHub.” (2023), [Online]. Available: https://github.com/JuliaLang/julia.
  19. A. V. Korolkova, M. N. Gevorkyan, and D. S. Kulyabov, “Implementation of hyperbolic complex numbers in Julia language,” vol. 30, no. 4, pp. 318-329, Dec. 2022. doi: 10.22363/2658-4670-2022-30-4-318-329.
  20. R. Muschevici, A. Potanin, E. Tempero, and J. Noble, “Multiple dispatch in practice,” in OOPSLA’08: Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications, ACM Press, Oct. 2008, pp. 563-582. doi: 10.1145/1449764.1449808.
  21. M. Besançon, T. Papamarkou, D. Anthoff, A. Arslan, S. Byrne, D. Lin, and J. Pearson, “Distributions.jl: Definition and modeling of probability distributions in the JuliaStats ecosystem,” Journal of Statistical Software, vol. 98, no. 16, pp. 1-30, 2021. doi: 10.18637/jss.v098.i16.

Copyright (c) 2023 Gevorkyan M.N., Korolkova A.V., Kulyabov D.S.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

This website uses cookies

You consent to our cookies if you continue to use our website.

About Cookies