With emphasis on data-science problems
This course is available on gitlab
Contact us: andrea.dotti@gmail.com, mancinit@infn.it
I am very lucky to have had the opportunity to see both sides of data-analysis and large data-analytics. I program in C++ and Python with the latter used (mainly) for data-analysis.
I program in C++ and Python with the latter used for data and medical images analysis and Deep Learning applications.
Python is one of the fastest growing programming language (among the most populars in industry, the second most active on github, and number 4 on stackoverflow ).
It is getting more and more traction for science and basic research problems (see here, here, here), thus it is a good moment to learn it.
I am not an expert of Python, but I hope to be able to give you:
Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects. Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including procedural, object-oriented, and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.
From Wikipedia)
The python interpreter reads the input (interactive or in a script) and executes each line of code sequentially. A python distribution comes with a REPL (Read Evaluate Print Loop) shell. E.g.:
# Technical notes, for this course, we use conda. In each
# new terminal type:
conda activate pycourse
python
Which will give you:
Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
the >>>
sequence is the python prompt, type a command and see the result, for example:
a = 3+2
print(a)
Hint:Type Ctrl+D
to exit, or type quit()
.
Other (python) shells are available, to simplify/improve the user-experience, for example IPython (ipython
), or GUIs (jupyter integration).
For this course:
conda activate pycourse # if new shell
jupyter notebook
Python strongly abstracts the specific hardware details.
Python is not a good language for performance critical applications. Use a lower level language instead.
It is a very good prototype language.
Python does not support threads (due to the global lock), but has some multiprocessing capabilities.
Often data intensive routines are written in C++/C, python bindings created to call fast code from python. See pybind11
Python is excellent for data analysis and/or data manipulation (ETL).
Python usually provides a very rich set of libraries and it supports C-binding allowing for offloading computationally heavy parts of the code to optimized routines.
Hint: if you know that you have a computationally expensive routine, check if it is available in some libraries, it is probably well optimized (e.g. do not write your own linear algebra functions, use scipy.linalg
).
Hint: some popular libraries or extension even come with GPU support to speed up the calculations if you have access to the hardware (e.g. tensorflow
vs tensorflow-gpu
).
Python can be used for a rich set of applications:
Traditionally, python is considered a glue language, used to coordinate programs (possibly written in other languages) and to manipulate the input and output from one to the other (a pipeline). Consider it, for this aspect, as a bash
on steroids.
However the growing number of specialized libraries (e.g. the scientific python stack), powerful visualization tools and rich I/O capabilities, has made it very popular among data scientist and for scientific computations.
Code is written in modules: a file containing functions, global variables, classes. Differently from C++ and Geant4, usually one module contains more than one class/function all related to each other (it would be like if in Geant4 all classes related to EM Bremsstrahlung are in a single file). Note: in python there is no .hh/.cc
distinction (no forward declaration), in C++ terminology: everything is inlined.
#Import a single module and use a function in it
import os
print(os.uname())
# IT is possible to import a single function from a module. And (optionally change its name)
from os import uname as un
print(un())
posix.uname_result(sysname='Darwin', nodename='Carlos-MacBook-Pro.local', release='19.6.0', version='Darwin Kernel Version 19.6.0: Tue Oct 12 18:34:05 PDT 2021; root:xnu-6153.141.43~1/RELEASE_X86_64', machine='x86_64') posix.uname_result(sysname='Darwin', nodename='Carlos-MacBook-Pro.local', release='19.6.0', version='Darwin Kernel Version 19.6.0: Tue Oct 12 18:34:05 PDT 2021; root:xnu-6153.141.43~1/RELEASE_X86_64', machine='x86_64')
A package is a directory containing one or more modules (or sub-packages). The directory must contain a special file __init__.py
that tells python that the directory is a package. The content of the file can tailor the package behavior (see here for details).
#Import a package
import numpy
#Import a module from a package
import numpy.random as rnd
print("Call 1:",rnd.binomial(10,0.5))
#Import a function
from numpy.random import binomial
print("Call 2:",binomial(10,0.5))
#Depending on how the __init__ file is written it is possible to:
from numpy.random import *
print("Call 3:",binomial(10,0.5))
#I do not recomment import * since you may have name clashes...
Call 1: 7 Call 2: 7 Call 3: 4
Python has a built-in function help(...)
that can be very useful:
help(binomial)
Help on built-in function binomial: binomial(...) method of numpy.random.mtrand.RandomState instance binomial(n, p, size=None) Draw samples from a binomial distribution. Samples are drawn from a binomial distribution with specified parameters, n trials and p probability of success where n an integer >= 0 and p is in the interval [0,1]. (n may be input as a float, but it is truncated to an integer in use) .. note:: New code should use the ``binomial`` method of a ``default_rng()`` instance instead; see `random-quick-start`. Parameters ---------- n : int or array_like of ints Parameter of the distribution, >= 0. Floats are also accepted, but they will be truncated to integers. p : float or array_like of floats Parameter of the distribution, >= 0 and <=1. size : int or tuple of ints, optional Output shape. If the given shape is, e.g., ``(m, n, k)``, then ``m * n * k`` samples are drawn. If size is ``None`` (default), a single value is returned if ``n`` and ``p`` are both scalars. Otherwise, ``np.broadcast(n, p).size`` samples are drawn. Returns ------- out : ndarray or scalar Drawn samples from the parameterized binomial distribution, where each sample is equal to the number of successes over the n trials. See Also -------- scipy.stats.binom : probability density function, distribution or cumulative density function, etc. Generator.binomial: which should be used for new code. Notes ----- The probability density for the binomial distribution is .. math:: P(N) = \binom{n}{N}p^N(1-p)^{n-N}, where :math:`n` is the number of trials, :math:`p` is the probability of success, and :math:`N` is the number of successes. When estimating the standard error of a proportion in a population by using a random sample, the normal distribution works well unless the product p*n <=5, where p = population proportion estimate, and n = number of samples, in which case the binomial distribution is used instead. For example, a sample of 15 people shows 4 who are left handed, and 11 who are right handed. Then p = 4/15 = 27%. 0.27*15 = 4, so the binomial distribution should be used in this case. References ---------- .. [1] Dalgaard, Peter, "Introductory Statistics with R", Springer-Verlag, 2002. .. [2] Glantz, Stanton A. "Primer of Biostatistics.", McGraw-Hill, Fifth Edition, 2002. .. [3] Lentner, Marvin, "Elementary Applied Statistics", Bogden and Quigley, 1972. .. [4] Weisstein, Eric W. "Binomial Distribution." From MathWorld--A Wolfram Web Resource. http://mathworld.wolfram.com/BinomialDistribution.html .. [5] Wikipedia, "Binomial distribution", https://en.wikipedia.org/wiki/Binomial_distribution Examples -------- Draw samples from the distribution: >>> n, p = 10, .5 # number of trials, probability of each trial >>> s = np.random.binomial(n, p, 1000) # result of flipping a coin 10 times, tested 1000 times. A real world example. A company drills 9 wild-cat oil exploration wells, each with an estimated probability of success of 0.1. All nine wells fail. What is the probability of that happening? Let's do 20,000 trials of the model, and count the number that generate zero positive results. >>> sum(np.random.binomial(9, 0.1, 20000) == 0)/20000. # answer = 0.38885, or 38%.
Documentation is written together with the code as comments. If you follow some specific rules (see here) you get pretty nicely formatted documentation (tools exist to create documentation from code):
def foo():
'''
This is the documentation.
It is written as multi-line comment
'''
# This is a single line comment
return
help(foo)
Help on function foo in module __main__: foo() This is the documentation. It is written as multi-line comment
python
to enter the interactive python interpreter. quit()
(or Ctrl+d
) to quitpython myscript.py
python -c "print(3+2)"
python -m os
pyton -i -m os
-i
should come before -m
. Whatever follows the name of the module is passed as arguments to it!