CSE 6040 Computing for Data Analytics: Methods and Tools

CSE 6040
Computing for Data Analytics:
Methods and Tools
Lecture 13 – Vectorization in
Numpy and R
DA KUA N G, P OLO CHAU
G EORGIA T ECH
FA L L 2 0 1 4
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
1
Vectorization not possible for Python's lists
> import time
> def func(x):
>
return x**4
> arr = range(1048576)
>
>
>
>
>
t0 = time.time()
arr2 = [None] * (1048576)
for i in arr:
arr2[i] = i ** 4
print "Time using for loop:", time.time() - t0
> t0 = time.time()
> arr3 = map(func, arr)
> print "Time using map:", time.time() - t0
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
2
vectorize() in Numpy
> import time
> import numpy as np
> def func(x):
>
return x**4
> arr = np.arange(0, 1048576, 1, dtype=np.float64)
> arr2 = np.zeros(1048576)
> t0 = time.time()
> for i in arr:
>
arr2 = arr[i] ** 4
> print "Time using for loop:", time.time() - t0
>
>
>
>
t0 = time.time()
vectorize() returns a function object
vecfunc = np.vectorize(func)
arr3 = vecfunc(arr)
print "Time using vectorize:", time.time() - t0
> t0 = time.time()
> arr4 = np.power(arr, 4)
> print "Time using numpy power:", time.time() - t0
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
3
Vectorizing conditional statements
> def func(x):
> if x <= 0:
>
return np.exp(x)
>
else:
>
return np.log(x)
> arr = np.random.randn(1048576)
> arr2 = np.zeros(1048576)
>
>
>
>
t0 = time.time()
vecfunc = np.vectorize(func)
arr3 = vecfunc(arr)
print "Time using vectorize:", time.time() - t0
> t0 = time.time()
> arr4 = np.where(arr <= 0, np.exp(arr), np.log(arr))
> print "Time using numpy where:", time.time() - t0
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
4
Vectorization in R
> a = 1:1e6
>c = 0
>
>
>
>
# compute sum of squares using a for loops
system.time(for (e in a) c = c + e^2)
## user system elapsed
## 0.832 0.001 0.833
> system.time(sum(a^2))
> ## user system elapsed
> ## 0.006 0.002 0.008
Summary: Avoid using for-loops to manipulate vectors and matrices.
Fall 2014
CSE 6040 COMPUTING FOR DATA ANALYSIS
5