Servers: Concurrency and Performance HTTP Server Inside your

HTTP Server
Servers: Concurrency and
Performance
Jeff Chase
Duke University
Inside your server
Example: Video On Demand
Measures
offered load
response time
throughput
utilization
Server application
(Apache,
Tomcat/Java, etc)
• HTTP Server
– Creates a socket (socket)
– Binds to an address
– Listens to setup accept backlog
– Can call accept to block waiting for connections
– (Can call select to check for data on multiple socks)
• Handle request
– GET /index.html HTTP/1.0\n
<optional body, multiple lines>\n
\n
accept
queue
packet
queues
listen
queue
Client() {
fd = connect(“server”);
write (fd, “video.mpg”);
while (!eof(fd))
{
read (fd, buf);
display (buf);
}
}
Server() {
while (1) {
cfd = accept();
read (cfd, name);
fd = open (name);
while (!eof(fd)) {
read(fd, block);
write (cfd, block);
}
close (cfd); close (fd);
}
How many clients can the server support?
Suppose, say, 200 kb/s video on a 100 Mb/s network link?
[MIT/Morris]
Performance “analysis”
WebServer Flow
• Server capacity:
– Network (100 Mbit/s)
– Disk (20 Mbyte/s)
• Obtained performance: one client stream
• Server is limited by software structure
• If a video is 200 Kbit/s, server should be able to
support more than one client.
Create ServerSocket
TCP socket space
connSocket = accept()
read request from
connSocket
close connSocket
[MIT/Morris]
state: established
address: {128.36.232.5:6789, 198.69.10.10.1500}
sendbuf:
recvbuf:
read
local file
write file to
connSocket
500?
128.36.232.5
128.36.230.2
state: listening
address: {*.6789, *.*}
completed connection queue:
sendbuf:
recvbuf:
state: listening
address: {*.25, *.*}
completed connection queue:
sendbuf:
recvbuf:
Discussion: what does step do and how long
does it take?
1
Web Server Processing Steps
Process States and Transitions
running
(user)
Accept Client
Connection
may block
waiting on
network
Read HTTP
Request Header
Find
File
interrupt,
exception
may block
waiting on
disk I/O
Yield
running
Sleep
Send HTTP
Response Header
Read File
Send Data
trap/return
blocked
(kernel)
Wakeup
Run
ready
Want to be able to process requests concurrently.
Server Blocking
Under the Hood
• accept() when no connect requests are waiting on the
listen queue
– What if server has multiple ports to listen from?
• E.g., 80 for HTTP, 443 for HTTPS
• open/read/write on server files
• read() on a socket, if the client is sending too slowly
• write() on socket, if the client is receiving too slowly
– Yup, TCP has flow control like pipes
What if the server blocks while serving one client, and
another client has work to do?
Concurrency and Pipelining
CPU
DISK
Before
NET
CPU
DISK
NET
After
start (arrival rate λ)
CPU
I/O completion
I/O request
exit
I/O device
(throughput λ until some
center saturates)
Better single-server
performance
• Goal: run at server’s hardware speed
– Disk or network should be bottleneck
• Method:
– Pipeline blocks of each request
– Multiplex requests from multiple clients
• Two implementation approaches:
– Multithreaded server
– Asynchronous I/O
[MIT/Morris]
2
Multiple Process Architecture
Concurrent threads or processes
Process 1
Accept
Conn
Accept
Conn
•
Example: a Multi-threaded WebServer, which creates a thread for each request
Using Threads
Find
File
Read File
Send Data
Send
Header
Read File
Send Data
…
Send
Header
Thread N
Accept
Conn
•
•
Read
Request
Find
File
Advantages
– Lower context switch overheads
– Shared address space simplifies optimizations (e.g., caches)
Disadvantages
– Need kernel level threads (why?)
– Some extra memory needed to support multiple stacks
– Need thread-safe programs, synchronization
Multithreaded server
server() {
while (1) {
cfd = accept();
read (cfd, name);
fd = open (name);
while (!eof(fd)) {
read(fd, block);
write (cfd, block);
}
close (cfd); close (fd);
}}
Read File
Send Data
separate address spaces
Read
Request
Find
File
Send
Header
Read File
Send Data
Advantages
– Simple programming while addressing blocking issue
Disadvantages
– Many processes; large context switch overheads
– Consumes much memory
– Optimizations involving sharing information among processes
(e.g., caching) harder
Threads
Thread 1
Read
Request
Send
Header
Process N
•
Accept
Conn
Find
File
Read
Request
…
• Using multiple threads/processes
– so that only the flow
processing a particular
request is blocked
– Java: extends Thread or
implements Runnable interface
• A thread is a schedulable stream of control.
• defined by CPU register values (PC, SP)
• suspend: save register values in
memory
• resume: restore registers from
memory
• Multiple threads can execute independently:
• They can run in parallel on multiple
CPUs...
– - physical concurrency
• …or arbitrarily interleaved on a single
CPU.
– - logical concurrency
• Each thread must have its own stack.
Event-Driven Programming
for (i = 0; i < 10; i++)
threadfork (server);
• When waiting for I/O,
thread scheduler runs
another thread
• What about references to
shared data?
• Synchronization
[MIT/Morris]
• One execution stream: no CPU
concurrency.
• Register interest in events
(callbacks).
• Event loop waits for events,
invokes handlers.
• No preemption of event
handlers.
• Handlers generally shortlived.
Event
Loop
Event Handlers
[Ousterhout 1995]
3
Single Process Event Driven (SPED)
Accept
Conn
Read
Request
Find
File
Send
Header
Asynchronous Multi-Process Event Driven (AMPED)
Accept
Conn
Read
Request
Find
File
Send
Header
Read File
Send Data
Read File
Send Data
Event Dispatcher
Event Dispatcher
Helper 1
• Single threaded
• Asynchronous (non-blocking) I/O
• Advantages
– Single address space
– No synchronization
• Disadvantages
– In practice, disk reads still block
•
•
•
•
Helper 1
Helper 1
Like SPED, but use helper processes/thread for disk I/O
Use IPC to communicate with helper process
Advantages
– Shared address space for most web server functions
– Concurrency for disk I/O
Disadvantages
– IPC between main thread and helper threads
This hybrid model is used by the “Flash” web server.
Event-Based Concurrent
Servers Using I/O Multiplexing
• Maintain a pool of connected descriptors.
• Repeat the following forever:
– Use the Unix select function to block until:
• (a) New connection request arrives on the listening
descriptor.
• (b) New data arrives on an existing connected descriptor.
– If (a), add the new connection to the pool of connections.
– If (b), read any available data from the connection
• Close connection on EOF and remove it from the pool.
Select
• If a server has many open sockets, how does it know
when one of them is ready for I/O?
int select(int n, fd_set *readfds, fd_set *writefds, fd_set *exceptfds,
struct timeval *timeout);
• Issues with scalability: alternative event interfaces
have been offered.
[CMU 15-213]
Asychronous server
Asynchronous I/O
struct callback {
bool (*is_ready)();
void (*cb)(arg);
void *arg;
}
main() {
while (1) {
for (c = each callback) {
if (c->is_ready())
c->handler(c->arg);
}
}
}
• Code is structured as a
collection of handlers
• Handlers are nonblocking
• Create new handlers for
blocking operations
• When operation
completes, call handler
[MIT/Morris]
init() {
on_accept(accept_cb);
}
accept_cb() {
on_readable(cfd,name_cb);
}
on_readable(fd, fn) {
c = new
callback(test_readable, fn, fd);
add c to callback list;
}
name_cb(cfd) {
read(cfd,name);
fd = open(name);
on_readable(fd, read_cb);
}
read_cb(cfd, fd) {
read(fd, block);
on_writeeable(fd, write_cb);
}
write_cb(cfd, fd) {
write(cfd, block);
on_readable(fd, read_cb);
}
[MIT/Morris]
4
Multithreaded vs. Async
•
•
•
•
•
Hard to program
– Locking code
– Need to know what blocks
Coordination explicit
State stored on thread’s stack
– Memory allocation implicit
Context switch may be
expensive
Multiprocessors
•
•
•
•
•
Coordination example
Hard to program
– Callback code
– Need to know what blocks
Coordination implicit
State passed around explicitly
– Memory allocation explicit
Lightweight context switch
Uniprocessors
• Threaded server:
– Thread for network
interface
– Interrupt wakes up
network thread
– Protected (locks and
conditional variables)
shared buffer shared
between server threads
and network thread
• Asynchronous I/O
– Poll for packets
• How often to poll?
– Or, interrupt generates
an event
• Be careful: disable
interrupts when
manipulating callback
queue.
[MIT/Morris]
[MIT/Morris]
One View
Should You Abandon Threads?
• No: important for high-end servers (e.g.
databases).
• But, avoid threads wherever possible:
Event-Driven Handlers
– Use events, not threads, for GUIs,
distributed systems, low-end servers.
– Only use threads where true CPU
Threaded Kernel
concurrency is needed.
– Where threads needed, isolate usage
in threaded application kernel: keep
most of code single-threaded.
Threads!
[Ousterhout 1995]
Another view
• Events obscure control flow
– For programmers and tools
Threads
thread_main(int sock) {
struct session s;
accept_conn(sock, &s);
read_request(&s);
pin_cache(&s);
write_response(&s);
unpin(&s);
}
pin_cache(struct session *s) {
pin(&s);
if( !in_cache(&s) )
read_file(&s);
}
Events
AcceptHandler(event e) {
struct session *s = new_session(e);
RequestHandler.enqueue(s);
}
RequestHandler(struct session *s) {
…; CacheHandler.enqueue(s);
}
CacheHandler(struct session *s) {
pin(s);
if( !in_cache(s) ) ReadFileHandler.enqueue(s);
else
ResponseHandler.enqueue(s);
}
...
ExitHandlerr(struct session *s) {
…; unpin(&s); free_session(s); }
Control Flow
Web Server
Accept
Conn.
Threads
Read
Request
Pin
Cache
• Events obscure control flow
– For programmers and tools
Read
File
Write
Response
Exit
[von Behren]
thread_main(int sock) {
struct session s;
accept_conn(sock, &s);
read_request(&s);
pin_cache(&s);
write_response(&s);
unpin(&s);
}
pin_cache(struct session *s) {
pin(&s);
if( !in_cache(&s) )
read_file(&s);
}
Events
CacheHandler(struct session *s) {
pin(s);
if( !in_cache(s) ) ReadFileHandler.enqueue(s);
else
ResponseHandler.enqueue(s);
}
RequestHandler(struct session *s) {
…; CacheHandler.enqueue(s);
}
...
ExitHandlerr(struct session *s) {
…; unpin(&s); free_session(s);
}
AcceptHandler(event e) {
struct session *s = new_session(e);
RequestHandler.enqueue(s); }
Web Server
Accept
Conn.
Read
Request
Pin
Cache
Read
File
Write
Response
Exit
[von Behren]
5
Exceptions
State Management
• Exceptions complicate control flow
– Harder to understand program flow
– Cause bugs in cleanup code
Threads
thread_main(int sock) {
struct session s;
accept_conn(sock, &s);
if( !read_request(&s) )
return;
pin_cache(&s);
write_response(&s);
unpin(&s);
}
pin_cache(struct session *s) {
pin(&s);
if( !in_cache(&s) )
read_file(&s);
}
Web Server
Accept
Conn.
Events
CacheHandler(struct session *s) {
pin(s);
if( !in_cache(s) ) ReadFileHandler.enqueue(s);
else
ResponseHandler.enqueue(s);
}
RequestHandler(struct session *s) {
…; if( error ) return; CacheHandler.enqueue(s);
}
...
ExitHandlerr(struct session *s) {
…; unpin(&s); free_session(s);
}
AcceptHandler(event e) {
struct session *s = new_session(e);
RequestHandler.enqueue(s); }
• Events require manual state management
• Hard to know when to free
– Use GC or risk bugs
Threads
Read
Request
Pin
Cache
Read
File
Write
Response
Exit
[von Behren]
thread_main(int sock) {
struct session s;
accept_conn(sock, &s);
if( !read_request(&s) )
return;
pin_cache(&s);
write_response(&s);
unpin(&s);
}
pin_cache(struct session *s) {
pin(&s);
if( !in_cache(&s) )
read_file(&s);
}
Web Server
Accept
Conn.
Events
CacheHandler(struct session *s) {
pin(s);
if( !in_cache(s) ) ReadFileHandler.enqueue(s);
else
ResponseHandler.enqueue(s);
}
RequestHandler(struct session *s) {
…; if( error ) return; CacheHandler.enqueue(s);
}
...
ExitHandlerr(struct session *s) {
…; unpin(&s); free_session(s);
}
AcceptHandler(event e) {
struct session *s = new_session(e);
RequestHandler.enqueue(s); }
Read
Request
Pin
Cache
Read
File
Write
Response
Exit
[von Behren]
Internet Growth and Scale
Thread 1
Accept
Conn
Read
Request
Find
File
Read File
Send Data
Send
Header
Read File
Send Data
…
Send
Header
The Internet
Thread N
Accept
Conn
Read
Request
Find
File
How to handle all those
client requests raining on
your server?
Response Time
Components
• Wire time +
• Queuing time +
• Service demand +
• Wire time (response)
Performance
Ideal
Peak: some
resource at max
Depends on
• Cost/length of request
• Load conditions at server
Overload: some
resource thrashing
latency
Servers Under Stress
offered load
Load (concurrent requests, or arrival rate)
[Von Behren]
6
Utilization
Queuing Theory for Busy People
• Big Assumptions
– Queue is First-Come-First-Served (FIFO, FCFS).
– Request arrivals are independent (poisson arrivals).
– Requests have independent service demands.
– i.e., arrival interval and service demand are
exponentially distributed (noted as “M”).
• What is the probability that the center is busy?
– Answer: some number between 0 and 1.
• What percentage of the time is the center busy?
– Answer: some number between 0 and 100
• These are interchangeable: called utilization U
• If the center is not saturated, i.e., it completes all its
requests in some bounded time, then:
• U = λD = (arrivals/T * service demand)
• “Utilization Law”
• The probability that the service center is idle is 1-U.
Little’s Law
Inverse Idle Time “Law”
offered load
request stream @
arrival rate λ
wait here
Process for mean
service demand D
M/M/1 Service Center
• For an unsaturated queue in steady state, mean
response time R and mean queue length N are
governed by:
Service center saturates as 1/ λ
approaches D: small increases in
λ cause large increases in the
expected response time R.
R
Little’s Law: N = λR
U
• Suppose a task T is in the system for R time units.
• During that time:
– λR new tasks arrive.
– N tasks depart (all tasks ahead of T).
• But in steady state, the flow in balances flow out.
– Note: this means that throughput X = λ.
1(100%)
Little’s Law gives response time R = D/(1 - U).
Intuitively, each task T’s response time R = D + DN.
Substituting λR for N: R = D + D λR
Substituting U for λD: R = D + UR
R - UR = D --> R(1 - U) = D --> R = D/(1 - U)
Under the Hood
What does this tell us about
server behavior at saturation?
start (arrival rate λ)
CPU
I/O completion
I/O request
exit
I/O device
(throughput λ until some
center saturates)
7
Common Bottlenecks
•
•
•
•
•
Scaling Server Sites: Clustering
No more File Descriptors
Sockets stuck in TIME_WAIT
High Memory Use (swapping)
CPU Overload
Interrupt (IRQ) Overload
Goals
server load balancing
failure detection
access control filtering
priorities/QoS
request locality
transparent caching
L4: TCP
L7: HTTP
SSL
etc.
Clients
virtual IP
addresses
(VIPs)
smart
switch
server array
What to switch/filter on?
L3 source IP and/or VIP
L4 (TCP) ports etc.
L7 URLs and/or cookies
L7 SSL session IDs
[Aaron Bannert]
Scaling Services: Replication
Site A
Distribute service load across
multiple sites.
How to select a server site for each
client or request?
Site B
Extra Slides
?
Internet
(Any new information on the following
slides will not be tested.)
Is it scalable?
Client
Event-Based Concurrent
Servers Using I/O Multiplexing
• Maintain a pool of connected descriptors.
• Repeat the following forever:
– Use the Unix select function to block until:
• (a) New connection request arrives on the listening
descriptor.
• (b) New data arrives on an existing connected descriptor.
– If (a), add the new connection to the pool of connections.
– If (b), read any available data from the connection
• Close connection on EOF and remove it from the pool.
Problems of Multi-Thread Server
• High resource usage, context switch overhead, contended
locks
• Too many threads → throughput meltdown, response time
explosion
• Solution: bound total number of threads
[CMU 15-213]
8
Event-Driven Programming
Traditional Processes
•
•
•
•
•
•
•
•
Event-driven programming, also called asynchronous i/o
Using Finite State Machines (FSM) to monitor the progress of requests
Yields efficient and scalable concurrency
Many examples: Click router, Flash web server, TP Monitors, etc.
•
Java: asynchronous i/o
– for an example see: http://www.cafeaulait.org/books/jnp3/examples/12/
Events
•
•
•
•
•
•
•
•
Need async I/O
Need select
Wasn’t originally available
Not standardized
Immature
But efficient
Code is distributed all through the program
Harder to debug and understand
Expensive and “heavyweight”
One system call per process
Fork overhead
Coordination
Threads
•
•
•
•
•
Separate interface and implementation
Pthreads interface
Implementation is user-level or kernel (native)
If user-level, needs async I/O
But hide the abstraction behind the thread interface
Reference
The State of the Art in Locally Distributed Webserver Systems
Valeria Cardellini, Emiliano Casalicchio, Michele Colajanni and Philip S.
Yu
9