[WIP] Enable concurrent read from IPFS in replay #425

ibnesayeed · 2018-07-09T04:58:54Z

This is an initial implementation towards #379. Currently, multi-threading is only implemented in replay. However, initial benchmarking is counter-intuitive.

Without Threading

$ ab -n 1000 -c 10 http://localhost:5000/memento/20160305192247/cs.odu.edu/~salam
This is ApacheBench, Version 2.3 <$Revision: 1757674 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        nginx
Server Hostname:        localhost
Server Port:            5000

Document Path:          /memento/20160305192247/cs.odu.edu/~salam
Document Length:        1699 bytes

Concurrency Level:      10
Time taken for tests:   8.807 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      2557000 bytes
HTML transferred:       1699000 bytes
Requests per second:    113.55 [#/sec] (mean)
Time per request:       88.070 [ms] (mean)
Time per request:       8.807 [ms] (mean, across all concurrent requests)
Transfer rate:          283.53 [Kbytes/sec] received

Connection Times (ms)
             min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       1
Processing:    18   88   5.0     87     104
Waiting:       18   87   5.0     87     104
Total:         19   88   5.0     87     105

Percentage of the requests served within a certain time (ms)
 50%     87
 66%     88
 75%     89
 80%     89
 90%     90
 95%     93
 98%    100
 99%    102
100%    105 (longest request)

With Threading

$ ab -n 1000 -c 10 http://localhost:5000/memento/20160305192247/cs.odu.edu/~salam
This is ApacheBench, Version 2.3 <$Revision: 1757674 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        nginx
Server Hostname:        localhost
Server Port:            5000

Document Path:          /memento/20160305192247/cs.odu.edu/~salam
Document Length:        1699 bytes

Concurrency Level:      10
Time taken for tests:   17.141 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      2557000 bytes
HTML transferred:       1699000 bytes
Requests per second:    58.34 [#/sec] (mean)
Time per request:       171.410 [ms] (mean)
Time per request:       17.141 [ms] (mean, across all concurrent requests)
Transfer rate:          145.68 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:    13  171  10.8    171     199
Waiting:       12  170  10.8    171     199
Total:         14  171  10.8    171     199

Percentage of the requests served within a certain time (ms)
  50%    171
  66%    173
  75%    174
  80%    175
  90%    176
  95%    178
  98%    185
  99%    186
 100%    199 (longest request)

machawk1 · 2018-07-09T15:13:47Z

@ibnesayeed Why were the rates with threading so much lower? I would expect them to be higher on average.

machawk1 · 2018-07-09T15:18:37Z

ipwb/replay.py

+        fetchPayload.start()
+        fetchHeader.join(IPFSTIMEOUT)
+        fetchPayload.join(IPFSTIMEOUT - (time.time() - fetch_start))
+        header  = message['header']


Please remove superfluous space to comply with pycodestyle.

machawk1 · 2018-07-09T15:19:14Z

ipwb/replay.py

-        if os.name != 'nt':  # Bug #310
-            signal.alarm(0)
+        message = {'header': None, 'payload': None}
+        fetchHeader  = Thread(target=load_from_ipfs,


Please remove superfluous space to comply with pycodestyle.

machawk1 · 2018-07-09T15:19:45Z

ipwb/replay.py

+# asynchronous nature of threads which is being utilized to call this function.
+def load_from_ipfs(digest, message, key):
+    message[key] = IPFS_API.cat(digest)
+


Extra CRLF needed here to comply with pycodestyle.

machawk1 · 2018-07-09T15:21:27Z

ipwb/replay.py

+# The key here could either be 'header' or 'payload'.
+# Using the mutable 'message' dict instead of returning a value due to the
+# asynchronous nature of threads which is being utilized to call this function.
+def load_from_ipfs(digest, message, key):


We have been using camelCaseFunctionNames whereas here you used an_underscore_delimited_function_name. Please make this consistent to match the style of the rest of the package.

We are not consistent about camelCasing. I initially named that way, but found some functions defined with underscores, so changed it just before committing the code. Changing it back to camelCase now.

ibnesayeed · 2018-07-09T17:51:31Z

Why were the rates with threading so much lower? I would expect them to be higher on average.

I am not sure about that yet. I did note that in my initial comment that it is counter-intuitive. It can be due to some threading overhead or our current thread implementation might not be how it should be. A more fine-grained profiling is needed to find out which step is taking how much time.

codecov · 2018-07-09T17:52:32Z

Codecov Report

Merging #425 into master will increase coverage by 0.09%.
The diff coverage is 13.04%.

@@            Coverage Diff             @@
##           master     #425      +/-   ##
==========================================
+ Coverage   23.29%   23.38%   +0.09%     
==========================================
  Files           6        6              
  Lines        1112     1116       +4     
  Branches      169      167       -2     
==========================================
+ Hits          259      261       +2     
- Misses        836      838       +2     
  Partials       17       17

Impacted Files	Coverage Δ
ipwb/replay.py	`13.46% <13.04%> (+0.23%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1aa8d60...b0f7654. Read the comment docs.

machawk1 · 2018-07-09T19:08:20Z

@ibnesayeed Should we wait until we can show that this is a more efficient solution before merging or go ahead and do it? This PR should resolve #310 but I would like to check it on a Windows machine to verify that before merging anyway.

ibnesayeed · 2018-07-09T19:37:29Z

This PR brings some functional changes, so we should sit on it a little longer and test it in different ways in different environments first. Also, it is important to profile the code to identify the cause of unexpected slowdown. If the slow down is due to thread overhead then we can find out how to make it more performant. If it is due to the fact that the IPFS server is running on the same machine, then it might perform better when a lookup is performed in the broader IPFS network.

ibnesayeed · 2018-07-14T03:32:57Z

I have been trying to profile this and observed some really strange behaviors. We need to isolate certain things in a separate script and test them.

Also, before merging we need to move some exception handling in threads, because the main thread would not be aware of those.

machawk1 · 2018-07-14T03:46:46Z

observed some really strange behaviors

Please document them here.

isolate certain things in a separate script and test them.

Which things? As we talked about in-person, the Py threading mechanism ought to be compared to the base but that is likely not the culprit.

ibnesayeed · 2018-07-16T06:13:22Z

Created a separate benchmarking repository (IPFS API Concurrency Test) and reported the observation in the API repo (ipfs-shipyard/py-ipfs-http-client#131).

funkyfuture · 2018-08-19T13:29:10Z

maybe i'm missing a point, but have you considered to put a wsgi daemon in front of ipwb?

ibnesayeed · 2018-08-19T15:30:43Z

maybe i'm missing a point, but have you considered to put a wsgi daemon in front of ipwb?

No, currently we are serving directly from the built-in Flask server. However, I am not sure how a web server used has anything to do with how the content will be read from the IPFS server.

funkyfuture · 2018-08-19T19:16:45Z

I am not sure how a web server used has anything to do with how the content will be read from the IPFS server.

not at all.

from your benchmark i had the impression that the aim is to improve response time to web requests. a threading wsgi host process would facilitate that.

if you are onto reading header and payload concurrently, trio is an option for Python 3.5+. (i know, but maybe that would motivate the port to Python 3.)

ibnesayeed · 2018-08-19T19:23:17Z

Thanks for the input @funkyfuture. I was pushing for Py3 support from the day one, but @machawk1 felt there are many people still using systems that only support Py2. However, now we are well-motivated to completely drop the support for Py2 (as per #51). Once we are on Py3, we will explore asyncio for sure.

machawk1 · 2018-08-19T19:29:56Z

Note that we are trying to optimize asynchronous requests to IPFS, not the Web. Thanks for your input, @funkyfuture, and as @ibnesayeed said, we will look into using that library as applicable when we can get the rest of the code ported to Py3.

Enable concurrent read from IPFS in replay, #379

f043c0e

ibnesayeed requested a review from machawk1 July 9, 2018 04:58

machawk1 requested changes Jul 9, 2018

View reviewed changes

Minor changes to comply with pycodestyle

b0f7654

ibnesayeed mentioned this pull request Jul 16, 2018

Concurrent fetches are slower than sequential ipfs-shipyard/py-ipfs-http-client#131

Open

machawk1 marked this pull request as draft May 18, 2022 20:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Enable concurrent read from IPFS in replay #425

[WIP] Enable concurrent read from IPFS in replay #425

ibnesayeed commented Jul 9, 2018

machawk1 commented Jul 9, 2018

machawk1 Jul 9, 2018

machawk1 Jul 9, 2018

machawk1 Jul 9, 2018

machawk1 Jul 9, 2018

ibnesayeed Jul 9, 2018

ibnesayeed commented Jul 9, 2018

codecov bot commented Jul 9, 2018 •

edited

Loading

machawk1 commented Jul 9, 2018

ibnesayeed commented Jul 9, 2018

ibnesayeed commented Jul 14, 2018

machawk1 commented Jul 14, 2018

ibnesayeed commented Jul 16, 2018

funkyfuture commented Aug 19, 2018

ibnesayeed commented Aug 19, 2018

funkyfuture commented Aug 19, 2018

ibnesayeed commented Aug 19, 2018

machawk1 commented Aug 19, 2018

[WIP] Enable concurrent read from IPFS in replay #425

Are you sure you want to change the base?

[WIP] Enable concurrent read from IPFS in replay #425

Conversation

ibnesayeed commented Jul 9, 2018

machawk1 commented Jul 9, 2018

machawk1 Jul 9, 2018

Choose a reason for hiding this comment

machawk1 Jul 9, 2018

Choose a reason for hiding this comment

machawk1 Jul 9, 2018

Choose a reason for hiding this comment

machawk1 Jul 9, 2018

Choose a reason for hiding this comment

ibnesayeed Jul 9, 2018

Choose a reason for hiding this comment

ibnesayeed commented Jul 9, 2018

codecov bot commented Jul 9, 2018 • edited Loading

Codecov Report

machawk1 commented Jul 9, 2018

ibnesayeed commented Jul 9, 2018

ibnesayeed commented Jul 14, 2018

machawk1 commented Jul 14, 2018

ibnesayeed commented Jul 16, 2018

funkyfuture commented Aug 19, 2018

ibnesayeed commented Aug 19, 2018

funkyfuture commented Aug 19, 2018

ibnesayeed commented Aug 19, 2018

machawk1 commented Aug 19, 2018

codecov bot commented Jul 9, 2018 •

edited

Loading