#35399: Reduce the "Case-When" sequence for a bulk_update when the values for a
certain field are the same.
-------------------------------------+-------------------------------------
     Reporter:  Willem Van Onsem     |                    Owner:  nobody
         Type:  New feature          |                   Status:  closed
    Component:  Database layer       |                  Version:  5.0
  (models, ORM)                      |
     Severity:  Normal               |               Resolution:  duplicate
     Keywords:  db, bulk_update,     |             Triage Stage:
  case, when                         |  Unreviewed
    Has patch:  1                    |      Needs documentation:  0
  Needs tests:  0                    |  Patch needs improvement:  0
Easy pickings:  0                    |                    UI/UX:  0
-------------------------------------+-------------------------------------
Comment (by Willem Van Onsem):

 I created a small testcase that has a speedup of 53.2% ±4.073% (p=95%):

 {{{
 def benching(n=100, m=10):
         def benchmark():
                 EventStatistics.objects.bulk_update(es,
 fields=('scans_in',))
         def benchmark2():
                 EventStatistics.objects.bulk_update2(es,
 fields=('scans_in',))

         es = list(EventStatistics.objects.order_by('?')[:n])
         for e in es:
                 e.scans_in = random.randint(0, m)
         without = timeit.timeit(benchmark, number=15)

         es = list(EventStatistics.objects.order_by('?')[:n])
         for e in es:
                 e.scans_in = random.randint(0, 10)
         _with = timeit.timeit(benchmark2, number=15)

         speedup = round(100*_with/without)  # percentage of the new one
 compared to the old
         print(m, n, without, _with, f'{speedup}%')
         return speedup
 }}}

 Where I added the `bulk_update2` method to the `QuerySet`. The code is the
 proposed change, whereas `bulk_update` is the original one.

 Here `scans_in` is an `IntegerField`, `m` is the number of different
 values (well randomly determined, so an upper bound), `n` is the number of
 records to update. I used random records to avoid that these are all
 located near each other, and furthermore used different ones for the two,
 to avoid that the indexes for example are cached, or some other speedup.
 Every test ran 15 times with the `timeit` module with the following
 results (first two columns are `m` and `n`, then the timings in seconds,
 and finally the time of the speedup, compared to the original):

 {{{
 10 100 0.23681159999978263 0.1789207000110764 76%
 10 500 1.103456400000141 0.5554370999889215 50%
 10 1000 2.4681710000004387 1.3000370000081602 53%
 10 5000 13.48228929999459 6.651690699989558 49%
 10 10000 33.66046780000033 15.352994000000763 46%
 20 100 0.25399490000563674 0.17354469999554567 68%
 20 500 1.041674799998873 0.5452321000047959 52%
 20 1000 2.2131319000036456 1.3127163000026485 59%
 20 5000 26.524631100008264 10.656895499996608 40%
 20 10000 36.67004110000562 14.241265099990414 39%
 50 100 0.21403420000569895 0.13851690001320094 65%
 50 500 1.2294649999967078 0.6158274999907007 50%
 50 1000 2.1560433999984525 1.1467072000086773 53%
 50 5000 14.895891699998174 7.622537400005967 51%
 50 10000 34.32338119999622 15.66809479999938 46%
 100 100 0.24263419999624602 0.15937580000900198 66%
 100 500 1.225836999990861 0.634526800000458 52%
 100 1000 2.245729800008121 1.1381603000045288 51%
 100 5000 13.832920399989234 6.757217299993499 49%
 100 10000 39.29845810000552 19.19906909999554 49%
 }}}

 or as a table with the "speedup" (smaller is a **stronger** speedup):

 {{{
 ---  ---  ---  ----  ----  -----
 x    100  500  1000  5000  10000
 10   76%  50%  53%   49%   46%
 20   68%  52%  59%   40%   39%
 50   65%  50%  53%   51%   46%
 100  66%  52%  51%   49%   49%
 ---  ---  ---  ----  ----  -----
 }}}

 I also ran these in the opposite direction, to make sure it was not that
 the first queries somehow let the database respond faster (because the
 PostgreSQL database probably contains code pages, indexes, etc. that could
 be in the cache). Then we achieve the following results:

 {{{
 10 100 0.16778360000171233 0.26211979999789037 156%
 10 500 0.575189399998635 1.1075479999999516 193%
 10 1000 1.0994390999985626 2.76130669999111 251%
 10 5000 6.549138200003654 14.862665100008599 227%
 10 10000 14.595775299996603 41.87046229999396 287%
 20 100 0.19783109999843873 0.26739639999868814 135%
 20 500 0.8581540000013774 1.7171591000078479 200%
 20 1000 1.173938799998723 2.096230099996319 179%
 20 5000 7.9105590999970445 15.384056599999894 194%
 20 10000 16.410469500013278 36.77300239999022 224%
 50 100 0.22504739998839796 0.22487839999666903 100%
 50 500 0.6390990999934729 1.1038762999960454 173%
 50 1000 1.2254422000114573 2.0602585000015097 168%
 50 5000 6.814096200003405 13.404201700002886 197%
 50 10000 18.954565500011086 41.927377500003786 221%
 100 100 0.2481432000058703 0.22796000000380445 92%
 100 500 0.7834087000082945 1.3490471999975853 172%
 100 1000 1.5713398999942 2.1639006000041263 138%
 100 5000 7.152072100012447 13.975013000002946 195%
 100 10000 16.15594609999971 36.78454979999515 228%
 }}}

 or as a table with the parameters (greater is a **stronger** speedup) we
 get 186.5% ±21.048% (p=95%):

 {{{
 ---  ----  ----  ----  ----  -----
 x    100   500   1000  5000  10000
 10   156%  193%  251%  227%  287%
 20   135%  200%  179%  194%  224%
 50   100%  173%  168%  197%  221%
 100  92%   172%  138%  195%  228%
 ---  ----  ----  ----  ----  -----
 }}}

 Finally I also moved the generation or random numbers into the testcase,
 this because if the values are once updated, this might reduce the query
 time, since it does not require writing to a disk, so:

 {{{
 def benching(n=100, m=10):
         def benchmark():
                 es = list(EventStatistics.objects.order_by('?')[:n])
                 for e in es:
                         e.scans_in = random.randint(0, m)
                 EventStatistics.objects.bulk_update(es,
 fields=('scans_in',))
         def benchmark2():
                 es = list(EventStatistics.objects.order_by('?')[:n])
                 for e in es:
                         e.scans_in = random.randint(0, 10)
                 EventStatistics.objects.bulk_update2(es,
 fields=('scans_in',))

         without = timeit.timeit(benchmark, number=15)
         _with = timeit.timeit(benchmark2, number=15)

         speedup = round(100*_with/without)  # percentage of the new one
 compared to the old
         print(m, n, without, _with, f'{speedup}%')
         return speedup
 }}}

 This will "flatten out" the speedup, since the part where we assign values
 also counts as time, and this is the same for both versions. With the
 original `bulk_update` first, we get the following results (76.55%
 ±10.293%, p=95%):

 {{{
 10 100 2.384173699989333 2.2892083000042476 96%
 10 500 3.9282099999982165 2.949110200002906 75%
 10 1000 5.0259445999981835 5.3350902999955 106%
 10 5000 20.10881380000501 11.846710900004837 59%
 10 10000 50.26291349998792 24.37554049999744 48%
 20 100 1.7966537000029348 2.1353922999987844 119%
 20 500 2.8477611999987857 3.5514096999977482 125%
 20 1000 4.702102600000217 3.8196782000013627 81%
 20 5000 19.417894899990642 12.532883399995626 65%
 20 10000 45.69321430000127 20.617325500003062 45%
 50 100 1.9369860000006156 1.8516613000101643 96%
 50 500 3.1590656000043964 2.6837016999925254 85%
 50 1000 4.705449500004761 2.9762405999936163 63%
 50 5000 17.380039900002885 10.148739099997329 58%
 50 10000 40.880203600012464 20.52081229999021 50%
 100 100 1.7478849000035552 1.6249813000031281 93%
 100 500 3.1730380000080913 2.2462558999977773 71%
 100 1000 4.052220900004613 3.3273870999983046 82%
 100 5000 17.978366199997254 11.050199799996335 61%
 100 10000 38.403917099989485 20.502074199990602 53%
 ---  ----  ----  ----  ----  -----
 x    100   500   1000  5000  10000
 10   96%   75%   106%  59%   48%
 20   119%  125%  81%   65%   45%
 50   96%   85%   63%   58%   50%
 100  93%   71%   82%   61%   53%
 ---  ----  ----  ----  ----  -----
 }}}

 and with the "improved" version first (142.9% ±19.563%, p=95%):

 {{{
 10 100 2.5029061999957776 1.902948799994192 76%
 10 500 2.3854384000005666 3.409526500006905 143%
 10 1000 3.062690600010683 4.10313469999528 134%
 10 5000 10.875457200003439 17.695201499998802 163%
 10 10000 21.629492600011872 48.996730700004264 227%
 20 100 2.12547889999405 2.262495199989644 106%
 20 500 2.699372699993546 3.4100137000059476 126%
 20 1000 3.840364799994859 4.572843499990995 119%
 20 5000 10.895366700002342 19.433951699989848 178%
 20 10000 22.678348500005086 42.50577279999561 187%
 50 100 3.069858699993347 2.5085584000044037 82%
 50 500 2.7575027999992017 3.3038902999978745 120%
 50 1000 3.572029700007988 4.635487000006833 130%
 50 5000 10.223477299994556 20.650777300004847 202%
 50 10000 30.925161999999546 56.15645150000637 182%
 100 100 3.82010420000006 2.949311400006991 77%
 100 500 3.6471351999934996 6.216737800001283 170%
 100 1000 6.9234703999973135 9.566761500012944 138%
 100 5000 26.763996299996506 25.526094300003024 95%
 100 10000 25.562641399999848 51.87259610000183 203%
 ---  ----  ----  ----  ----  -----
 x    100   500   1000  5000  10000
 10   76%   143%  134%  163%  227%
 20   106%  126%  119%  178%  187%
 50   82%   120%  130%  202%  182%
 100  77%   170%  138%  95%   203%
 ---  ----  ----  ----  ----  -----
 }}}

 Some extra ideas to optimize is to put the `Case` with the largest number
 of pks first, since that is then the first one that will be taken,
 avoiding to have to walk over all cases by the database, which is
 essentially linear search.
-- 
Ticket URL: <https://code.djangoproject.com/ticket/35399#comment:2>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-updates+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-updates/0107018f104fe610-0c58e992-e2f9-495c-af42-f1b00118f12a-000000%40eu-central-1.amazonses.com.

Reply via email to