Source code of the project application

Author	Message
Natalia Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Help desk expert Send message Joined: 11 Aug 17 Posts: 103 Credit: 1,974,727 RAC: 4	Message 111 - Posted: 19 Oct 2017, 20:05:26 UTC Last modified: 19 Oct 2017, 20:12:46 UTC The source code of the project application is available at GitHub; the main computing module is in RakeDiagSearch directory. -- Исходный код приложения проекта доступен в GitHub; основной расчетный модуль находится в директории RakeDiagSearch. ID: 111 · Rating: 0 · rate: / Reply Quote

[B@P] Daniel Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0	Message 115 - Posted: 20 Oct 2017, 23:10:31 UTC Nice! I checked it and I have one request: could you translate all comments in code to English? Russian ones are not very helpful. BTW, it looks that code has potential for optimization. I tried to optimize one method, and I was able to make it about 4 times faster. I assumed that values in Matrix are from 0 to 8 (inclusive). union U { uint8_t cnt[Rank]; struct { uint64_t val1; uint8_t val2; } __attribute__((packed)) s; } __attribute__((packed)); int Square::IsLatin() { U u1, u2; // check rows and cols for (int r = 0; r < Rank; ++r) { u1.s.val1 = 0; u1.s.val2 = 0; u2.s.val1 = 0; u2.s.val2 = 0; for (int c = 0; c < Rank; ++c) { ++u1.cnt[Matrix[r][c]]; ++u2.cnt[Matrix[c][r]]; } if ((u1.s.val1 != 0x0101010101010101) \|\| (u1.s.val2 != 1)) return 0; if ((u2.s.val1 != 0x0101010101010101) \|\| (u2.s.val2 != 1)) return 0; } return 1; } ID: 115 · Rating: 0 · rate: / Reply Quote

[B@P] Daniel Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0	Message 116 - Posted: 21 Oct 2017, 9:28:26 UTC Update: setting to 1 works even better than incrementing, making function about 5 times faster than original: u1.cnt[Matrix[r][c]] = 1; u2.cnt[Matrix[c][r]] = 1; ID: 116 · Rating: 0 · rate: / Reply Quote

Natalia Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Help desk expert Send message Joined: 11 Aug 17 Posts: 103 Credit: 1,974,727 RAC: 4	Message 117 - Posted: 21 Oct 2017, 9:46:47 UTC - in response to Message 116. Dear Daniel, wow, it looks very impressive! Thank you for your help! We will incorporate your code and translate the comments, too :) ID: 117 · Rating: 0 · rate: / Reply Quote

[B@P] Daniel Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0	Message 118 - Posted: 21 Oct 2017, 19:29:21 UTC Last modified: 21 Oct 2017, 20:02:20 UTC Thanks! I tried one more approach and found solution that is ~6.5 times faster than original: int Square::IsLatin() { uint16_t val1, val2; // check rows and cols for (int r = 0; r < Rank; ++r) { val1 = 0; val2 = 0; for (int c = 0; c < Rank; ++c) { val1 \|= 1 << Matrix[r][c]; val2 \|= 1 << Matrix[c][r]; } if ((val1 != 0x1ff) \|\| (val2 != 0x1ff)) return 0; } return 1; } ID: 118 · Rating: 0 · rate: / Reply Quote

[B@P] Daniel Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0	Message 119 - Posted: 21 Oct 2017, 20:42:31 UTC Update: I was testing my code on SandyBridge i7 CPU. When I moved on Haswell Xeon, speedup was "only" 5 times. But when I recompiled code with BMI2 instruction set enabled, it was faster ~9 times! It will be worth to have code compiled with and without BMI2 enabled. You can to this by having two app versions and defining different plan classes for them, or by having two function versions and choosing at runtime which one should be used (pointer to function initialized at app startup will be most efficient approach). Node: on SandyBridge I was compiling code using gcc 5.4.0, on Haswell using gcc 4.8.5. ID: 119 · Rating: 0 · rate: / Reply Quote

[B@P] Daniel Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0	Message 120 - Posted: 22 Oct 2017, 11:53:40 UTC Last modified: 22 Oct 2017, 12:16:19 UTC Bonus version: with AVX2 and BMI2 function is almost 14 times faster! #include "immintrin.h" int Square::IsLatin() { const __m256i v1 = _mm256_set1_epi32(1); __m256i val = _mm256_set1_epi32(0); for (int r = 0; r < Rank; ++r) { __m256i row = _mm256_loadu_si256((__m256i const *)&Matrix[r][0]); val = _mm256_or_si256(val, _mm256_sllv_epi32(v1, row)); } val = _mm256_cmpeq_epi32(val, _mm256_set1_epi32(0x1ff)); int v = _mm256_movemask_epi8(val); if (v != 0xffffffff) return 0; uint16_t val1 = 0, val2; // check rows and 9th col for (int r = 0; r < Rank; ++r) { val2 = 0; val1 \|= 1 << Matrix[r][8]; for (int c = 0; c < Rank; ++c) { val2 \|= 1 << Matrix[c][r]; } if (val2 != 0x1ff) return 0; } if (val1 != 0x1ff) return 0; return 1; } This code can be compiled with AVX2 only, but of course is a bit slower. On Intel CPU AVX2 and BMI2 both are supported starting from Haswell, earlier CPU versions do not support any of these sets. I do not know about AMD CPUs, maybe some model supports only of these instruction sets. In such case it may be worth to have special app version for it. BTW, AMD CPUs support XOP instruction set which provides instruction similar to vpsllvd from AVX2. I do not have experience with it, but someone else may try to use it. Update: it seems that it is enough to test that values in (Rank) rows and (Rank-1) columns (or vice-versa) are unique to tell if square is Latin or not. If you can prove it formally, this function can be simplified a bit - variable val1 and corresponding calculations can be removed. This will make it a bit faster too. ID: 120 · Rating: 0 · rate: / Reply Quote

hoarfrost Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Help desk expert Send message Joined: 11 Aug 17 Posts: 678 Credit: 25,973,732 RAC: 26,832	Message 123 - Posted: 22 Oct 2017, 17:58:06 UTC - in response to Message 120. Last modified: 22 Oct 2017, 18:01:29 UTC Daniel, bitwise ariphmetics is a great idea. It is possible that it can be useful in other places of program. ID: 123 · Rating: 0 · rate: / Reply Quote

Natalia Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Help desk expert Send message Joined: 11 Aug 17 Posts: 103 Credit: 1,974,727 RAC: 4	Message 124 - Posted: 22 Oct 2017, 18:50:29 UTC The comments in the main computing module RakeDiagSearch are now in English. ID: 124 · Rating: 0 · rate: / Reply Quote

[B@P] Daniel Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0	Message 126 - Posted: 22 Oct 2017, 21:53:10 UTC - in response to Message 124. The comments in the main computing module RakeDiagSearch are now in English. Thanks! Daniel, bitwise ariphmetics is a great idea. It is possible that it can be useful in other places of program. Yes, of course. But there are also other ways to optimize, e.g. pass object to function by reference instead of by value. Simple change to function declaration makes it almost twice as fast: int Square::OrthoDegree(const Square& a, const Square& b) ID: 126 · Rating: 0 · rate: / Reply Quote

[B@P] Daniel Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0	Message 145 - Posted: 2 Nov 2017, 19:05:40 UTC Last modified: 2 Nov 2017, 19:29:34 UTC I have performed a small profiling and found the bottleneck - app spends about 92% of time in MovePairSearch::MoveRows(). This function definitely needs some love. Edit: could you create workunit file which will need up to 5 minutes to process? Current one probably needs many hours to complete, I stopped it after few minutes. ID: 145 · Rating: 0 · rate: / Reply Quote

[B@P] Daniel Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0	Message 147 - Posted: 4 Nov 2017, 16:40:50 UTC - in response to Message 145. I have performed a small profiling and found the bottleneck - app spends about 92% of time in MovePairSearch::MoveRows(). This function definitely needs some love. Edit: could you create workunit file which will need up to 5 minutes to process? Current one probably needs many hours to complete, I stopped it after few minutes. I was able to optimize MovePairSearch::MoveRows(), now it takes "only" ~69% of CPU time. Performance increase varies from WU to WU, but on average it is about 2 times faster. Some WUs are slower, some faster - e.g. this one completed over 3 times faster than wingman! http://rake.boincfast.ru/rakesearch/workunit.php?wuid=34933. So far there are no validation errors, so looks that I did not broke anything. ID: 147 · Rating: 0 · rate: / Reply Quote

[B@P] Daniel Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0	Message 149 - Posted: 4 Nov 2017, 22:02:05 UTC - in response to Message 147. Last modified: 4 Nov 2017, 22:08:33 UTC I have performed a small profiling and found the bottleneck - app spends about 92% of time in MovePairSearch::MoveRows(). This function definitely needs some love. Edit: could you create workunit file which will need up to 5 minutes to process? Current one probably needs many hours to complete, I stopped it after few minutes. I was able to optimize MovePairSearch::MoveRows(), now it takes "only" ~69% of CPU time. Performance increase varies from WU to WU, but on average it is about 2 times faster. Some WUs are slower, some faster - e.g. this one completed over 3 times faster than wingman! http://rake.boincfast.ru/rakesearch/workunit.php?wuid=36493. So far there are no validation errors, so looks that I did not broke anything. Sometimes speedup may be close to 4 times, e.g. for http://rake.boincfast.ru/rakesearch/workunit.php?wuid=36493. What is surprising is that i7 SandyBridbe/Win10/MinGW app (AVX only) is so much faster than official MSVC/Win app running on Xeon Haswell (AVX2+BMI2)! ID: 149 · Rating: 0 · rate: / Reply Quote

hoarfrost Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Help desk expert Send message Joined: 11 Aug 17 Posts: 678 Credit: 25,973,732 RAC: 26,832	Message 150 - Posted: 5 Nov 2017, 10:54:58 UTC Daniel, great work! What changes in code led to the acceleration of application? ID: 150 · Rating: 0 · rate: / Reply Quote

[B@P] Daniel Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0	Message 151 - Posted: 5 Nov 2017, 15:05:58 UTC - in response to Message 150. Daniel, great work! What changes in code led to the acceleration of application? I used bitfields instead of arrays where possible, optimized calculations and used AVX/SSE for copying and initializing data in arrays. Additionally I replaced -O2 with -O3 -ftree-vectorize to enable better optimization. I have forked your project and pushed my changes on new branch. Here is my commit with all changes: https://github.com/sirzooro/RakeSearch/commit/d89124edfd910d28a79bfcaca270e1f82e80edc7 ID: 151 · Rating: 0 · rate: / Reply Quote

[B@P] Daniel Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0	Message 152 - Posted: 6 Nov 2017, 10:35:35 UTC A bit more optimized app can work up to 4.8 times faster :) This is the highest result which I found so far: http://rake.boincfast.ru/rakesearch/workunit.php?wuid=49600 I have general question about MovePairSearch::MoveRows: is it generating all permutations of rows, and discarding ones with duplicated values on diagonals? I am looking for way how to vectorize this function, and I am not sure if I understand its code properly. ID: 152 · Rating: 0 · rate: / Reply Quote

[B@P] Daniel Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0	Message 153 - Posted: 6 Nov 2017, 19:16:41 UTC Last modified: 6 Nov 2017, 19:17:16 UTC I have found that MovePairSearch::ProcessOrthoSquare() calls b.IsLatin() twice. Is this call left by mistake after copy-paste, or is it a bug? I also repeat my previous request: could you create workunit file which will need up to 5 minutes to process? Current one probably needs hours to complete, It would let me to quickly check if modified app produces correct results. Now I have to install this app in BOINC and wait at least hour to get some results. ID: 153 · Rating: 0 · rate: / Reply Quote

hoarfrost Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Help desk expert Send message Joined: 11 Aug 17 Posts: 678 Credit: 25,973,732 RAC: 26,832	Message 154 - Posted: 7 Nov 2017, 23:03:57 UTC - in response to Message 153. Last modified: 7 Nov 2017, 23:06:53 UTC Daniel, my congratulations! I have found that MovePairSearch::ProcessOrthoSquare() calls b.IsLatin() twice. Is this call left by mistake after copy-paste, or is it a bug? Yes, it's a mistake. Must be a.IsLatin(). But really, if (b.IsLatin()), a.IsLatin() also. Moreover, this check may be disabled (but it still needs to be checked). Thank you! I also repeat my previous request: could you create workunit file which will need up to 5 minutes to process? Current one probably needs hours to complete, It would let me to quickly check if modified app produces correct results. Now I have to install this app in BOINC and wait at least hour to get some results. I try to make workunit on several minutes. It take some time for review new version of code. Next official application code, I hope! Thank you! ID: 154 · Rating: 0 · rate: / Reply Quote

hoarfrost Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Help desk expert Send message Joined: 11 Aug 17 Posts: 678 Credit: 25,973,732 RAC: 26,832	Message 155 - Posted: 7 Nov 2017, 23:22:07 UTC - in response to Message 152. I have general question about MovePairSearch::MoveRows: is it generating all permutations of rows, and discarding ones with duplicated values on diagonals? I am looking for way how to vectorize this function, and I am not sure if I understand its code properly. Yes. Basic Idea is a very simple - we generate a Diagonal Latin Square and check - would we built by rows permutations of this square a new diagonal latin square, which will be orthogonal to original square. Permutation of rows in Latin Square cannot break its, but for Diagonal Latins Square, in common case it is not true due to changes on diagonals. ID: 155 · Rating: 0 · rate: / Reply Quote

[B@P] Daniel Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0	Message 156 - Posted: 8 Nov 2017, 13:24:47 UTC - in response to Message 155. I have general question about MovePairSearch::MoveRows: is it generating all permutations of rows, and discarding ones with duplicated values on diagonals? I am looking for way how to vectorize this function, and I am not sure if I understand its code properly. Yes. Basic Idea is a very simple - we generate a Diagonal Latin Square and check - would we built by rows permutations of this square a new diagonal latin square, which will be orthogonal to original square. Permutation of rows in Latin Square cannot break its, but for Diagonal Latins Square, in common case it is not true due to changes on diagonals. Thanks for explanation. In the meantime I pushed one more commit with few more tweaks, please take a look on it too: https://github.com/sirzooro/RakeSearch/commit/b82642409f3fdd4761e89d0514ad8e9b91d8777f ID: 156 · Rating: 0 · rate: / Reply Quote