Message boards :
Science :
Source code of the project application
Message board moderation
Author | Message |
---|---|
Send message Joined: 11 Aug 17 Posts: 103 Credit: 1,973,929 RAC: 15 |
|
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
Nice! I checked it and I have one request: could you translate all comments in code to English? Russian ones are not very helpful. BTW, it looks that code has potential for optimization. I tried to optimize one method, and I was able to make it about 4 times faster. I assumed that values in Matrix are from 0 to 8 (inclusive). union U { uint8_t cnt[Rank]; struct { uint64_t val1; uint8_t val2; } __attribute__((packed)) s; } __attribute__((packed)); int Square::IsLatin() { U u1, u2; // check rows and cols for (int r = 0; r < Rank; ++r) { u1.s.val1 = 0; u1.s.val2 = 0; u2.s.val1 = 0; u2.s.val2 = 0; for (int c = 0; c < Rank; ++c) { ++u1.cnt[Matrix[r][c]]; ++u2.cnt[Matrix[c][r]]; } if ((u1.s.val1 != 0x0101010101010101) || (u1.s.val2 != 1)) return 0; if ((u2.s.val1 != 0x0101010101010101) || (u2.s.val2 != 1)) return 0; } return 1; } |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
Update: setting to 1 works even better than incrementing, making function about 5 times faster than original: u1.cnt[Matrix[r][c]] = 1; u2.cnt[Matrix[c][r]] = 1; |
Send message Joined: 11 Aug 17 Posts: 103 Credit: 1,973,929 RAC: 15 |
Dear Daniel, wow, it looks very impressive! Thank you for your help! We will incorporate your code and translate the comments, too :) |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
Thanks! I tried one more approach and found solution that is ~6.5 times faster than original: int Square::IsLatin() { uint16_t val1, val2; // check rows and cols for (int r = 0; r < Rank; ++r) { val1 = 0; val2 = 0; for (int c = 0; c < Rank; ++c) { val1 |= 1 << Matrix[r][c]; val2 |= 1 << Matrix[c][r]; } if ((val1 != 0x1ff) || (val2 != 0x1ff)) return 0; } return 1; } |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
Update: I was testing my code on SandyBridge i7 CPU. When I moved on Haswell Xeon, speedup was "only" 5 times. But when I recompiled code with BMI2 instruction set enabled, it was faster ~9 times! It will be worth to have code compiled with and without BMI2 enabled. You can to this by having two app versions and defining different plan classes for them, or by having two function versions and choosing at runtime which one should be used (pointer to function initialized at app startup will be most efficient approach). Node: on SandyBridge I was compiling code using gcc 5.4.0, on Haswell using gcc 4.8.5. |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
Bonus version: with AVX2 and BMI2 function is almost 14 times faster! #include "immintrin.h" int Square::IsLatin() { const __m256i v1 = _mm256_set1_epi32(1); __m256i val = _mm256_set1_epi32(0); for (int r = 0; r < Rank; ++r) { __m256i row = _mm256_loadu_si256((__m256i const *)&Matrix[r][0]); val = _mm256_or_si256(val, _mm256_sllv_epi32(v1, row)); } val = _mm256_cmpeq_epi32(val, _mm256_set1_epi32(0x1ff)); int v = _mm256_movemask_epi8(val); if (v != 0xffffffff) return 0; uint16_t val1 = 0, val2; // check rows and 9th col for (int r = 0; r < Rank; ++r) { val2 = 0; val1 |= 1 << Matrix[r][8]; for (int c = 0; c < Rank; ++c) { val2 |= 1 << Matrix[c][r]; } if (val2 != 0x1ff) return 0; } if (val1 != 0x1ff) return 0; return 1; } This code can be compiled with AVX2 only, but of course is a bit slower. On Intel CPU AVX2 and BMI2 both are supported starting from Haswell, earlier CPU versions do not support any of these sets. I do not know about AMD CPUs, maybe some model supports only of these instruction sets. In such case it may be worth to have special app version for it. BTW, AMD CPUs support XOP instruction set which provides instruction similar to vpsllvd from AVX2. I do not have experience with it, but someone else may try to use it. Update: it seems that it is enough to test that values in (Rank) rows and (Rank-1) columns (or vice-versa) are unique to tell if square is Latin or not. If you can prove it formally, this function can be simplified a bit - variable val1 and corresponding calculations can be removed. This will make it a bit faster too. |
Send message Joined: 11 Aug 17 Posts: 645 Credit: 22,416,960 RAC: 12,737 |
Daniel, bitwise ariphmetics is a great idea. It is possible that it can be useful in other places of program. |
Send message Joined: 11 Aug 17 Posts: 103 Credit: 1,973,929 RAC: 15 |
The comments in the main computing module RakeDiagSearch are now in English. |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
The comments in the main computing module RakeDiagSearch are now in English. Thanks! Daniel, bitwise ariphmetics is a great idea. It is possible that it can be useful in other places of program. Yes, of course. But there are also other ways to optimize, e.g. pass object to function by reference instead of by value. Simple change to function declaration makes it almost twice as fast: int Square::OrthoDegree(const Square& a, const Square& b) |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
I have performed a small profiling and found the bottleneck - app spends about 92% of time in MovePairSearch::MoveRows(). This function definitely needs some love. Edit: could you create workunit file which will need up to 5 minutes to process? Current one probably needs many hours to complete, I stopped it after few minutes. |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
I have performed a small profiling and found the bottleneck - app spends about 92% of time in MovePairSearch::MoveRows(). This function definitely needs some love. I was able to optimize MovePairSearch::MoveRows(), now it takes "only" ~69% of CPU time. Performance increase varies from WU to WU, but on average it is about 2 times faster. Some WUs are slower, some faster - e.g. this one completed over 3 times faster than wingman! http://rake.boincfast.ru/rakesearch/workunit.php?wuid=34933. So far there are no validation errors, so looks that I did not broke anything. |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
I have performed a small profiling and found the bottleneck - app spends about 92% of time in MovePairSearch::MoveRows(). This function definitely needs some love. Sometimes speedup may be close to 4 times, e.g. for http://rake.boincfast.ru/rakesearch/workunit.php?wuid=36493. What is surprising is that i7 SandyBridbe/Win10/MinGW app (AVX only) is so much faster than official MSVC/Win app running on Xeon Haswell (AVX2+BMI2)! |
Send message Joined: 11 Aug 17 Posts: 645 Credit: 22,416,960 RAC: 12,737 |
Daniel, great work! What changes in code led to the acceleration of application? |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
Daniel, great work! What changes in code led to the acceleration of application? I used bitfields instead of arrays where possible, optimized calculations and used AVX/SSE for copying and initializing data in arrays. Additionally I replaced -O2 with -O3 -ftree-vectorize to enable better optimization. I have forked your project and pushed my changes on new branch. Here is my commit with all changes: https://github.com/sirzooro/RakeSearch/commit/d89124edfd910d28a79bfcaca270e1f82e80edc7 |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
A bit more optimized app can work up to 4.8 times faster :) This is the highest result which I found so far: http://rake.boincfast.ru/rakesearch/workunit.php?wuid=49600 I have general question about MovePairSearch::MoveRows: is it generating all permutations of rows, and discarding ones with duplicated values on diagonals? I am looking for way how to vectorize this function, and I am not sure if I understand its code properly. |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
I have found that MovePairSearch::ProcessOrthoSquare() calls b.IsLatin() twice. Is this call left by mistake after copy-paste, or is it a bug? I also repeat my previous request: could you create workunit file which will need up to 5 minutes to process? Current one probably needs hours to complete, It would let me to quickly check if modified app produces correct results. Now I have to install this app in BOINC and wait at least hour to get some results. |
Send message Joined: 11 Aug 17 Posts: 645 Credit: 22,416,960 RAC: 12,737 |
Daniel, my congratulations! I have found that MovePairSearch::ProcessOrthoSquare() calls b.IsLatin() twice. Is this call left by mistake after copy-paste, or is it a bug? Yes, it's a mistake. Must be a.IsLatin(). But really, if (b.IsLatin()), a.IsLatin() also. Moreover, this check may be disabled (but it still needs to be checked). Thank you! I also repeat my previous request: could you create workunit file which will need up to 5 minutes to process? Current one probably needs hours to complete, It would let me to quickly check if modified app produces correct results. Now I have to install this app in BOINC and wait at least hour to get some results. I try to make workunit on several minutes. It take some time for review new version of code. Next official application code, I hope! Thank you! |
Send message Joined: 11 Aug 17 Posts: 645 Credit: 22,416,960 RAC: 12,737 |
I have general question about MovePairSearch::MoveRows: is it generating all permutations of rows, and discarding ones with duplicated values on diagonals? I am looking for way how to vectorize this function, and I am not sure if I understand its code properly. Yes. Basic Idea is a very simple - we generate a Diagonal Latin Square and check - would we built by rows permutations of this square a new diagonal latin square, which will be orthogonal to original square. Permutation of rows in Latin Square cannot break its, but for Diagonal Latins Square, in common case it is not true due to changes on diagonals. |
Send message Joined: 8 Sep 17 Posts: 99 Credit: 402,603,726 RAC: 0 |
I have general question about MovePairSearch::MoveRows: is it generating all permutations of rows, and discarding ones with duplicated values on diagonals? I am looking for way how to vectorize this function, and I am not sure if I understand its code properly. Thanks for explanation. In the meantime I pushed one more commit with few more tweaks, please take a look on it too: https://github.com/sirzooro/RakeSearch/commit/b82642409f3fdd4761e89d0514ad8e9b91d8777f |
©2024 The searchers team, Karelian Research Center of the Russian Academy of Sciences