Source code of the project application

Message boards : Science : Source code of the project application
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Natalia
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Help desk expert

Send message
Joined: 11 Aug 17
Posts: 103
Credit: 1,973,929
RAC: 15
Message 111 - Posted: 19 Oct 2017, 20:05:26 UTC
Last modified: 19 Oct 2017, 20:12:46 UTC

The source code of the project application is available at GitHub; the main computing module is in RakeDiagSearch directory.
--
Исходный код приложения проекта доступен в GitHub; основной расчетный модуль находится в директории RakeDiagSearch.
ID: 111 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 99
Credit: 402,603,726
RAC: 0
Message 115 - Posted: 20 Oct 2017, 23:10:31 UTC

Nice! I checked it and I have one request: could you translate all comments in code to English? Russian ones are not very helpful.

BTW, it looks that code has potential for optimization. I tried to optimize one method, and I was able to make it about 4 times faster. I assumed that values in Matrix are from 0 to 8 (inclusive).

union U
{
    uint8_t cnt[Rank];
    struct
    {
        uint64_t val1;
        uint8_t val2;
    } __attribute__((packed)) s;
} __attribute__((packed));

int Square::IsLatin()
{
    U u1, u2;
    
    // check rows and cols
    for (int r = 0; r < Rank; ++r)
    {
        u1.s.val1 = 0;
        u1.s.val2 = 0;
        u2.s.val1 = 0;
        u2.s.val2 = 0;
        
        for (int c = 0; c < Rank; ++c)
        {
            ++u1.cnt[Matrix[r][c]];
            ++u2.cnt[Matrix[c][r]];
        }
        
        if ((u1.s.val1 != 0x0101010101010101) || (u1.s.val2 != 1))
            return 0;
        if ((u2.s.val1 != 0x0101010101010101) || (u2.s.val2 != 1))
            return 0;
    }
    
    return 1;
}

ID: 115 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 99
Credit: 402,603,726
RAC: 0
Message 116 - Posted: 21 Oct 2017, 9:28:26 UTC

Update: setting to 1 works even better than incrementing, making function about 5 times faster than original:
            u1.cnt[Matrix[r][c]] = 1;
            u2.cnt[Matrix[c][r]] = 1;

ID: 116 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Natalia
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Help desk expert

Send message
Joined: 11 Aug 17
Posts: 103
Credit: 1,973,929
RAC: 15
Message 117 - Posted: 21 Oct 2017, 9:46:47 UTC - in response to Message 116.  

Dear Daniel,

wow, it looks very impressive! Thank you for your help!
We will incorporate your code and translate the comments, too :)
ID: 117 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 99
Credit: 402,603,726
RAC: 0
Message 118 - Posted: 21 Oct 2017, 19:29:21 UTC
Last modified: 21 Oct 2017, 20:02:20 UTC

Thanks!

I tried one more approach and found solution that is ~6.5 times faster than original:
int Square::IsLatin()
{
    uint16_t val1, val2;
    
    // check rows and cols
    for (int r = 0; r < Rank; ++r)
    {
        val1 = 0;
        val2 = 0;
        
        for (int c = 0; c < Rank; ++c)
        {
            val1 |= 1 << Matrix[r][c];
            val2 |= 1 << Matrix[c][r];
        }
        
        if ((val1 != 0x1ff) || (val2 != 0x1ff))
            return 0;
    }
    
    return 1;
}

ID: 118 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 99
Credit: 402,603,726
RAC: 0
Message 119 - Posted: 21 Oct 2017, 20:42:31 UTC

Update: I was testing my code on SandyBridge i7 CPU. When I moved on Haswell Xeon, speedup was "only" 5 times. But when I recompiled code with BMI2 instruction set enabled, it was faster ~9 times! It will be worth to have code compiled with and without BMI2 enabled. You can to this by having two app versions and defining different plan classes for them, or by having two function versions and choosing at runtime which one should be used (pointer to function initialized at app startup will be most efficient approach).

Node: on SandyBridge I was compiling code using gcc 5.4.0, on Haswell using gcc 4.8.5.
ID: 119 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 99
Credit: 402,603,726
RAC: 0
Message 120 - Posted: 22 Oct 2017, 11:53:40 UTC
Last modified: 22 Oct 2017, 12:16:19 UTC

Bonus version: with AVX2 and BMI2 function is almost 14 times faster!

#include "immintrin.h"

int Square::IsLatin()
{
    const __m256i v1 = _mm256_set1_epi32(1);
    __m256i val = _mm256_set1_epi32(0);
    
    for (int r = 0; r < Rank; ++r)
    {
        __m256i row = _mm256_loadu_si256((__m256i const *)&Matrix[r][0]);
        val = _mm256_or_si256(val, _mm256_sllv_epi32(v1, row));
    }
    
    val = _mm256_cmpeq_epi32(val, _mm256_set1_epi32(0x1ff));
    int v = _mm256_movemask_epi8(val);
    if (v != 0xffffffff)
        return 0;
    
    uint16_t val1 = 0, val2;
    
    // check rows and 9th col
    for (int r = 0; r < Rank; ++r)
    {
        val2 = 0;
        
        val1 |= 1 << Matrix[r][8];
        for (int c = 0; c < Rank; ++c)
        {
            val2 |= 1 << Matrix[c][r];
        }
        
        if (val2 != 0x1ff)
            return 0;
    }
    
    if (val1 != 0x1ff)
        return 0;
    
    return 1;
}


This code can be compiled with AVX2 only, but of course is a bit slower. On Intel CPU AVX2 and BMI2 both are supported starting from Haswell, earlier CPU versions do not support any of these sets. I do not know about AMD CPUs, maybe some model supports only of these instruction sets. In such case it may be worth to have special app version for it.

BTW, AMD CPUs support XOP instruction set which provides instruction similar to vpsllvd from AVX2. I do not have experience with it, but someone else may try to use it.

Update: it seems that it is enough to test that values in (Rank) rows and (Rank-1) columns (or vice-versa) are unique to tell if square is Latin or not. If you can prove it formally, this function can be simplified a bit - variable val1 and corresponding calculations can be removed. This will make it a bit faster too.
ID: 120 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Help desk expert

Send message
Joined: 11 Aug 17
Posts: 645
Credit: 22,416,873
RAC: 12,733
Message 123 - Posted: 22 Oct 2017, 17:58:06 UTC - in response to Message 120.  
Last modified: 22 Oct 2017, 18:01:29 UTC

Daniel, bitwise ariphmetics is a great idea. It is possible that it can be useful in other places of program.
ID: 123 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Natalia
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Help desk expert

Send message
Joined: 11 Aug 17
Posts: 103
Credit: 1,973,929
RAC: 15
Message 124 - Posted: 22 Oct 2017, 18:50:29 UTC

The comments in the main computing module RakeDiagSearch are now in English.
ID: 124 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 99
Credit: 402,603,726
RAC: 0
Message 126 - Posted: 22 Oct 2017, 21:53:10 UTC - in response to Message 124.  

The comments in the main computing module RakeDiagSearch are now in English.


Thanks!

Daniel, bitwise ariphmetics is a great idea. It is possible that it can be useful in other places of program.


Yes, of course. But there are also other ways to optimize, e.g. pass object to function by reference instead of by value. Simple change to function declaration makes it almost twice as fast:
int Square::OrthoDegree(const Square& a, const Square& b)

ID: 126 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 99
Credit: 402,603,726
RAC: 0
Message 145 - Posted: 2 Nov 2017, 19:05:40 UTC
Last modified: 2 Nov 2017, 19:29:34 UTC

I have performed a small profiling and found the bottleneck - app spends about 92% of time in MovePairSearch::MoveRows(). This function definitely needs some love.

Edit: could you create workunit file which will need up to 5 minutes to process? Current one probably needs many hours to complete, I stopped it after few minutes.
ID: 145 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 99
Credit: 402,603,726
RAC: 0
Message 147 - Posted: 4 Nov 2017, 16:40:50 UTC - in response to Message 145.  

I have performed a small profiling and found the bottleneck - app spends about 92% of time in MovePairSearch::MoveRows(). This function definitely needs some love.

Edit: could you create workunit file which will need up to 5 minutes to process? Current one probably needs many hours to complete, I stopped it after few minutes.


I was able to optimize MovePairSearch::MoveRows(), now it takes "only" ~69% of CPU time. Performance increase varies from WU to WU, but on average it is about 2 times faster. Some WUs are slower, some faster - e.g. this one completed over 3 times faster than wingman! http://rake.boincfast.ru/rakesearch/workunit.php?wuid=34933. So far there are no validation errors, so looks that I did not broke anything.
ID: 147 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 99
Credit: 402,603,726
RAC: 0
Message 149 - Posted: 4 Nov 2017, 22:02:05 UTC - in response to Message 147.  
Last modified: 4 Nov 2017, 22:08:33 UTC

I have performed a small profiling and found the bottleneck - app spends about 92% of time in MovePairSearch::MoveRows(). This function definitely needs some love.

Edit: could you create workunit file which will need up to 5 minutes to process? Current one probably needs many hours to complete, I stopped it after few minutes.


I was able to optimize MovePairSearch::MoveRows(), now it takes "only" ~69% of CPU time. Performance increase varies from WU to WU, but on average it is about 2 times faster. Some WUs are slower, some faster - e.g. this one completed over 3 times faster than wingman! http://rake.boincfast.ru/rakesearch/workunit.php?wuid=36493. So far there are no validation errors, so looks that I did not broke anything.

Sometimes speedup may be close to 4 times, e.g. for http://rake.boincfast.ru/rakesearch/workunit.php?wuid=36493. What is surprising is that i7 SandyBridbe/Win10/MinGW app (AVX only) is so much faster than official MSVC/Win app running on Xeon Haswell (AVX2+BMI2)!
ID: 149 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Help desk expert

Send message
Joined: 11 Aug 17
Posts: 645
Credit: 22,416,873
RAC: 12,733
Message 150 - Posted: 5 Nov 2017, 10:54:58 UTC

Daniel, great work! What changes in code led to the acceleration of application?
ID: 150 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 99
Credit: 402,603,726
RAC: 0
Message 151 - Posted: 5 Nov 2017, 15:05:58 UTC - in response to Message 150.  

Daniel, great work! What changes in code led to the acceleration of application?

I used bitfields instead of arrays where possible, optimized calculations and used AVX/SSE for copying and initializing data in arrays. Additionally I replaced -O2 with -O3 -ftree-vectorize to enable better optimization. I have forked your project and pushed my changes on new branch. Here is my commit with all changes: https://github.com/sirzooro/RakeSearch/commit/d89124edfd910d28a79bfcaca270e1f82e80edc7
ID: 151 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 99
Credit: 402,603,726
RAC: 0
Message 152 - Posted: 6 Nov 2017, 10:35:35 UTC

A bit more optimized app can work up to 4.8 times faster :) This is the highest result which I found so far: http://rake.boincfast.ru/rakesearch/workunit.php?wuid=49600

I have general question about MovePairSearch::MoveRows: is it generating all permutations of rows, and discarding ones with duplicated values on diagonals? I am looking for way how to vectorize this function, and I am not sure if I understand its code properly.
ID: 152 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 99
Credit: 402,603,726
RAC: 0
Message 153 - Posted: 6 Nov 2017, 19:16:41 UTC
Last modified: 6 Nov 2017, 19:17:16 UTC

I have found that MovePairSearch::ProcessOrthoSquare() calls b.IsLatin() twice. Is this call left by mistake after copy-paste, or is it a bug?

I also repeat my previous request: could you create workunit file which will need up to 5 minutes to process? Current one probably needs hours to complete, It would let me to quickly check if modified app produces correct results. Now I have to install this app in BOINC and wait at least hour to get some results.
ID: 153 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Help desk expert

Send message
Joined: 11 Aug 17
Posts: 645
Credit: 22,416,873
RAC: 12,733
Message 154 - Posted: 7 Nov 2017, 23:03:57 UTC - in response to Message 153.  
Last modified: 7 Nov 2017, 23:06:53 UTC

Daniel, my congratulations!

I have found that MovePairSearch::ProcessOrthoSquare() calls b.IsLatin() twice. Is this call left by mistake after copy-paste, or is it a bug?

Yes, it's a mistake. Must be a.IsLatin(). But really, if (b.IsLatin()), a.IsLatin() also. Moreover, this check may be disabled (but it still needs to be checked). Thank you!

I also repeat my previous request: could you create workunit file which will need up to 5 minutes to process? Current one probably needs hours to complete, It would let me to quickly check if modified app produces correct results. Now I have to install this app in BOINC and wait at least hour to get some results.

I try to make workunit on several minutes.

It take some time for review new version of code. Next official application code, I hope! Thank you!
ID: 154 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
hoarfrost
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Help desk expert

Send message
Joined: 11 Aug 17
Posts: 645
Credit: 22,416,873
RAC: 12,733
Message 155 - Posted: 7 Nov 2017, 23:22:07 UTC - in response to Message 152.  

I have general question about MovePairSearch::MoveRows: is it generating all permutations of rows, and discarding ones with duplicated values on diagonals? I am looking for way how to vectorize this function, and I am not sure if I understand its code properly.

Yes. Basic Idea is a very simple - we generate a Diagonal Latin Square and check - would we built by rows permutations of this square a new diagonal latin square, which will be orthogonal to original square.

Permutation of rows in Latin Square cannot break its, but for Diagonal Latins Square, in common case it is not true due to changes on diagonals.
ID: 155 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [B@P] Daniel

Send message
Joined: 8 Sep 17
Posts: 99
Credit: 402,603,726
RAC: 0
Message 156 - Posted: 8 Nov 2017, 13:24:47 UTC - in response to Message 155.  

I have general question about MovePairSearch::MoveRows: is it generating all permutations of rows, and discarding ones with duplicated values on diagonals? I am looking for way how to vectorize this function, and I am not sure if I understand its code properly.

Yes. Basic Idea is a very simple - we generate a Diagonal Latin Square and check - would we built by rows permutations of this square a new diagonal latin square, which will be orthogonal to original square.

Permutation of rows in Latin Square cannot break its, but for Diagonal Latins Square, in common case it is not true due to changes on diagonals.

Thanks for explanation.

In the meantime I pushed one more commit with few more tweaks, please take a look on it too: https://github.com/sirzooro/RakeSearch/commit/b82642409f3fdd4761e89d0514ad8e9b91d8777f
ID: 156 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Science : Source code of the project application

©2024 The searchers team, Karelian Research Center of the Russian Academy of Sciences