C/C++ Programming

Author	Message	Time
HdxBmx27	So, Self Modifying Code. That or inline compileing. To get to the point i'm doing checkrevision. I'm trying to find a efficient way of doing it. Right now i'd doing like what everyone else has. Strip out the values for ABC, and the operations {^-+/} And then i have a big switch statement in the main loop. Thats eww. So I was thinking, I could* modify the code at runtime by writing over it in memory. But then there's a problem. What exactly should I write? Arnt the math operations different depending on what platform you're compiled on? psudo code: [code]void doMath(uint32_t S){ A += S; B += C; C += A; A += B; } switch(operator1){ case '-': WriteMemory(&doMath + 1, THE_SUB_ASM_BYTE, 1); break; case '+': WriteMemory(&doMath + 1, THE_ADD_ASM_BYTE, 1); break; case '^': WriteMemory(&doMath + 1, THE_XOR_ASM_BYTE, 1); break; } switch(operator2){ case '-': WriteMemory(&doMath + 5, THE_SUB_ASM_BYTE, 1); break; case '+': WriteMemory(&doMath + 5, THE_ADD_ASM_BYTE, 1); break; case '^': WriteMemory(&doMath + 5, THE_XOR_ASM_BYTE, 1); break; } for(x = 0; x < data.lenth, x += 4){ doMath((uint32_t)(&data+x)); } [/code] The other idea would be runtime compiling. [code]void doMath(uint32_t S){ / like 10 operations worth of NOPs / } sIn = "A=A^S B=B-C C=C+A A=A+B"; code = Compile(sIn); WriteMemory(&doMath, code, len(code)); for(x = 0; x < data.lenth, x += 4){ doMath((uint32_t)(&data+x)); }[/code] Im jsut kinda ranting here but if you have suggestions feel free to post. Also, if anyone has a high resolution timer in C I can snag that'd be great {I wana time some functions}	November 22, 2008, 12:22 AM
BreW	First off, you'd have to write the rest of the routine in assembly. Unless that is, you have some way to get your compiler to use one specific register for your operations. Second of all, this is just retarded. So you're saying you'd like to replace a switch statement of operations with a switch statement to write operations ?... There's a better way to do what you want to do: Stick with what you have right now. And yes, it is very architecture dependent...	November 22, 2008, 3:08 AM
HdxBmx27	Moving the switch statement outside the main loop gives ~35% speed increase. I jsut think I could get even MORE of an increase if I was able to get rid of the jmp/rets	November 22, 2008, 3:12 AM
BreW	[quote author=Hdx link=topic=17719.msg180531#msg180531 date=1227323524] Moving the switch statement outside the main loop gives ~35% speed increase. I jsut think I could get even MORE of an increase if I was able to get rid of the jmp/rets [/quote] ... How did you get 35%? Did you pull that number out of your ass? Tell me how you were able to move the switch statement out? Wasn't that what you were asking in your first post?	November 22, 2008, 4:11 AM
Barabajagal	It's sort of easy... you use a switch statement before the loop and pointers to functions. Increases the speed by about 35% from the tests he and I have been doing most of today. If you can get it any more efficient than this, I'd love to know (PB does allow inline assembly).	November 22, 2008, 4:18 AM
BreW	[quote author=Andy link=topic=17719.msg180533#msg180533 date=1227327496] If you can get it any more efficient than this, I'd love to know (PB does allow inline assembly). [/quote] Write it in assembly. Don't bother with the stack for the operation functions.	November 22, 2008, 5:19 AM
HdxBmx27	Odd, I finially got home where I can test it in my C implementation. And having the switches in the loop is 2x faster then having them outside. This is odd, because as andy said we got a ~35% decrease in time in his PB implementation. Maybe PB just sucks? Without getting the EXE info, my implementation does 179ms for wc3. I'll write the pe loader and then work on making it faster later. I am still curious about SMC, in C. It would be awesome if i could learn to do it.	November 22, 2008, 5:40 AM
BreW	Yeah, the idea is valid to a point but flawed. If he's having to call the operations, and you're worried about the call and ret, then why call? I guess I like your idea about precalculating the calls- but why can't they be jumps? That'd make for some serious performance gains. The processor wouldn't have to bother pushing/popping eip, or jumping back to the next instruction after the call. Also, I recommend using Intel's C compiler if you're going for the fastest code possible with C. And yes, PB just does suck. By the way, I think you're stressing about the cost of a jump way too much. Perhaps PB wasn't smart enough to make a jump table, and the switch statement was just compiled into a bunch of sequential jumps. Or more likely, it's the file operations. In that main loop you're opening each file, reading, etc etc. A lot of performance nasty stuff. I'd try optimizing that.	November 22, 2008, 1:17 PM
Barabajagal	The file reading is done all at once, and it takes almost no time. I thought it was the file operations at first, too... but it's not.	November 22, 2008, 6:36 PM
Yegg	Out of curiosity, will this method be even the least bit noticeable with a language like C? Are you just doing this for the sake of it being more efficient? Is this to make slower languages process the information faster?	November 22, 2008, 7:51 PM
Barabajagal	I don't think he cares about slower languages... He's just trying to find the best and fastest way to do it.	November 22, 2008, 8:54 PM
HdxBmx27	[quote author=Yegg link=topic=17719.msg180540#msg180540 date=1227383513] Out of curiosity, will this method be even the least bit noticeable with a language like C? Are you just doing this for the sake of it being more efficient? Is this to make slower languages process the information faster? [/quote]In higher languages, Yes, it will be noticeable. It'll be noticeably slower. With the switch inside the main loop i average 180ms. With it outside I average 370ms. With a little bit of looking at it, it was obvious that switches inside would be faster then push/call/ret/pop But it is valid for other language {PB for example where I got my ~35% from} I'm working on loading PE files now. Maping out the sections is annoying me -.-	November 23, 2008, 12:39 AM
Kp	Ignoring the overhead costs of preparing each method, you will get better performance by a large margin if you generate the code at runtime, for the simple reason that it avoids both the jumps/calls inherent in the switch and the repeated loads of the operator to identify what statement should be next. For a large enough data set, the preparation overhead will be lost in the time spent executing the computation itself. Depending on the data set, it's possible that fetching the data to checksum will evict the control string, making repeated switches even more expensive. By generating the checksum function, you can keep the generated code in the icache, so it is protected from getting evicted by data loads. If not for the sheer number of possible control strings, you could unroll the switch and get performance equal to, if not greater than (due to reduced overhead), the performance of doing runtime generation. Although it might be possible, it's very likely not worth the trouble to come up with a platform independent way to do runtime generation. One way to minimize your platform dependence would be to create a set of platform specific helper functions that know the opcodes for each operation. Then the main, independent loop can do: switch(op1) { case '+': add_op_plus(...); break; case '-': add_op_minus(...); break; ... } Porting to a new platform then becomes a matter of updating the add_op_* functions, and writing something to generate new prolog/epilog.	November 23, 2008, 2:19 AM

Author

Message

Time

HdxBmx27

So, Self Modifying Code. That or inline compileing.
To get to the point i'm doing checkrevision. I'm trying to find a efficient way of doing it. Right now i'd doing like what everyone else has. Strip out the values for ABC, and the operations {^-+/*} And then i have a big switch statement in the main loop. Thats eww.
So I was thinking, I *could* modify the code at runtime by writing over it in memory. But then there's a problem. What exactly should I write?
Arnt the math operations different depending on what platform you're compiled on?
psudo code:
[code]void doMath(uint32_t S){
A += S;
B += C;
C += A;
A += B;
}

switch(operator1){
case '-': WriteMemory(&doMath + 1, THE_SUB_ASM_BYTE, 1); break;
case '+': WriteMemory(&doMath + 1, THE_ADD_ASM_BYTE, 1); break;
case '^': WriteMemory(&doMath + 1, THE_XOR_ASM_BYTE, 1); break;
}
switch(operator2){
case '-': WriteMemory(&doMath + 5, THE_SUB_ASM_BYTE, 1); break;
case '+': WriteMemory(&doMath + 5, THE_ADD_ASM_BYTE, 1); break;
case '^': WriteMemory(&doMath + 5, THE_XOR_ASM_BYTE, 1); break;
}

for(x = 0; x < data.lenth, x += 4){
doMath((uint32_t*)(&data+x));
}
[/code]
The other idea would be runtime compiling.
[code]void doMath(uint32_t S){
/* like 10 operations worth of NOPs */
}

sIn = "A=A^S B=B-C C=C+A A=A+B";
code = Compile(sIn);
WriteMemory(&doMath, code, len(code));
for(x = 0; x < data.lenth, x += 4){
doMath((uint32_t*)(&data+x));
}[/code]

Im jsut kinda ranting here but if you have suggestions feel free to post.
Also, if anyone has a high resolution timer in C I can snag that'd be great {I wana time some functions}

November 22, 2008, 12:22 AM

BreW

First off, you'd have to write the rest of the routine in assembly. Unless that is, you have some way to get your compiler to use one specific register for your operations.
Second of all, this is just retarded. So you're saying you'd like to replace a switch statement of operations with a switch statement to write operations ?... There's a better way to do what you want to do: Stick with what you have right now.
And yes, it is very architecture dependent...

November 22, 2008, 3:08 AM

HdxBmx27

Moving the switch statement outside the main loop gives ~35% speed increase.
I jsut think I could get even MORE of an increase if I was able to get rid of the jmp/rets

November 22, 2008, 3:12 AM

BreW

[quote author=Hdx link=topic=17719.msg180531#msg180531 date=1227323524]
Moving the switch statement outside the main loop gives ~35% speed increase.
I jsut think I could get even MORE of an increase if I was able to get rid of the jmp/rets
[/quote]
...
How did you get 35%? Did you pull that number out of your ass? Tell me how you were able to move the switch statement out? Wasn't that what you were asking in your first post?

November 22, 2008, 4:11 AM

Barabajagal

It's sort of easy... you use a switch statement before the loop and pointers to functions. Increases the speed by about 35% from the tests he and I have been doing most of today.

If you can get it any more efficient than this, I'd love to know (PB does allow inline assembly).

November 22, 2008, 4:18 AM

BreW

[quote author=Andy link=topic=17719.msg180533#msg180533 date=1227327496]
If you can get it any more efficient than this, I'd love to know (PB does allow inline assembly).
[/quote]

Write it in assembly. Don't bother with the stack for the operation functions.

November 22, 2008, 5:19 AM

HdxBmx27

Odd, I finially got home where I can test it in my C implementation. And having the switches in the loop is 2x faster then having them outside.
This is odd, because as andy said we got a ~35% decrease in time in his PB implementation.
Maybe PB just sucks? Without getting the EXE info, my implementation does 179ms for wc3. I'll write the pe loader and then work on making it faster later.

I am still curious about SMC, in C. It would be awesome if i could learn to do it.

November 22, 2008, 5:40 AM

BreW

Yeah, the idea is valid to a point but flawed. If he's having to call the operations, and you're worried about the call and ret, then why call? I guess I like your idea about precalculating the calls- but why can't they be jumps?
That'd make for some serious performance gains. The processor wouldn't have to bother pushing/popping eip, or jumping back to the next instruction after the call. Also, I recommend using Intel's C compiler if you're going for the fastest code possible with C. And yes, PB just does suck.

By the way, I think you're stressing about the cost of a jump way too much. Perhaps PB wasn't smart enough to make a jump table, and the switch statement was just compiled into a bunch of sequential jumps. Or more likely, it's the file operations. In that main loop you're opening each file, reading, etc etc. A lot of performance nasty stuff. I'd try optimizing that.

November 22, 2008, 1:17 PM

Barabajagal

The file reading is done all at once, and it takes almost no time. I thought it was the file operations at first, too... but it's not.

November 22, 2008, 6:36 PM

Yegg

Out of curiosity, will this method be even the least bit noticeable with a language like C? Are you just doing this for the sake of it being more efficient? Is this to make slower languages process the information faster?

November 22, 2008, 7:51 PM

Barabajagal

I don't think he cares about slower languages... He's just trying to find the best and fastest way to do it.

November 22, 2008, 8:54 PM

HdxBmx27

[quote author=Yegg link=topic=17719.msg180540#msg180540 date=1227383513]
Out of curiosity, will this method be even the least bit noticeable with a language like C? Are you just doing this for the sake of it being more efficient? Is this to make slower languages process the information faster?
[/quote]In higher languages, Yes, it will be noticeable. It'll be noticeably slower. With the switch inside the main loop i average 180ms. With it outside I average 370ms. With a little bit of looking at it, it was obvious that switches inside would be faster then push/call/ret/pop
But it is valid for other language {PB for example where I got my ~35% from}

I'm working on loading PE files now. Maping out the sections is annoying me -.-

November 23, 2008, 12:39 AM

Ignoring the overhead costs of preparing each method, you will get better performance by a large margin if you generate the code at runtime, for the simple reason that it avoids both the jumps/calls inherent in the switch and the repeated loads of the operator to identify what statement should be next. For a large enough data set, the preparation overhead will be lost in the time spent executing the computation itself. Depending on the data set, it's possible that fetching the data to checksum will evict the control string, making repeated switches even more expensive. By generating the checksum function, you can keep the generated code in the icache, so it is protected from getting evicted by data loads.

If not for the sheer number of possible control strings, you could unroll the switch and get performance equal to, if not greater than (due to reduced overhead), the performance of doing runtime generation.

Although it might be possible, it's very likely not worth the trouble to come up with a platform independent way to do runtime generation. One way to minimize your platform dependence would be to create a set of platform specific helper functions that know the opcodes for each operation. Then the main, independent loop can do:

switch(op1) { case '+': add_op_plus(...); break; case '-': add_op_minus(...); break; ... }

Porting to a new platform then becomes a matter of updating the add_op_* functions, and writing something to generate new prolog/epilog.

November 23, 2008, 2:19 AM

Valhalla Legends Forums Archive | C/C++ Programming | [C]SMC