FAQ
My code threading code looks like this, with standard static compute
threads within a class:

for (int x = 0; x < 4; x++) pthread_create(&threads[x],NULL, &cm::computeX,
&simulation);
for (int x = 0; x < 4; x++) pthread_join(threads[x], NULL);

The 4 compute threads are completely independent, the compute is really
long so the overhead from starting the threads is low in comparison.

Threaded result is the same speed as a the non-threaded result. Any
suggestions?

--
You received this message because you are subscribed to the Google Groups "android-ndk" group.
To view this discussion on the web visit https://groups.google.com/d/msg/android-ndk/-/63HlQObR6Q0J.
To post to this group, send email to android-ndk@googlegroups.com.
To unsubscribe from this group, send email to android-ndk+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/android-ndk?hl=en.

Search Discussions

  • Shervin Emami at Oct 8, 2012 at 11:59 pm
    When you say the compute is really long, do you mean in the order of
    microseconds, milliseconds or seconds? Because depending on the
    circumstances, it might not power up all 4 cores until it is doing
    something CPU-intensive for tens or hundreds of milliseconds, and if your
    code is mostly waiting on something else such as GPU / RAM / SD card /
    network / other threads, then it probably doesn't need to use multiple
    cores.

    Cheers,
    Shervin.
    Senior Systems Engineer, NVIDIA.

    On Friday, October 5, 2012 11:09:59 PM UTC-7, llynx wrote:

    My code threading code looks like this, with standard static compute
    threads within a class:

    for (int x = 0; x < 4; x++) pthread_create(&threads[x],NULL,
    &cm::computeX, &simulation);
    for (int x = 0; x < 4; x++) pthread_join(threads[x], NULL);

    The 4 compute threads are completely independent, the compute is really
    long so the overhead from starting the threads is low in comparison.

    Threaded result is the same speed as a the non-threaded result. Any
    suggestions?
    --
    You received this message because you are subscribed to the Google Groups "android-ndk" group.
    To view this discussion on the web visit https://groups.google.com/d/msg/android-ndk/-/VwV9y0O5PVgJ.
    To post to this group, send email to android-ndk@googlegroups.com.
    To unsubscribe from this group, send email to android-ndk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/android-ndk?hl=en.
  • Angel Segura at Oct 9, 2012 at 8:24 pm
    Quite interesting. I did the same question months ago without an answer.
    Based on Shervin's reply, it makes someone think that it doesnt matter if
    you explicitly create threads for whatever your purpose is. Supposing that
    is true then, how could we know the state on which the system decide to
    kick other cores in order to gain the potencial of threading? Could it be
    the fact that the scheduling procedures are configured in a particular way?
    I dont know, but it would be interesting if we complement this situation.

    My observations long ago, were that the main thread monopolize most of the
    time spent on execution, while spawned threads were left with time to
    execute their work. Try measuring the time of your threads and you will see.

    2012/10/8 Shervin Emami <shervin.emami@gmail.com>
    When you say the compute is really long, do you mean in the order of
    microseconds, milliseconds or seconds? Because depending on the
    circumstances, it might not power up all 4 cores until it is doing
    something CPU-intensive for tens or hundreds of milliseconds, and if your
    code is mostly waiting on something else such as GPU / RAM / SD card /
    network / other threads, then it probably doesn't need to use multiple
    cores.

    Cheers,
    Shervin.
    Senior Systems Engineer, NVIDIA.


    On Friday, October 5, 2012 11:09:59 PM UTC-7, llynx wrote:

    My code threading code looks like this, with standard static compute
    threads within a class:

    for (int x = 0; x < 4; x++) pthread_create(&threads[x],**NULL,
    &cm::computeX, &simulation);
    for (int x = 0; x < 4; x++) pthread_join(threads[x], NULL);

    The 4 compute threads are completely independent, the compute is really
    long so the overhead from starting the threads is low in comparison.

    Threaded result is the same speed as a the non-threaded result. Any
    suggestions?
    --
    You received this message because you are subscribed to the Google Groups
    "android-ndk" group.
    To view this discussion on the web visit
    https://groups.google.com/d/msg/android-ndk/-/VwV9y0O5PVgJ.

    To post to this group, send email to android-ndk@googlegroups.com.
    To unsubscribe from this group, send email to
    android-ndk+unsubscribe@googlegroups.com.
    For more options, visit this group at
    http://groups.google.com/group/android-ndk?hl=en.
    --
    You received this message because you are subscribed to the Google Groups "android-ndk" group.
    To post to this group, send email to android-ndk@googlegroups.com.
    To unsubscribe from this group, send email to android-ndk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/android-ndk?hl=en.
  • Shervin Emami at Oct 10, 2012 at 3:12 am
    It's true that spawning multiple threads on a multi-core CPU does not
    guarantee that they will run on multiple cores, the hardware &/or OS decide
    that for you. Generally this should happen automatically without you
    worrying about it, except that if you do some multi-threaded processing for
    a short duration and expect it to always use all cores then you can get
    confusing results. This happens often when you measure the time to execute
    a single iteration of code on single vs multi-threaded code and become
    surprised that perhaps it is not faster with multi-threading. But if you
    run the same test for a longer duration (eg: 1 or 2 seconds), then it is
    safe to say that intensive multi-threaded code would be spread across all 4
    cores. You might be lucky and find your 20ms of code runs on multiple cores
    (eg: if they were already running anyway because of a heavy multi-core app
    such as the camera or web browser running at the same time, etc), but for
    measuring performance you should do it over a long interval (this is
    recommended for normal performance testing including single-core code
    anyway).

    To put it into perspective, let's say for simplicity that your OS is
    running just once every 10 milliseconds (as this is common), so if you
    create new threads, the other threads probably won't even get a chance to
    start for roughly that long, and both the OS & CPU hardware have to detect
    that based on recent history it is worth powering up some more cores rather
    than just increasing the clock frequency of the current cores (more cores
    will not be powered up unless if it really looks worth it, since it will
    result in higher power draw). If they do get powered up, there will be a
    delay until the multiple cores are ready, then they will start transferring
    the multiple threads you created. So if each of these steps happens at say
    10 millisecond intervals then it's not surprising that it can take hundreds
    of milliseconds for your code to be fully spread across 4 cores.

    Like I said, running a test for atleast 1 or 2 seconds should be a safe bet
    (either by doing your test multiple times or on bigger data), and depending
    on how parallel the code is, you can definitely get very close to 4x
    speedup by using 4 cores, such as for camera image processing, etc.

    Cheers,
    Shervin.

    On Tuesday, October 9, 2012 10:51:55 AM UTC-7, PortugueseBreakfast wrote:

    Quite interesting. I did the same question months ago without an answer.
    Based on Shervin's reply, it makes someone think that it doesnt matter if
    you explicitly create threads for whatever your purpose is. Supposing that
    is true then, how could we know the state on which the system decide to
    kick other cores in order to gain the potencial of threading? Could it be
    the fact that the scheduling procedures are configured in a particular way?
    I dont know, but it would be interesting if we complement this situation.

    My observations long ago, were that the main thread monopolize most of the
    time spent on execution, while spawned threads were left with time to
    execute their work. Try measuring the time of your threads and you will see.

    2012/10/8 Shervin Emami <shervi...@gmail.com <javascript:>>
    When you say the compute is really long, do you mean in the order of
    microseconds, milliseconds or seconds? Because depending on the
    circumstances, it might not power up all 4 cores until it is doing
    something CPU-intensive for tens or hundreds of milliseconds, and if your
    code is mostly waiting on something else such as GPU / RAM / SD card /
    network / other threads, then it probably doesn't need to use multiple
    cores.

    Cheers,
    Shervin.
    Senior Systems Engineer, NVIDIA.


    On Friday, October 5, 2012 11:09:59 PM UTC-7, llynx wrote:

    My code threading code looks like this, with standard static compute
    threads within a class:

    for (int x = 0; x < 4; x++) pthread_create(&threads[x],**NULL,
    &cm::computeX, &simulation);
    for (int x = 0; x < 4; x++) pthread_join(threads[x], NULL);

    The 4 compute threads are completely independent, the compute is really
    long so the overhead from starting the threads is low in comparison.

    Threaded result is the same speed as a the non-threaded result. Any
    suggestions?
    --
    You received this message because you are subscribed to the Google Groups
    "android-ndk" group.
    To view this discussion on the web visit
    https://groups.google.com/d/msg/android-ndk/-/VwV9y0O5PVgJ.

    To post to this group, send email to andro...@googlegroups.com<javascript:>
    .
    To unsubscribe from this group, send email to
    android-ndk...@googlegroups.com <javascript:>.
    For more options, visit this group at
    http://groups.google.com/group/android-ndk?hl=en.
    --
    You received this message because you are subscribed to the Google Groups "android-ndk" group.
    To view this discussion on the web visit https://groups.google.com/d/msg/android-ndk/-/xGbEYHIk1qoJ.
    To post to this group, send email to android-ndk@googlegroups.com.
    To unsubscribe from this group, send email to android-ndk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/android-ndk?hl=en.
  • Fadden at Oct 10, 2012 at 8:24 pm

    On Oct 9, 8:12 pm, Shervin Emami wrote:
    Like I said, running a test for atleast 1 or 2 seconds should be a safe bet
    (either by doing your test multiple times or on bigger data), and depending
    on how parallel the code is, you can definitely get very close to 4x
    speedup by using 4 cores, such as for camera image processing, etc.
    Something like this: http://bigflake.com/cpu-spinner.c.txt

    --
    You received this message because you are subscribed to the Google Groups "android-ndk" group.
    To post to this group, send email to android-ndk@googlegroups.com.
    To unsubscribe from this group, send email to android-ndk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/android-ndk?hl=en.
  • Jeff shanab at Oct 12, 2012 at 12:49 pm
    Depends on OS is very true. For the desktop it is more obvious. (win32 all
    threads on same core period. Mac OS mostly 64bit shares cores very nicely,
    Linux usually also shares nicesly)
    The affinity within a process is to use the same core. Synchronization
    primitives are less expensive on the same core.
    One way to allow better core balance might be to refactor the code into
    multiprocess instead of multi-thread.
    On one project I am using ZMQ. It allows you to use a message queue to
    seperate async tasks as threads,processes or different machines
    simultanously or with a single line of code change. I have not tried
    compiling ZMQ for android yet.
    On Tue, Oct 9, 2012 at 10:12 PM, Shervin Emami wrote:

    It's true that spawning multiple threads on a multi-core CPU does not
    guarantee that they will run on multiple cores, the hardware &/or OS decide
    that for you. Generally this should happen automatically without you
    worrying about it, except that if you do some multi-threaded processing for
    a short duration and expect it to always use all cores then you can get
    confusing results. This happens often when you measure the time to execute
    a single iteration of code on single vs multi-threaded code and become
    surprised that perhaps it is not faster with multi-threading. But if you
    run the same test for a longer duration (eg: 1 or 2 seconds), then it is
    safe to say that intensive multi-threaded code would be spread across all 4
    cores. You might be lucky and find your 20ms of code runs on multiple cores
    (eg: if they were already running anyway because of a heavy multi-core app
    such as the camera or web browser running at the same time, etc), but for
    measuring performance you should do it over a long interval (this is
    recommended for normal performance testing including single-core code
    anyway).

    To put it into perspective, let's say for simplicity that your OS is
    running just once every 10 milliseconds (as this is common), so if you
    create new threads, the other threads probably won't even get a chance to
    start for roughly that long, and both the OS & CPU hardware have to detect
    that based on recent history it is worth powering up some more cores rather
    than just increasing the clock frequency of the current cores (more cores
    will not be powered up unless if it really looks worth it, since it will
    result in higher power draw). If they do get powered up, there will be a
    delay until the multiple cores are ready, then they will start transferring
    the multiple threads you created. So if each of these steps happens at say
    10 millisecond intervals then it's not surprising that it can take hundreds
    of milliseconds for your code to be fully spread across 4 cores.

    Like I said, running a test for atleast 1 or 2 seconds should be a safe
    bet (either by doing your test multiple times or on bigger data), and
    depending on how parallel the code is, you can definitely get very close to
    4x speedup by using 4 cores, such as for camera image processing, etc.

    Cheers,
    Shervin.

    On Tuesday, October 9, 2012 10:51:55 AM UTC-7, PortugueseBreakfast wrote:

    Quite interesting. I did the same question months ago without an answer.
    Based on Shervin's reply, it makes someone think that it doesnt matter if
    you explicitly create threads for whatever your purpose is. Supposing that
    is true then, how could we know the state on which the system decide to
    kick other cores in order to gain the potencial of threading? Could it be
    the fact that the scheduling procedures are configured in a particular way?
    I dont know, but it would be interesting if we complement this situation.

    My observations long ago, were that the main thread monopolize most of
    the time spent on execution, while spawned threads were left with time to
    execute their work. Try measuring the time of your threads and you will see.

    2012/10/8 Shervin Emami <shervi...@gmail.com>
    When you say the compute is really long, do you mean in the order of
    microseconds, milliseconds or seconds? Because depending on the
    circumstances, it might not power up all 4 cores until it is doing
    something CPU-intensive for tens or hundreds of milliseconds, and if your
    code is mostly waiting on something else such as GPU / RAM / SD card /
    network / other threads, then it probably doesn't need to use multiple
    cores.

    Cheers,
    Shervin.
    Senior Systems Engineer, NVIDIA.


    On Friday, October 5, 2012 11:09:59 PM UTC-7, llynx wrote:

    My code threading code looks like this, with standard static compute
    threads within a class:

    for (int x = 0; x < 4; x++) pthread_create(&threads[x],**NUL**L,
    &cm::computeX, &simulation);
    for (int x = 0; x < 4; x++) pthread_join(threads[x], NULL);

    The 4 compute threads are completely independent, the compute is really
    long so the overhead from starting the threads is low in comparison.

    Threaded result is the same speed as a the non-threaded result. Any
    suggestions?
    --
    You received this message because you are subscribed to the Google
    Groups "android-ndk" group.
    To view this discussion on the web visit https://groups.google.com/d/**
    msg/android-ndk/-/VwV9y0O5PVgJ<https://groups.google.com/d/msg/android-ndk/-/VwV9y0O5PVgJ>
    **.

    To post to this group, send email to andro...@googlegroups.com.
    To unsubscribe from this group, send email to android-ndk...@**
    googlegroups.com.
    For more options, visit this group at http://groups.google.com/**
    group/android-ndk?hl=en<http://groups.google.com/group/android-ndk?hl=en>
    .
    --
    You received this message because you are subscribed to the Google Groups
    "android-ndk" group.
    To view this discussion on the web visit
    https://groups.google.com/d/msg/android-ndk/-/xGbEYHIk1qoJ.
    To post to this group, send email to android-ndk@googlegroups.com.
    To unsubscribe from this group, send email to
    android-ndk+unsubscribe@googlegroups.com.
    For more options, visit this group at
    http://groups.google.com/group/android-ndk?hl=en.
    --
    You received this message because you are subscribed to the Google Groups "android-ndk" group.
    To post to this group, send email to android-ndk@googlegroups.com.
    To unsubscribe from this group, send email to android-ndk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/android-ndk?hl=en.
  • Llynx at Feb 7, 2013 at 9:35 am
    This was exactly the issue. My pc implementation using Boost::threads
    spawned threads and the threads were immediately distributed to all cores,
    however on Android the threads are only sent to other cores if a thread
    does a 'significant' amount of work.

    I had to make my threads indefinite instead of terminating once a job was
    done and using mutex to condition when to run the next job.

    I ignored the threading issue for 3 months and improved my algorithms
    efficiency so that it would run @ 75fps on 1 tegra core on power-saver all
    the way from 40 fps on 1 core on high-performance. Now I'm able to
    quadruple my simulation size and I'm still not maxing out the Tegra chip!

    So the "Answer" is make sure your pthreads last long enough to be
    distributed to other cores! (preferably never terminate for realtime
    simulations!)

    --
    You received this message because you are subscribed to the Google Groups "android-ndk" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to android-ndk+unsubscribe@googlegroups.com.
    To post to this group, send email to android-ndk@googlegroups.com.
    Visit this group at http://groups.google.com/group/android-ndk?hl=en.
    For more options, visit https://groups.google.com/groups/opt_out.
  • Jérôme Baril at Oct 12, 2012 at 9:59 am
    It seems to be an algorithmic problem more than a NDK problem. Time to
    create threads could be longer than doing the job in one thread. If the
    curve of performance is constant whatever the number of thread you use,
    your process is not parallelizable (from a programming point of view). Try
    to make benchmark of the algorithm efficiency with varying number of thread
    (1 to 4) to be sure that your algorithm is a candidate to multi-thread
    processing.

    Good luck,
    Jérôme

    Le samedi 6 octobre 2012 08:09:59 UTC+2, llynx a écrit :
    My code threading code looks like this, with standard static compute
    threads within a class:

    for (int x = 0; x < 4; x++) pthread_create(&threads[x],NULL,
    &cm::computeX, &simulation);
    for (int x = 0; x < 4; x++) pthread_join(threads[x], NULL);

    The 4 compute threads are completely independent, the compute is really
    long so the overhead from starting the threads is low in comparison.

    Threaded result is the same speed as a the non-threaded result. Any
    suggestions?
    --
    You received this message because you are subscribed to the Google Groups "android-ndk" group.
    To view this discussion on the web visit https://groups.google.com/d/msg/android-ndk/-/tV38DG8DvxUJ.
    To post to this group, send email to android-ndk@googlegroups.com.
    To unsubscribe from this group, send email to android-ndk+unsubscribe@googlegroups.com.
    For more options, visit this group at http://groups.google.com/group/android-ndk?hl=en.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupandroid-ndk @
categoriesandroid
postedOct 8, '12 at 5:34a
activeFeb 7, '13 at 9:35a
posts8
users6
websitedeveloper.android.com...

People

Translate

site design / logo © 2018 Grokbase