-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hm2 read(), write() functions need signficant time #70
Comments
hm, the above samples were from the RIP build which is compiled with with
played with
I'll have a look at the hm2_soc_read and hm2_soc_write functions tomorrow. I think this byte-wise reading is not necessary. |
If you want to burst read we need to make sure the firmware can handle it. For example, I don't think the AXI lite interface I implemented can do burst reading out of the box. At the driver level they implement a translation ram, I think which is an attempt to avoid calling write repeatedly in the driver level. We could expose a function that reads the entire range specified by the tram instead of individual reads and writes if you think it's function call overhead from hostmot2 down to hm2 llio. |
One other thing to note: The AXI bus and avalon bus are clocked much slower than the processor. For instance, I can't go any faster than 100MHz on the bus without timing closure issues. That's probably what is causing the hold up. I'm not sure what the avalon bus is clocked at, but I imagine it is something similar. |
batching might help, yes - not sure how invasive that would be for now I concentrate on dumbing-down - replacing these supersmart loops by memcpy (which is super-optimized anyway) shaves another 10-20% off hm2_soc_read/hm2_soc_write.. so with that we're in the 70uS ballpark for all four functs - that's ok for now |
Wow, that's a lot. With compiler optimizations turned on? The loops were a guard of some kind to protect misaligned reads, but I don't think they're entirely necessary. I'm a fan of memcpy replacement.. |
yes, |
perf is a great tool, recommended (I use linux-perf-4.6 from jessie-backports on the Altera) |
Is it necessary to call both ioport_update and tram_read / write? I think that just duplicates the same access to the hm2 firmware. |
This is what I'm driving at https://github.com/machinekit/machinekit/blob/master/src/hal/drivers/mesa-hostmot2/hostmot2.c#L78. The read/write functions automatically pull/push in the GPIO status. Is this in your config?
Would that be a duplicate GPIO read write per cycle to the firmware? |
yes, here not sure, maybe @cdsteinkuehler or PCW know more? or Michael @micges for sure? |
Yeah, I'm willing to bet that's a duplicate interaction. I don't call the gpio directly in the configs because they are updated automatically. I think the gpio update functions are there if you don't need all of the other stuff hm2 provides and want to save the read time because of the slower bus speeds I mentioned (or even worse, bus speed across slow ethernet). Maybe try removing those two calls. The other thing I do is remove any extra modules from the config firmware, which reduces the amount of data passed back and forth across the slower buses. E.g, instead of 10 stepgens, only instantiate 4, 5, or 6. |
well, our website has the answer..
summary: if one calls hm2....read, hm2...read_gpio is not needed, same for write/write_gpio |
it's interesting to compare the IRQ- and non-IRQ versions of perf output
nevertheless a couple of things can be read out of the below stats:
IRQ version:
for comparision, without IRQ - task at 5Khz:
|
This aligns with what I've seen in the firmware and the driver code. It makes sense intuitively, the more pins/modules you add, the more you need to swap data back and forth in a linearly increasing fashion.
I wonder if this is from the driver or from some interaction with the interrupt controller. For instance, all of these interrupts are shared interrupts. If you have more than one interrupt registered, based on that document you linked from Linus, all registered interrupt handlers get notified, and they check if it is theirs to handle. Perhaps by turning on our uio interrupt, it's showing the additional overhead of that notification framework along with the overhead of the uio driver. That leads me to wonder: What are the debug config options in your kernel config? Lot's of debugging turned on? From your testing and my testing, it looks like we sacrifice some processing time for less interrupt jitter - a fair trade in my mind. |
@dkhughes - that is the stock cv kernel - I think perf should work just alike on the zynq the tradeoff is ok - I'm just exploring options assume we were to do an alternative polling solution for external timing - let's think through if/how that would be possible. Naive approach:
does that make sense? would we cause an unclaimed IRQ making the kernel complain ".. nobody cared" ? |
No, that should be totally fine, just don't register the IRQ with the controller. If the interrupt isn't registered it shouldn't cause any trouble. That being said, we probably need to go back to the polling version of hm2_uio driver that was dkms built just to be safe. Then, we need to poll the IRQ status register bit '0' (hm2's status register, not the ARM interrupt controller) in waitirq in a tight spin loop. When it goes true (or times out) we break out, and write clear interrupt to hm2 firmware just like we are doing now. It will eat an entire thread, but with the 2 cores I think we're fine. We might have our cake and eat it to in that case, since it would skip the uio_irq portion entirely, we just deal directly with the firmware. |
Might be worth a shot of turning down the debug level, and maybe turn on the quiet option. |
The picture is a bit different on the zynq, not sure I see a common pattern yet which warrants action.. enable jessie-backports and install linux-perf irqtest.hal/5kHz,
|
That aligns with what I mentioned before about the slow bus speed at the processor -> fpga interface. The zynq side is configured with a 100MHz axi interface. I'd have to double check, but I think the cv interface is only clocked at 50MHz, which would double your transfer time before any (if there is any) extra overhead incurred by an avalon->axi merge. |
re the tight spin loop.. starvation is always an issue whith an RT thread and spinning, and the cycles are lost for good one way could be to adopt an old idea of PCW and do the following:
|
On 8/22/2016 10:23 AM, Michael Haberler wrote:
The way to make things fast is to avoid reads. If it seriously becomes an issue to read the required data from the
The overhead of doing bus-mastering DMA is a pain, but it provides There are some memory semantics you need to be careful with when doing For now, switching to memcpy seems like the low-hanging fruit. :) Charles Steinkuehler |
I don't need the extra performance yet, but I have to say, full blown dma between the host and the fpga sounds like fun. I'd have to read the TRM a lot more for the zynq first... Maybe after some of the other wishlist things get ticked off...like allowing aux gpio when cutter comp is turned on... |
observed running irqtest.hal, on terasic de0, times are in nS:
read() and write() are almost 60uS - together that's 120uS or 12% of the cycle budget at 1kHz; the waitirq() time basically shows the budget left for motion, kins etc
at 5kHz we have only about 50uS left for actual work:
this suggests looking into where the time is spent
The text was updated successfully, but these errors were encountered: