Skip to content
This repository has been archived by the owner on Sep 7, 2020. It is now read-only.

refactor ap_metrics #1467

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

RanRegev
Copy link
Collaborator

refactored ap metrics query and response by collecting all responses together before sending to the controller

Copy link
Collaborator

@VladyslavTupikin VladyslavTupikin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combined_infra_metrics starting
Starting tcpdump, output file test_combined_infra_metrics.pcap
Capturing on 'br-a8d5d8787c22'
...
Checking for CMDU ACK (0x8000) from 0e:0c:da:fb:f8:be
Checking for CMDU AP metrics response (0x800c) from 0e:0c:da:fb:f8:be
Check AP Metrics Response message has AP Metrics TLV
Check AP metrics response has STA Link Metrics
FAIL: STA metrics with wrong MAC 51:a1:10:20:00:01 instead of 51:a1:10:20:00:09
...
Terminating tcpdump
combined_infra_metrics failed

Please, take a look on the results of test_flows.py in CI.

@RanRegev RanRegev force-pushed the refactor/ap_link_metrics_isolated_tests branch from 62f62c6 to d1731ea Compare June 23, 2020 14:35
@RanRegev RanRegev requested a review from vitalybu June 23, 2020 14:36
@morantr
Copy link
Collaborator

morantr commented Jun 24, 2020

I would really happy if no refactoring will be done on the backhaul manager until I'll finish moving all flows to tasks as part of the unified agent epic. It is making task more difficult.

@RanRegev
Copy link
Collaborator Author

I would really happy if no refactoring will be done on the backhaul manager until I'll finish moving all flows to tasks as part of the unified agent epic. It is making task more difficult.

@arnout ? @vitalybu ?

@vitalybu
Copy link
Collaborator

I would really happy if no refactoring will be done on the backhaul manager until I'll finish moving all flows to tasks as part of the unified agent epic. It is making task more difficult.

@morantr when do you plan to complete you changes?

@RanRegev RanRegev force-pushed the refactor/ap_link_metrics_isolated_tests branch from d1731ea to c459d9a Compare June 24, 2020 07:05
@morantr
Copy link
Collaborator

morantr commented Jun 24, 2020

I would really happy if no refactoring will be done on the backhaul manager until I'll finish moving all flows to tasks as part of the unified agent epic. It is making task more difficult.

@morantr when do you plan to complete you changes?

It will not be very quick, it will take some time. It is a little bit hard to estimate it right now, but I already started moving the topology flow to a different task, and I planned to do the same on the link metrics right afterward.

@arnout
Copy link
Collaborator

arnout commented Jun 24, 2020

I would really happy if no refactoring will be done on the backhaul manager until I'll finish moving all flows to tasks as part of the unified agent epic. It is making task more difficult.

@morantr when do you plan to complete you changes?

It will not be very quick, it will take some time. It is a little bit hard to estimate it right now, but I already started moving the topology flow to a different task, and I planned to do the same on the link metrics right afterward.

I understand the feeling, but we can't stop the world while waiting for that to be finished...

But point taken, we'll try to focus on different parts of the code.

@morantr
Copy link
Collaborator

morantr commented Jun 24, 2020

I would really happy if no refactoring will be done on the backhaul manager until I'll finish moving all flows to tasks as part of the unified agent epic. It is making task more difficult.

@morantr when do you plan to complete you changes?

It will not be very quick, it will take some time. It is a little bit hard to estimate it right now, but I already started moving the topology flow to a different task, and I planned to do the same on the link metrics right afterward.

I understand the feeling, but we can't stop the world while waiting for that to be finished...

But point taken, we'll try to focus on different parts of the code.

@arnout, I don't ask to "stop the world", only refactoring tasks.
If it is a bug fix or important feature that is needed for certification or something like that, go ahead, but refactoring tasks are by definition something that could be delayed for the right moment.

@RanRegev RanRegev force-pushed the refactor/ap_link_metrics_isolated_tests branch from 655f403 to 5aa619a Compare June 24, 2020 13:30
@RanRegev RanRegev added the ready for merge PR is ready to be merged (automatically) label Jun 25, 2020
@arnout arnout dismissed stale reviews from vitalybu and VladyslavTupikin June 29, 2020 08:05

Superseded

Copy link
Collaborator

@mariomaz mariomaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not finished yet.
cursor = handle_slave_ap_metrics_response()

m_expected_ap_metrics_response.reset_to_new_mid(mid);

// compare two mac addresses using string compare on their octets
auto mac_comp = [](const sMacAddr &addr1, const sMacAddr &addr2) -> bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting MAC addresses looks like a very handy and recurrent operation. Please consider defining operator< for sMacAddr in tlvftypes.h

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my point of view is that it is a bad idea to define a comparison operator on things that are not ordered. mac1 < mac2 has no meaning at all. It would be the same as defining operator+ on MAC addresses. The only thing that defining such an operator does is to motivate people to use std::map or std::set, which are not appropriate containers in most cases.

The only reason why a comparison operation is defined here, is because the standard library developers are too lazy :-) to implement an intersection algorithm that works on unordered containers (probably because that would be highly inefficient). And of course, for this particular case, there is a good reason to use an ordered container, because it makes the intersection operation efficient.

However, if both Ran and Mario have no qualms about defining an operator that has only implementation semantics and no real-world semantics, I'm not going to stop it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is going to be defined somewhere, it needs to be defined on "tlvtypes.h" on the framework.
And on a separate preparative commit of course.
Also, please don't use memcmp(), replace it with addr1 == addr2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need memcmp() because we are implementing operator<

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memcmp is in the internal lambda function. I could compare it however you want. Not see how it is related to operator<.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that the result of memcmp() is compared against 0 with operator<, not with operator==

int i = 0;
for (const auto &mac : required_bssids) {
auto list = query->bssid_list(i);
std::get<0>(list) = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment is pointless. First element in the tuple is used only when reading, to know if the second element is valid.


if (!message_com::send_cmdu(socket->slave, cmdu_tx)) {
LOG(ERROR) << "Failed forwarding AP_METRICS_QUERY_MESSAGE message to son_slave";
ret &= false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CMDU fails to be forwarded to one slave, backhaul manager will never get the expected response from that slave.
It's pointless forwarding the CMDU to the rest of the slaves, therefore loop should break.
See my previous comment to this very same issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. Technically you are right. However, we might change the response implementation in the future, to send back "whatever was returned". In this case stopping on first fail would be wrong.
Moreover, this loop creates an asynchronous process of requests. Stopping on failure may result in different runs just because the order of the the sockets is different.
All in all, I am not sure that stopping on error is correct here.

}

return ret;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this method returns false, CMDU is going to be forwarded to all the slaves as per handle_cmdu().
I think this should be avoided by returning true always.

Copy link
Collaborator

@mariomaz mariomaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though tests are passing, I think there is at least one scenario not properly considered.

uint16_t length = message_com::get_uds_header(cmdu_rx)->length;
cmdu_rx.swap(); //swap back before forwarding
return send_cmdu_to_bus(cmdu_rx, controller_bridge_mac, bridge_info.mac, length);
if (m_expected_ap_metrics_response.get_mid() != mid) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the following scenario:

  • Controller sends AP Metrics Query message
  • backhaul manager forwards message to slaves
  • All slaves process message and send response to backhaul manager
  • backhaul manager gathers all responses and sends full AP Metrics Response message to controller
  • One monitor/slave sends AP Metrics Response message with mid = 0 because channel utilization crossed threshold.

Since this equal-to-0 mid does not match the mid of previously processed AP Metrics Query message, this last response will wrongly be discarded.

Also see @arnout's comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a long discussion regarding how to handle AP Metrics Responses correctly for all cases: specific request, repeated requests and spontaneous AP metrics response - taking into account that mid=0 is legitimate for all cases.
There was no correct and simple solution.
Options were either simple or correct, but not both.
Anyhow I moved the forward unconditionally on mid=0 to be the first check.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shall not unconditionally forward responses from slaves to controller if mid is equal to 0.
If you do so, responses to the query in case of periodic metrics reporting are not gathered together in a single response.
The proof is that tests are now failing.

Please check it, but I think that restoring the code as it was when I wrote my comment and calling m_expected_ap_metrics_response.reset_to_new_mid(0); if m_expected_ap_metrics_response.is_expected_bssid_empty(), right before sending AP Metrics Response message will do the job. Well, at least if the mid in AP Metrics Query message was not also 0, but this will happen once in a blue moon only.

There was no correct and simple solution.

My suggestion is yet another another patch to the list of patches used to implement this flow. The problem we are trying to solve is or resembles a lot to a well known problem called stack ripping. I've been thinking on how to correctly implement this flow by addressing the root cause of the issue and I think that I've come up with a solution: using one new sequence byte in the UDS header called rid (for "request ID") would allow us to correctly match queries and responses. The trick is that, unlike the mid value, the rid would not be allowed to be 0, as this value would be reserved for responses without a query (for the channel utilization use case).

Maybe it's too late for this PR, but I had to write this just in case it can be used to implement a similar flow in the future.

Vladyslav Tupikin and others added 2 commits June 29, 2020 17:12
…SAGE

Create new class ExpectedApMetricsResponse for
easy handling of the logic of AP Metrics query and response.
The class saves the expected bssids responses and
collect them together before sending the combined
response to the controller

Signed-off-by: Vladyslav Tupikin <[email protected]>
Signed-off-by: Ran Regev <[email protected]>
added test for ap metrics query/response
tests that there is exactly one response from
each agent within the time of the interval

MAP-4.7.4_ETH_FH24G:netgear-rax40
MAP-4.7.4_ETH_FH5GL:netgear-rax40
MAP-4.7.4_ETH_FH5GH:netgear-rax40
MAP-4.7.5_ETH_FH24G:netgear-rax40
MAP-4.7.5_ETH_FH5GL:netgear-rax40
MAP-4.7.5_ETH_FH5GH:netgear-rax40
MAP-4.7.6_ETH_FH24G:netgear-rax40
MAP-4.7.6_ETH_FH5GL:netgear-rax40
MAP-4.7.6_ETH_FH5GH:netgear-rax40

Signed-off-by: Ran Regev <[email protected]>
@RanRegev RanRegev force-pushed the refactor/ap_link_metrics_isolated_tests branch from 5aa619a to a0892f1 Compare June 29, 2020 14:25
@RanRegev RanRegev requested a review from mariomaz June 29, 2020 14:26
Copy link
Collaborator

@morantr morantr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase to #1504, and use available data from the AgentDB.


/**
* @brief stores expected bssids
* @details adds the bssids in the range [first,last) to the already existing internal list
Copy link
Collaborator

@morantr morantr Jul 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unequal parenthesis: [first,last)
Edit: I see now that on a lot of places on documentation website do it as you did, not sure why. So fill free to ignore this comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mathematical notation for a range that includes the beginning (square bracket) but doesn't include the end (round bracket).

Your comment, however, indicates that this is not knowledge that we can assume everyone has, so indeed it's better to replace it with

Suggested change
* @details adds the bssids in the range [first,last) to the already existing internal list
*
* Adds the bssids in the range @a first (inclusive) to @a last (exclusive) to the already existing internal list

Additional notes:

  • capitalization, as @morantr already indicated;
  • use @a to indicate an argument;
  • don't use @details to separate from the brief, instead just leave an empty line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* @param last iterator to the source last
* @return - nothing
*/
template <class FirstIt, class LastIt> void add_expected_bssid(FirstIt first, LastIt last);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use template here? There is only one use if this function, so please replace it to non-template function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Template is good, because different containers have a different iterator type.

However, first and last must be the same type, so

Suggested change
template <class FirstIt, class LastIt> void add_expected_bssid(FirstIt first, LastIt last);
template <class Iterator> void add_expected_bssid(Iterator first, Iterator last);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here it is redundant because we don't have different containers. We have a single container.

* @param bssid the bssid to remove
* @return - nothing
*/
void remove_expected_bssid(const sMacAddr &);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add argument name:
const sMacAddr &bssid
Also on find_expected_bssid().

Comment on lines +85 to +86
uint8_t m_response_buffer[beerocks::message::MESSAGE_BUFFER_LENGTH] = {0};
ieee1905_1::CmduMessageTx m_response = {m_response_buffer, sizeof(m_response_buffer)};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you define it? It is a waste of memory. If you create the instance on the backhaul manager, you can pass the cmdu_tx of the backhaul manager as a reference and use it.
Please remove it, and declare a reference of cmdu_tx instead, and initialize it on the class constructor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The response message is built up for every partial response that we get from the slaves. If we would use cmdu_tx from backhaul manager for it, it would be cleared any time a different CMDU arrives on the backhaul manager.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not true. Arriving messages are getting into cmdu_rx which has different buffer than cmdu_tx.
Anyway, it doesn't matter, the whole class is redundant and can be inlined on the backhaul manager code which in that case would use the cmdu_tx of the backhaul.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used here: https://github.com/prplfoundation/prplMesh/pull/1467/files#diff-80c48bbadac679bad34f8ee37e30f772R2661

There are several responses coming from several slaves. Each of them goes into handle_slave_ap_metrics_response, which adds a TLV to the existing CMDU. Only if all of them have been received (line 2743) the CMDU is actually sent.

Copy link
Collaborator

@morantr morantr Jul 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that your new comment here #1467 (comment) got me to understand why I thought this whole class was useless.
What you wrote here is not what was implemented on the PR. I saw that each time we create a new message.

So this whole thing is a tradeoff between adding all the information to some container, and when all the messages from son_slave are done copy the information to tlvs, to adding a tlv from each newly received data on side cmdu_tx, and send it when it finishes???

I'm not sure which option I dislike the most. 😑
@arnout does SharedDB could help somehow to avoid these two options?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since sharedDB is not actually shared with the monitor, I don't see how it would help.

The original code indeed had adding the information to a container instead of adding it to a CMDU. I thought the latter would be easier to understand - but apparently not :-)

if (!bssid_list.empty()) {
send_slave_ap_metric_query_message(UINT16_MAX, bssid_list);
send_slave_ap_metric_query_message(0, bssid_list);
Copy link
Collaborator

@morantr morantr Jul 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably best if the mid will be optional value with 0 as default.

m_expected_ap_metrics_response.reset_to_new_mid(mid);

// compare two mac addresses using string compare on their octets
auto mac_comp = [](const sMacAddr &addr1, const sMacAddr &addr2) -> bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is going to be defined somewhere, it needs to be defined on "tlvtypes.h" on the framework.
And on a separate preparative commit of course.
Also, please don't use memcmp(), replace it with addr1 == addr2.

Comment on lines +1263 to +1264
std::vector<sMacAddr> given_bssid_list = bssid_list;
std::sort(given_bssid_list.begin(), given_bssid_list.end(), mac_comp);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need it sorted, change the type bssid_list to std::set on backhaul_fsm_main().
It is promised to be sorted, and benefit the fact that no reallocation will occur on every push_back.
And also you don't need to copy the input argument to a local argument.
(although it was totally ok to define it as a non-const reference and you wouldn't have to do it anyway).

Comment on lines +1278 to +1284
std::vector<sMacAddr> current_socket_bssids;
std::transform(socket->vaps_list.vaps, socket->vaps_list.vaps + beerocks::IFACE_TOTAL_VAPS,
std::back_inserter(current_socket_bssids),
[](const beerocks_message::sVapInfo &vap_info) { return vap_info.mac; });

auto list = query->bssid_list(0);
std::get<0>(list) = true;
std::get<1>(list) = mac;
// sort current vaps bssid
std::sort(current_socket_bssids.begin(), current_socket_bssids.end(), mac_comp);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here, make it std::set.
And I don't think that std::transform is a great choice. std::fill could fit more.
One more thing, when you iterate over the vaps, you need to check that vap is valid. Some elements on the array are empty.

std::set<sMacAddr> current_socket_bssids;
for (cont auto & bssid : socket->vaps_list.vaps) {
    if (bssid.mac == network_utils::ZERO_MAC) {
        continue;
    }
    current_socket_bssids.insert(bssid.mac);
}

}
}
// fill a list of the common bssids: given list with current vaps
std::vector<sMacAddr> required_bssids;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::vector is good for use if you need push elements only once or few times.
Since every call of this function you will push new things, it is not the best container to choose from.
Please change it to std::deque is it just like a vector but without nasty reallocation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Begging to differ here...

std::vector is most efficient for small containers, especially if you only append at the end, not in the middle. Reallocation doesn't actually happen that often, because it will over-allocate a bit. A std::list, for instance, will usually have a lot more allocations. A deque will have slightly less allocations, but use a lot more space. Also, even though both vector and deque have O(1) access, the deque access takes twice as much time (it needs two pointer dereferences instead of just one).

So, the only good reasons to use a deque are:

  • you did profiling and it turns out to be faster for realistic use cases;
  • the semantics we need are more like a deque than a general container.

I think neither is true here.

Copy link
Collaborator

@morantr morantr Jul 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of getting the feeling that std::vector is kind of the default container everyone uses without giving much thought if it the right container for the job.
It is good for use if you push or insert elements once or very few times. This is not the case here.

Here is a link to a flow chart that explains how to choose a container:
https://stackoverflow.com/a/471461/6039092
I think we should adopt it, and add it to our coding style document.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should adopt something.

However, I'm kind of a fan of the flow chart from CppGuideline SL.con.2: Use std::array if size is constant, std::vector if not. The implicit corollary is: only use one of the other containers if profiling proves it's more efficient. I like that as a programmer, I don't have to think about it but I can just use std::vector. Simpler is better IMO. But I'm not against using the flow chart either.

Note that for this particular case, the flow chart will point to std::vector (order is not important, no key lookup, no persistent positions, size doesn't vary wildly) (although I'm not entirely sure what "size varies wildly" means).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Size doesn't vary wildly" means that you push or insert elements once or a few times.
In this case, the vector will be filled each time handling the message, so its size varies wildly.
Also, a case when you push elements one by one is a case of "size varies wildly" (Unless you know how much you are going to push and reserve the size first).

I don't agree with what is said on "CppGuideline SL.con.2". It is too general and could miss bigtime.

metric.channel_utilization = ap_metrics_tlv->channel_utilization();
metric.number_of_stas_currently_associated =
// prepare ap metrics tlv
auto &ap_metrics_tx_message = m_expected_ap_metrics_response.create_tx_message();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong: create_tx_message will delete all previously received metrics. The create itself should be done when the expected ap metrics response is initialized.


// compare two mac addresses using string compare on their octets
auto mac_comp = [](const sMacAddr &addr1, const sMacAddr &addr2) -> bool {
return memcmp((char *)addr1.oct, (char *)addr2.oct, sizeof(sMacAddr)) < 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, casting to (char*) is not required.

@arnout arnout removed the ready for merge PR is ready to be merged (automatically) label Jul 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants