Skip to content

Conversation

@bobrippling
Copy link
Contributor

for example, if we're deleting or delete-pending a device

closes #1574

@bobrippling
Copy link
Contributor Author

The endpoint is currently safe - on the issue (#1574) there was the idea to skip endpoint creation here and instead return early. I can include that in this PR too if desired

if (!endpoint) {
logger.debug(
`Data is from unknown endpoint '${payload.endpoint}' from device with network address '${payload.address}', creating it...`,
NS,
);
endpoint = device.createEndpoint(payload.endpoint);
}

@Nerivec
Copy link
Collaborator

Nerivec commented Dec 6, 2025

Should check against === undefined not null here.
This would not fix all possible race conditions though, just the specific issue linked. I think this needs to be done entirely in Controller (caller), we should not call the Device class when the device is deleted, and we should short-circuit the bottom processing in Controller (no trying to identify, no endpoint creation, no emitting, nothing really).

@bobrippling
Copy link
Contributor Author

Should check against === undefined not null here.

Thanks for the quick review - I went for == null since it's equivalent to === undefined || === null, to guard against future changes which might return null (since we don't have this protection from the types). But I think the new approach (.isDeleted()) is better, do you agree?

This would not fix all possible race conditions though, just the specific issue linked. I think this needs to be done entirely in Controller (caller), we should not call the Device class when the device is deleted, and we should short-circuit the bottom processing in Controller (no trying to identify, no endpoint creation, no emitting, nothing really).

Sounds good, I've made those changes. Will sort out linting/format/tests if this change is more along the right lines

@Nerivec
Copy link
Collaborator

Nerivec commented Dec 6, 2025

Have to use device.isDeleted (no parens, it's a getter). It's also called too late.

But I think we need a more complex mechanism to properly check the state.
The RCs happen because we retrieve the device, so we have a ref to it, but by the time we make a call (like on an endpoint), the device is deleted (ref still valid though). So, a pre-emptive check alone would do no good, technically, it's already done (find, by methods don't return a deleted device unless specifically requested).

There are several things to fix though:

  • if a device is deleted identifyUnknownDevice should return early, it's not unknown (have to add a method in Device to check against the static maps since it can be by ieee or nwk address).
  • should not create endpoint if isDeleted
  • should not call any other device/endpoint function if device isDeleted and said function is not appropriate for deleted

For the RCs, we need to come up with a good way to test this, so we can catch at least the major edge-cases, have a better idea of where things can go wrong exactly. i.e. write tests to find the bugs
RCs are hard to catch 😰
We could splatter isDeleted checks everywhere, but that's rather ugly 😬

@Koenkk any ideas?

@Koenkk
Copy link
Owner

Koenkk commented Dec 6, 2025

I think it will be very hard to prevent all the race conditions unless we decide to also return deleted device in

return includeDeleted ? (Device.deletedDevices.get(ieeeAddr) ?? Device.devices.get(ieeeAddr)) : Device.devices.get(ieeeAddr);
, maybe that's the more safe and simple solution? (I did not really check the impact of what that change would be)

@Nerivec
Copy link
Collaborator

Nerivec commented Dec 6, 2025

Can't return deleted device there, we'd end up triggering logic for devices we don't want to.

In Controller ZCL listener, we have a valid Device:

if (!device) {
logger.debug(`Data is from unknown device with address '${payload.address}', skipping...`, NS);
return;

we then grab the Endpoint (or create it):
let endpoint = device.getEndpoint(payload.endpoint);

we call the ZCL listener for the Device, we pass a valid Endpoint:
await device.onZclData(payload, frame, endpoint);

At some point after first snippet but before here, the device moved from Device.devices to Device.deletedDevices

we call getDevice() from the Endpoint:

const disableTuyaDefaultResponse = endpoint.getDevice().manufacturerName?.startsWith("_TZ") && process.env.DISABLE_TUYA_DEFAULT_RESPONSE;

getDevice() only grabs non-deleted Device BUT assumes Device is always present (save for the extra logging we added):
public getDevice(): Device {
const device = Device.byIeeeAddr(this.deviceIeeeAddress);
if (!device) {
logger.error(`Tried to get unknown/deleted device ${this.deviceIeeeAddress} from endpoint ${this.ID}.`, NS);
// biome-ignore lint/style/noNonNullAssertion: ignored using `--suppress`
logger.debug(new Error().stack!, NS);
}
// biome-ignore lint/style/noNonNullAssertion: ignored using `--suppress`
return device!;
}

Since Endpoint should never be present without a Device, it's a good assumption, except when we carry an Endpoint ref for too long without checking if the Device got deleted.


Of course the simplest of fix for this scenario, would just be to (not sure why it isn't like that in the first place btw):

- endpoint.getDevice().manufacturerName
+ this.manufacturerName

But it won't solve the RC problem (nor the other possible RCs). It won't fail, but it will keep processing stuff for a deleted device (and potentially make a mess in Z2M/ZHC since they expect emitted stuff & co to be from valid devices, not to mention network requests to a device that no longer is present).

Maybe we start by fixing the first two points I mentioned in previous post, plus add a return if (!Device.devices.has(this.ieeeAddr)) at the top of Device.onZclData.
We can review other cases as they show up. I think the problem will be largely limited to emit scenario (where a ref can stay for extended period).

@Koenkk
Copy link
Owner

Koenkk commented Dec 6, 2025

we'd end up triggering logic for devices we don't want to.

What kind of logic?

@Nerivec
Copy link
Collaborator

Nerivec commented Dec 6, 2025

I'm not sure I understood what you meant, because this would be everything really, from unnecessary/invalid MQTT publishing, to requests to devices that left the network (i.e. 100% fail)...

for example, if we're deleting or delete-pending a device
@Nerivec Nerivec force-pushed the fix/guard-dev-delete branch from 465805d to 2c1d081 Compare December 7, 2025 15:41
@Nerivec
Copy link
Collaborator

Nerivec commented Dec 7, 2025

I think this should cover most cases. It's not perfect, still probably several ways to RC, but it should better handle the case of Device.onZclData and the identify.

Note: Device.onZclData should be refactored a bit too (another PR), seems it's checking a bunch of if needlessly.

@bobrippling
Copy link
Contributor Author

Since Endpoint should never be present without a Device, it's a good assumption, except when we carry an Endpoint ref for too long without checking if the Device got deleted.

I guess this isn't really something we can enforce either - I had thought about forcing access to go through a sync function:

-device.getEndpoint(...)
+device.withEndpoint(endpoint => { // can't be async by the typesystem
+    ...
+})

but the same problem exists in that the endpoint ref can leak away. But it might help people think twice?

@Koenkk Koenkk merged commit 069666f into Koenkk:master Dec 7, 2025
1 check passed
@Koenkk
Copy link
Owner

Koenkk commented Dec 7, 2025

Thanks!

@bobrippling bobrippling deleted the fix/guard-dev-delete branch December 7, 2025 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

onZclData throws Cannot read properties of undefined (reading 'manufacturerName')

3 participants