since weeks we’re running into a really blocking problem. We’re using the Fly.io integration and rely on it to update our deployments. It suddenly started breaking and we received daily mails - most often minutes after the re-enablement - that the integration has been disabled due to a timeout to fly.
The tool tip is the following:
We were unable to sync secret your secrets to your cloud provider as the requests resulted in a timeout.
Really sorry that you’re running into trouble here! I’m not seeing this when testing on a project I have there. Could you try creating a simple test Fly.io app and Doppler project with a couple simple test secrets and see if setting that up works? I wonder if there may be something specific about the app or secrets you’re syncing to it that are causing problems here.
When it comes to the deployment, all we do is fetch the Machine spec and then update the spec with what we got back to trigger a deploy (i.e., we don’t modify it in any way). I did just notice that it appears as though they’ve updated their Secrets page UI to indicate whether a secret has been deployed or not. It seems like it’s assuming that Machines will only pick these up when a full deploy happens. We don’t trigger a deploy though – just force the Machine spec to update to pick up the new secrets. In my tests just now, this works as expected and the Machines are restarted, coming up with the updated secrets as you would expect.
Shouldn’t you edit the secrets on the app instead of the machine spec? The machines then follow the app’s secrets and therefore need to wait for a deployment, but you can then specify a deployment strategy. Right now you can’t really use it like this in production. Maybe you can add a drop down to an integration to then specify the deployment strategy?
I am unsure of the API, because it seems to not be publicly documented? I can only see this, which targets an app:
Machines inherit secrets from the app. Existing Machines must be updated to pick up secrets set after the Machine was created.
This is impacting us really hard right now in production
Edit: nvm I misread. They even say in this link you need to update it…. I wonder how fly does it in their UI. Do you have a good contact at fly to ask these questions? Or should I also open a thread there?
Just saying: my remark is not about the syncing not working when it updates the spec. It’s about fully killing all of the application’s instances and restarting. This leads to a relatively short unavailability time.
This is what we do. The secrets get updated on the app and then we fetch the machine specs and perform a no-op update to ensure they restart to get the updated secrets (otherwise, the machines would not receive the updated secrets without manual intervention). This is the exact behavior that their CLI uses when you set a secret as well from what I could tell (it sets the secret on the app and then proceeds to restart all the machines to pick up the secret change).
I would recommend opening a thread about this in their community forum. As far as I can tell, we’re using the recommended way to restart machines so they pick up the updated secrets (we actually talked with someone on their support team to fix an issue – we had previously actually been setting these in the machine spec and backed that out so we’re just doing a no-op update to trigger a restart – and this was the method that they recommended). If there’s a better way to perform these, we’re open to making improvements!
Can you tell me what API you’re using and with what parameters? I am currently looking through the code and their CLI explicitly does a “restart only” via their GraphQL API. Locally this triggers a deployment with a blue green strategy.
Are you manually calling their GraphQL API? Or is it the HTTP API you’re using?
Looking at the flyctl source code I gained a new view on the process of Fly. I didn’t know that the blue green strategy deployment is totally orchestrated by their CLI. There is no server side orchestrator it seems that spawns the new machines, monitors them until they run green, marks them as active and then removes the old ones. It’s all done by the local CLI:
Understanding this I guess you only really patch the existing Machine and the app entities?
I think you should definitely not re implement all of that. You should be using their GO package and leverage it’s existing secret update method. The process seems to have a huge amount of complexity (i.e. fetch the app’s used update strategy). This way you don’t need to control any deployment specific options in your secret management tool.
Or you go the different route and set the app secrets as staged (as far as I can tell this is done by the CLI implicitly by not deploying the secrets (https ://github.com/superfly/flyctl/blob/c32266d6acea2714d57c8bd6b88c75509aab409a/internal/command/secrets/secrets.go#L84, but just applying them to the app) and tell the user to trigger the new deployment as they please. But I guess the first option fits better into your product perspective of Doppler actively rolling keys without the user’s interaction right?
@fermentfan Our initial implementation of this sync only used the secret update API call that you linked to. In the V1 of their platform, this worked how you’d expect and it ensured your app was rolled to pick up the changes. In V2 of the platform, their Machines don’t pick up secret updates until they’re manually deployed (which is all handled by the CLI now as you noted). We specifically got requests from Fly.io users to make sure Machines would be updated. So, the sync process now is as follows:
Update each Machine with the spec we received from the API without modifying it (to ensure the Machine is restarted and picks up the updated secrets).
I think the real problem here is that Fly.io should be handling this server-side and all we should really need to do is update the secrets via the API, which should, in turn, trigger the appropriate deploy/update procedure for Machines.
I mean I understand where you’re coming from. I’d totally want the same thing to happen if I were in your position. And I also dislike behavioral changes in APIs, but Fly gave many notices to the old system being shut down.
After all this is now a question of whether your product works with Fly in production or not. I don’t think Fly will suddenly change their whole process architecture on how to deploy machines just because of you. If you advertise interoperability with Fly you should not stop halfway through. If this won’t be fixed - this would be a showstopper for our product. And time is running on that decision.
I will make sure to raise this point on the Fly forums, but please also try to contact them about this and be open to change.
Which is why we made the changes we did prior to the shutdown of the V1 platform – to ensure that machines would get updated after secrets were synced from Doppler and not remain running with stale secrets.
Maybe, maybe not. I suspect this might be a rough edge they’re working on though. The main issue here is that our sync system is not designed to manage entire deployment process rollouts from start to finish via API. Pretty much no other platform we’ve created a sync for works in this fashion. We’re looking at options for what we can do here, but it’s possible that the only thing we’ll be able to do short term is provide an option to not perform the machine restarts automatically, so secrets would get synced, but you’d then need to manually roll your machines for them to pick up the updated secrets.
I’ve already reached out to them to see what their thoughts are on this. As I mentioned earlier, we’re definitely open to making improvements to fix this. I’m just not sure if implementing logic for managing third party deployments via API is in scope for this on our end. We’ll see what we can do though!
@fermentfan My contact at Fly never got back to me, so I posted in their forum. Unfortunately, they confirmed that they’re intentionally (somewhat bewilderingly) baking all the deployment logic into their CLI. There’s no API endpoint to trigger a deploy at all, so the only way to do that would be to bundle their CLI with our application and shell out during the sync process – which is a non-starter for us. That being the case, the only other option here is to add a checkbox when setting up the sync that would disable the machine restarts, which would then mean you’d have to manually trigger a deploy via the CLI (which does negate a decent amount of the convenience of the sync to begin with). Is that option something that might be useful to you given the current scenario?
@fermentfan Okay, the change is now deployed that gives you the option to disable the restart. You’ll have to remove your existing sync and create a new one to take advantage of the option.
Regarding webhooks – you could definitely use those to trigger restarts. We have some improvements in the pipeline for webhooks that should make that even easier (e.g., allowing you to choose specifically which configs trigger the webhook, custom authorization headers, etc.).
Yay cool. Thank you! I just tried it out and I am now using the webhook to trigger an action in Pipedream that allows for better Authorization and body control towards the Github dispatch API. Works great!
It would be nice if everything for that was natively possible in the Doppler webhook settings (custom JSON body instead of the predefined?). Looking forward to this! Then I could remove Pipedream from the stack again.
@fermentfan Nice! Yep, we just shipped the ability to select which configs the webhook triggers for (it used to only allow an entire environment, so every branch config in the environment would trigger it). Custom authorization header and bodies are in the pipeline!