The thing is, I’ve had pretty URLs set up following the documentation this whole time. And when I go to domain.com it does not show Domain.com any more. And when I go to sub-pages the index.php isn’t part of the hyperlink.
HOWEVER when I directly go to Domain.com or Domain.com it does NOT redirect to a URL without index.php, and index.php is there.
So google is indexing bad pages, and so far as I can tell the .htaccess rewrite presented is not sufficient. I’ve tried a few permutations (which I have since forgotten the details of, sorry), and I had one configuration that seemed to work, but when I went to a dashboard to modify the attributes of a page, I had load errors. And those page load errors went away when I removed those modifications.
I couldn’t quite land on a single reliable configuration, so I was hoping someone could chime in. I’m also on v8.5.12, and upgrading to v9 is not currently possible due to addon limitations.
Anyways, is anyone able to help with the rewrite contents I need in .htaccess please?
My sitemap.xml is generating properly, I know that before even checking this I’m really talking about more an Apache configuration probably. (one centric to Concrete CMS maybe?)
Yeah I am canonical. I have followed the documentation and steps needed for all of that. A human clicking around the website seems to never encounter index.php . However if I manually go to Domain.com the page still loads with index.php in the URL, which is likely how Google is indexing/finding those pages. The sitemap doesn’t even list any entries for index.php in any way!
And yes the .htaccess file does include the code that Concrete CMS spits out. Also keep in mind I’m using v8.5.12, so I don’t know if what is generated in v9.x is different or not (there’s addons blocking me from upgrading to v9 right now).
Yeah I’m still not seeing why any htaccess or other changes I made are actually allowing the /index.php content, and I’m actually now seeing this with 3x Concrete CMS websites (all v8.5.x).
I’ve found some threads alluding to this being a Concrete problem (from many years ago) and I’m really not sure what I can actually do at this point to improve this. This is actually hurting the SEO of these websites, so really do need a solution here! Anyone?
You mention the .htaccess includes what Concrete CMS spits out - but does it contain other stuff, or have you modified it? Just wondering if trying just exactly what Concrete provides would be worth testing. Because yeah, otherwise I’m stumped.
Here is the contents of .htaccess (there is content below this that is redacted doing “Redirect 301” and “Redirect 410” functions, redacted due to sensitivity):
(Sorry that I took so long to reply, lost track of this thread)
Nothing really other than it definitely looks custom compared to the default stuff generated by the core. Might also be somewhere in the server configuration. I don’t believe this happens for most installations so that would lead me to believe somewhere along the lines it’s a configuration issue.
Okay, but is there any particular direction you can point me to put efforts into next? I’ve worked through a substantial permutation of configuration changes for the reverse-proxy in-front of the host (nginx), and the HTTPD on the host itself (apache2), as well as adjustments to how the rewrite is done within .htaccess. Both from the lines generated by Concrete CMS itself “normally” as well as going through oodles of examples online on the index.php rewrite topic. So far I have not found a reliable solution, and reverted to the rewrite aspects Concrete CMS outputs. Oh and I do believe I have also tried with and without caching settings on/off within Concrete CMS.
As for the customisations within .htaccess itself, that’s in response to website analysis tools and their recommendations.
Full disclosure I’m not an Apache / Nginx / webserver expert, but my hunch is that your problem might not actually be at the webserver level.
What I’m thinking is you might have a hardcoded internal link or something in your site that contains index.php, and Google sees that and is like “oh, index.php you say? What else is under here? Why, it’s the whole site! Let us index!”
Or something equally simple. So my recommendation would be to re-start your investigatino from the outside-in - Start with Google, instead of your webserver.
As a triage measure, I think you can set things in that tool to not index, so you might be able to say “don’t index anything under index.php, Google” and it will honor that and you at least stop the problem at the surface level, even if the underlying cause is still present.
But again, even though I don’t know for certain, I would say odds are this isn’t a Concrete-specific issue, because otherwise we’d be hearing about it from left, right and center, you know?
No worries about not knowing “everything about everything”
One of the example hosts isn’t publicly accessible (dev system), the other is (and is Google Indexed). And in my testing both present/render websites with the /index.php/ aspect being included. So I hear you on the Google seeing it and being all “GIMMEEE GOBLBBLEE GOBBLE GOBBLE”, I am still perplexed by the dev system results too.
I have tried a bunch of noindex methods in robots.txt, with limited success. It has improved the indexing aspect, but still feels like not the “best” solution. I even tried 301’s with index.php and from what I remember that started causing undesirable problems (but I forget which this moment).
I hear you on the “why aren’t more people talking about this?” angle, and I’ve asked that myself lots. I’ve dug so deep all over the inter-butts (even many Concrete-specific threads) assuming I’m doing it wrong to come to the conclusion that I really don’t think I am.
Hardlinks? Yeah, that’s def possible on the publicly-accessible site, but on the Dev one, I really do not believe I have that going on. I doubt I have it going on for the pub-access website, but I don’t have the same level of confidence of that vs the Dev site.
Either way, thanks for your thoughts. Someday I might figure it out, dunno.
You could modify your /index.php file. I did this years ago before canonical URLs existed on 5.x legacy sites. Interesting things can be done here, but no one seems to talk about it.
I’d start with the dev site (the comment block can be removed) …
While that may be an option, I really would rather not modify core Concrete CMS files (and yes I’m sure index.php doesn’t really change). If the issue originates within Concrete CMS, well then we should have that changed upstream by the CMS Devs. If not, well then change something else (.htaccess? HTTPD config? dunno!)
I do appreciate you sharing, as it may help others (yay!) but I myself am likely to opt out of that method.
I’ve never seen it changed in over 10 years, and I wouldn’t consider it part of the core either. In this case I’d look at it more like adding a custom route.
This is the perfect place to do it - why involve the overhead of loading the CMS when you are just doing a redirect? Besides, it’s only 4 lines of PHP code.
You could try a feature or pull request, but I would be surprised if it made it into the core.
Again, I’ve done some very interesting/custom things in index.php in the past. Logging/stopping DDOS attacks from crashing the web/database server among them.
It’s also not specific to the type of webserver (Apache/Nginx/IIS/etc).
I should also point out that https://SOMESITE/index.php?cID=1 could also be indexed by a crawler (and is not dependent on Pretty URLs).
I’d love to hear @andrew’s opinion on this subject.